SLE712 Assignment 3 Due Friday 29th May (week 11)● This assignment consists of two bioinformatics exercises.● You may work individually or in groups of two or three, so long as there is evidence of contribution tothe code repository by all members.● Your submission will consist of a written report AND one GitHub repository.● The report will be submitted as one PDF document to the CloudDeakin dropbox. Submitting in adifferent format will result in a 5% deduction. There is a maximum word count of 1000 words.● The report should include a cover sheet with names, student numbers, unit code, date of submissionand assignment title● If you describe ideas and works that are not your own, you must reference your sources with in-textcitations and a list of references according to the Harvard style:https://www.deakin.edu.au/students/studying/study-support/referencing/harvard● Any further questions please contact me by discussion board or email: firstname.lastname@example.org● This assignment is worth 20% of your total grade for the unit. A breakdown of the marks is provided:
Marks given (total=100)
Part 1 (50 marks)
Code works (2 marks per qn)
Code documentation (README and comments)
Evidence of team coding (source control)
Written answers (1 mark per qn)
Part 2 (50 marks)
Code works for points 1 – 4; 2 marks each
Code works for points 5 – 6; 6 marks each
Code documentation (README and comments)
Evidence of team coding (source control)
Written answers for points 1-4; 1 mark each
Written answers for points 5-6; 3 marks each
The code provided on Github executeswithout errors and generates the correctanswer
The code provided on GitHubhas a slight mistake which givesan incorrect answer but there isevidence that student has usedthe learning materials and madean attempt
The code yields an error orthere is a major mistake, orno GitHub repository wasprovided
The repository has a detailed README thataccurately describes the contents. Thecode contains enough comments todescribe what each chunk of code is doing
There is a README and somecomments but they are notdetailed enough or containinaccurate information
There was no attempt todocument the repository
Evidence ofteam coding
All group members made numerouscontributions including code, issues anddocumentation versioning
Each member made onecontribution to the repository
Only one commit wasmade, or no repo wasprovided
Addresses the question accurately and isconsistent with code provided. Studentprovided a clear description of how theproblem was solved.
The question was answeredaccurately but the method usedto solve the problem was notgiven. Minor inconsistenciesbetween answer and code.Minor grammar or spellingerrors.
Student response did notanswer the question orthere are majorinconsistencies betweencode and answer. Majorgrammatical and spellingerrors.
1Part 1: Importing files, data wrangling, mathematical operations, plots andsaving code on GitHubThe purpose of this exercise will be for you to develop skills in problem solving, R coding, work together as ateam using Rstudio and GitHub. You will be provided with two data files to work with: “gene_expression.tsv”and “growth_data.csv” which are available from this URL*:https://github.com/markziemann/SLE712_files/tree/master/bioinfo_asst3_part1_files* To download a file with R, click on “view raw” and then you can copy the URL from the address bar and thenuse the download.file command in R.________________________________________________________________________________● For points 1-10 below○ Describe how you solved the problem.○ Provide the answer as directed. The answer could be a descriptive, numerical, categorical,table or chart.● Provide a link to GitHub repository with the following:○ The code should run without errors, and yield answers to points 1-10 below.○ If working in a group, there needs to be evidence that all group members have madecontributions to the code repository. This means that there needs to be “commits” and “issues”from each group member.○ A README that describes the purpose of each script and their inputs and outputs.○ The code should contain sufficient comments so that someone else can understand what eachline or chunk of code is trying to achieve________________________________________________________________________________The file “gene_expression.tsv” contains RNA-seq count data for two samples of interest.1. Read in the file, making the gene accession numbers the row names. Show a table of values for thefirst six genes.2. Make a new column which is the mean of the other columns. Show a table of values for the first sixgenes.3. List the 10 genes with the highest mean expression4. Determine the number of genes with a mean <105. Make a histogram plot of the mean values in png format and paste it into your report.The file “growth_data.csv” contains measurements for tree circumference growing at two sites, control site andtreatment site which were planted 20 years ago.6. Import this csv file into an R object. What are the column names?7. Calculate the mean and standard deviation of tree circumference at the start and end of the study atboth sites.8. Make a box plot of tree circumference at the start and end of the study at both sites.9. Calculate the mean growth over the past 10 years at each site.10. Use the t.test and wilcox.test functions to estimate the p-value that the 10 year growth is different atthe two sites.2Part 2: Determine the limits of BLASTIn class you will be shown how to● Download and unzip files● Perform simple manipulations and analyses with sequence data● Use a provided function to incorporate point mutations into a sequence● Use provided functions to perform a BLAST search and interpret resultsIn this assignment we will be testing your ability to use supplied functions to perform an analysis into the limitsof BLAST. Your group will be allocated one E. coli gene sequence found in the file:https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part2_files/sample.faFor example if your Rstudio username is student71 then your sequence is 71. Each group selects just 1sequence. Next, you will need the whole set of E. coli genes can be downloaded from this link:ftp://ftp.ensemblgenomes.org/pub/bacteria/release-42/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/cds/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa.gz________________________________________________________________________________● For points 1-6 below○ Describe how you solved the problem.○ Provide the answer as directed. The answer could be a numerical, categorical, table or chart.● Provide a link to GitHub repository with the following:○ The code should run without errors, and yield answers to questions 1-6 below.○ If working in a group, there needs to be evidence that all group members have madecontributions to the code repository. This means that there needs to be “commits” and “issues”from each group member.○ A README that describes the purpose of each script and their inputs and outputs.○ The code should contain sufficient comments so that someone else can understand what eachline or chunk of code is trying to achieve________________________________________________________________________________1. Download the whole set of E. coli gene DNA sequences and use gunzip to decompress. Use themakeblast() function to create a blast database. How many sequences are present in the E.coli set?2. Download the sample fasta sequences and read them in as above. For your allocated sequence,determine the length (in bp) and the proportion of GC bases.3. You will be provided with R functions to create BLAST databases and perform blast searches. Useblast to identify what E. coli gene your sequence matches best. Show a table of the top 3 hits includingpercent identity, E-value and bit scores.4. You will be provided with a function that enables you to make a set number of point mutations to yoursequence of interest. Run the function and write an R code to check the number of mismatchesbetween the original and mutated sequence.5. Using the provided functions for mutating and BLASTing a sequence, determine the number andproportion of sites that need to be altered to prevent the BLAST search from matching the gene oforigin. Because the mutation is random, you may need to run this test multiple times to get a reliableanswer.6. Provide a chart or table that shows how the increasing proportion of mutated bases reduces the abilityfor BLAST to match the gene of origin. Summarise the results in 1 to 2 sentences.3
The post SLE712 Assignment 3 appeared first on My Assignment Online.