Gene 760 â Problem Set 4 The purpose of this problem set is to
Transcription
Gene 760 â Problem Set 4 The purpose of this problem set is to
Gene 760 – Problem Set 4 The purpose of this problem set is to familiarize students with exome sequence analysis in the context of human disorders. By the end of this problem set we hope students will have learned the following skills: Familiarity with the methods of exome sequencing Mechanisms by which sporadic disorders can occur in humans Familiarity with FASTQ, BAM and VCF formats Writing Python scripts to process large datasets and using external modules (pysam) Using test files to design a sequencing analysis pipeline using GATK/BWA/Picard To teach these skills we will be using high-throughput sequencing data (Illumina HiSeq 2000) generated from a targeted exome capture (Nimblegen EZ2). The data is generated from a single family from the Simons Simplex Collection in which the proband (11114.p1) is affected with autism, while the sibling (11114.s1), mother (11114.mo), and father (11114.fa) are unaffected. Students are to submit a gzipped tarball named [YourNetID]_PS0.tar.gz containing: [YourNetID]_PS0_answers.txt : Text file with the responses to the questions below [YourNetID]_denovo_variants.py: Commented Python script used to answer the final problem SeattleSeq annotated variant files for the proband and sibling to ‘DROPBOX/PS4’ by 9:00PM on Sunday, April 12th. 1) Generate symbolic links from your Problem_Set_4 directory to the eight FASTQ files in: DATA/PS4. In addition to read sequences, the files contain quality data for each basepair of sequence, represented by ASCII characters. A description of the FASTQ format is given here: http://en.wikipedia.org/wiki/FASTQ_format; by looking at the first few lines of each file work out if these data are encoded in Illumina 1.5+ or Illumina 1.8+ format. ANS: State the command you used to generate the symbolic links ANS: State the format of these quality data and how you can know this. If you wanted to be absolutely sure how could you check? ANS: What is the read length (nt) of these data? 2) Detecting variants in genomic data requires multiple steps; to design a pipeline for doing this it is best to start with a smaller, but representative, data set. We will generate a test file using the paternal dataset which prints out 1 in every 1000 FASTQ reads from ‘Samp_11114.fa_Pair_1.fq’ and ‘Samp_11114.fa_Pair_2.fq’ (this is called subsampling). Use the following awk command for Pair_1 and repeat it for Pair_2 awk 'NR%4000 == 13{c=3;{print}next}c-->0' Samp_11114.fa_Pair_1.fq > Test_Pair_1.fq ANS: State the number of reads in the full FASTQ files for 11114.fa, and the test files 3) To build this exome sequencing pipeline, you will use BWA (v0.7.3a), Picard Tools, a custom python module, and GATK (Genome Analysis Toolkit). These are installed for you courtesy of the .bashrc file you copied in PS0. Note that you cannot add a directory of .jar files (java executables, used by Picard and GATK) to your PATH without using special commands. Jar files are executables that require the command structure “java file.jar [arguments]” in order to be run on the cluster. You will need to run the following utilities directly from the paths below, or copy/link the executable .jar files into your working directory. /home2/tgj2/gene760/TOOLS/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar /home2/tgj2/gene760/TOOLS/picard-tools-1.88/SortSam.jar /home2/tgj2/gene760/TOOLS/picard-tools-1.88/SamFormatConverter.jar /home2/tgj2/gene760/TOOLS/picard-tools-1.88/MarkDuplicates.jar IMPORTANT: To ensure smooth operation start each java command with: ‘java -Xmx2g -Djava.io.tmpdir=`pwd`/tmp file.jar -jar command.jar’ ANS: What do the above java arguments mean? ANS: BWA contains three alignment algorithms. State which one is most appropriate to this dataset, and why. 4) Using the two test files, align the data with BWA and output to a .sam file; ensuring that the output is compatible with Picard and GATK. You can do this by: - Adding a read group header that is independent for each sample (e.g. for the father, ‘@RG\tID:HiSeq.EZ2\tSM:11114.fa'. - Using the reference: ‘DATA/PS4/human_g1k_v37.fasta’. - Marking shorter split hits as secondary - Your command should run with 4 threads ANS: Write the BWA command and a description of what each argument specifies/does. ANS: In the read group header what do ID and SM stand for? What sort of information should be included? 5) Using the output from BWA, use Picard to convert the SAM file into a BAM file. ANS: Write the command. ANS: What is the advantage of using a BAM file over a SAM file? Are there any disadvantages? 6) Some commands can be joined together using the ‘|’ (pipe) symbol. The output of the first command needs to be written to STDOUT (this often happens automatically, the ‘>’ directs STDOUT into a file) and the input of the second command needs to capture STDIN. In Picard STDIN can be read by setting the input file as ‘/dev/stdin’. Combine the BWA and Picard commands into a single command using the pipe. ANS: Write the command and a description of how it works ANS: What are the advantages of using a pipe rather than simply doing each stage separately? 7) Use Picard to sort the resulting BAM file (use SORT_ORDER=coordinate) Use Picard to mark duplicates in the resulting file and create an index at the same time. (use CREATE_INDEX=true METRICS_FILE=file_metrics ; where file_metrics is the metrics file created by the output of the previous commands) ^ Note that these commands cannot be used with pipes since they need to read multiple lines at the same time. ANS: Write the commands and explain the command arguments used ANS: Why do we mark duplicates? 8) Use GATK (TOOLS/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar) to call variants with the HaplotypeCaller method, output the results to a .vcf file (add the option ‘-rf BadCigar’ to filter out any reads with incorrect Cigar strings) ANS: Write the command you used to do this and explain the command arguments used 9) By running the commands from parts 6, 7, and 8 you have developed a simple exome sequencing pipeline. Once this is functional without errors, you could apply it to the full FASTQ files BUT this would take approximately 24 hours per sample!! (rough guide per sample: BWA to BAM 3.5hrs; Sort BAM 1.5hrs; MarkDups 1.5hrs; Variant Calling 17hrs). Thus, to save you computational resources and time, we will provide sorted bam files with marked duplicates to you so that you only have to execute the final variant calling step (8). These files are in REFERENCE/PS4_REFERENCE/*MD.ba* ANS: Write the commands used ANS: State the number of variants in each sample (maternal, paternal, proband, sibling) 10) Write a python script to find de novo variants. Specifically filter the VCF file (preserving its format) to identify substitution variants that are: heterozygous (35-65% variant read frequency) with at least 8 variant reads in the child; absent in both parents; covered by at least 10 reads in both parents (use the pysam module to count reads in specific regions). Annotate the VCF file using SeattleSeq (http://snp.gs.washington.edu/SeattleSeqAnnotation141/) on the default settings. IMPORTANT: To run the python file, use ‘py27 file.py’ rather than ‘python file.py’. py27 is an alias in your .bashrc that links to a version of python on the cluster with the pySAM module installed. Pysam documentation can be found here: http://pysam.readthedocs.org/en/latest/ ANS: How many variants were present after filtering by the Python script? ANS: Save your python script (NetID_denovo_variants.py) and SeattleSeq-annotated variant files in the proband and sibling ANS: Do any de novo variants stand out as potential disease candidates? Why? ANS: Outline a strategy to identify other variants in this family with a potentially high (Odds ratio > 2) contribution to autism risk. Optional programming challenges For those of you with some programming experience looking to practice your python skills, here are a couple optional challenges. These will NOT be worth more than a point of extra credit (negligible to your grade), and you should not complete them if you haven’t finished the problem set already. If you decide to work on these, please use a separate file from your main script when handing in. Medium: Create a script that searches for possible recessive LOF alleles that may be involved in the disease state. In this case, you should look for heterozygous SNPs in both parents that are homozygous in the affected proband but not in the unaffected sibling.