Practical RNA-seq Analysis - Bioinformatics and Research Computing
Transcription
Practical RNA-seq Analysis - Bioinformatics and Research Computing
Practical RNA-seq Analysis Yanmei Huang BaRC Hot Topics – March 17, 2015 Bioinformatics and Research Computing Whitehead Institute http://barc.wi.mit.edu/hot_topics/ Outline • • • • • • • RNA-seq overview Before sequencing QC Hot Topic on Feb 12, 2015 Mapping Quantifying Today’s focus Differential expression Presenting the results 2 RNA-seq Overview What is RNA-seq? - High throughput sequencing of RNA-derived cDNAs ‘Regular’ RNA-seq total RNA, mRNA, ncRNA Ribosome profiling ribosome protected RNA fragments RIP-seq, CLIP-seq, PAR-CLIP, iCLIP protein-bound RNA fragments Purpose? - To identify and quantify (same as microarray) Advantages? (compared to microarray) Identify: Single nucleotide resolution - RNA editing - alternative splicing - novel transcripts Quantify: Bigger dynamic range More robust and repeatable (arguably) 3 Before sequencing Experimental design Replication (especially biological replicates) Multiplexing Read length Read depth Paired or unpaired Stranded or unstranded Lab practice “Garbage in, garbage out” rule applies! Use only high quality input RNA for library prep 4 Data analysis pipeline Quality Control Mapping Quantifying Differential expression Presenting the results 5 Quality control • Check quality of file of raw reads FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ See handout for fastqc command (step 1) • Respond to QC analysis: – Filter poor-quality reads – Trim poor-quality positions – Trim adapter and/or other vector • Check quality of file of modified reads • See previous Hot Topic: NGS: Quality Control and Mapping Reads (Feb 12, 2015) 6 Quality control 7 Responding to quality issues • Method 1: – Keep all reads as is – Map as many as possible • Method 2: – Drop all poor-quality reads – Trim poor-quality bases – Map only good-quality bases • Which makes more sense for your experiment? 8 Mapping Reads Various mappers for spliced alignment: TopHat (recommended), STAR, GSNAP, MapSplice, PALMapper,ReadsMap, GEM, PASS, GSTRUCT, BAGET… Pär G Engström et al. Nature Methods | VOL.10 NO.12 | DECEMBER 2013 9 Spliced alignment with tophat2 http://ccb.jhu.edu/software/tophat/manual.shtml Command: tophat [options] <genome_index_base> <reads.fastq> - see handout “step 2” for sample tophat commands Commonly considered parameters quality score encoding Find out from fastQC result --segment-length Shortest length of a spliced read that can map to one side of the junction. default: 25 --no-novel-juncs Only look at reads across junctions in the supplied GFF file -G <GTF file> Map reads to virtual transcriptome (from gtf file) first. -N max. number of mismatches in a read, default is 2 -o/--output-dir default = tophat_out --library-type (fr-unstranded, fr-firststrand, fr-secondstrand) -I/--max-intron-length default: 500000 10 Mapping reads • Inspect mapping result in a genome browser • Remap if necessary 11 Quantifying Expression • Different ways of quantifying expression level for different purposes Normalized count Sample 1 Sample 2 Gene 1 RPKM FPKM Gene 2 Reads (Fragments) Per Kilobase per Million mapped reads 12 Counting methods • htseq-count (recommended) http://www-huber.embl.de/users/anders/HTSeq/doc/count.html – Output is raw counts • Cufflinks http://cufflinks.cbcb.umd.edu – Output is FPKM and related statistics • Bedtools (intersectBed; coverageBed) http://code.google.com/p/bedtools/ – Output is raw counts (but may need post-processing) 13 Using htseq-count Command: htseq-count [options] <alignment_file> <gff_file> htseq-count "modes" Frequently considered options default -r <order> name -s <yes/no/reverse> yes -t <feature type> exon -m <mode> union -f, -i, -o, -q, -h … … Keep in mind: Only unique alignments are counted!! 14 Differential Expression Analysis • Normalization • Estimating variability (dispersion) 15 Differential Expression Analysis • Normalization - problem with simply normalizing to total reads: • Variation (dispersion of data) Total reads 200 400 16 Preferred normalization method - geometirc (implemented in DESeq, cuffdiff) 1. Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples. 2. Calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample. 3. Take the median of all the quotients to get the relative depth of the library and use it as the scaling factor. 1 𝑥𝑖 Gene 1 Gene 2 Gene 3 Gene 4 Sample 1 2135 (447) 600 3150 378 > cds = estimateSizeFactors(cds) > sizeFactors(cds) Sample 1 Sample 2 4.7772242 1.0490870 𝑛 𝑛 𝑥𝑖 𝑖=1 Sample 2 615 (586) 58 1346 187 Sample 3 0.3697529 Sample 3 128 (346) 103 68 11 Sample 4 161(288) 189 88 22 Sample 4 0.5590669 Differential Expression Analysis Dispersion affects ability to differentiate x1 x x2 x1 x x2 18 Methods for estimating dispersion implemented by DEseq pooled Each replicated condition is used to build a model, then these models are averaged to provide a single global model for all conditions in the experiment. (Default) per-condition Each replicated condition receives its own model. Only available when all conditions have replicates. blind All samples are treated as replicates of a single global "condition" and used to build one model. pooled-CR Fit models according to modelFormula and estimate the dispersion by maximizing a Cox-Reid adjusted profile likelihood (CR-APL) 19 Interpreting DESeq output Gene ID (from GTF file) id Mean norm counts Mean normalized counts for group Fold change Log2 Raw FDR (fold p-value p-value change) baseMe baseMe baseMe log2(YRI YRI/CEU pval an an.CEU an.YRI /CEU) ENSG00000213442 74.71 149.06 0.36 ENSG00000138061 83.15 159.78 6.52 ENSG00000198618 11.71 23.42 0.00 ENSG00000134184 8.17 0.00 16.34 CEU_N CEU_N YRI_NA YRI_NA CEU_N CEU_N YRI_NA YRI_NA padj A07357 A11881 18502_ 19200_ A07357 A11881 18502 19200 _norm _norm norm norm 8.13E- 1.55E169 18 13 5.61E- 5.36E0.04 -4.62 184 12 08 #NAME 1.51E0.00 0.00 31 ? 06 0.00021 Inf Inf 0.11 0 1 0.00 division by 0 Normalized counts = raw / (size factor) Raw counts -8.69 145 1 0 148.28 149.85 0.72 0 153 10 4 161.44 158.12 7.2 5.83 19 0 0 27.2 19.64 0 0 0 15 15 0 0 10.8 21.88 -Inf sizeFactors (from DESeq): CEU_NA07357 CEU_NA11881 YRI_NA18502 YRI_NA19200 1.1397535 0.9676201 1.3891488 0.6856959 20 Presenting results • What do you want to show? • All-gene scatterplots can be helpful to – See level and fold-change ranges – Identify sensible thresholds – Hint at data or analysis problems • Heatmaps are useful if many conditions are being compared but only for gene subsets • Output normalized read counts with same method used for DE statistics • Whenever one gene is especially important, look at the mapped reads in a genome browser 21 Scatterplots Standard scatterplot MA (ratio-intensity) plot 22 Clustering and heatmap Softwares: Cluster 3.0: •Log-transform •Mean center •Cluster Java TreeView: •Visualize •Export Illustrator •Assemble Liang et al. BMC Genomics 2010, 11:173 23 Further mining of the data Find enrichment in descriptive terms, regulatory pathways and networks: Commercial Ingenuity Pathway analysis Pathway studio GeneGo … Public domain DAVID GSEA … 24 Summary • • • • • Experimental design Quality control (fastqc) Mapping spliced reads (tophat) Counting gene levels (htseq-count) Identifying "differentially expressed" genes (DESeq R package) • Exploring your results Slides and exercise materials are available at http://jura.wi.mit.edu/bio/education/hot_topics/ More practical commands and procedures available at http://barcwiki.wi.mit.edu/wiki/SOPs 25