ChimPipe Documentation Release v0.8.0 Bernardo Rodríguez-Martín, Emilio Palumbo and Sarah Djebali
Transcription
ChimPipe Documentation Release v0.8.0 Bernardo Rodríguez-Martín, Emilio Palumbo and Sarah Djebali
ChimPipe Documentation Release v0.8.0 Bernardo Rodríguez-Martín, Emilio Palumbo and Sarah Djebali November 03, 2014 Contents 1 Contents: 1.1 ChimPipe for chimera detection 1.2 Installation . . . . . . . . . . . 1.3 Manual . . . . . . . . . . . . . 1.4 FAQ . . . . . . . . . . . . . . . 1.5 Downloads . . . . . . . . . . . 1.6 Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 5 9 12 13 i ii ChimPipe Documentation, Release v0.8.0 ChimPipe is a pipeline for the detection of chimeric transcripts from Paired-End RNA-seq data developed at the Computational Biology of RNA Processing group of the Centre for Genomic Regulation (CRG) in Barcelona, Spain. ChimPipe’s source code is freely available from our GitHub repository. Contents 1 ChimPipe Documentation, Release v0.8.0 2 Contents CHAPTER 1 Contents: 1.1 ChimPipe for chimera detection 1.1.1 Biological importance of chimeras Chimeras are transcripts whose sequence is encoded in two or more different genes. The study of these transcripts is relevant in two different contexts: • Cancer genomics. It is very well know that the generation of fusion genes through chromosomal rearrangements is a major driver in certain types of cancer. These are hydrid genes formed from two previously separate genes that encode altered proteins with abnormal activity. Thus, the identification and study of chimeric transcripts in cancer can help detecting such alterations in cancer genomes and eventually lead to new diagnoses and therapeutic targets. • Genome biology. Understanding how the information is encoded in the genome is one of the challenges of current biology. In the last years, an increasing number of chimeric transcripts have been reported in non cancer samples from different species. Although their function and the mechanism through which they arise are not well understood, their existence could reveal new dimensions in how the information is encoded in the genome. One could hypothesize that the combination of sequences from different genes could be a way of increasing the information content of the genomes. Several transcriptional events have been proposed to explain how this transcripts could arise: – Trans-splicing. Splicing between two pre-mRNAs from different genes producing a chimera. – Readthrough or conjoined transcription. Combined transcription of two consecutive genes on the same chromosome and strand. Typically, such chimera begin at the promoter of the upstream gene and end at the termination point of the downstream gene. – SHS-transcriptional slippage. Pairing of short homologous sequences (SHS) among two different transcripts that leads to a recombination producing a chimeric transcript. 1.1.2 ChimPipe’s principle Briefly, ChimPipe uses the GEMtools RNA-seq pipeline to contiguously map the read pairs to the genome and the annotated transcriptome, and segmentally maps them into the genome to find de novo splice junctions in the same chromosome and strand. Then, in a second mapping step, it further segmentally maps the reads that could not be mapped this way to find de novo splice junctions allowing connections between different chromosomes and strands through the GEM RNA mapper. ChimPipe further aggregates the two blocks of segmentally mapped reads whose two parts map in two different genes into chimeric junctions, and applies several filters such as number of spanning reads and number of read pairs encompassing the chimeric junction. 3 ChimPipe Documentation, Release v0.8.0 The ChimPipe’s paper is currently being written. 1.1.3 ChimPipe’s features • Can handle both unstranded and stranded data • Provides the precise chimeric junction coordinates • Outputs many pieces of information for each chimeric junction • Has a good balance between sensitivity and specificity (according to our benchmark) 1.2 Installation 1.2.1 System’s requirements Hardware: • 64-bits CPU • RAM: ~40G for 100 million human PE reads of 52 bases. Software: • 64-bit Linux System • Bedtools v2.20.1 or higher • Samtools v0.1.19 or higher • Blast v2.2.29+ or higher (only if you want to generate your own similarity text files, see Reference between gene pairs in the Manual section) 1.2.2 Downloading ChimPipe You can do it from the ChimPipe GitHub repository in two different ways: 1. Press the Download ZIP button which is located at the bottom right of the page to download the most recent version of the code as a zip archive. 2. Clone the Git repository in case you want the code under the Git version control system (currently, it is not allowed to contribute to the project, but it will in the future): # $ # $ Move into the folder in which you want to clone the repositoy. cd /users/rg/brodriguez/bin Clone it. git clone https://github.com/Chimera-tools/ChimPipe.git 1.2.3 Setting up ChimPipe’s path Finally, go to your newly created ChimPipe’s directory and type make to set up the path to ChimPipe in your system. # $ # $ 4 Move into ChimPipe. cd /users/rg/brodriguez/bin/ChimPipe Type make to set up the path to ChimPipe in your system make Chapter 1. Contents: ChimPipe Documentation, Release v0.8.0 1.3 Manual This section describes ChimPipe’ input files, how to generate them, how to run ChimPipe, and how to interpret its outputs. 1.3.1 Input files ChimPipe takes as input 4 mandatory files and 1 optional file. Mandatory: • Paired-end (PE) RNA-seq reads • Genome index • Genome annotation • Transcriptome index Optional: • Gene pair similarity file Paired-end (PE) RNA-seq reads ChimPipe has been designed to deal with Illumina paired-end RNA sequencing data. It takes as input two FASTQ files, one for the first mates and one for the second mates of the read pairs. These files need to be located in the same directory and named according to this convention: • Mate 1. “SampleId + [.1|_1] + [.fastq|.txt] (+ .gz if compressed)” • Mate 2. “SampleId + [.2|_2] + [.fastq|.txt] (+ .gz if compressed)” 5. (a) BERGER_1.fastq.gz and BERGER_2.fastq.gz would be the compressed FASTQ files for mate 1 and mate 2 in the BERGER sample. Warning: The read identifier should follow Illumina convention to specify which member of the pair the read is. See our FAQ section for more information. Warning: Make sure the 2 FASTQ files are in the same directory and you use the same convention for both mates. E. g: if mate 1 is BERGER_1.fastq.gz, mate 2 can not be BERGER.2.txt.gz but has to be BERGER_2.fastq.gz Genome index An indexed reference genome in GEM format has to be provided to do the mapping steps. We provide some pre-generated genome indices for human, mouse and drosophila in the Downloads section, however if you have a different genome, you just need to run the GEMtools indexer (at ChimPipe/bin/gemtools-1.7.1-i3/gemtools) with your genome in FASTA format to produce it: $ gemtools index -i genome.fa It will produce 3 files in the directory where the genome fasta file is located: • genome.gem – indexed genome file in GEM format (needed for running ChimPipe). 1.3. Manual 5 ChimPipe Documentation, Release v0.8.0 • genome.hash – hash table with the genome (not needed). • genome.log – indexer log file (not needed). Tip: If your machine has more than one CPU, it is recommended to run the indexer with multiple threads. Use the option -t <threads>, where threads is the number of available CPUs. Genome annotation ChimPipe also takes as input a genome annotation file in GTF format (including annotated exons). It can contain other features different from exons, i. e. introns or UTR, but they will not be considered by ChimPipe in the chimera detection process. This annotation needs to contain at least two (tag,value) pairs in the attribute field with the gene_id and the transcript_id tags. Two optional (tag,value) pairs will be taken into account by ChimPipe if they are provided: gene_name and gene_type. e.g: # This is an example of an annotated exon with an appropiate format. # The attributes are the gene_id, transcript_id (mandatory), the gene type and gene name (optional), # plus some additional (tag,value) pairs that will not be considered by ChimPipe. chr1 HAVANA exon 69091 70008 . + . gene_id "ENSG00000186092.4"; transcri transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT0 Note: ChimPipe has been benchmarked with Gencode v10 and UCSC Known Genes human gene annotations. It displayed a better sensitivity with Gencode v10 but a similar false positive rate with both annotations. Thus, it is advisable to use Gencode annotation, since it is a richer annotation which increases the sensitivity of the chimera detection process. Transcriptome index An transcriptome index in GEM format has to be provided as input to ChimPipe in the same directory as the genome annotation GTF file, in order to find reads spanning annotated splice junctions. We provide some pre-generated transcriptome indices for human, mouse and drosophila annotations in the Downloads section, however if your genome annotation or your genome is different, you will need to to run the GEMtools transcriptome indexer ((at ChimPipe/bin/gemtools-1.7.1-i3/gemtools)) on your previously generated GEM indexed genome and your annotation in GTF format, as indicated below. $ gemtools t-index -i genome.gem -a annotation.gtf It will produce 5 files in your current working directory: • annotation.gtf.junctions – annotated splice junctions coordinates (not needed) • annotation.gtf.junctions.fa – annotated splice junctions sequence (not needed) • annotation.gtf.junctions.gem – transcriptome index in GEM format (needed) • annotation.gtf.junctions.keys – keys to convert from transcriptome to genome (needed) • annotation.gtf.junctions.log – indexer log file (not needed) Tip: If your machine has more than one CPU it is recommended to run the indexer with multiple threads. Use the option -t <threads>, where threads is the number of available CPUs. 6 Chapter 1. Contents: ChimPipe Documentation, Release v0.8.0 Warning: The transcriptome index has to be placed in the same folder as the genome annotation to be used by ChimPipe. Gene pair similarity file (Optional) One of ChimPipe’s filtering steps to discard actefactual chimeras is to filter out those chimeric junctions connecting genes that encode transcripts with a high sequence homology. To do this ChimPipe produces a matrix with information about the sequence similarity between annotated gene pairs. This step is done internally and takes around 1:30h. So, in case you want to run many samples with the same annotation, it is recommended to provide the matrix with the option --similarity-gene-pairs <MATRIX TEXT FILE>, to avoid recomputing it for every sample. You can download our pre-generated matrices por human, mouse and drosophila annotations from Downloads section or you can produce your own matrix with the script ChimPipe/src/bash/similarity_bt_gnpairs.sh as follows: $ bash similarity_bt_gnpairs.sh annot.gtf genome.gem Please check out our FAQ section for more information about how the script works. Warning: Make sure you run ChimPipe with a similarity matrix generated from the annotation and genome you are using. 1.3.2 Executing ChimPipe 1. Setting up the environment As explained in the Installation section, you need to have BEDtools, SAMtools and Blast installed on your system to execute ChimPipe. In case you do not have them, you can download and install them from their web pages. Once they are installed, you have to export the path to their binaries. Please check out our FAQ section in case you have any problem. 2. Running ChimPipe Once you have generated the genome and the transcriptome indices, you know the quality offset and the library type of your PE RNA-seq reads, you can run ChimPipe as follows: bash ChimPipe.sh -i reads_1.fastq -g genome.gem -a annotation.gtf All these files and parameters given as input to ChimPipe are mandatory arguments. Please see bellow their descripion: -i|--input reads_1.fastq - First mate sequencing reads. ChimPipe deals with paired-end data. Please make sure the second mate file is in the same directory as the first one, and the files are named according to the same convention. E.g: the second mate of "reads_1.fastq" should be "reads_2.fastq". -g|--genome-index genome.gem - Index for the reference genome in GEM format. -a|--annotation annotation.gtf - Reference genome annotation file in GTF format. The transcriptome index has to be in the same directory as the annotation. 1.3. Manual 7 ChimPipe Documentation, Release v0.8.0 Optional arguments. Please do ChimPipe.sh -h or --help to see a short help with the main options. You can also do ChimPipe.sh --full-help to see the all the possible options. Tip: If your machine has more than one CPU it is recommended to run ChimPipe with multiple threads (at least 4). It will speed up the mapping steps a lot. Use the option -t|--threads <threads>, where threads is the number of CPUs available. Note: The pipeline is restartable: this means that if ChimPipe fails and you run it again, it will skip the steps that have been already completed. You just need to make sure that you removed the files generated in the step where the pipeline failed. 1.3.3 Output By default, ChimPipe produces 3 output files: • First mapping BAM file • Second mapping MAP file • Chimeric junctions file Tip: If you want to keep intermediate output files, run ChimPipe with the --no-cleanup option. First mapping BAM file BAM file containing the reads mapped in the genome, transcriptome and de novo transcriptome with the GEMtools RNA-seq pipeline. This is the standard format for next-generation sequencing, meaning that most analysis tools work with this format. The bam file produced can therefore be used to do other downstream analyses such as gene and transcript quantification or differential gene expression analysis. Second mapping MAP file MAP file containing reads segmentally mapped in the genome allowing for interchromosomal, different strand and unexpected genomic order mappings. Chimeric junction file Tabular text file containing the detected chimeric junctions in your RNA-seq dataset. It has rows of 19 fields, where each row corresponds to a chimeric junction and the fields contains information about it. Here is a brief description of the 19 fields: 1. juncId - Chimeric junction identifier. It is an string encoding the position of the chimeric junction in the genome as follows: chrA”_”breakpointA”_”strandA”:”chrB”_”breakpointB”_”strandB. E. g., “chr4_90653092_+:chr17_22023757_+” is a chimeric junction between the position 90653092 of the chromosome 4 in the plus strand, and the position 22023757 of the chromosome chr17 in the plus strand. 2. nbstag - Number of staggered reads supporting the chimera. 3. nbtotal - Total number of reads supporting the chimera. 4. maxbeg - Maximum beginning of the chimeric junction, The starting position at which 8 Chapter 1. Contents: ChimPipe Documentation, Release v0.8.0 5. maxEnd - Maximum end of the junction 6. samechr - Flag to specify if the connected gene pairs are in the same cromosome (1) or not (0). 7. samestr - Flag to specify if the connected gene pairs are in the same strand (1) or not (0), NA in case the samechr field was 0. 8. dist - Distance between the two breakpoints, NA in case the “samestr” field was 0. 9. ss1 - Splice donor site sequence. 10. ss2 - Splice acceptor site sequence. 11. gnlist1 - List of genes overlapping the first part of the chimera. 12. gnlist2 - List of genes overlapping the second part of the chimera. 13. gnname1 - Name of the genes in the field gnlist1, ”.” if unknown. 14. gnname2 - Name of the genes in the field gnlist1, ”.” if unknown. 15. bt1 - Biotype of the genes in the field gnlist1, ”.” if unknown. 16. bt2 - Biotype of the genes in the field gnlist2, ”.” if unknown. 17. PEsupport - Total number of read pairs supporting the chimera, ”.” if not Paired-end support. It is a string containing information about the number of read pairs supporting the connection between the involved gene pairs as follows: geneA1-GeneA2:nbReadPairs,geneB1-geneB2:nbReadPairs. E.g.: “1-1:1,3-1:2” means that the connection between the genes 1, in the gnlist1 and gnlist2 respectively, is supported by 1 read pair; and the connection between the gene 3 in the gnlist1 and the gene 1 in the gnlist2 is supported by 2 read pairs. 18. maxSim - Maximum percent of similarity in the BLAST alignment between the transcript with the longest BLAST alignment, ”.” if no blast hit found. 19. maxLgal - Maximum length of the BLAST alignment between all the transcripts of the gene pairs connected by the chimeric junction, ”.” if no blast hit found. Example chr1_121115975_+:chr1_206566046_+ 1 1 121115953 206566073 1 1 85450071 GC AG SRGAP2D, SRGAP2, SRGAP2D, SRGAP2 . . 1-1:2, 99.44 1067 1.4 FAQ 1.4.1 Does ChimPipe considers the reads quality for the mapping step? The quality score (Q) measures the probability that a base is called incorrectly by the sequencing machine. Within your FASTQ files, they are represented in the fourth line of each read as an ASCII character string (each character corresponds to the Q score of a certain base in the sequencing read). The correspondence between each ASCII character and the Q score is based on some offset. This offset varies depending on the sequencing platform (Illumina machines from CASAVA v1.8 uses 33, while older ones use 64). Yes, ChimPipe deals with both 33 and 64 quality encodings. It will automatically detect it from your reads, so you do not need to specify it as a parameter. 1.4.2 Which sequencing library protocols are supported? Different protocols can be used to generate a RNA-seq library. Currently, ChimPipe can handle data generated with the following protocols: 1.4. FAQ 9 ChimPipe Documentation, Release v0.8.0 • Non strand-specific protocols. (unstranded data). The information about from which strand the transcript is transcribed is not available. • Strand-specific protocols (stranded data): – MATE1_SENSE. Mates 1 are sequenced from the transcript sequence (they will map on the same strand as the transcript), and mates 2 are sequenced from the reverse complement of the transcript sequence (they will map on the strand that is the opposite of the transcript strand). – MATE2_SENSE. Mates 1 are sequenced from the reverse complement of the transcript sequence (they will map on the strand that is the opposite of the transcript strand), and mates 2 are sequenced from the transcript sequence (they will map on the same strand as the transcript). ChimPipe is able to infer the protocol used to produce your data. To do it, it takes a subset of 1M of mapped reads and compares the mapping strand with the strand of the annotated gene where they map. OptionaLly, you can supply this information and skip this step with the option -l|--seq-library <library>. 1.4.3 Read identifier format The FASTQ file uses four lines per sequencing read. You need to check the format of the first line of each read, which begins with the ‘@’ character and is followed by a read identifier. This identifier should meet one of the two Illumina standards to specify which member of the pair the read is: • CASAVA lower than v1.8. The identifier has to be a single string ended in /1 or /2 for mate 1 and mate 2, respectively. E. g.: $ # Mate 1 @SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868#0/1 GTAACATATTCACAGACATGTCAGTGTGCACTCAGGAACACTGGTTTCATT +SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868#0/1 IIIIIIIIIIIIIII>=CII=8=H032/-++D+’.@)2+4/+)1’4.#"*. $ # Mate 2 @SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868#0/2 CAGATGGCATTGCAGAACTAAAAAAGCAAGCTGAATCAGTGTTAGATCTCC +SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868#0/2 IIIGIIIIIIIIIFI:>DID<AFH=>78@I41I055549.?+.42-’%**’ • CASAVA v1.8 or higher. The identifier consists on two strings separated by a space with the second one specifying the mate in the first digit. E. g: # Mate 1 @seq.1_set1:140:D034KACXX:1:2105:18345:62400 1:N:0: CCCAGCCTGTTTCCCTGCCTGGGAAACTAGAAAGAGGGCTTCTTTCTCCT + IJJJIIIIIGIGIEBHHHGFFFF=CDEEEEEDDDDCCCDDA?55><CBBB # Mate 2 @seq.1_set1:140:D034KACXX:1:2105:18345:62400 2:N:0: GCACCCTTCACTCCCTCCCTTGGGCGCCTCCCTCCCGAGGGTAGGGACCC + FFHIJJCHIIHBHIIIAHFFFFFCDEDEECDBB;??@CD?CCCCCCC@CC I case your FASTQ files do not meet this format you should modify the identifier. Awk is a perfect tool for such kind of problems. E. g: I downloaded a dataset from the NCBI Sequence Read archive: # Mate 1 read without a proper identifier. It has three strings as identifier and does not end with " @SRR018259.1 BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868 length=51 10 Chapter 1. Contents: ChimPipe Documentation, Release v0.8.0 GTAACATATTCACAGACATGTCAGTGTGCACTCAGGAACACTGGTTTCATT +SRR018259.1 BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868 length=51 IIIIIIIIIIIIIII>=CII=8=H032/-++D+’.@)2+4/+)1’4.#"*. # No worries, I can use awk to fix it. $ awk ’{if (NR%2==1){print $1"_"$2"_"$3"#0/1"} else {print $0}} dataset_1.fastq @SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868_length=51#0/1 GTAACATATTCACAGACATGTCAGTGTGCACTCAGGAACACTGGTTTCATT +SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868_length=51#0/1 IIIIIIIIIIIIIII>=CII=8=H032/-++D+’.@)2+4/+)1’4.#"*. $ # Finally, I apply the same procedure for the mate 2.. 1.4.4 How can I export the path to the dependencies? To export the path of bedtools, samtools and blast (if needed) binaries you just need to type: $ export PATH=<BEDTOOLS_BINARIES_PATH>:<SAMTOOLS_BINARIES_PATH><BLAST_BINARIES_PATH>:$PATH $ # E.g. export bedtools and samtools on my system $ export PATH=/users/rg/brodriguez/bin/bedtools2-2.20.1/bin:/users/rg/brodriguez/bin/samtools-0.1.19: 1.4.5 How does the script to compute gene pair similarity work? This script will produce a matrix containing gene pair similarity information through 4 steps: 1. Extract the cDNA sequence of each transcript in the annotation. 2. Make a BLAST database out of the transcript sequences. 3. Run BLAST on all trancript against all transcripts to detect local similarity between transcripts. 4. Produce a 8 fields matrix where each row corresponds to a gene pair and it contains information about the alignment between the pair of transcripts of this two genes with the maximum alignment similarity and length. Here is a brief description of the 8 fields: (a) Gene id A (b) Gene id B (c) Transcripts alignment similarity (d) Transcript alignment length (e) Transcript name A (f) Transcript name B (g) Trancript A exonic length (h) Transcript B exonic length Note that is expect BLAST binaries to be in your PATH. Example ENSG00000000003.10 ENSG00000003402.15 91.43 70 ENST00000373020.4 ENST00000309955.3 2206 14672 1.4. FAQ 11 ChimPipe Documentation, Release v0.8.0 1.5 Downloads 1.5.1 GEM genome indices Homo sapiens • GRCh37/hg19 (February 2009) Mus musculus • NCBI37/mm9 (July 2007) Drosophila melanogaster • FlyBase genome v5.46 (July 2012) 1.5.2 GEM transcriptome indices H. sapiens • UCSC known genes (July 2007, hg19) • Gencode v19 (July 2013, hg19) long transcripts. • Gencode v10 (July 2011, hg19) long transcripts. Note: The long transcript group excludes ribosomal, micro, transfer, small cytoplasmic, small nucleolar and small nuclear RNAs. They were removed since they are not relevant for chimera detection and they could be a source of artefactual chimeras due to sequence similarity between them, specially in ribosomal RNAs. M. musculus • Ensembl v65 (December 2011, mm9) long transcripts. D. melanogaster • FlyBase annotation v5.46 (July 2012, v5.46) 1.5.3 Similarity between gene pair matrices H. sapiens • UCSC known genes similarity matrix February 2009 (GRCh37/hg19). • Gencode v19 similarity matrix for long transcripts (GRCh37/hg19). • Gencode v10 similarity matrix for long transcripts (GRCh37/hg19). 12 Chapter 1. Contents: ChimPipe Documentation, Release v0.8.0 M. musculus • Ensembl v65 similarity matrix for long transcripts (GRCh37/mm9). D. melanogaster • ‘FlyBase v5.46 similarity matrix‘_ 1.6 Contact Please feel free to join the ChimPipe’s mailing list in case you have a question, you want to report an issue or request a feature. You can also directly contact us at: chimpipe.pipeline@gmail.com 1.6. Contact 13