Johan Dahlberg: Primary data analysis provided by NGI
Transcription
Johan Dahlberg: Primary data analysis provided by NGI
Primary data analysis provided by NGI Johan Dahlberg, NGI johan.dahlberg@medsci.uu.se Why? Sequencing output Centralize ● Processing this much data is difficult … and oring ● Better usage of existing resources ● Don’t do what others an do for you ● Make your needs known ● Know the limits of automation! How does it work? The NGI pipeline NGI_Pipeline Trigger Manual commands Fetches info from Automatically triggers Engine Piper Helper Manual commands Datastore Charon Piper Best practice analysis workflow (WGS) Genotyping data Verify sample identity (GATK) Quality control data Processed aligned reads (bam) Raw data (fastq) Alignment quality control (Qualimap) Map to genome (bwa + samtools) Indel realignment (GATK) Duplicate marking (Picard) Base quality recalibration (GATK) Variant Annotation (SnpEff) Variant quality recalibration and evaluation (GATK) Variant calling (GATK) Variant calls (vcf) References and versions bwa: 0.7.5a samtools: 0.1.19 qualimap: v2.0 snpEff: 4.0 gatk: 3.3-0-geee94ec reference: human_g1k_v37 resources: GATK bundle 2.8 What will you get? ● For each sample: o Raw sequencing data in fastq format o Genotyping data in vcf and idat format (Optional) o Processed alignments in bam format o Variant calls in gvcf and vcf format o Quality control data (alignment statistics, gccontent, etc) ● Per project: o Project quality control summary statistics Where will you get it? ● Delivered to a Uppmax resources (at the moment Milou) The team NGI Stockholm Francesco Vezzi Per Kraulis Mario Giovacchini Guillermo Carrasco Denis Moreno Pär Lundin Pelin Akan Phil Ewels NGI Uppsala Jessica Nordlund Patrik Smeds Per Lundmark Pontus Larsson Johan Dahlberg It’s all out there, let us know what you think! github.com/NationalGenomicsInfrastructure johan.dahlberg@medsci.uu.se Questions?