DNA sequencing and
Transcription
DNA sequencing and
Big data in cancer research : DNA sequencing and personalised medicine Philippe Hupé Conférence BIGDATA 04/04/2013 1 - Titre de la présentation - nom du département émetteur et/ ou rédacteur - 00/00/2005 Deciphering the cancer genome with high-throughput technologies Cancer karyotype Cancer Normal karyotype is a gene disease Sequence the cancer genome (i.e. read its DNA sequence) to : → Understand the molecular mechanisms of tumoral progression → Tailored the therapy for each patient individually Use high-throughput sequencing methods (Next-Generation sequencing) 30 years ago... the era of DNA sequencing Walter Gilbert Harvard Nobel Laureate, 1980 Co-inventor with Frederick Sanger of the eponymic DNA sequencing method in 1977 “I expect that within a few years, our technology will be able to sequence one megabase/technician-year. At that rate 100 technicians could sequence the genome in 30 years. An effort to improve the technology over a 10-year period should raise the rate by a factor of 10.” The Scientist. October 20. 1986 Evolution of sequencing technologies and cost decreasing Year Genome 2003 HGP 2007 Venter 2008 Watson 2009 Cost $ Duration Technology Nb. of scientists 2,700,000,000 13 years Sanger 2,800 100,000,000 4 years Sanger 31 Roche 454 27 1,500,000 4.5 months 50,000 4 weeks Helicos 3 Sources: Pushkarev et al. (2009), Wadman et al. (2008) Roche 454 Illumina Solid Helicos In 2013, around 5000$ to sequence a human genome in one week with one technician (1500 times faster than Gilbert's prediction) → Toward the 1000$ genome Data tsunami in cancer research Low cost sequencing + Availability to every lab = Cost is divided by 2 in : ● CPU - Moore's law: 18 months ● Storage - Kryder's law : 12-14 months ● Network - Butter's law : 9 months ● NGS' law : 5 month → informatic challenges Next-generation sequencing... some figures... Sequencing with Illumina Hiseq 2500 : – 6 billions of sequences: – 1 sequence = 100 bases (A, T, C, G) – 1 experiment = 600 billions of bases = 200,000 “Les Misérables” – 1Tb of data (per week) ● Human genome = 3 billions of bases = 1,000 “Les Misérables” ● Reference human genome (known sequence) = dictionnary ● Cancer genome = wrong copy the the dictionnary ● In cancer, genes = words contains mutation = mistake gene1 = GIRAFFE → gene1 = GILAFFE ● Cancer creates new words = fusion genes gene1 = GIRAFFE, genes2 = ZEBRA → new gene = GIBRA → The 6 billions of sequences will be compared to the reference genome to find the mutations and fusion genes taking into account the fact that the sequencer itself makes error when reading the sequence Extraction of the biological signal from the raw data Development of algorithms and statistical methods Interdisciplinary work with bioinformaticians, informaticians, biologists, mathematiciens, statisticians and algorithmists HPC infrastructure Pieces of the cancer genome CGAGCTG ACGAGCT TCCTAGC GCTCCTA TTTACGA AGCTCCT TTTACGA AGCTCCT ACGACTT ACTACGA GGCCAAC CGGCCAA AGCTGCG CGAGCTG CTACGAG CATCTAC Reference Genome Sequence = dictionnary A C T A C G A C T C T A C G A G C A T C TA C G A GC T A C T A G C G A T C A C G A G C T G C G A G C A A C G GC CA A C Mutations Visualisation of the significant fusions Intra-chromosome fusions Intra-chromosome fusions Source: MCF-7 breast cancer cell line, Hampton et al., Genome Research 2009 Application to personalised medicine: the SHIVA clinical trial molecularly targeted therapy >? conventional therapy Molecular profile Molecular abnormality Targeted agent Targeted agent Chemotherapy Chemotherapy Chemotherapy Targeted agent Targeted agent Targeted agent Targeted agent → compare the efficacy of molecularly targeted therapy based on tumor molecular profiling versus conventional therapy in patients with refractory cancer SHIVA clinical trial: the workflow Patient’s inclusion Shipment to CRB biopsy clinic Validation of amplified/deleted genes by IHC 4 weeks Shipment to pathology Shipment of DNA to Affymetrix platform DNA extraction Affymetrix Cytoscan HD IHC RO/RP/RA Shipment of DNA to sequencing platform Sequencing Ion Torrent Bioinformatics data integration List of amplified/ deleted genes Bioinformatics analysis: detection of amplified/deleted genes Bioinformatics analysis: detection of mutated genes Elaboration of a report that is sent to the Molecular Biology Board Therapeutic decision The therapeutic decision is based on a report with the list of molecular abnormalities Simple decision rules: ● If STK11 is mutated targeted therapy = everolimus ● Other simple rules are used for other targeted therapies → Cancer biology is much more complex and these “naive” rules need to be improved Cancer is a complex disease Multiple biological layers Interactions between chemical species The multidimensional nature of the cancer (genome, proteome, epigenome, kinome, etc.) has to be considered to unravel the complexity of the disease. Mathematical models and computational systems biology are definitely needed to improve current decision rules and understand the emergent properties of cancer cells. → In order to perfom such integrative analyses with sophisticated mathematical models, the data integration of these multidimensional informations within an efficent information system is required. Data integration is a major challenge in cancer research Private data Medical Copy Number images data Public data Clinical data NGS data MS data Gene expression data Phenotyping data Biobank data Reactome TCGA CCLE ICGC RPPA data A large Volume of patients' data is disseminated accross a large Variety of databases which increase in size at a huge Velocity. In order to extract most of the hidden Value from these data we must face challenges at : → the technical level : develop a powerful informatic architecture → the organisational and management levels : define the procedures to collect data with hightest confidence and quality → the scientific level : create sophisticated mathematical models to predict the disease evolution and patient's risk → At Institut Curie we are currently building an information system to fully integrate all the molecular, biological and clinical data Can we dream of an online prediction system to help therapeutic prediction? Private data Public data wrapper LIMS NGS data wrapper LIMS RPPA data wrapper Reactome wrapper ... ... ● Every day, for several patients, information are collected : wrapper Gene expression data LIMS Integrative analysis aim at building signatures to predict disease evolution (e.g. risk of metastatis) Clinical data Centralised bioinformatics database Virtual database – pathological complete response – survival – response to therapy – molecular profiles – etc. Therapeutic decision Re-evaluate prediction rules in real-time taking into account these new informations ● ● Apply online machine learning techniques Prediction of pCR New patient Training math models Observed pCR ... time ● Towards P4 medicine ● P4 medecine was coined by Leroy Hood (president of the Institute of System Biology) ● The practise of medicine is mainly reactive, i.e. the physician reacts to the disease state of the patient and little is done to prevent the occurrence of the disease. ● Predictive medicine was first introduced by Jean Dausset (Nobel prize in medicine, 1980). P4 medicine : – Predictive : consider the genetic background of the individual and his environment – Preventive : adapting lifestyle, traking preventing drugs – Personalised : tailored the treatment to the unique feature of the individual (such as patient's genetic background, tumour's genetic and epigenetic landscape, life environment) – Parcipatory : many options about healthcare which require in-depth exchanges between the indivudual and his physician → P4 medicine = manage patient'health instead of manage a patient's disease Big basket with a large variety of data Data integration + mathematical models → leverage new information Bienvenue à GATTACA