Taming the Big Data Dragon
Transcription
Taming the Big Data Dragon
Taming the Big Data Dragon John Quackenbush Winter School 9 July 2014 Every revolution in science — from Copernican heliocentric model to the rise of statistical and quantum mechanics, from Darwin’s theory of evolution and natural selection to the theory of the gene — has been driven by one and only one thing: access to data. –John Quackenbush Disease Progression and Personalized Care Birth Treatment Natural History of Disease Clinical Care Environment + Lifestyle Outcomes Treatment Options Disease Staging Patient Stratification Early Detection Genetic Risk Biomarkers Quality Of Life Death Turning the vision into a reality Assure access to samples and rational consent Develop a technology platform Make information integration as a central mission Present data and information to the research community Enable research beyond your own Engage corporate partners Communicate the mission to the community Conduct research as a vital component. Costs of Generating Data Have Plummeted What about the cost of analysis? Genome Med. 2010 Nov 26;2(11):84. doi: 10.1186/gm205. The Precision Medicine Ecoverse Cost of Analysis $105 Clinical Medicine $104 $103 Clinical Medicine $102 100 Clinical Medicine 101 102 103 Number of Genes 104 Springfield Diagnostic Labs Precision Medicine Demands Simplicity Springfield Diagnostic Labs The Challenges of Big Data NRC on Big Data National Research Council’s Committee on Massive Data Analysis concluded in their 2013 “Frontiers of Massive Data Analysis” report that the challenges associated with massive data go far beyond the technical aspects of data management (although those are not to be ignored.). The NRC consensus report noted the key element in meeting Big Data’s challenges was development of rigorous quantitative and statistical methods. http://www.nap.edu/catalog.php?record_id=18374 NRC on Big Data The challenges for massive data go beyond the storage, indexing, and querying that have been the province of classical database systems (and classical search engines) and, instead, hinge on the ambitious goal of inference. Statistical rigor is necessary to justify the inferential leap from data to knowledge, and many difficulties arise in attempting to bring statistical principles to bear on massive data. Overlooking this foundation may yield results that are, at best, not useful, or harmful at worst. http://www.nap.edu/catalog.php?record_id=18374 NRC on Big Data In any discussion of massive data and inference, it is essential to be aware that it is quite possible to turn data into something resembling knowledge when actually it is not. Moreover, it can be quite difficult to know that this has happened. http://www.nap.edu/catalog.php?record_id=18374 Transforming Medicine? New technologies from surveillance and exposure to genomics and imaging, from electronic health data to survey-based longitudinal studies are providing unprecedented data that have opened new avenues of investigation, transforming biomedical research into an information science This is the era of Big Data in biomedical research with increases in the Three V’s: Volume, Velocity, and Variety. The challenge is to bring this information together with other information to better understand fundamental problems, including a wide range of problems in health and biomedical research. Key Challenges in Big Data Preprocessing (Normalization) and Hot Spot Detection Need methods to compare measurements across sources and to rapidly identify salient features Data Integration Need methods that can combine data from various sources where there are hidden correlations in the data Reproducible Research Need to leverage the volume and velocity of the data to provide opportunities for validation of findings Network Methods Need to move beyond correlations in studying relationships in data Normalization and Batch Effects Batch Effects must be normalized http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html Integrative Analysis Variety: A blessing and a curse New technologies allow us to generate multiple, independent data types on individual samples, such as genome sequencing, RNA-seq, ChIP-seq, and proteomics data Making sense of any ‘omics data requires that we have extensive phenotypic (clinical) data Electronic Medical Records are not designed to capture the appropriate data There are multiple, hidden dependencies in ’omic data and which we want to use or ignore often depends on our application. Reproducible Research Have Robust Phenotypes • • • Two large studies, CCLE and CGP 471 cell lines in common 15 drugs in common Are CCLE and CGP consistent? Gene expression data are highly concordant Phenotypes are not What is an ‘omic biomarker It is a “feature set” (such as a “gene set”) that has been identified through a careful statistical analysis contrasting distinct phenotypic groups in a dataset This can be confounded by a failure to recognize mixed phenotypes in the analysis It is an “algorithm” that can use the values of the elements in the feature set to assign new patients to each of the phenotypic groups The same gene set with different algorithms can produce different classifications Reproducible and robust biomarkers A biomarker (gene set and algorithm) should be reproducible in the sense that anyone can use the same data and algorithm and reproduce the same classification of samples. A biomarker (gene set and algorithm) should be robust in the sense that it can be reproduced using independent data sets, producing a similar prediction of clinically-relevant outcome. The ‘omic problem with biomarkers Breast cancer is the single best ‘omic-ly characterized disease with established molecular subtypes—although no clear consensus on how to classify new patients There have been hundreds of predictive biomarkers published in breast cancer, few of which have made it into clinical practice, all of which are of questionable value Recent work by Venet underscores the problem: Venet D, Dumont JE, Detours V (2011) Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput Biol 7: e1002240 The ‘omic problem with biomarkers We need to empirically benchmark potential biomarker gene sets against random gene signatures to establish their robustness We need to validate gene signatures and the associated algorithms in independent data sets Andrew Beck, Benjamin Haibe-Kains Subtype Classification Models Benjamin Haibe-Kains, Christos Sotiriou Subtype Classification Models Benjamin Haibe-Kains, Christos Sotiriou Subtype Classification Models (SCMs) Mixture of Gaussians in ER/HER2 gene expression space to identify the main subtypes Computation of maximum posterior probability of a tumor belonging to a subtypes Benjamin Haibe-Kains, Christos Sotiriou Subtype Classification Models (SCMs) Different classification algorithms provide different classifications for the same samples, and these depend on training set! Benjamin Haibe-Kains, Christos Sotiriou Subtype Classification Models (SCMs) Classification accuracy for classifiers trained with different subsets of a set of consolidated dataset of 5175 breast cancer patients Benjamin Haibe-Kains, Christos Sotiriou Significance Analysis of Prognostic Signatures Andrew Beck, Benjamin Haibe-Kains Requirements for “reproducible and robust biomarkers” Access to primary data used to derive the signatures we use as biomarkers. Access to the sample annotation that are associated with those primary data. Access to the software code used to build predictive models. Rigorous validation of biomarker signatures in independent datasets, including assessment compared to random gene sets. Andrew Beck, Benjamin Haibe-Kains Requirements for “reproducible and robust biomarkers” We need a training dataset and a (blinded) test dataset for understanding the power of our feature selection approach We should benchmark our feature selection against random feature sets We need (multiple) independent training and (blinded) test datasets for our algorithms to test the robustness of the methods. Andrew Beck, Benjamin Haibe-Kains Additional Thought Context is everything – we cannot think about developing meaningful biomarkers without the associated metadata. Network Methods What can we learn from networks? Normal Tissue Network Chemosensitive Tumor Chemoresistant Tumor 37 Regulation of Transcription regulatory sequences promoter Specific transcription factors A Simple Idea: Message Passing Transcription Factor The TF is Responsible for communicating with its Target Downstream Target The Target must be Available to respond to the TF Kimberly Glass, GC Yuan Message-Passing Networks: PANDA (Passing Attributes between Networks for Data Assimilation) Genomic Data Use Message Passing to find a consensus among the networks Initial Network Information Protein-protein interactions Network Representation Cooperation between TFs Potential Regulatory Events Gene Expression Potential CoRegulatory Events genes Protein-DNA interactions Learned Network Information Message Passing Glass et. al. “Passing Messages Between Biological Networks to Refine Predicted Interactions.” PLoS One. 2013 May 31;8(5):e64832. Code and related material available on sourceforge: http://sourceforge.net/projects/panda-net/ Message-Passing Networks: PANDA Motif Data Network0 PPI0 Responsibility PPI1 Expression0 Availability Network1 Kimberly Glass, GC Yuan Expression1 PANDA: Integrative Network Models Conditions Genes Network for Angiogenic Subtype Expression data (Angiogenic) Genes Conditions Compare/Identify Differences Network for Non-angiogenic Subtype Expression data (Non-angiogenic) Kimberly Glass, GC Yuan Network Differences are captured in Edges 15735 unique edges, Including 49 TFs Targeting 4419 genes 12631 unique edges, Including 56 TFs Targeting 4081 genes Kimberly Glass, GC Yuan Kimberly Glass, GC Yuan Kimberly Glass, GC Yuan Kimberly Glass, GC Yuan Inner ring: key TFs Colored by Edge Enrichment (A or N) Outer ring: genes Colored by Differential Expression (A or N) Interring Connections Colored by Subnetwork (A or N) Ticks – genes annotated to “angiogenesis” in GO, Ten “Key” Transcription Factors TF differential Expression Potential Connection with Angiogenesis Target differential Expression important chromatin remodeler in angiogenesis NFKB1 ARID3A TF differential Methylation required for hematopoetic development SOX5 involved in prostate cancer progression, responsive to estrogen Target differential Methylation Publication(s) PMID 20203265 21199920 19173284, 16636675 TFAP2A increases MMP2 expression and angiogenesis in melanoma 11423987 NKX2-5 regulates heart development 10021345 PRRX2 deletion cause vascular anomalies 10664157 AHR knock-out impairs angiogenesis 19617630 SPIB inhibits plasma cell differentiation 18552212 MZF1 represses MMP-2 in cervical cancer 22846578 BRCA1 inhibits VEGF and represses IGF1 in breast cancer 12400015, 22739988 Regulatory Patterns suggest Therapies Kimberly Glass, GC Yuan Other disease datasets provide validation Sorafenib (a bi-aryl urea) is a small molecular inhibitor of several Tyrosine protein kinases (VEGFR and PDGFR) and Raf kinases (more avidly C-Raf than B-Raf). Message-Passing Networks: PANDA 2.0 miRNA targets Genetics PPI0 Motif Data Methylation Network0 Expression0 Metabolomics Responsibility PPI1 Availability Network1 Expression1 eQTL Analysis Use genome-wide SNP data and gene expression data together Treat gene expression as a quantitative trait Ask, “Which SNPs are correlated with the degree of gene expression?” Most people concentrate on cis-acting SNPs What about trans-acting SNPs? John Platig eQTL Networks: A simple idea eQTLs should group together with core SNPs regulating particular cellular functions Perform a “standard eQTL” analysis: Y = β0 + β1 ADD + ε where Y is the quantitative trait and ADD is the allele dosage of a genotype. John Platig, Fah Sathirapongsasuti Which SNPs affect function? Many strong eQTLs are found near the target gene. But what about multiple SNPs that are correlated with multiple genes? SNPs Genes John Platig Can a network of SNPgene associations inform the functional roles of these SNPs? eQTL Networks: A simple idea Create a bipartite graph where SNPs and genes are nodes and significant eQTL associations are edges. Use “leading eigenvector” clustering to find “communities” in the graph John Platig, Fah Sathirapongsasuti A bipartite network has 2 types of node Links only connect different node types Node types: SNPs, Genes Correlation SNPs John Platig Genes Background A quantity x obeys a power law if it is drawn from a probability distribution: Scale-free networks emerge through: (1) expansion through addition of new vertices (2) new vertices attach preferentially to sites that are already well-connected Hubs dominate the topology of scale-free networks eQTL hotspots are genomic regions that play an important role in regulating gene expression Results: COPD Can we use this network to identify groups of SNPs and genes that play functional roles in the cell? Try clustering the nodes into ‘communities’ based on the network structure John Platig eQTL Networks: A simple idea Communities are groups of highly intraconnected nodes • Community structure algorithms group nodes such that the number of links within a community is higher than expected by chance • Formally, they assign nodes to communities such that the modularity, Q, is optimized Fraction of network links in community i Fraction of links expected by chance John Platig Newman 2006 (PNAS) Communities are groups of highly intraconnected nodes Community structure algorithms group nodes such that the number of links within a community is higher than expected by chance. Bipartite networks require a different null model Implement “BRIM” algorithm to find communities John Platig Newman 2006 (PNAS) BRIM produces GO enriched Communities John Platig BRIM produces GO enriched Communities ATP6V1G2 ATRNL1 HLA-DQA2 HLA-DQB1 HLA-DQB2 HLA-DRA HLA-DRB1 HLA-DRB4 HLA-DRB5 MAGEA2B MICB NCR3 PLEKHG6 PSORS1C1 TAP2 John Platig BRIM produces GO enriched Communities John Platig BRIM produces GO enriched Communities John Platig BRIM produces GO enriched Communities John Platig Calculate Local Connectivity Modularity of node i Modularity of community c John Platig Community Structure Matters Are “disease” SNPs skewed towards the top of my SNP list as ranked by the overall out degree? No! The highest-degree SNPs are devoid of disease-related SNPs Highly deleterious SNPs that affect many processes are probably removed by evolutionary sweeps. John Platig Community Structure Matters Are “disease” SNPs skewed towards the top of my SNP list as ranked by the community core score (Qic)? Yes! KS test yields p < 10-16, wilcoxon rank-sum yields p < 10-9 John Platig The future is here. It's just not widely distributed yet. - William Gibson Before I came here I was confused about this subject. After listening to your lecture, I am still confused but at a higher level. - Enrico Fermi, (1901-1954) Acknowledgments Array Software Hit Team Eleanor Howe John Quackenbush Dan Schlauch Gene Expression Team Fieda Abderazzaq Stefan Bentink Aedin Culhane Benjamin Haibe-Kains Jessica Mar Melissa Merritt Megha Padi Renee Rubio <johnq@jimmy.harvard.edu> Center for Cancer Computational Biology Dustin Holloway Lan Hui Lev Kuznetsov Yaoyu Wang John Quackenbush http://cccb.dfci.harvard.edu Students and Postdocs Martin Aryee Kimberly Glass Marieke Kuijjer Kaveh Maghsoudi Jess Mar Megha Padi John Platig Alejandro Qiuiroz J. Fah Sathirapongsasuti Systems Support Stas Alekseev, Sys Admin Administrative Support Julianna Coraccio University of Queensland Christine Wells Lizzy Mason http://compbio.dfci.harvard.edu