Taming the Big Data Dragon

Transcription

Taming the Big Data Dragon
Taming the Big Data Dragon
John Quackenbush
Winter School
9 July 2014
Every revolution in science — from
Copernican heliocentric model to the rise of
statistical and quantum mechanics, from
Darwin’s theory of evolution and natural
selection to the theory of the gene — has
been driven by one and only one thing:
access to data.
–John Quackenbush
Disease Progression and
Personalized Care
Birth
Treatment
Natural History of Disease
Clinical Care
Environment
+ Lifestyle
Outcomes
Treatment
Options
Disease
Staging
Patient
Stratification
Early
Detection
Genetic
Risk
Biomarkers
Quality
Of Life
Death
Turning the vision into a reality
Assure access to samples and rational consent
Develop a technology platform
Make information integration as a central mission
Present data and information to the research community
Enable research beyond your own
Engage corporate partners
Communicate the mission to the community
Conduct research as a vital component.
Costs of Generating Data Have Plummeted
What about the cost of analysis?
Genome Med. 2010 Nov 26;2(11):84. doi: 10.1186/gm205.
The Precision Medicine Ecoverse
Cost of Analysis
$105
Clinical
Medicine
$104
$103
Clinical
Medicine
$102
100
Clinical Medicine
101
102
103
Number of Genes
104
Springfield Diagnostic Labs
Precision Medicine Demands Simplicity
Springfield Diagnostic Labs
The Challenges of Big Data
NRC on Big Data
National Research Council’s Committee on
Massive Data Analysis concluded in their 2013
“Frontiers of Massive Data Analysis” report that
the challenges associated with massive data go
far beyond the technical aspects of data
management (although those are not to be
ignored.).
The NRC consensus report noted the key element
in meeting Big Data’s challenges was
development of rigorous quantitative and
statistical methods.
http://www.nap.edu/catalog.php?record_id=18374
NRC on Big Data
The challenges for massive data go beyond the
storage, indexing, and querying that have been
the province of classical database systems (and
classical search engines) and, instead, hinge on
the ambitious goal of inference.
Statistical rigor is necessary to justify the
inferential leap from data to knowledge, and many
difficulties arise in attempting to bring statistical
principles to bear on massive data.
Overlooking this foundation may yield results that
are, at best, not useful, or harmful at worst.
http://www.nap.edu/catalog.php?record_id=18374
NRC on Big Data
In any discussion of massive data and inference, it
is essential to be aware that it is quite possible to
turn data into something resembling knowledge
when actually it is not. Moreover, it can be quite
difficult to know that this has happened.
http://www.nap.edu/catalog.php?record_id=18374
Transforming Medicine?
New technologies from surveillance and exposure to
genomics and imaging, from electronic health data to
survey-based longitudinal studies are providing
unprecedented data that have opened new avenues of
investigation, transforming biomedical research into an
information science
This is the era of Big Data in biomedical research with
increases in the Three V’s: Volume, Velocity, and Variety.
The challenge is to bring this information together with
other information to better understand fundamental
problems, including a wide range of problems in health
and biomedical research.
Key Challenges in Big Data
Preprocessing (Normalization) and Hot Spot Detection
Need methods to compare measurements across
sources and to rapidly identify salient features
Data Integration
Need methods that can combine data from various
sources where there are hidden correlations in the data
Reproducible Research
Need to leverage the volume and velocity of the data to
provide opportunities for validation of findings
Network Methods
Need to move beyond correlations in studying
relationships in data
Normalization
and Batch Effects
Batch Effects must be normalized
http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html
Integrative Analysis
Variety: A blessing and a curse
New technologies allow us to generate multiple,
independent data types on individual samples, such as
genome sequencing, RNA-seq, ChIP-seq, and
proteomics data
Making sense of any ‘omics data requires that we have
extensive phenotypic (clinical) data
Electronic Medical Records are not designed to capture
the appropriate data
There are multiple, hidden dependencies in ’omic data
and which we want to use or ignore often depends on
our application.
Reproducible Research
Have Robust Phenotypes
•
•
•
Two large studies, CCLE and CGP
471 cell lines in common
15 drugs in common
Are CCLE and CGP consistent?
Gene expression data are highly concordant
Phenotypes are not
What is an ‘omic biomarker
It is a “feature set” (such as a “gene set”) that has
been identified through a careful statistical
analysis contrasting distinct phenotypic groups in
a dataset
This can be confounded by a failure to
recognize mixed phenotypes in the analysis
It is an “algorithm” that can use the values of the
elements in the feature set to assign new patients
to each of the phenotypic groups
The same gene set with different algorithms
can produce different classifications
Reproducible and robust biomarkers
A biomarker (gene set and algorithm) should be
reproducible in the sense that anyone can use
the same data and algorithm and reproduce the
same classification of samples.
A biomarker (gene set and algorithm) should be
robust in the sense that it can be reproduced
using independent data sets, producing a similar
prediction of clinically-relevant outcome.
The ‘omic problem with biomarkers
Breast cancer is the single best ‘omic-ly
characterized disease with established molecular
subtypes—although no clear consensus on how
to classify new patients
There have been hundreds of predictive
biomarkers published in breast cancer, few of
which have made it into clinical practice, all of
which are of questionable value
Recent work by Venet underscores the problem:
Venet D, Dumont JE, Detours V (2011) Most
random gene expression signatures are
significantly associated with breast cancer
outcome. PLoS Comput Biol 7: e1002240
The ‘omic problem with biomarkers
We need to empirically benchmark potential
biomarker gene sets against random gene
signatures to establish their robustness
We need to validate gene signatures and the
associated algorithms in independent data sets
Andrew Beck, Benjamin Haibe-Kains
Subtype Classification Models
Benjamin Haibe-Kains, Christos Sotiriou
Subtype Classification Models
Benjamin Haibe-Kains, Christos Sotiriou
Subtype Classification Models (SCMs)


Mixture of Gaussians in ER/HER2 gene expression
space to identify the main subtypes
Computation of maximum posterior probability of a
tumor belonging to a subtypes
Benjamin Haibe-Kains, Christos Sotiriou
Subtype Classification Models (SCMs)
Different classification algorithms provide
different classifications for the same
samples, and these depend on training set!
Benjamin Haibe-Kains, Christos Sotiriou
Subtype Classification Models (SCMs)
Classification
accuracy for
classifiers
trained with
different
subsets of a set
of consolidated
dataset of 5175
breast cancer
patients
Benjamin Haibe-Kains, Christos Sotiriou
Significance Analysis of Prognostic Signatures
Andrew Beck, Benjamin Haibe-Kains
Requirements for
“reproducible and robust biomarkers”
Access to primary data used to derive the
signatures we use as biomarkers.
Access to the sample annotation that are
associated with those primary data.
Access to the software code used to build
predictive models.
Rigorous validation of biomarker signatures in
independent datasets, including assessment
compared to random gene sets.
Andrew Beck, Benjamin Haibe-Kains
Requirements for
“reproducible and robust biomarkers”
We need a training dataset and a (blinded) test
dataset for understanding the power of our
feature selection approach
We should benchmark our feature selection
against random feature sets
We need (multiple) independent training and
(blinded) test datasets for our algorithms to test
the robustness of the methods.
Andrew Beck, Benjamin Haibe-Kains
Additional Thought
Context is everything – we cannot think about
developing meaningful biomarkers without the
associated metadata.
Network Methods
What can we learn from networks?
Normal Tissue
Network
Chemosensitive
Tumor
Chemoresistant
Tumor
37
Regulation of Transcription
regulatory
sequences
promoter
Specific transcription factors
A Simple Idea: Message Passing
Transcription Factor
The TF is Responsible for
communicating with its Target
Downstream Target
The Target must be Available
to respond to the TF
Kimberly Glass, GC Yuan
Message-Passing Networks: PANDA
(Passing Attributes between Networks for Data Assimilation)
Genomic
Data
Use Message Passing to find a
consensus among the networks
Initial Network
Information
Protein-protein
interactions
Network
Representation
Cooperation
between TFs
Potential
Regulatory Events
Gene Expression
Potential CoRegulatory Events
genes
Protein-DNA
interactions
Learned Network
Information
Message
Passing
Glass et. al. “Passing Messages Between Biological Networks to Refine Predicted Interactions.” PLoS One. 2013 May
31;8(5):e64832. Code and related material available on sourceforge: http://sourceforge.net/projects/panda-net/
Message-Passing Networks:
PANDA
Motif Data
Network0
PPI0
Responsibility
PPI1
Expression0
Availability
Network1
Kimberly Glass, GC Yuan
Expression1
PANDA: Integrative Network Models
Conditions
Genes
Network for
Angiogenic Subtype
Expression data
(Angiogenic)
Genes
Conditions
Compare/Identify Differences
Network for
Non-angiogenic Subtype
Expression data
(Non-angiogenic)
Kimberly Glass, GC Yuan
Network Differences are captured in
Edges
15735 unique edges,
Including 49 TFs
Targeting 4419 genes
12631 unique edges,
Including 56 TFs
Targeting 4081 genes
Kimberly Glass, GC Yuan
Kimberly Glass, GC Yuan
Kimberly Glass, GC Yuan
Kimberly Glass, GC Yuan
Inner ring: key TFs
Colored by Edge
Enrichment (A or N)
Outer ring: genes
Colored by Differential
Expression (A or N)
Interring Connections
Colored by
Subnetwork (A or N)
Ticks – genes
annotated to
“angiogenesis” in GO,
Ten “Key” Transcription Factors
TF differential Expression
Potential Connection with Angiogenesis
Target
differential Expression important chromatin remodeler in angiogenesis
NFKB1
ARID3A
TF differential Methylation
required for hematopoetic development
SOX5
involved in prostate cancer progression, responsive to estrogen
Target
differential Methylation
Publication(s) PMID
20203265
21199920
19173284, 16636675
TFAP2A
increases MMP2 expression and angiogenesis in melanoma
11423987
NKX2-5
regulates heart development
10021345
PRRX2
deletion cause vascular anomalies
10664157
AHR
knock-out impairs angiogenesis
19617630
SPIB
inhibits plasma cell differentiation
18552212
MZF1
represses MMP-2 in cervical cancer
22846578
BRCA1
inhibits VEGF and represses IGF1 in breast cancer
12400015, 22739988
Regulatory Patterns suggest Therapies
Kimberly Glass, GC Yuan
Other disease datasets provide validation
Sorafenib (a bi-aryl urea) is a small molecular inhibitor of
several Tyrosine protein kinases (VEGFR and PDGFR)
and Raf kinases (more avidly C-Raf than B-Raf).
Message-Passing Networks:
PANDA 2.0
miRNA targets
Genetics
PPI0
Motif Data
Methylation
Network0
Expression0
Metabolomics
Responsibility
PPI1
Availability
Network1
Expression1
eQTL Analysis
Use genome-wide SNP data and gene expression
data together
Treat gene expression as a quantitative trait
Ask, “Which SNPs are correlated with the degree
of gene expression?”
Most people concentrate on cis-acting SNPs
What about trans-acting SNPs?
John Platig
eQTL Networks: A simple idea
eQTLs should group together with core SNPs
regulating particular cellular functions
Perform a “standard eQTL” analysis:
Y = β0 + β1 ADD + ε
where Y is the quantitative trait and ADD is the
allele dosage of a genotype.
John Platig, Fah Sathirapongsasuti
Which SNPs affect function?
Many strong eQTLs are found near the target
gene. But what about multiple SNPs that are
correlated with multiple genes?
SNPs
Genes
John Platig
Can a network of SNPgene associations
inform the functional
roles of these SNPs?
eQTL Networks: A simple idea
Create a bipartite graph where SNPs and genes
are nodes and significant eQTL associations are
edges.
Use “leading eigenvector” clustering to find
“communities” in the graph
John Platig, Fah Sathirapongsasuti
A bipartite network has 2 types of node
Links only connect different node types
Node types: SNPs, Genes
Correlation
SNPs
John Platig
Genes
Background
 A quantity x obeys a power law if it is drawn from a
probability distribution:
 Scale-free networks emerge through:
 (1) expansion through addition of new vertices
 (2) new vertices attach preferentially to sites that are
already well-connected
 Hubs dominate the topology of scale-free networks
 eQTL hotspots are genomic regions that play an
important role in regulating gene expression
Results: COPD
Can we use this network to
identify groups of SNPs and
genes that play functional roles
in the cell?
Try clustering the nodes into
‘communities’ based on the
network structure
John Platig
eQTL Networks: A simple idea
Communities are groups of highly intraconnected nodes
• Community structure algorithms group nodes
such that the number of links within a community
is higher than expected by chance
• Formally, they assign nodes to communities such
that the modularity, Q, is optimized
Fraction of network
links in community i
Fraction of
links expected
by chance
John Platig
Newman 2006
(PNAS)
Communities are groups of highly intraconnected nodes
Community structure algorithms group nodes such
that the number of links within a community is
higher than expected by chance.
Bipartite networks require a different null model
Implement “BRIM” algorithm
to find communities
John Platig
Newman 2006
(PNAS)
BRIM produces GO enriched
Communities
John Platig
BRIM produces GO enriched
Communities
ATP6V1G2
ATRNL1
HLA-DQA2
HLA-DQB1
HLA-DQB2
HLA-DRA
HLA-DRB1
HLA-DRB4
HLA-DRB5
MAGEA2B
MICB
NCR3
PLEKHG6
PSORS1C1
TAP2
John Platig
BRIM produces GO enriched
Communities
John Platig
BRIM produces GO enriched
Communities
John Platig
BRIM produces GO enriched
Communities
John Platig
Calculate Local Connectivity
Modularity of node i
Modularity of
community c
John Platig
Community Structure Matters
 Are “disease” SNPs skewed towards the
top of my SNP list as ranked by the
overall out degree?
 No!
 The highest-degree SNPs are devoid of
disease-related SNPs
 Highly deleterious SNPs that affect many
processes are probably removed by
evolutionary sweeps.
John Platig
Community Structure Matters
 Are “disease” SNPs skewed towards the
top of my SNP list as ranked by the
community core score (Qic)?
 Yes!
 KS test yields p < 10-16,
 wilcoxon rank-sum yields p < 10-9
John Platig
The future is here.
It's just not widely distributed yet.
- William Gibson
Before I came here I was confused
about this subject.
After listening to your lecture,
I am still confused but at a higher level.
- Enrico Fermi, (1901-1954)
Acknowledgments
Array Software Hit Team
Eleanor Howe
John Quackenbush
Dan Schlauch
Gene Expression Team
Fieda Abderazzaq
Stefan Bentink
Aedin Culhane
Benjamin Haibe-Kains
Jessica Mar
Melissa Merritt
Megha Padi
Renee Rubio
<johnq@jimmy.harvard.edu>
Center for Cancer
Computational Biology
Dustin Holloway
Lan Hui
Lev Kuznetsov
Yaoyu Wang
John Quackenbush
http://cccb.dfci.harvard.edu
Students and Postdocs
Martin Aryee
Kimberly Glass
Marieke Kuijjer
Kaveh Maghsoudi
Jess Mar
Megha Padi
John Platig
Alejandro Qiuiroz
J. Fah Sathirapongsasuti
Systems Support
Stas Alekseev, Sys Admin
Administrative Support
Julianna Coraccio
University of Queensland
Christine Wells
Lizzy Mason
http://compbio.dfci.harvard.edu