Outline Wprowadzenie do genetyki i zastosowa statystyki w

Transcription

Outline Wprowadzenie do genetyki i zastosowa statystyki w
Outline
Wprowadzenie do genetyki i
zastosowań statystyki w
genetyce
♦ Elementary genetics
♦ “Omics”
♦ A cautionary case study
Tomasz Burzykowski
♦ Statistics for “omics” technologies
Hasselt University &
International Drug Development Institute (IDDI), Belgium
tomasz.burzykowski@uhasselt.be
1
2
DNA (deoxyribonucleic acid)
♦ The hereditary material in a cell
is coded in the sequence of the
nucleotides of DNA.
• There are normally 46 strands of DNA in
23 chromosomes in human cells.
• The complete set is called genome.
♦ Prior to cell division, the DNA
material must be duplicated so
that after cell division, each new
cell contains the full amount of
DNA material. The process is
usually called replication.
•
3
The replication is semiconservative, as each
new cell contains one strand of original DNA
and one newly synthesized strand of DNA.
4
DNA and RNA
DNA Replication
♦ The double helix of DNA is caused
to unwind. Each DNA strand
serves as a template to guide the
synthesis of its complementary
strand of DNA
♦ Template #2 guides the formation
of a new complementary #1
strand: A → T, C → G, T → A, etc.
Exactly the opposite reaction
occurs using template #1.
♦ The new sequences are checked
by two different polymerase
enzymes. Mismatched nucleotides
are hydrolyzed and cut out and
new correct ones are inserted.
5
Genes (1)
6
Genes
CATCGGCTTATCTAGCTAATCGAGCTCTCTGAAGAGAAATATCATCTACGACTACTACGACACACATCGACGAGGCATC
♦ You can think of a cell as a protein factory.
♦ A gene is a contiguous section of a
chromosome that encodes
information to build a protein or an
RNA (ribonucleic acid) molecule.
♦ Proteins are the basic building blocks of life.
• Some proteins are the fundamental, structural components of
tissue; others (enzymes) are catalysts for chemical reactions.
♦ In humans, a gene is composed of
about 10,000 bp.
♦ Each gene is a blueprint for a protein, which gets
manufactured in the cell, and then goes and does some job
elsewhere in the body, or maybe in the same cell.
♦ A chromosome contains genes and
contiguous sections that are not
part of any gene.
♦ A gene specifies how to make a specific protein, using the
materials typically found inside the cell (amino acids, AAs).
7
8
Proteins, Peptides, Amino Acids
Protein Structure
♦ Proteins are large molecules
composed of one or more AA chains
(polypeptides), arranged in a
biologically functional way.
♦ Peptides (Greek: "digested") are short
chains of AAs. Distinguished
(arbitrarily) from proteins based on size
(typically, peptide < 50 AAs)
• dipeptides (two AAs), tripeptides,
tetrapeptides, etc.
♦ A polypeptide is a long, continuous,
and unbranched peptide.
9
Amino Acids
10
The Genetic Code
♦ AAs are coded by triplets of nucleotides
♦ Redundancy: there are 20 basic AAs and 43 = 64 triplets
11
12
Transcription: DNA → mRNA
Translation: mRNA → protein
♦ After mRNA has been produced, it leaves the nucleus to allow
♦ The genetic code is “read” from a type of RNA called
protein synthesis.
messenger RNA (mRNA).
♦ In the cytoplasm, ribsomal RNA
• DNA needs to be transcribed into mRNA.
(rRNA) and protein combine to
form a ribosome. It serves as the
site and carries the enzymes
necessary for protein synthesis.
♦ Transfer RNA (tRNA) contains
13
Translation: gene → protein
RNA transcript
A
R
C
S
E
Y
14
Human Genome Project
♦ A 13-year project coordinated by the U.S. Department of
CUAGCUCGAUGCUCUGAGUACGUCUAG
L
about 75 nucleotides, 3 of which
are called anticodons, and one AA.
The tRNA reads the mRNA codon
by using anticodon and carries the
AA to be incorporated into the
protein. There are at least 20
different tRNA's - one for each AA.
Energy and the National Institutes of Health
V [stop]
♦ Project goals were to:
The ribosome “translates” each 3-letter codon into a specific AA.
• identify all the approximately 20,000-25,000 genes in human DNA,
• determine the sequences of the 3 billion chemical base pairs that
make up human DNA,
• store this information in databases,
• improve tools for data analysis,
• transfer related technologies to the private sector, and
• address the ethical, legal, and social issues (ELSI) that may arise
from the project.
♦ It was completed in 2003:
http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml
15
16
We Know the Genome, but...
Genomics
♦ The knowledge of the genome is only a start. We want to be
able to answer questions like:
♦ Genome: the set of genes.
• What proteins do the genes code?
• What do the proteins do?
♦ Genomics: “Any attempt to analyze or compare the entire
• In what processes are the genes/proteins involved?
genetic complement of a species or species (plural).”
• Can we modify the code contained in a gene? If so, how?
• It is of course possible to compare genomes by comparing moreor-less representative subsets of genes within genomes.
• ...
♦ This information cannot be obtained simply from the DNA
♦ Genome (humans): 2.3 × 104
sequence. Intensive biological experimenting is needed,
using sophisticated technologies. This results in the need
for suitable data-processing and analysis methods.
17
Transcriptomics
18
Proteomics
♦ Transcriptome: the set of expressed mRNA molecules.
♦ Proteome: the set of proteins encoded by the genome.
♦ Transcriptomics: the study of the transcriptome.
♦ Proteomics: The study of the proteome; evokes not only all
the proteins in any given cell, but also the set of all protein
isoforms and modifications, the interactions between
them, the structural description of proteins and their
higher-order complexes.
♦ Transcriptome (humans): ~ 106
♦ Proteome (humans): ~ 108
19
20
Metabolomics
Many other “omics”
♦ The study of the metabolome, the collection of all
♦ Genomics
metabolites in a biological cell, tissue, organ or organism,
which are the end products of cellular processes.
• Cognitive Genomics, Comparative Genomics, Functional
genomics, Metagenomics, ...
• Glycomics: study of glycomes (the entire complement of sugars)
• Nutrigenomics
• Lipidomics: study of lipids
• Pharmacogenomics
• Toxicogenomics
♦ Proteomics:
• Immunoproteomics, Nutriproteomics, Proteogenomics
♦ ...
Tragicomics
(Tragicomix)
21
Aims of “Omics” Experiments
22
High-throughput Technologies
♦ Class discovery
♦ Genomics: genome-sequencing
• E.g., gene- or protein-signatures to find new disease sub-types
„unsupervised learning”
♦ Gene-expression: microarrays, SAGE, RNA-seq
♦ Class comparison
♦ Proteomics: mass-spectrometry, protein chips
• E.g., comparison of protein abundance between biological conditions
„differential expression analysis”
♦ Metabolomics: mass-spectrometry, NMR
♦ Class prediction
• E.g., gene or protein-signatures to be used for diagnostic purposes
♦ …
„supervised learning”
23
24
“Omics” Technologies are
Sophisticated and Impressive...
... Which Makes Them Vulnerable
♦ Highly sensitive; systematic effects due to time, place,
♦ Based on advanced scientific principles
reagents, personnel, … can be visible
♦ Use complex instrumentation
♦ Reproducibility can easily be compromised
♦ Produce massive amounts of data (“high-throughput”)
♦ Variability can be considerable
♦ Naïve data analysis can lead to erroneous conclusions
25
26
Classification Using Mass Spectra (1)
Mass Spectrometry: A Case Study
♦ Use of mass spectra to discriminate between ovarian
cancer and normal samples:
• Petricoin et al., Lancet 2002; 359: 572-577
• Conrads et al., Endocr Relat Cancer 2004; 11: 163-178
• Baggerly et al., Bioinformatics 2004; 20: 777-785
• Sorace and Zhan, BMC Bioinformatics 2003; 4:24
27
28
Classification Using Mass Spectra (2)
Classification Using Mass Spectra (3)
Lancet 2002; 359: 572-577
♦ 100 ovarian cancer pts.; 100 normal controls; 16 pts. with “benign disease”
(216 in total)
♦ July 2004: samples processed with the original SELDI
technology and with a higher resolution instrument (QqTOF)
♦ Method: 50 cancer and 50 normal spectra used to train a classifier; the
algorithm tested on the remaining samples.
♦ Results:
♦ Attention paid to QA/QC
• Correctly classified 50/50 of the “test” ovarian cancer cases (100%
sensitivity).
• Correctly classified 63/66 of the “test” non-malignant cases (95%
specificity).
♦ The results indicate 100% sensitivity and 100% specificity for
identifying cancer from normal
29
Classification Using Mass Spectra (4)
30
Classification Using Mass Spectra (5)
One can find a
separation in
dataset 3…
♦ Re-analysis of three datasets:
(1) described in Petricoin et al., 2002 (216 spectra)
Something’s gone
wrong. What?
(2) the same 216 samples run on the Ciphergen WCX2
ProteinChip array
.. but not using 5
features (peaks)
from dataset 2.
(3) a new set of 253 spectra (91 normal and 162 cancer
samples), run on the WCX2 array.
31
32
Classification Using Mass Spectra (6)
Day 1
Day 2
“… 32 spectra that were lesser
quality (…) were all generated
at the end of experimental run,
suggesting that a deviation in
the process had occurred.”
♦ Focus on their Figures 6 and 7
Day 3
33
34
What Happened? Bias due to
Confounding
Cancer samples
Control samples
MS intensity
measurements
Cancers were processed mainly on day 1...
... controls on days 2-3…
observed
association
(induced)
Sample status
(cancer/control)
association
created by the
study design
association
created by chance
… but there were quality problems occuring on
day 3...
Day of measurement
… so what do we discriminate between?
35
Day 1
Day 2 Day 3
Control
100%
0%
Cancer
0%
~50% ~50%
0%
36
How the Problem Could Have Been
Prevented?
Discrimination Using Mass Spectra (7)
♦ By randomizing the order of processing the samples
♦ Re-analysis of the third dataset (253 spectra, WCX2 array)
• Measurement days (“interventions ”) assigned to each of the samples
with equal probability
♦ Found perfect classification rules, using only two m/z features
♦ It would balance the distribution of days within the groups
• The association between the day and group would be
eliminated
37
Discrimination Using Mass Spectra (8)
38
Common Features of Technology and
Data
♦ Sophisticated instrumentation
♦ Experimental techniques highly sensitive; systematic effects
due to time, place, reagents, personnel, … can be visible
♦ Reproducibility can easily be compromised
♦ Large amount of data, many measurements/sample (103 - 106)
♦ Highly structured/complex data (correlation, variability, etc.)
39
40
Common Features of Analyses
♦ Sophisticated instrumentation
• needs understanding
♦ Experimental techniques highly sensitive
• pre-processing (removal of artifacts, normalization, ...)
♦ Many measurements per sample (103 - 106)
• Multiple testing adjustment
• Automated analyses preferred
• Computational time requires a consideration
♦ Data structure/complexity
• Modelling preferred...
• ... but taking into account assumptions and computational time
41