Document 6533808
Transcription
Document 6533808
in vivo in vitro in silico The Frontier of Computational Biology and Functional Genomics 11th NTT Science Forum 6 April 2000 “NIH Urged to Train Biologists on Computers” Headline in The Washington Post, Monday June 7 1999 Recommendation of Federal Advisory Panel to NIH Director Varmus: Establish 20 new U.S. centers to teach computer-based biomedical research at a cost of US$8M per center per year. Dr. Harold Varmus Why? “It’s sink or swim as a tidal wave of data approaches” Nature 399:517 10 June 1999 Scientific literature continues to accumulate at a rapid rate 12,000,000 Now over 10 million articles in MEDLINE® 10,000,000 8,000,000 6,000,000 400,000 new articles added each year 4,000,000 2,000,000 0 1965 1970 1975 1980 1985 Year National Library of Medicine 1990 1995 Molecular Genetics articles are accumulating even faster than general scientific literature 1,200,000 1,000,000 Over 1 million Molecular Biology 800,000 and Genetics articles 600,000 in MEDLINE® 400,000 200,000 0 1965 1970 1975 1980 Year National Library of Medicine 1985 1990 1995 The rate at which DNA sequences are accumulating is exponential 6,000,000 Over 6 million sequence entries in GenBank 5,000,000 4,000,000 3,000,000 Over 5 billion bases from 50,000 species 2,000,000 Human Genome Project begun Rapid DNA sequencing invented 1,000,000 0 1965 1970 1975 1980 1985 Year National Library of Medicine 1990 1995 2000 How do we bridge the gap between sequence and function? 6,000,000 5,000,000 4,000,000 3,000,000 DNA Sequencing Invented Human Genome Project Begun The Gap 2,000,000 1,000,000 0 1975 1980 1985 Publications 1990 1995 2000 DNA sequences Science (Genome Issue) 15 Oct. 1999 National Library of Medicine Gene Mapping Milestones 1996 15,000 genes 1998 30,000 genes 1999 35,000 genes Most mapped genes are anonymous -- their locations are known but their functions await discovery. The Gene Map has helped to identify genes responsible for inherited diseases Science 276, 2045-2047 (1997) Mutation in the alpha-synuclein gene identified in families with Parkinson's disease. Polymeropoulos MH, et.al. Parkinson’s disease gene The Accelerating Human Genome Project Nature Science (September, 1998) (October, 1998) Waterston Collins Nature (March, 1999) Gibbs Science (March, 1999) Lander The Accelerating Human Genome Project Sequenced Regions of Human Chromosomes To date, more than 800 million bases of DNA sequence have been produced Sequenced Regions of Human Chromosomes State of the Genome April May 1999 2000 Sequenced Regions of Human Chromosomes State of the Genome Chr.7: 49% 1999 April May 2000 finished Chr.X: 42% finished Done! 33 Mb Chr.21: 74% finished2 Dec 1999 The Human Genome Project is an International Effort Japan 6% France 12% UK 26% JFCR JST Corp. Keio University RIKEN GSC Tokai University Germany 4% other 2% USA 50% Human Genome Project data available now on the Internet … … for use by researchers prior to genome completion Details of Chromosome 22 Sequence, Biology and Medicine Computational Biology: Performing biological experiments in silico Gene > DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAGA Biological structure & function > Protein sequence MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNI DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE PDEAEQDCIEFGKKIANI The power of computing on the data! Ataxia-telangiectasia gene: 18 years and 5 minutes New England Journal of Medicine 333:645-7; 1995 Comparative Analysis of Genes Cell, Vol. 75, 1027-1038, December 3, 1993, Copyright © 1993 by Cell Press The Human Mutator Gene Homolog MSH2 and its Association with Hereditary Nonpolyposis Colon Cancer Richard Fishel, * Mary Kay Lescoe, * M. R. S. Rao, § Neal G. Copeland, † Nancy Jenkins, † Judy Garber, ‡ Michael Kane, § and Richard Kolodner § *Department of Microbiology and Molecular Genetics Markey Center for Molecular Genetics University of Vermont Medical School can give rise to mismatched bases example, the deamination of 5thymine and and, therefore, a G 1980). Second, misincorporation DNA replication Similarity to bacterial and yeast genes sheds new light on human disease process Human 638 RHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPC 697 Yeast 657 RHPVLEMQDDISFISNDVTLESGKGDFLIITGPNMGGKSTYIRQVGVISLMAQIGCFVPC 716 E.coli 584 RHPVVEQVLNEPFIANPLNLSPQRR-MLIITGPNMGGKSTYMRQTALIALMAYIGSYVPA 642 portion of DNA mismatch repair protein sequence Comparative Analysis of Genomes “What is true for E. coli is also true for elephant.” “What is true for yeast is also true for human.” Jacques Monod, c. 1961 David Botstein, 1988 The importance of “model organisms” Mouse Genes are closely related to Human Genes Human 86% Rat 85% 93% Mouse DNA sequence identity was computed for more than 1000 pairs of orthologous human, mouse, and rat genes “Homology... ... is the central concept for all of biology. Whenever we say that a mammalian hormone is the ‘same’ hormone as a fish hormone, that a human gene sequence is the ‘same’ as a sequence in a chimp or a mouse, that a HOX gene is the ‘same’ in a mouse, a fruit fly, a frog, and a human -- even when we argue that discoveries about a worm, a fruit fly, a frog, a mouse, or a chimp have relevance to the human condition -- we have made a bold and direct statement about homology. The aggressive confidence of modern biomedical science implies that we know what we are talking about.” David B. Wake NCBI “Dinosaur DNA” BLAST Search with “Dinosaur DNA” NCBI BLASTN 1.1.7MP [23-Nov-90] "Dinosaur DNA" from Crichton's JURASSIC PARK, p. 103 Query: Database: GenBank Release 65.0 (complete), October 1990 39,533 sequences; 49,179,285 total residues. Sequences producing high-scoring segment pairs: >Plasmid pBR322, complete genome. length = 4361 Score = 328, Matches = 95% (68/71), Query strand = Expect = 9.7e-18, Poisson P = 9.7e-18 Query: Sbjct: Query: Sbjct: nt 1-1200 A common piece of DNA used in every molecular biology Plus laboratory! 721 CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA 780 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 2581 CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA 2640 781 GCGCTCTCCTG 791 || | ||| || 2641 GCTCCCTCGTG 2651 Score = 320, Matches = 93% (68/73), Query strand = Plus Expect = 4.5e-17, Poisson P = 4.5e-38 Query: Sbjct: 530 GCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGG 589 || || || |||||||||||||||||||||||||||||||||||||||||||||||||| 1026 GCATCGGGATGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGG 1085 Dot matrix analysis of Dinosaur DNA NCBI Window Size = 35 Min. % Score =100 Dinosaur DNA 200 Scoring Matrix: DNA Matrix A 400 C1 600 800 C2 1000 B 1200 500 1000 1500 2000 2500 3000 3500 4000 pBR322, complete genome NCBI Rejected by Science Rejected by Nature Rejected by Cell Published in BioTechniques 12(5):668-9; 1992 Dr. Crichton’s reply: NCBI Another Dinosaur gene. Or is it? Crichton, The Lost World Database Search with “Lost World” DNA Similar to chicken and frog DNA! Query: Sequence from THE LOST WORLD, page 135 (1435 bases) Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences 316,522 sequences; 481,803,458 total letters Searching..........……………………………......................................done High E Sequences producing significant alignments: Score Value gb|M26209|CHKRERYF1 Chicken erythroid-specific transcription fa…………….....783 0.0 gb|M76564|XELGATAC X.laevis GATA-binding protein (XGATA-2) gene………….670 0.0 gb|M76563|XELGATAB X.laevis GATA-binding protein (XGATA-1B ) ge…………..248 1e-63 dbj|D13518|RATGATA1 Rat mRNA for transcription factor GATA-1, c……………..71.9 2e-10 emb|X95701|HSGATA6PR H.sapiens mRNA for GATA-6 DNA-binding protein…..65.9 1e-08 gb|U66075|HSU66075 Human transcription factor hGATA-6 mRNA, com………...65.9 1e-08 gb|U91328|HSU91328 Human hereditary haemochromatosis region, hi…………..60.0 6e-07 emb|X00257|SCCDC28 Yeast CDC28 (cell division control) gene………………….60.0 6e-07 emb|X99254|PFPRIMSSU P.falciparum gene encoding primase, small……………60.0 6e-07 emb|Z36028|SCYBR159W S.cerevisiae chromosome II reading frame O………….60.0 6e-07 A Secret Message in the Dinosaur DNA NCBI Score = 607 bits (1637), Expect = e-174 Identities = 304/318 (95%), Positives = 304/318 (95%) Gaps = 14/318 (4%) QUERY 1 P17678 1 QUERY 61 P17678 61 QUERY 121 P17678 117 QUERY 181 P17678 170 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 120 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGAT TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116 MARK ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 180 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQT ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169 WAS HERE STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 240 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226 NIH Mark Boguski Cloning of the mouse “Obesity Gene” The gene encodes a protein called “leptin.” Sequence homology searching revealed nothing about the possible function of this protein. Zhang et al.(1994) Nature 372, 425-432. Computed prediction of leptin’s 3D structure IL-2 structure / IL-2 sequence IL-2 structure / leptin sequence The protein sequence of leptin is compatible with the protein structure of interleukin-2 (IL-2), suggesting that the two may have a similar mechanisms of action Madej et al. (1995). FEBS Lett. 373,13-18 The structure of leptin is now known As predicted, it is a member of the longchain helical cytokine family, like IL-2 Zheng et al. (1997). Nature 387, 206-209 How do we bridge the gap between sequence and function? 6,000,000 5,000,000 4,000,000 3,000,000 DNA Sequencing Invented Human Genome Project Begun The Gap 2,000,000 1,000,000 0 1975 1980 1985 Publications 1990 1995 2000 DNA sequences Science (Genome Issue) 15 Oct. 1999 National Library of Medicine “Functional Genomics” ...refers to the development and application of global (genome-wide or system-wide) experimental approaches to assess gene function by making use of the information and reagents provided by genome projects. It is characterized by high throughput or large scale experimental methodologies combined with statistical and computational analysis of the results. The fundamental strategy in a functional genomics approach is to expand the scope of biological investigation from studying single genes or proteins to studying all genes or proteins at once in a systematic fashion. Hieter & Boguski (1997) Science 278: 601-602 National Library of Medicine Tens of thousands of genes can be studied in a single microarray experiment V.R.Iyer et al, (1999). Science 283, 83-87. Gene Expression Profiling using DNA Microarrays Plasminogen activator inhibitor-2 HMG CoA reductase Each spot corresponds to a single gene Signal color and intensity reveal changes in gene activity “Anticipated advances in computer speed will be unable to keep up with the growing [DNA] sequence databases and the demand for homology searches of the data.” Charles DeLisi, 1988 U.S. Department of Energy Luckily, DeLisi’s dire prediction has not (yet) come true 100,000,000.00 10,000,000.00 Moore’s Law vs. Growth of GenBank 1,000,000.00 100,000.00 10,000.00 1,000.00 100.00 10.00 Transistors/chip DNA Sequences 19 70 19 72 19 74 19 76 19 78 19 80 19 82 19 84 19 86 19 88 19 90 19 92 19 94 19 96 19 98 20 00 1.00 Computational model of heart failure Model based on aberrant behaviour of cardiac ion transporter genes Computation requires days of time on a large, multiprocessor computer ⎤ 1 ⎡ 1⎛ κ ⎞ ∂v( x, t ) ( ) I ( v ( x , t )) I ( x , t ) ∇ • M x ∇ v x t ( ) ( , ) ,L∀ x ∈ H = − − + ⎟ ⎜ ion app i ⎢ ⎥ β ⎝ κ +1⎠ Cm ⎣ ∂t ⎦ Total Membrane Current Coupling Current ELSI National Center for Biotechnology Information National Library of Medicine National Institutes of Health