Kevin Drew Thesis Intro - Bonneau Lab
Transcription
Kevin Drew Thesis Intro - Bonneau Lab
Predicting old functions and designing new ones: Genome scale protein annotation & Peptidomimetic design by Kevin Drew A dissertation submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy Department of Basic Medical Science New York University May 2013 Richard Bonneau, PhD Introduction Only c Kevin Drew � All Rights Reserved, 2013 Introduction Erwin Schrodinger in his 1944 book, What is Life?, described life as a system capable of exporting entropy [1]. Simply, if the living system is unable to retain its own order and export entropy to the environment, it will reach equilibrium and die. Conversely, if it is capable of transferring entropy from inside the system to outside while retaining order, it will continue living. This description of life is neat and concise. The mechanisms that achieve such a task are, however, enormously complex. Proteins, polymeric chains of amino acids, are responsible for much of this sophisticated functionality. This thesis is, broadly, about protein function. The study of protein function, as with nearly all biological fields, depends on the intimate relationship between the concept of the gene and its end product, most notably protein. Genes are passed from cell to cell, organism to organism, generation to generation and most often encode biochemically active proteins that interact with each other (and other molecules) to carry out the functions of the cell. Understanding genes, their resulting proteins, and their interactions within the cell is perhaps the fundamental focus of modern biology. There are many ways to go about this understanding but ultimately the interactions of life are chemical and best described at the atomic and molecular level. The goal of this atomic and molecular approach is to identify, for example, the specific atoms or amino acid residues responsible for a specific function of a protein produced by a gene that is linked to a phenotype of interest (such as a disease state). If we have a clear understanding of protein function at this level of resolution, we can then begin to manipulate their functions to, for instance, cure diseases. These challenges are enormous but this thesis addresses two specific avenues (among many) for using atomic and molecular approaches to better describe and engineer biology, allowing greater understanding and utility. The two avenues revolve around the computational software Rosetta which has the ability to both model and design atomic structure.∗ The first approach uses the modeling aspect of Rosetta as part of a genome annotation pipeline that annotates structure and function to proteins of unknown function. The second approach uses the design aspect of Rosetta to design specific molecular inhibitors to protein interactions thus modulating ∗ “If the only tool you have is a hammer, you tend to see every problem as a nail.” - Abraham Maslow 1 their function. These two approaches address important problems in the field of protein function. 0.1 Roots of the field The roots of this research rely heavily on work which began in the middle of the last century when the concept of the gene was nonphysical, the genetic material was still unknown and molecular biology had yet to evolve from genetics and biochemistry (see timeline in figure 1). In 1941, George Beadle and Edward Tatum, attempting to understand how genes control biochemical function through experiments with the model system Neurospora, conclusively linked the gene to a biochemically active protein [2]. Using x-ray mutagenesis, they found strains incapable of synthesizing vitamin B6 which was due to a lack of functional protein. These effects were inherited by further generations of the strains — suggesting the gene held the information necessary for a specific biochemical function. This discovery was the pivot point on which the fields of genetics and biochemistry would eventually merge. In 1944, Erwin Schrodinger, a quantum physicist, published a series of lectures in which he contemplated about the physical description of life, in What is life? [1]. This book inspired many researchers to think about the chemical nature of the gene. Common thought of the day was that protein was the genetic material and DNA was just a structural component of the cell. Oswald Avery conducted an experiment in 1944 that suggested DNA, not protein, was the chemical component of the gene [3] and this work was furthered by Hershey and Chase in 1952 by isotopically labeling of protein and DNA to show DNA was definitively the chemical component of the gene [4]. In 1953, Watson and Crick determined structure of DNA providing the greatest evidence that DNA and genetic material were one in the same [5]. The next dozen or so years provided the ”Central Dogma” [6], the genetic code [7, 8] and a near complete description of the relationship of genetic codons to amino acids [9]. Intercalated within these events was Linus Pauling’s discovery in 1949 that a genetic mutation caused a molecular change in the protein hemoglobin that altered its function [10]. 2 3 Figure 1: Timeline of important protein structure and function discoveries Besides these events being foundational discoveries of molecular biology, they provide the basis for functional annotation of the proteins in a genome. If one knows the DNA sequence of a gene, one will know the protein sequence that that gene encodes. Knowing this, along with Pauling’s observation that the protein’s sequence is related to its function, provides a tremendously powerful insight into determining the function of proteins. Fast forward four decades (with a brief stopover in 1977 to mention Frederick Sanger’s sequencing technology and the publication of the first sequenced genome, bacteriophage phi X174 [11]) to 2001 and the complete sequence of Human DNA [12, 13] is known along with model organisms of E. coli [14], Yeast [15], C. elegans [16], Drosophila melanogaster [17] and Arabidopsis thaliana [18]. With this vast amount of sequence data for genes and their resulting proteins and with the observation that a gene’s sequence is related to its function, we have a lot of the pieces necessary to determine a gene’s function. 0.2 Anfinsen’s dogma Unfortunately, however, uncovering general principles that determine a protein’s function from its sequence has been more complex than uncovering the code that translates DNA sequence into protein sequence. To understand the relationship between protein sequence and protein function, we must think more about the configuration of atoms that is determined by a protein’s sequence and ultimately its shape. A protein’s function is determined by the protein’s sequence of amino acids, but only indirectly. More directly, a protein’s function is determined by the arrangement of atoms in three dimensional space. For example, the enzyme, Ribonuclease A, consists of 124 amino acids (372 DNA coding bases) and functions to cleave single stranded RNA. Christian Anfinsen, in his studies of Ribonuclease A, showed that denaturing and reducing the protein’s disulfide bonds destroyed its ability to cleave RNA. More importantly, however, he showed that when the disulfides are re-oxidized, the protein’s function returns, albeit at a lower percentage of the native protein [19, 20]. This showed that the protein could be chemically unfolded and then recover its shape and therefore its function without any external factors. In other words, all the information necessary for the protein’s fold and function was in its sequence. Unbeknownst to him at the time, the reducing of disulfide bonds in Ribonuclease A and subsequent unfolding 4 caused the disruption of important residues for catalytic activity. Interestingly, none of the residues important for Ribonuclease A function, His12, Lys41, His119 and Asp121, are near each other in linear sequence space but it is their relative orientation in three dimensional space that allows each residue to carry out its chemical function (figure 2). Figure 2: Ribonuclease A: a model for protein folding. Protein structure (pdbid: 7RSA) of Ribonuclease A, showing residues important for function (His12, Lys41, His119 and Asp121) and residues important for conformational stability (Cys26, Cys81, Cys58, Cys110). Anfinsen’s work on Ribonuclease A laid the foundation for the field of protein folding. 0.3 Evolutionary relationships of protein structure Another observation about the relationship between protein sequence and structure came through examining the first experimental structures of proteins. It was noticed almost immediately af5 ter the x-ray crystal structure of the tetrameric hemoglobin had been solved that each chain resembled the structure of myoglobin, a protein whose structure was solved just a few years earlier [21, 22]. At the time, the sequences of these proteins were not fully known so further comparisons of the sequence - structure relationship were limited. But once protein sequencing technologies improved and more sequences became available, Perutz and Kendrew noted that only 9 residues out of 140 total amino acids in hemoglobins from various vertebrates were identical even though their overall three dimensional structures were similar [23]. Later in 1986, Chothia and Lesk analyzed the protein structures available at the time by comparing both sequences and structures of homologous protein pairs [24]. They discerned a trend where protein structure was more conserved than protein sequence. This observation generalized Perutz and Kendrew’s observation of the globin family sequence-structure relationship and suggests that when available, protein structure similarity is a better metric to infer homology than protein sequence. Further studies of the relationship between protein structure and protein sequence continued as the number of protein sequences and structures grew. Contrary to the early days of protein crystallography where the structure of some proteins were known and their sequences were not, protein sequences soon became available in abundance while protein crystallography continued at a slow pace. Because of the difficulty in experimentally solving protein structures, it became of interest to infer protein structure by sequence alone. One notable analysis by Chris Sanders and later improved by Burkhard Rost laid out parameters of protein sequence similarity (i.e. residue similarity and length of alignment) that could be used to accurately determine similar structures between pairs of proteins [25, 26]. This allowed one to infer the structure of one protein from another if their sequences were similar enough. Interestingly, there seems to be a realm of sequence space, coined the “Twilight Zone of protein sequence alignment”, where a pair of proteins shows no sequence similarity but had nearly identical three dimensional folds. In other words, two proteins in this region that share structure similarity but not sequence similarity are still likely to have a common evolutionary origin (and possibly a conserved molecular function). From this one can reasonably infer that pairs of proteins with high sequence similarity may be predicted to have similar functions because of the likely similarity of structures. But what about protein pairs in the “Twilight Zone”? Is it possible to find relationships among proteins 6 whose sequences have diverged beyond the detectable limits? Without sequence similarity, we would have to rely on structure similarity. Unfortunately, experimentally determining protein structure is notoriously expensive and laborious. We are left with a need to theoretically predict the three dimensional structure of proteins from its sequence alone. This is what is known as the protein folding problem. 0.4 Protein folding problem In lecturing about the protein folding problem in 1969, Cyrus Levinthal calculated that for a 100 residue protein, there would be 10300 separate conformations available [27]. Essentially, a sequence can exhibit 10300 different ‘folds’ and it is the theoreticians job to pick the one that represents the one observed in nature. Levinthal went on to note that nature surely did not sample all conformations, so there must be identifiable rules that limit the search space of protein folding and one such rule is that local structure formation of the protein backbone nucleates the rest of folding (another is the observation that hydrophobic packing in the core of the protein is a main driving force of folding [28]). Rosetta is one of many computational algorithms∗ that take advantage of local structure formation prior to global folding while attempting to predict the three dimensional structure of a protein [29]. To do this, Rosetta combines many small fragments of experimentally determined structures whose fragment’s sequence match portions of the sequence of interest. This estimation removes a large number of degrees of freedom from the system and therefore makes the problem more tractable. Rosetta then uses a Monte Carlo simulated annealing approach to sampling the remaining degrees of freedom. After each change of a degree of freedom, the new conformation is compared to the old conformation based on a energy based score function and is accepted or rejected based on the Metropolis criterion. The score function is comprised of several terms including one that favors hydrophobic residues in the core of the protein and hydrophilic residues exposed to solvent. Using the Rosetta algorithm, Bonneau et al. [30] achieved a significant milestone in protein folding by correctly predicting the folds of several proteins at the 2000 CASP4 competition (The ∗ “Computer Science is no more about computers than astronomy is about telescopes.” - EW Dijkstra 7 Critical Assessment of protein Structure Prediction is a biannual competition which attempts to determine the state of the art among protein folding algorithms) [31]. Building on this result, they observed the Rosetta structure predictions were accurate enough to compare to known structures in the Protein Data Bank (PDB) and find proteins with similar structures [32]. The Rosetta predicted structure of one such target, Bacteriocin AS-48, was compared to all known structures and matched PDB entry 1NKL, a NK-lysin protein (the experimental structures match to 3Å RMSD, see figure 3). Both are functionally related as lysins but — similar to Perutz and Kendrew’s observation of the globin family described above — when the NK-lysin protein and Bacteriocin AS-48 protein sequences are compared, only 14% of residues are identical. The low sequence similarity and the high structural similarity of this pair puts it squarely in the ”Twilight Zone” of protein sequence alignment. Additionally, the fact that a predicted structure was accurate enough to match a highly similar structure in the PDB showed promise that this procedure would be useful to functionally annotate proteins on a genome scale. Figure 3: Bacteriocin AS-48 CASP4 Target Rosetta structure prediction from CASP4 matched 1NKL in PDB [30]. Protein’s have low sequence similarity, high structural similarity and have related functions. 8 0.5 Scaling Up These results from Rosetta’s CASP4 predictions provided a proof of concept for further studies into genome annotation using the method of comparing predicted structures to experimentally determined structures to infer function. The computational cost of the Rosetta de novo algorithm is quite large and prediction for a single protein domain is estimated to take over a year on a single CPU. More over, there are four thousand protein domains that are suitable for de novo structure prediction in the human proteome which would therefore take four thousand years on a single CPU computer. This cost is a large barrier to any such project that attempts to use structure predictions to annotate genomes. Fortunately, there have been tremendous advances in distributed computing, specifically grid computing. The premise behind grid computing is that use of individual computers is sporadic with short times of intense use and long periods where the computer sits idle. If we construct programs in such a way to run on computers while they sit idle and pause our program when they are in use, then we can utilize these idle CPU cycles. IBM’s World Community Grid (WCG) provides such infrastructure to construct programs in this way so that we can distribute Rosetta structure predictions across thousands of idle computers. Hundreds of thousands of volunteers all over the world download the WCG screen saver onto their personal computers, which includes the Rosetta executable and while their computers are idle, the screen saver will download a protein sequence from WCG servers and run our structure prediction algorithm. It is this ability to distribute computation across many computers that allows structure predictions to be obtained in a timely manner. 0.6 Protein’s and their interactions Determining the molecular function of individual proteins is an important goal but even if all proteins were fully annotated, it would not complete our understanding of the cell. This is because proteins do not act in isolation but rather generally interact with other proteins to perform their molecular functions. Therefore studying the partners with which a single protein interacts is as 9 important as determining the molecular function of the individual protein.∗ One way to study a specific protein interaction is to remove its occurrence and observe the effect on the cell (or organism) when the interaction is absent. This is similar to genetic knock out studies where a gene is removed from a genome, which often leads to uncovering a function or process that the gene is required. Modulating a protein interaction will provide a similar tool for studying the effect of that interaction on processes in the cell. Additionally, having a way to modulate a protein interaction would be beneficial for drug development because many diseases are caused by misregulated protein interactions. Disrupting an interaction may provide a way to resolving this misregulation [33]. A good starting point toward understanding how to modulate protein interactions is to look at the chemical makeup of the interface. Soon after the first crystal structures were determined, Chothia and Janin (1975) [34] studied three structures of protein-protein interactions. They noted three points, 1) amino acid side chains at the interfaces were well packed, 2) interfaces spanned around 1500 Å2 and 3) a large portion of the free energy of binding came from hydrophobic residues at the interaction interface. Once it became feasible to mutate protein sequences in an efficient manner, Bogan and Thorn analyzed mutagenesis data at protein interfaces [35]. They collected a database of interface alanine point mutations along with the interaction’s delta delta G values upon binding. They observed the majority of binding affinity was due to just a few amino acids termed ”hotspot” residues while identities of other residues at the interface had little effect on affinity. Later studies of experimental structures of protein interactions deposited in the PDB by Bullock et al. [36] show many hotspot residues in protein interactions are on an α helix at the interface. From these observations, one can begin to develop a rational approach to creating molecular antagonists to protein interactions that will inhibit their occurrence in the cell. A starting point to disrupting a protein interaction is to mimic the hotspot residues of the interface so that the target will recognize it with high affinity and outcompete the native partner. ∗ It should be noted that some Gene Ontology Molecular Function terms do in fact describe a protein’s function as interacting with another protein (e.g. GO:0005515, “Protein Binding”). 10 0.7 Peptidomimetics Using this idea, a naı̈ve approach might be to excise an alpha helix containing hotspot residues from one side of the interaction and use that as an inhibitor. This, in theory, will be recognized by the target protein but in practice there is no guarantee the native α helical conformation will be a low energy conformation for the isolated peptide. Additionally, a short peptide the size of an α helix is subject to degradation by proteases in the cell. Peptidomimetics are a class of molecular scaffolds that address these concerns; they are stable in solution, proteolytic resistant and have functional groups that mimic the side chains on protein secondary structures. A subclass of peptidomimetics that has been successfully used to inhibit protein interactions are helical mimetics. The Gellman lab has created a helical mimetic antagonist to the bcl2 family of anti-apoptotic proteins using alpha-beta peptides which form stable helices [37]. An alternative method used by the Arora group is to nucleate a helix by converting a hydrogen bond between the i and i +4 residues to a covalent linker. Using this approach they have created an inhibitor of the Hif1α P300 interaction which is involved in the angiogenesis pathway [38]. One final example of a successful helical mimetic used for protein interaction inhibition is of Ernst et al. using the simple molecular scaffold of a terphenyl to disrupt oligomerization of gp41, a protein necessary for HIV entry into the host cell [38]. These successes show the promise of using peptidomimetics for inhibiting protein interactions. 0.8 Rosetta Design Using peptidomimetics to mimic secondary structures at protein interfaces is a logical first step in developing interaction inhibitors. It is often the case, however, that this strategy alone will produce only a mediocre binder to the target protein and further optimization is necessary to produce a better binder. The molecular modeling software, Rosetta, is well suited for this optimization. Rosetta has been shown to design novel protein folds [39], redesign protein interfaces [40], enzymes [41] and more recently a protein binder to Hemagluttin of Influenza which disrupts its function [42]. The design algorithm iterates between a conformational optimization step and a sequence optimization step. During the conformational optimization step, random 11 changes are made to the protein’s degrees of freedom (ex. phi and psi angles) in an attempt to lower the score evaluated by Rosetta’s energy function. This step is generally the same as the one described above for de novo structure prediction. During the sequence optimization step, random substitutions are made to residues in the protein from a library of potential amino acids (generally these are the 20 canonical amino acids) in an attempt again, to lower the Rosetta energy function score. Substitutions are accepted if they lower the score and accepted with some probability proportional to their energy increase otherwise. Iterating between these steps often leads to a conformation and sequence that is lower in energy than the ones used to begin. There is a limit however to the sequence diversity provided by using a substitution library containing just the 20 canonical amino acids. Certain applications that have more flexibility in their chemical synthesis (ex. solid state synthesis instead of expressing recombinant DNA in a cell) have the advantage of creating molecules out any arbitrary amino acid rather than just the canonical 20 amino acids. Renfrew et al. [43] exploited this fact when they built a library of noncanonical amino acids within Rosetta and designed a binder to calpain-1 protein. Synthesis of peptidomimetics often use solid state techniques and therefore make excellent design candidates for using these noncanonical libraries. Many of the noncanonical amino acids in the library are analogues of one of the canonical amino acids, for instance a hydrogen atom on the ring of a phenylalanine substituted for a fluorine atom. These analogues provide additional amino acids for Rosetta to suggest as substitutions at hotspot residues that will attempt to increase binding affinity of mediocre binders. Using a peptidomimetic scaffold in combination with Rosetta design and the chemically diverse noncanonical amino acids offers a powerful approach in developing high affinity protein interaction inhibitors. 0.9 The structure of this thesis In this thesis, I describe advancements in both areas of protein structure/function prediction and the computational design of protein interaction inhibitors. During my graduate work I have also had excellent collaborations which were tangental or fall outside the focus of my thesis which I will briefly describe here. Several projects have grown out of the genome scale protein structure and function prediction 12 pipeline project. First, I was involved in work in collaboration with Mike Boxem, which experimentally determined the minimal region of protein domains to interact with their partners. Our database of protein structure annotations was important for the analysis of these regions to show they were in folded well-ordered regions of the proteins. This work was published in Cell [44]. Additionally, our database was used to help the Eichenberger lab at NYU define domains important in the spore coat formation of Bacillus subtilis. This work was published in Molecular Microbiology [45]. I collaborated with the Purugganan lab at NYU to use protein structure as a canvas to map sites of positive selection derived from evolutionary analysis of plant species. This work has been published in Genome Biology and Evolution along with a web server which users can use to view three dimensional structures highlighted by sites important to the evolution of the protein [46]. Another collaboration which resulted in a publication with the Landthaler lab at MDC-Berlin in Molecular Cell, involved finding unexpected enriched superfamilies in a set of novel mRNA binding proteins for which we had structure predictions [47]. This paper also uses structure predictions to predict function of these novel mRNA binding proteins. This extends the idea that structure predictions are a valuable predictor for protein function, something that I describe in the body of this thesis. Noah Youngs in our lab has applied this idea further to several genomes using sophisticated machine learning techniques and preliminary work is currently in review. Finally, I have developed a web based interface to the protein structure annotations using protein protein interactions as an entry point. This interface, BioNetBuilder, is a Cytoscape [48] plugin which builds protein interaction networks based on public interaction data for easy visual display and access to additional data such as structure and function annotations. This work was published in Bioinformatics [49] and a follow up paper describes a redesign of the software and the construction of the Chicken interactome [50]. Collaborations relating to modeling of noncanonical backbones in Rosetta include participation in the CAPRI (Critical Assessment of PRediction of Interactions) challenge with the Grey lab at Johns Hopkins. This involved predicting the binding mode of an oligosaccaride to a protein and although the challenge results have yet to be made public, the work has been submitted for presentation at the CAPRI meeting. Also, in collaboration with the Grey lab at Johns Hopkins, is the development of web server infrastructure for Rosetta applications which will be submitted for publication soon. 13 I was also involved in a project to predict temperature sensitive mutations in proteins of interest using structure modeling by Rosetta as a predictor. This work was published in the 2011 Rosetta Special Collection in PLoS ONE [51]. 0.9.1 Chapter description In the first chapter of this thesis, I describe structure annotation of protein domains in over 100 genomes. Building on the 2000 CASP4 results by Bonneau et al. [30] and the utility of IBM’s World Community Grid, the ability of using protein structure predictions to annotate proteins of unknown function on the genome scale is feasible. I have applied the Rosetta de novo structure prediction algorithm to protein domains run on the grid. The resulting structure predictions were then compared to proteins with known structures to classify the subject protein domain into a SCOP superfamily where all the members are inferred to have a common evolutionary origin and therefore likely a similar function. Using a double blind benchmark to evaluate the error in our classifier, we correctly predict the SCOP superfamily nearly 50% of the time over the whole benchmark and reach nearly 80% for predictions deemed high confidence. In the second chapter, I describe how SCOP superfamily classifications of protein domains were then combined with Gene Ontology (GO) cellular component and biological process annotations in a naı̈ve Bayes framework for GO molecular function prediction. This function prediction method using structure information outperformed the method without structure information suggesting structure is a valuable predictor of molecular function. The third chapter of this thesis describes the development of a framework within the Rosetta molecular modeling suite to model and rationally design helical mimetics. I focus on the peptidomimetic molecular scaffold, oligooxopiperazine (OOP), which mimics one face of an alpha helix. In the fourth chapter, I describe how within the Rosetta framework I made several designs of OOPs targeting the p53 MDM2 protein interaction, a relevant cancer drug target. These designs have been synthesized and experimentally validated to target MDM2 and competitively inhibit the p53 MDM2 protein interaction. Finally, I discuss in the fifth chapter future perspectives of my work. 14 Bibliography [1] Erwin Schrödinger. What is life?: The physical aspect of the living cell. The University Press, Cambridge, 1944. [2] G W Beadle and E L Tatum. Genetic control of biochemical reactions in Neurospora. Proc Natl Acad Sci U S A, 27(11):499–506, Nov 1941. [3] O T Avery, C M Macleod, and M McCarty. Studies on the chemical nature of the substance inducing transformation of pneumococcal types : Induction of transformation by a desoxyribonucleic acid fraction isolated from pneumococcus type iii. J Exp Med, 79(2):137– 58, Feb 1944. [4] A D HERSHEY and M CHASE. Independent functions of viral protein and nucleic acid in growth of bacteriophage. J Gen Physiol, 36(1):39–56, May 1952. [5] J D WATSON and F H CRICK. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171(4356):737–8, Apr 1953. [6] F H CRICK. On protein synthesis. Symp Soc Exp Biol, 12:138–63, 1958. [7] F H CRICK, L BARNETT, S BRENNER, and R J WATTS-TOBIN. General nature of the genetic code for proteins. Nature, 192:1227–32, Dec 1961. [8] P LENGYEL, J F SPEYER, and S OCHOA. Synthetic polynucleotides and the amino acid code. Proc Natl Acad Sci U S A, 47:1936–42, Dec 1961. [9] M Nirenberg, P Leder, M Bernfield, R Brimacombe, J Trupin, F Rottman, and C O’Neal. Rna codewords and protein synthesis, vii. on the general nature of the rna code. Proc Natl Acad Sci U S A, 53(5):1161–8, May 1965. [10] L PAULING and H A ITANO. Sickle cell anemia, a molecular disease. 109(2835):443, Apr 1949. 142 Science, [11] F Sanger, G M Air, B G Barrell, N L Brown, A R Coulson, C A Fiddes, C A Hutchison, P M Slocombe, and M Smith. Nucleotide sequence of bacteriophage phi x174 dna. Nature, 265(5596):687–95, Feb 1977. [12] E S Lander, L M Linton, B Birren, C Nusbaum, M C Zody, J Baldwin, K Devon, K Dewar, M Doyle, W FitzHugh, R Funke, D Gage, K Harris, A Heaford, J Howland, L Kann, J Lehoczky, R LeVine, P McEwan, K McKernan, J Meldrim, J P Mesirov, C Miranda, W Morris, J Naylor, C Raymond, M Rosetti, R Santos, A Sheridan, C Sougnez, N StangeThomann, N Stojanovic, A Subramanian, D Wyman, J Rogers, J Sulston, R Ainscough, S Beck, D Bentley, J Burton, C Clee, N Carter, A Coulson, R Deadman, P Deloukas, A Dunham, I Dunham, R Durbin, L French, D Grafham, S Gregory, T Hubbard, S Humphray, A Hunt, M Jones, C Lloyd, A McMurray, L Matthews, S Mercer, S Milne, J C Mullikin, A Mungall, R Plumb, M Ross, R Shownkeen, S Sims, R H Waterston, R K Wilson, L W Hillier, J D McPherson, M A Marra, E R Mardis, L A Fulton, A T Chinwalla, K H Pepin, W R Gish, S L Chissoe, M C Wendl, K D Delehaunty, T L Miner, A Delehaunty, J B Kramer, L L Cook, R S Fulton, D L Johnson, P J Minx, S W Clifton, T Hawkins, E Branscomb, P Predki, P Richardson, S Wenning, T Slezak, N Doggett, J F Cheng, A Olsen, S Lucas, C Elkin, E Uberbacher, M Frazier, R A Gibbs, D M Muzny, S E Scherer, J B Bouck, E J Sodergren, K C Worley, C M Rives, J H Gorrell, M L Metzker, S L Naylor, R S Kucherlapati, D L Nelson, G M Weinstock, Y Sakaki, A Fujiyama, M Hattori, T Yada, A Toyoda, T Itoh, C Kawagoe, H Watanabe, Y Totoki, T Taylor, J Weissenbach, R Heilig, W Saurin, F Artiguenave, P Brottier, T Bruls, E Pelletier, C Robert, P Wincker, D R Smith, L Doucette-Stamm, M Rubenfield, K Weinstock, H M Lee, J Dubois, A Rosenthal, M Platzer, G Nyakatura, S Taudien, A Rump, H Yang, J Yu, J Wang, G Huang, J Gu, L Hood, L Rowen, A Madan, S Qin, R W Davis, N A Federspiel, A P Abola, M J Proctor, R M Myers, J Schmutz, M Dickson, J Grimwood, D R Cox, M V Olson, R Kaul, C Raymond, N Shimizu, K Kawasaki, S Minoshima, G A Evans, M Athanasiou, R Schultz, B A Roe, F Chen, H Pan, J Ramser, H Lehrach, R Reinhardt, W R McCombie, M de la Bastide, N Dedhia, H Blöcker, K Hornischer, G Nordsiek, R Agarwala, L Aravind, J A Bailey, A Bateman, S Batzoglou, E Birney, P Bork, D G Brown, C B Burge, L Cerutti, 143 H C Chen, D Church, M Clamp, R R Copley, T Doerks, S R Eddy, E E Eichler, T S Furey, J Galagan, J G Gilbert, C Harmon, Y Hayashizaki, D Haussler, H Hermjakob, K Hokamp, W Jang, L S Johnson, T A Jones, S Kasif, A Kaspryzk, S Kennedy, W J Kent, P Kitts, E V Koonin, I Korf, D Kulp, D Lancet, T M Lowe, A McLysaght, T Mikkelsen, J V Moran, N Mulder, V J Pollara, C P Ponting, G Schuler, J Schultz, G Slater, A F Smit, E Stupka, J Szustakowski, D Thierry-Mieg, J Thierry-Mieg, L Wagner, J Wallis, R Wheeler, A Williams, Y I Wolf, K H Wolfe, S P Yang, R F Yeh, F Collins, M S Guyer, J Peterson, A Felsenfeld, K A Wetterstrand, A Patrinos, M J Morgan, P de Jong, J J Catanese, K Osoegawa, H Shizuya, S Choi, Y J Chen, J Szustakowki, and International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, Feb 2001. [13] J C Venter, M D Adams, E W Myers, P W Li, R J Mural, G G Sutton, H O Smith, M Yandell, C A Evans, R A Holt, J D Gocayne, P Amanatides, R M Ballew, D H Huson, J R Wortman, Q Zhang, C D Kodira, X H Zheng, L Chen, M Skupski, G Subramanian, P D Thomas, J Zhang, G L Gabor Miklos, C Nelson, S Broder, A G Clark, J Nadeau, V A McKusick, N Zinder, A J Levine, R J Roberts, M Simon, C Slayman, M Hunkapiller, R Bolanos, A Delcher, I Dew, D Fasulo, M Flanigan, L Florea, A Halpern, S Hannenhalli, S Kravitz, S Levy, C Mobarry, K Reinert, K Remington, J Abu-Threideh, E Beasley, K Biddick, V Bonazzi, R Brandon, M Cargill, I Chandramouliswaran, R Charlab, K Chaturvedi, Z Deng, V Di Francesco, P Dunn, K Eilbeck, C Evangelista, A E Gabrielian, W Gan, W Ge, F Gong, Z Gu, P Guan, T J Heiman, M E Higgins, R R Ji, Z Ke, K A Ketchum, Z Lai, Y Lei, Z Li, J Li, Y Liang, X Lin, F Lu, G V Merkulov, N Milshina, H M Moore, A K Naik, V A Narayan, B Neelam, D Nusskern, D B Rusch, S Salzberg, W Shao, B Shue, J Sun, Z Wang, A Wang, X Wang, J Wang, M Wei, R Wides, C Xiao, C Yan, A Yao, J Ye, M Zhan, W Zhang, H Zhang, Q Zhao, L Zheng, F Zhong, W Zhong, S Zhu, S Zhao, D Gilbert, S Baumhueter, G Spier, C Carter, A Cravchik, T Woodage, F Ali, H An, A Awe, D Baldwin, H Baden, M Barnstead, I Barrow, K Beeson, D Busam, A Carver, A Center, M L Cheng, L Curry, S Danaher, L Davenport, R Desilets, S Dietz, K Dodson, L Doup, S Ferriera, N Garg, A Gluecksmann, B Hart, J Haynes, C Haynes, C Heiner, S Hladun, 144 D Hostin, J Houck, T Howland, C Ibegwam, J Johnson, F Kalush, L Kline, S Koduru, A Love, F Mann, D May, S McCawley, T McIntosh, I McMullen, M Moy, L Moy, B Murphy, K Nelson, C Pfannkoch, E Pratts, V Puri, H Qureshi, M Reardon, R Rodriguez, Y H Rogers, D Romblad, B Ruhfel, R Scott, C Sitter, M Smallwood, E Stewart, R Strong, E Suh, R Thomas, N N Tint, S Tse, C Vech, G Wang, J Wetter, S Williams, M Williams, S Windsor, E Winn-Deen, K Wolfe, J Zaveri, K Zaveri, J F Abril, R Guigó, M J Campbell, K V Sjolander, B Karlak, A Kejariwal, H Mi, B Lazareva, T Hatton, A Narechania, K Diemer, A Muruganujan, N Guo, S Sato, V Bafna, S Istrail, R Lippert, R Schwartz, B Walenz, S Yooseph, D Allen, A Basu, J Baxendale, L Blick, M Caminha, J CarnesStine, P Caulk, Y H Chiang, M Coyne, C Dahlke, A Mays, M Dombroski, M Donnelly, D Ely, S Esparham, C Fosler, H Gire, S Glanowski, K Glasser, A Glodek, M Gorokhov, K Graham, B Gropman, M Harris, J Heil, S Henderson, J Hoover, D Jennings, C Jordan, J Jordan, J Kasha, L Kagan, C Kraft, A Levitsky, M Lewis, X Liu, J Lopez, D Ma, W Majoros, J McDaniel, S Murphy, M Newman, T Nguyen, N Nguyen, M Nodell, S Pan, J Peck, M Peterson, W Rowe, R Sanders, J Scott, M Simpson, T Smith, A Sprague, T Stockwell, R Turner, E Venter, M Wang, M Wen, D Wu, M Wu, A Xia, A Zandieh, and X Zhu. The sequence of the human genome. Science, 291(5507):1304–51, Feb 2001. [14] F R Blattner, G Plunkett, 3rd, C A Bloch, N T Perna, V Burland, M Riley, J ColladoVides, J D Glasner, C K Rode, G F Mayhew, J Gregor, N W Davis, H A Kirkpatrick, M A Goeden, D J Rose, B Mau, and Y Shao. The complete genome sequence of Escherichia coli k-12. Science, 277(5331):1453–62, Sep 1997. [15] A Goffeau, B G Barrell, H Bussey, R W Davis, B Dujon, H Feldmann, F Galibert, J D Hoheisel, C Jacq, M Johnston, E J Louis, H W Mewes, Y Murakami, P Philippsen, H Tettelin, and S G Oliver. Life with 6000 genes. Science, 274(5287):546, 563–7, Oct 1996. [16] C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282(5396):2012–8, Dec 1998. [17] M D Adams, S E Celniker, R A Holt, C A Evans, J D Gocayne, P G Amanatides, S E Scherer, P W Li, R A Hoskins, R F Galle, R A George, S E Lewis, S Richards, M Ashburner, 145 S N Henderson, G G Sutton, J R Wortman, M D Yandell, Q Zhang, L X Chen, R C Brandon, Y H Rogers, R G Blazej, M Champe, B D Pfeiffer, K H Wan, C Doyle, E G Baxter, G Helt, C R Nelson, G L Gabor, J F Abril, A Agbayani, H J An, C AndrewsPfannkoch, D Baldwin, R M Ballew, A Basu, J Baxendale, L Bayraktaroglu, E M Beasley, K Y Beeson, P V Benos, B P Berman, D Bhandari, S Bolshakov, D Borkova, M R Botchan, J Bouck, P Brokstein, P Brottier, K C Burtis, D A Busam, H Butler, E Cadieu, A Center, I Chandra, J M Cherry, S Cawley, C Dahlke, L B Davenport, P Davies, B de Pablos, A Delcher, Z Deng, A D Mays, I Dew, S M Dietz, K Dodson, L E Doup, M Downes, S Dugan-Rocha, B C Dunkov, P Dunn, K J Durbin, C C Evangelista, C Ferraz, S Ferriera, W Fleischmann, C Fosler, A E Gabrielian, N S Garg, W M Gelbart, K Glasser, A Glodek, F Gong, J H Gorrell, Z Gu, P Guan, M Harris, N L Harris, D Harvey, T J Heiman, J R Hernandez, J Houck, D Hostin, K A Houston, T J Howland, M H Wei, C Ibegwam, M Jalali, F Kalush, G H Karpen, Z Ke, J A Kennison, K A Ketchum, B E Kimmel, C D Kodira, C Kraft, S Kravitz, D Kulp, Z Lai, P Lasko, Y Lei, A A Levitsky, J Li, Z Li, Y Liang, X Lin, X Liu, B Mattei, T C McIntosh, M P McLeod, D McPherson, G Merkulov, N V Milshina, C Mobarry, J Morris, A Moshrefi, S M Mount, M Moy, B Murphy, L Murphy, D M Muzny, D L Nelson, D R Nelson, K A Nelson, K Nixon, D R Nusskern, J M Pacleb, M Palazzolo, G S Pittman, S Pan, J Pollard, V Puri, M G Reese, K Reinert, K Remington, R D Saunders, F Scheeler, H Shen, B C Shue, I Sidén-Kiamos, M Simpson, M P Skupski, T Smith, E Spier, A C Spradling, M Stapleton, R Strong, E Sun, R Svirskas, C Tector, R Turner, E Venter, A H Wang, X Wang, Z Y Wang, D A Wassarman, G M Weinstock, J Weissenbach, S M Williams, WoodageT, K C Worley, D Wu, S Yang, Q A Yao, J Ye, R F Yeh, J S Zaveri, M Zhan, G Zhang, Q Zhao, L Zheng, X H Zheng, F N Zhong, W Zhong, X Zhou, S Zhu, X Zhu, H O Smith, R A Gibbs, E W Myers, G M Rubin, and J C Venter. The genome sequence of Drosophila melanogaster. Science, 287(5461):2185–95, Mar 2000. [18] Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408(6814):796–815, Dec 2000. [19] M SELA, F H WHITE, Jr, and C B ANFINSEN. Reductive cleavage of disulfide bridges in ribonuclease. Science, 125(3250):691–2, Apr 1957. 146 [20] C B ANFINSEN and E HABER. Studies on the reduction and re-formation of protein disulfide bonds. J Biol Chem, 236:1361–3, May 1961. [21] J C KENDREW, G BODO, H M DINTZIS, R G PARRISH, H WYCKOFF, and D C PHILLIPS. A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature, 181(4610):662–6, Mar 1958. [22] M F PERUTZ, M G ROSSMANN, A F CULLIS, H MUIRHEAD, G WILL, and A C NORTH. Structure of haemoglobin: a three-dimensional fourier synthesis at 5.5-a. resolution, obtained by x-ray analysis. Nature, 185(4711):416–22, Feb 1960. [23] H.C. Watson M.F. Perutz, J.C. Kendrew. Structure and function of haemoglobin: Ii. some relations between polypeptide chain configuration and amino acid sequence. Journal of Molecular Biology, 13(3):669–678, October 1965. [24] C Chothia and A M Lesk. The relation between the divergence of sequence and structure in proteins. EMBO J, 5(4):823–6, Apr 1986. [25] C Sander and R Schneider. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9(1):56–68, 1991. [26] B Rost. Twilight zone of protein sequence alignments. Protein Engineering, 12(2):85–94, February 1999. [27] Cyrus Levinthal. How to fold graciously. In Mössbaun Spectroscopy in Biological Systems Proceedings, number 41 in 67, pages 22–24. Univ. of Illinois Bulletin, 1969. [28] W KAUZMANN. Some factors in the interpretation of protein denaturation. Adv Protein Chem, 14:1–63, 1959. [29] Carol A Rohl, Charlie E M Strauss, Kira M S Misura, and David Baker. Protein structure prediction using rosetta. Methods Enzymol, 383:66–93, 2004. [30] R Bonneau, J Tsai, I Ruczinski, D Chivian, C Rohl, C E Strauss, and D Baker. Rosetta in casp4: progress in ab initio protein structure prediction. Proteins, Suppl 5:119–26, 2001. 147 [31] A M Lesk, L Lo Conte, and T J Hubbard. Assessment of novel fold targets in casp4: predictions of three-dimensional structures, secondary structures, and interresidue contacts. Proteins, Suppl 5:98–118, 2001. [32] R Bonneau, J Tsai, I Ruczinski, and D Baker. Functional inferences from blind ab initio protein structure predictions. J Struct Biol, 134(2-3):186–90, 2001. [33] Michelle R Arkin and James A Wells. Small-molecule inhibitors of protein-protein interactions: progressing towards the dream. Nat Rev Drug Discov, 3(4):301–17, Apr 2004. [34] C Chothia and J Janin. Principles of protein-protein recognition. Nature, 256(5520):705–8, Aug 1975. [35] A A Bogan and K S Thorn. Anatomy of hot spots in protein interfaces. J Mol Biol, 280(1):1–9, Jul 1998. [36] Brooke N Bullock, Andrea L Jochim, and Paramjit S Arora. Assessing helical protein interfaces for inhibitor design. J Am Chem Soc, 133(36):14220–3, Sep 2011. [37] Melissa D Boersma, Holly S Haase, Kimberly J Peterson-Kaufman, Erinna F Lee, Oliver B Clarke, Peter M Colman, Brian J Smith, W Seth Horne, W Douglas Fairlie, and Samuel H Gellman. Evaluation of diverse /-backbone patterns for functional -helix mimicry: analogues of the bim bh3 domain. J Am Chem Soc, 134(1):315–23, Jan 2012. [38] Laura K Henchey, Swati Kushal, Ramin Dubey, Ross N Chapman, Bogdan Z Olenyuk, and Paramjit S Arora. Inhibition of hypoxia inducible factor 1-transcription coactivator interaction by a hydrogen bond surrogate alpha-helix. J Am Chem Soc, 132(3):941–3, Jan 2010. [39] Brian Kuhlman, Gautam Dantas, Gregory C Ireton, Gabriele Varani, Barry L Stoddard, and David Baker. Design of a novel globular protein fold with atomic-level accuracy. Science, 302(5649):1364–8, Nov 2003. [40] Tanja Kortemme, Lukasz A Joachimiak, Alex N Bullock, Aaron D Schuler, Barry L Stoddard, and David Baker. Computational redesign of protein-protein interaction specificity. Nat Struct Mol Biol, 11(4):371–9, Apr 2004. 148 [41] Lin Jiang, Eric A Althoff, Fernando R Clemente, Lindsey Doyle, Daniela Röthlisberger, Alexandre Zanghellini, Jasmine L Gallaher, Jamie L Betker, Fujie Tanaka, Carlos F Barbas, 3rd, Donald Hilvert, Kendall N Houk, Barry L Stoddard, and David Baker. De novo computational design of retro-aldol enzymes. Science, 319(5868):1387–91, Mar 2008. [42] Sarel J Fleishman, Timothy A Whitehead, Damian C Ekiert, Cyrille Dreyfus, Jacob E Corn, Eva-Maria Strauch, Ian A Wilson, and David Baker. Computational design of proteins targeting the conserved stem region of influenza hemagglutinin. Science, 332(6031):816– 21, May 2011. [43] P Douglas Renfrew, Eun Jung Choi, Richard Bonneau, and Brian Kuhlman. Incorporation of noncanonical amino acids into rosetta and use in computational protein-peptide interface design. PLoS One, 7(3):e32637, 2012. [44] Mike Boxem, Zoltan Maliga, Niels Klitgord, Na Li, Irma Lemmens, Miyeko Mana, Lorenzo de Lichtervelde, Joram D Mul, Diederik van de Peut, Maxime Devos, Nicolas Simonis, Muhammed A Yildirim, Murat Cokol, Huey-Ling Kao, Anne-Sophie de Smet, Haidong Wang, Anne-Lore Schlaitz, Tong Hao, Stuart Milstein, Changyu Fan, Mike Tipsword, Kevin Drew, Matilde Galli, Kahn Rhrissorrakrai, David Drechsel, Daphne Koller, Frederick P Roth, Lilia M Iakoucheva, A Keith Dunker, Richard Bonneau, Kristin C Gunsalus, David E Hill, Fabio Piano, Jan Tavernier, Sander van den Heuvel, Anthony A Hyman, and Marc Vidal. A protein domain-based interactome network for C. elegans early embryogenesis. Cell, 134(3):534–45, Aug 2008. [45] Katherine H Wang, Anabela L Isidro, Lia Domingues, Haig A Eskandarian, Peter T McKenney, Kevin Drew, Paul Grabowski, Ming-Hsiu Chua, Samantha N Barry, Michelle Guan, Richard Bonneau, Adriano O Henriques, and Patrick Eichenberger. The coat morphogenetic protein spovid is necessary for spore encasement in Bacillus subtilis. Mol Microbiol, 74(3):634–49, Nov 2009. [46] M M Pentony, P Winters, D Penfold-Brown, K Drew, A Narechania, R DeSalle, R Bonneau, and M D Purugganan. The plant proteome folding project: structure and positive selection in plant protein families. Genome Biol Evol, 4(3):360–71, 2012. 149 [47] Alexander G Baltz, Mathias Munschauer, Björn Schwanhäusser, Alexandra Vasile, Yasuhiro Murakawa, Markus Schueler, Noah Youngs, Duncan Penfold-Brown, Kevin Drew, Miha Milek, Emanuel Wyler, Richard Bonneau, Matthias Selbach, Christoph Dieterich, and Markus Landthaler. The mrna-bound proteome and its global occupancy profile on protein-coding transcripts. Mol Cell, 46(5):674–90, Jun 2012. [48] Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 13(11):2498– 504, Nov 2003. [49] I. Avila-Campillo, K. Drew, J. Lin, D. J. Reiss, and R. Bonneau. BioNetBuilder: automatic integration of biological networks. Bioinformatics, 23:392–393, Feb 2007. [50] Jay H Konieczka, Kevin Drew, Alex Pine, Kevin Belasco, Sean Davey, Tatiana A Yatskievych, Richard Bonneau, and Parker B Antin. Bionetbuilder2.0: bringing systems biology to chicken and other model organisms. BMC Genomics, 10 Suppl 2:S6, 2009. [51] Christopher S Poultney, Glenn L Butterfoss, Michelle R Gutwein, Kevin Drew, David Gresham, Kristin C Gunsalus, Dennis E Shasha, and Richard Bonneau. Rational design of temperature-sensitive alleles using computational structure prediction. PLoS One, 6(9):e23947, 2011. [52] Kevin Drew, Patrick Winters, Glenn L Butterfoss, Viktors Berstis, Keith Uplinger, Jonathan Armstrong, Michael Riffle, Erik Schweighofer, Bill Bovermann, David R Goodlett, Trisha N Davis, Dennis Shasha, Lars Malmström, and Richard Bonneau. The proteome folding project: proteome-scale prediction of structure and function. Genome Res, 21(11):1981–94, Nov 2011. [53] DT Jones and JJ Ward. Prediction of disordered regions in proteins from position specific score matrices. Proteins-Structure Function and Bioinformatics, 53(6):573–578, 2003. [54] DT Jones. Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292(2):195–202, September 1999. 150