Asthma - copsac
Transcription
Asthma - copsac
Translating inter-individual genetic variation to biological function in complex phenotypes Rachita Yadav 14th April, 2014 iii �A grain in the balance will determine which individual shall live and which shall die - which variety or species shall increase in number, and which shall decrease, or finally become extinct.� Charles Darwin, The Origin of Species PREFACE v Preface This thesis was prepared at the Center for Biological Sequence Analysis (CBS), Department of Systems Biology, at the Technical University of Denmark (DTU), under the supervision of Associate Professor Ramneek Gupta. This thesis is a partial fulfilment of the requirements for acquiring the Ph.D. degree. The Ph.D. was funded by the Danish Council for Strategic Research and DTU. This thesis is based on work carried at CBS in collaboration with Copenhagen Prospective Studies on Asthma in Childhood (COPSAC), The Faculty of Health Sciences, University of Copenhagen; Sino-Danish Breast Cancer Research, Centre at Faculty of Life Sciences, University of Copenhagen; Department of Biology, University of Copenhagen and UCSF Diabetes Center and Department of Cell and Tissue Biology, University of California, San Francisco. This thesis presents five main projects and one auxiliary project based on common theme of understanding variations in biological data. Due to the lack of space in the co-author statements, my contributions to the multi disciplinary projects are further explained in the introduction to chapter 2 and chapter 8. Contents Preface . . . . . . . . . . . . . . Contents . . . . . . . . . . . . . . Abstract . . . . . . . . . . . . . . Dansk resumé . . . . . . . . . . . Acknowledgements . . . . . . . . Papers included in the thesis . . Papers not included in the thesis Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I Introduction v vii ix xi xiii xiv xv xvi 1 1 Tools, Techniques and Data Analysis 1.1 Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . Past, Present and Future . . . . . . . . . . . . . . . . . . The Pioneer: Microarrays . . . . . . . . . . . . . . . . . . The Exciting Present: Next Generation Sequencing . . . . The Promising Future . . . . . . . . . . . . . . . . . . . . Applications of Sequencing . . . . . . . . . . . . . . . . . 1.2 Processing of Sequencing Data . . . . . . . . . . . . . . . 1.3 Genome Variation Analysis . . . . . . . . . . . . . . . . . Variant Calling . . . . . . . . . . . . . . . . . . . . . . . . Genome Wide Association Study . . . . . . . . . . . . . . Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . Targeted Sequencing . . . . . . . . . . . . . . . . . . . . . 1.4 Gene Expression Profiling . . . . . . . . . . . . . . . . . . Microarray Based Expression Profiling . . . . . . . . . . . Sequencing Based Expression Profiling . . . . . . . . . . . 1.5 Epigenetics . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 1.8 Translating High Throughput Variation Data to Function vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 5 5 5 6 8 8 11 12 12 13 14 14 16 16 17 21 24 25 29 viii 1.9 Effects of Genomic Variations . . . . . . . Enrichment Analysis . . . . . . . . . . . . Pathway Analysis . . . . . . . . . . . . . . Integrative Analysis . . . . . . . . . . . . Pathway Based Prediction Tool . . . . . . Challenges of Next Generation Sequencing Complex Phenotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 31 32 32 33 34 35 II Childhood Asthma 37 Asthma Aetiology 39 2 Paper I - Genome-wide association analysis of childhood asthma 43 3 Childhood asthma candidate gene study 53 4 Paper II - Machine learning based prediction of childhood asthma 61 III Obesity 75 Obesity Aetiology 77 5 Paper III - Brown to white adipose tissue transition 81 6 Paper IV - Epigenetic changes in obesity 97 IV Genotype to Phenotype 115 7 Discovering phenotypes 117 7.1 Danish Pan-genome . . . . . . . . . . . . . . . . . . . . . . . 118 7.2 Ancient Genome . . . . . . . . . . . . . . . . . . . . . . . . . 121 V Epilogue Summary and perspectives VI Appendix 127 129 133 8 Paper V - Role of TIMP-1 in chemotherapy resistant breast cancer 135 Bibliography 153 ix Abstract The key objectives of this thesis work are to decipher and prioritise observed variations among different phenotypes. With advancements in high throughput technology leading to a surge in biological data, it is imperative to analyse and interpret this information. Consequently, this thesis work examines epigenetic, genetic, transcriptomic and proteomic variations within different multifactorial diseases and this pivotal information is then annotated and associated to its corresponding phenotype. Childhood asthma and obesity are the two main phenotypic themes in this thesis. In the first section, Chapter 1 provides an introduction to various methodologies utilised in this thesis work. Subsequently, chapters 2, 3 and 4 in the second section, address finding causal variations in childhood asthma. Chapter 2 focuses on a genome wide association study (GWAS) performed on asthma exacerbation case cohort. This study reports a new susceptibility locus within the gene CDHR3 for exacerbation phenotype of childhood asthma. Chapter 3 of the thesis presents a pilot study, which aims at designing a candidate gene panel for childhood asthma to identify the causal variants from known asthma genes. Chapter 4 describes artificial neural network (ANN) based methodology of selecting genetic and clinical features with predictive power for childhood asthma. The goal of these studies is to understand the complex genetics of childhood asthma. The third part of this thesis (chapters 5 and 6) focuses on various mechanisms involved in adipose depots, which is a major tissue implicated in obesity. Chapter 5 sheds light on different mechanisms that result in the replacement of metabolism efficient brown fat with the storage-type white fat in large mammals (including human) especially within the first few months following birth. The project work discussed in chapter 6 is aimed towards understanding the various underlying differences in obesity responses in fat cells from different white adipose tissue depots under diet-induced and genetic obesity by decoding the global epigenetic modifications. The fourth section of this thesis work (chapter 7) comprises of two studies that are aimed towards genotype to phenotype mapping. The first section of chapter 7, details the usage of variations from the Danish pan-genome pilot project to comprehend the common phenotypes of the population and attempt to establish its kinship with European populations. Next, the second portion of this chapter describes a personalised genome study of an ancient genome which was conducted by calculating the genetic risk scores to unravel phenotypes. Appendix section (Chapter 8) comprises of an integrative functional analysis study of the changing proteome and phosphor-proteome in chemotherapy resistant breast cancer cell lines with high TIMP-1 gene expression. x In summary, this thesis work demonstrates applications of various �omic� variations at different levels of complexity and their integration using systems biology based methodologies to associate them to multifactorial phenotypes. These studies help in revealing pivotal mechanistic details concerning the phenotypes, which can be further utilized in drug designing and disease management. xi Dansk resumé Hovedformålet med denne afhandling er at afkode og prioritere de observerede variationer blandt forskellige fænotyper. De seneste års betydelige fremskridt i high-throughput teknologier har medført en eksplosion i mængden af biologisk data, der genereres fra mange forskellige kilder. For netop at kunne afkode den biologiske fænotype fra det molekylære data, er det vigtigt at kombinere data fra forskellige kilder i analyserne. Denne afhandling beskæftiger sig derfor med hvorledes genetiske, epigenetiske, transkriptomiske, og proteomiske variationer påvirker multifaktorielle sygdomme. Disse variationer annoteres og associeres med forskellige biologiske fænotyper. Der er i denne afhandling primært fokuseret på fænotyperne astma hos børn og fedme. I afhandlingens kapitel I gives en generel introduktion til de forskellige metoder, der er benyttet i denne afhandling. Kapitlerne II-IV i afhandlingens anden del omhandler identifikationen af kausale variationer i astma hos børn. I kapitel II fokuseres der på analyser af genome-wide associations studie (GWAS) data udført på en kohorte af børn med forværret astma. Dette studie identificerede et nyt højrisiko locus placeret i genet CDHR3, som øgede risikoen for at få forværret astma. I kapitel III præsenteres et pilot-studie, som sigter efter at designe et panel af gener, som kan identificere den kausale varianter blandt gener, der er kendt for at forårsage astma. Kapitel IV beskriver en artificial neural network (ANN)-baseret metode til at vælge genetiske og kliniske faktorer, der kan forudsige sygdomsforløbet for børneastma. Disse studier er designet til at øge forståelsen af mekanismerne bag sygdomsforløbet af børneastma, hvilket kan lede til forbedrede prognoseværktøjer, samt til bedre behandling af sygdommen. Den tredje del af afhandlingen er kapitlerne V-VI, som fokuserer på de forskellige mekanismer involveret i fordeling af fedtdepoter, der har stor indflydeles på overvægt. Kapitel V belyser hvordan forskellige mekanismer i større pattedyr, inklusiv mennesker, kan resultere i at det metabolisk effektive brune fedt erstattes af det hvide oplagringsfedt særligt indenfor de første par måneder efter fødslen. Kapitel VI omhandler de underlæggende forskelle i fedme-responset i fedtceller fra forskellige hvide fedtcelledepoter, både ved diæt-relateret fedme og genetisk fedme, via afkodning af globale epigenetiske ændringer. Den fjerde del af afhandlingen (kapitel VII) består af to studier, der er målrettet mod genotype-til-fænotype mapping. Den første del af kapitel VII beskriver et personligt studie lavet på et antikt genom, som blev udført ved at beregne den genetiske risikoscore. Anden del af dette kapitel detaljerer hvorledes variationer mineret fra det danske pan-genom projekt kan benyttes til at forstå de gængse fænotyper i befolkningen og undersøge hvordan den danske befolkningen er relateret til øvrige europæiske nationers befolkninger. xii Tillæg afsnit beskrives i kapitel VIII og er en integreret analyse af hvordan proteomet og phospho-proteomet ændres i kemoterapi-resistente brystkræftcellelinjer med høj ekspression af TIMP-1 genet. Denne afhandling beskriver forskellige metoder til at arbejde med “omics” data i stor stil og i forskellige grad af kompleksitet, og hvordan de forskellige datatyper kan integreres ved at benytte systembiologiske metoder til at associere dem med multifaktorielle fænotyper. Disse studier er medhjælpende til at afsløre centrale mekanismer, som er vigtige for udviklingen eller videreudviklingen af forskellige sygdomsfænotyper, hvilket kan være af stor vigtighed i den fremtidige udviklingen af nye typer medicin, samt i den generelle sygdomsbehandling. xiii Acknowledgements I take the oppurtunity to express my sincere thanks to all the people who have directly or indirectly inspired and helped me during my PhD. I would like to express my gratitude to my supervisor Ramneek Gupta for being supportive, encouraging and giving freedom of thoughts and work . It has been a great learning journey. I have been very fortunate to collaborate with many different groups namely the Copenhagen Prospective Studies on Asthma in Childhood, Sino-Danish Breast Cancer Research and the Department of biology, University of Copenhagen. The work presented in this thesis was possible because of your expertise in field and critical assessments. I would like to express my special gratitude towards Hans Bisgaard, Klaus Bønnelykke, Eskil Kreiner-Møller, Karsten Kristiansen, Jacob B. Hansen and Si Brask Sonne. It has been an extreme pleasure to work with all of you. It has been a pleasure to be surrounded by many helpful people from CBS who always engaged in scientific discussions and provided me with many helpful insights and guidance. A special thanks to Thomas Nordahl Petersen, Thomas Sicheritz-Ponten, Simon Rasmussen, Aron Eklund and Nicolai Juul Birkbak. A special thanks to DTU Multi-Assay Core (DMAC) and especially to Marlene Damsgaard for all the experimental work that too in tight schedules. The CBS system administration team has always been forthcoming with technical support. Thanks to John Damm Sørensen, Peter Wad Sackett, Kristoffer Rapacki and Hans Henrik Stærfeldt. The CBS administration never hesitated in helping with any official work. Thank you for your help - Lone Boesen, Dorthe Kjœrsgaard, Martin Lund, Marlene Beck, Annette Vibeke Uldall and Karina Sreseli. Special thanks to the members and guest members of Functional Human Variation group. I have enjoyed all our scientific discussion as well as teambuilding events. I would also like to thank all the people who gave invaluable comments on my thesis or its parts, especially Tammi Vest, Kisrtine Belling and Henrik M. Geertz-Hansen. It has been a pleasure to share the office space with Dave Userry and his group. I would like to thank my PhD colleagues particularly Asli, Kalliopi, Bent, Ali, Ida, Agata, Dhany and to my late-lunch companions Arcadio, David, Khoa and Grace for all laughs and gossips. Thanks to all other former and present colleagues for contributing to the friendly working environment and great parties. Finally, I would like to thank all my friends especially few old ones, Bhanu and Rounak for keeping me company though miles apart. This thesis would not be possible without the support from my mamma and daddy, who always had faith in me and supported me. Special thanks to Mohita who is best at the art of infusing positive enthusiasm during difficult times. A special thanks to the special person of my life, Sudhir, for all the support, encouragements, patience and also for copy writing the thesis. xiv Papers included in the thesis • Klaus Bønnelykke∗ , Patrick Sleiman∗ , Kasper Nielsen∗ , Eskil KreinerMøller, Josep M Mercader, Danielle Belgrave, Herman T den Dekker, Anders Husby, Astrid Sevelsted, Grissel Faura-Tellez, Li Juel Mortensen, Lavinia Paternoster, Richard Flaaten, Anne Mølgaard, David E Smart, Philip F Thomsen, Morten A Rasmussen, Silvia Bonàs-Guarch, Claus Holst, Ellen A Nohr, Rachita Yadav, Michael E March, Thomas Blicher, Peter M Lackie, Vincent W V Jaddoe, Angela Simpson, John W Holloway, Liesbeth Duijts, Adnan Custovic, Donna E Davies, David Torrents, Ramneek Gupta, Mads V Hollegaard, David M Hougaard, Hakon Hakonarson, Hans Bisgaard A genome-wide association study identifies CDHR3as a susceptibility locus for early childhood asthma with severe exacerbations. Nat Genet. 2014 Jan;46(1):51-5. • Rachita Yadav, Thomas Nordahl Petersen, Eskil Kreiner-Møller, Hans Bisgaard, Kluas Bønnelykke, Ramneek Gupta. Ranking genetic and clinical features for prediction of asthma at age 7. Manuscript in preparation. • Astrid L. Basse∗ , Karen Dixen∗ , Rachita Yadav∗ , Malin P. Tygesen, Klaus Qvortrup, Karsten Kristiansen, Bjørn Quistorff, Ramneek Gupta, Jun Wang, Jacob B. Hansen Global gene expression profiling of brown to white adipose tissue transformation in sheep reveals novel transcriptional components linked to adipose remodeling. Manuscript submitted • Rachita Yadav, Si Brask Sonne, Yin Guangliang, Ramneek Gupta, Jun Wang, Karsten Kristiansen, Shingo Kajimura Adipose-depot specific gene regulation by DNA-methylation in obesity. Manuscript in preparation. • Omid Hekmat∗ , Stephanie Munk∗ , Louise Fogh∗ , Rachita Yadav, Chiara Francavilla,Heiko Horn, Sidse Ørnbjerg Würtz, Anne-Sofie Schrohl, Britt Damsgaard, Maria Unni Rømer, Kirstine C. Belling, Niels Frank Jensen, Irina Gromova, Dorte B. Bekker-Jensen, José M. Moreira, Lars J. Jensen, Ramneek Gupta, Ulrik Lademann, Nils Brünner, Jesper V. Olsen, Jan Stenvang. TIMP-1 Increases Expression and Phosphorylation of Proteins Associated with Drug Resistance in Breast Cancer Cells. J. Proteome Res., 2013, 12 (9), pp 4136�4151. ∗ These authors contributed equally. xv Papers not included in the thesis • Christina Bjerre, Lena Vinther, Kirstine C. Belling, Sidse. Würtz. Ø, Rachita Yadav, Ulrik Lademann, Olga Rigina, Khoa Nguyen Do, Henrik J. Ditzel, Anne E. Lykkesfeldt, Jun Wang, Henrik Bjørn Nielsen, Nils Brünner, Ramneek Gupta, Anne-Sofie Schrohl, Jan Stenvang. TIMP1 overexpression mediates resistance of MCF-7 human breast cancer cells to fulvestrant and down-regulates progesterone receptor expression. Tumor Biology December 2013, Volume 34, Issue 6, pp 3839-3851. • Morten Rasmussen, Sarah L. Anzick, Michael R. Waters, Pontus Skoglund, Michael DeGiorgio, Thomas W. Stafford Jr, Simon Rasmussen, Ida Moltke, Anders Albrechtsen, Shane M. Doyle, G. David Poznik, Valborg Gudmundsdottir, Rachita Yadav, Anna-Sapfo Malaspinas, Samuel Stockton White V, Morten E. Allentoft, Omar E. Cornejo, Kristiina Tambets, Anders Eriksson, Peter D. Heintzman, Monika Karmin, Thorfinn Sand Korneliussen, David J. Meltzer, Tracey L. Pierre, Jesper Stenderup, Lauri Saag, Vera M. Warmuth, Margarida C. Lopes, Ripan S. Malhi, Søren Brunak, Thomas Sicheritz-Ponten, Ian Barnes, Matthew Collins, Ludovic Orlando, Francois Balloux, Andrea Manica, Ramneek Gupta, Mait Metspalu, Carlos D. Bustamante, Mattias Jakobsson, Rasmus Nielsen, Eske Willerslev. The genome of a Late Pleistocene human from a Clovis burial site in western Montana. Nature 506, 225�229 (13 February 2014). • The Genome Denmark Consortium. Deep whole-genome sequencing of Danish parent-offspring trios determines private variation, de novo mutation rates and allows population wide de novo assembly. Manuscript in preparation. • Qin Hao, Rachita Yadav, Sidsel Petersen, Si B. Sonne, Simon Rasmussen, Qianhua Zhu, Zhike Lu, Jun Wang, Karine Audouse, Ramneek Gupta, Lise Madsen, Karsten Kristiansen and Jacob B. Hansen. Transcriptome profiling of brown and white adipose tissues during cold exposure provides evidence for extensive regulation of glucose metabolism in brown adipocytes Manuscript in preparation. xvi Abbreviations Adj.p-value AF ANN ATP CDHR3 cDNA ChIP DNA ENCODE EST EUR FAIRE FDR GO GRS GWAS HFD HGNC Indels KEGG LD LOF MAPQ MCC MeDIP-Seq mRNA miRNA NGS OR PCA PCC PCR PPI PTM RD RNA RNA-seq RNAi SNP SNV T2D Tag-seq TF WBC WGA Adjusted p-value Allele frequency Artificial neural network Adenosine triphosphate Cadherin-related family member 3 Complementary DNA Chromatin immoprecipitation Deoxyribonucleic acid Encyclopedia of DNA Elements Expressed sequence tag European Formaldehyde-assisted isolation of regulatory elements False discovery rate Gene ontology Genetic risk scores Genome wide association study High fat diet HUGO (Human Genome Organisation ) gene nomenclature committee Insertions and deletions Kyoto Encyclopedia of Genes and Genomes Linkage disequilibrium Loss-of-function Mapping quality Matthews correlation coefficient Methylated DNA immunoprecipitation sequencing messenger RNA microRNA Next generation sequencing Odd ratio Principal component analysis Pearsons correlation coefficient Polymerase chain reaction Protein-protein interactions Post-translational modification Regular diet Ribonucleic acid Ribonucleic acid sequencing RNA interference Single nucleotide polymorphism Single nucleotide variation Type 2 diabetes Tag sequencing Transcription factor White blood cell Whole genome amplified Part I Introduction 1 Chapter 1 Tools, Techniques and Data Analysis All living organisms are made of smaller units called cells, which are governed by a central rule called “the central dogma of molecular biology”. The central dogma was first articulated by Francis Crick in 1958 [1] and restated in an article published in Nature in 1970 [2]. According to the originally proposed central dogma, information in biological systems only flow from DNA to RNA to proteins. However later developments showed that RNA can be converted to DNA (Figure 1.1). The central dogma provides a framework to understand biological information and relationship between different biological components and mechanisms. A living cell is a heterozygous mixture of polymers, which are nothing but sequential organisation of individual repetitive monomer units. The three most important biological polymers that govern and regulate all cellular mechanisms are Deoxyribonucleic acid (DNA), Ribonucleic acid (RNA) and proteins. Nucleotides and amino acids are monomers for DNA/RNA and proteins, respectively. In biology, information is stored and transferred in the form of these three sequential molecules. The central dogma defines the transfer of information between these sequential polymers and thus they are responsible for the existence of life. A lot has been discovered about these polymers and their role in the organism development, growth and sustainability. As these polymers regulate all biological mechanisms, any variation in these polymers from the steady state, leads to changes in the vital status of the organism and these are reflected as phenotype and some crucial differences result in diseases. The total content of DNA, RNA and protein of a cell makes the genome, the transcriptome and the proteome, respectively. Almost 150 years ago, Gregor Johann Mendel discovered the basis of 3 4 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS Figure 1.1. Adapted from Crick’s version of “central dogma” of biology [2]. genetic heredity, which could explain genetic basis of many diseases running in families. He described these genetic discoveries in a set of three laws known as Mendel’s laws. These laws were although sufficient to explain diseases that are caused by a single gene, however they fell short in explaining the phenotypes that are a result of either accumulation of multiple genetic defects or genetic changes occurring in response to external stimuli. The changes in DNA causing these defects are called the genetic variations. These defects can be as small as a change of single nucleotide in DNA called the single nucleotide polymorphism (SNP) or insertion or deletions of bigger chunks called chromosomal aberrations. Interactions between the genetics of the organism and the environment lead to complex phenotypes. All the molecules within cells and the cells themselves are interacting complex systems and it is hard to predict the property of individual systems separately. To understand them, it is required to quantitatively measure the behaviour of these groups and their interacting partners. Systematic measurement technologies measuring these individual components of the cell are called as genomics, transcriptomics and proteomics. Based on these measurements, systems biology methods use mathematical and computational models to imitate the cell components and their interactions using computers. This includes interactions of the genes with each other, genes and proteins, RNA with DNA, RNA with proteins and the interactions between the cellular components and the environment. Accordingly, combining the genetic information with the transcriptome and proteome information will lead to deeper understanding of the basic biological mechanism and also provide new insights into disease states. 1.1. GENOMICS 5 1.1 Genomics Past, Present and Future In order to understand the complex genetic mechanisms that result in or regulate complex phenotypes a branch of genetics evolved in late 1980s is referred to as Genomics. The term genomics was coined by Dr. Tom Roderick and it describes the comprehensive study of the entire genetic material of an organism [3]. Genomics also provides new scientific basis to study complex diseases, which may result in new possibilities for therapies and treatments for some diseases, as well as new diagnostic methods [http://www.genome.gov/19016904]. Genomics is relatively new field of science, originated with the description of structure of the DNA helix that was made by James D. Watson and Francis H. C. Crick in 1953 [4]. It was also discovered that the sequences of the two strands define the structure of DNA molecule and also its function. Later technology advances led to the development of DNA sequencing and polymerase chain reaction, which were extended to other molecules like RNA. DNA, RNA and proteins harbour a sophisticated and unique code in their sequences which facilitate accurate deciphering and transformation of the coded information. This in turn allows them to control and administer the different activities of a cell. Therefore, to understand how these molecules control cellular activities, it is required to unravel the actual primary sequence of these molecules and this process is called as “sequencing”. Once the genetic sequence of an organism is decoded, this can be compared with individual from same species or from different species. The process of comparing the genetic code of different species/individuals to determine its genetic variants is called as genotyping. Finding the genotype reveals the specific alleles inherited by an individual, which is particularly useful when more than one genotypic combinations drive the clinical manifestations in organisms. The Pioneer: Microarrays Microarray is a solid platform used to assay molecular contents of biological samples. A microarray, also called as chips, has numerous wells, each acting as an experiment in itself. In a microarray, a fluorescent tagged nucleic acid sample (target) is hybridised (annealing of two complementary sequences) to the probes, which are attached to a solid surface. The fluorescence generated by this hybridisation is used to determine variations or expressions of genes. First microarrays were introduced in 1995 [5] to compare the messenger RNA (mRNA) content of cells for finding the differences in gene expression of two cells. 6 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS Genotyping arrays Genotyping arrays are DNA microarrays used to detect genetic variations in an organism. DNA microarrays can be used to identify genotypic differences between individuals or between normal and diseased state. These differences can be assessed by several means, among them one of the very informative tools is Single Nucleotide Polymorphism (SNP) microarrays. These arrays have probes that can bind to different alleles of the SNP, and the hybridisation of these two probes to the genome gives the allele counts for the SNPs. They are designed to capture the genome wide polymorphisms assuming a uniform distribution of variations throughout all chromosomes. To examine the functional regions of the genome, exome arrays are designed to capture the variation in the coding section of the genes. Chips can also be custom designed to capture low frequency variation (Minor allele frequency (MAF) 0.5-5%), variation known in a pathway or the prior known SNPs in either a disease or drug metabolism. SNP arrays can be applied to detect very small variation between individuals that can be further used to determine disease susceptibility and for assessing genetic variation linked to efficacy and toxicity of drugs. In chapter 2 and 4 of this thesis, we have used genotyping data from asthma cohorts to ascertain childhood asthma risks. Though, these flexibilities of customising arrays are available for the SNP chips, they are still unable to capture rare (MAF < 0.5%) or novel variation. Therefore, with the advent of Next Generation Sequencing (NGS) in the last decade, it has been successfully applied to assess SNPs in vast population studies [6, 7, 8]. The Exciting Present: Next Generation Sequencing Sequencing to decode the order of bases in DNA molecules was first developed in 1977 by Frederick Sanger and colleagues [9]. It works on the principle of termination of synthesis at each possible base. The all possible DNA fragments are synthesized by selective incorporation of modified chainterminating dideoxynucleotides [10]. These fragments are sorted size wise by running on gel and the sequence is decoded by reading the terminating base in the ascending order of size. The revolution in sequencing started with the advent of NGS, in which one can sequence tens of thousands of molecules in parallel. The method starts with enrichment of molecule (DNA or RNA) from samples by fragmenting it and creating a concentrated solution either in solution or on array. These fragments are amplified by polymerase chain reaction (PCR) to increase the number of individual events being sequenced. These amplified molecules are attached to a solid surface called the flow cell and they are subjected to sequencing (Figure 1.2). There are multiple methods for sequencing these amplified molecules: 1.1. GENOMICS 7 Figure 1.2. Next generation sequencing (NGS) workflow. Adapted from [11] 1. Sequencing by Synthesis: Sequencing by synthesis uses DNA polymerase and ligase enzymes to extend many DNA strands in parallel by incorporating fluorescently labeled modified nucleotides. These incorporated modified nucleotides does not allow further extension and thus serves as a terminator for polymerisation [12]. The fluorescent dye is then imaged to identify the added bases. The last base is enzymatically cleaved which allows further incorporation of nucleotide and this process is repeated till the end of the sequence. Base calls are 8 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS directly made from signal intensities measured during each cycle and this greatly reduces raw error rates when compared to other technologies. The Illumina sequencers like HiSeq and Miseq use this technology. 2. Pyrosequencing: The single-strand sequencing library fragments are captured onto beads and these beads are immobilised on solid support. The setup of stationary DNA is flushed with nucleotides and the incorporation of a nucleotide to the DNA by DNA polymerase results in release of a pyrophosphate [13]. This pyrophosphate is converted to light by adenylsulfuryltransferase (ATP sulfurylase) and luciferase enzymes, which in turn is captured by a camera and the signal strength is proportional to the number of nucleotides incorporated. Pyrosequencing is used in Roche GS FLX 454 machine. 3. Sequencing by Ligation (SBL): Instead of using DNA polymerase, the SBL technology uses DNA ligase to decode the sequence of fragment of interest with four fluorescent dyes to encode for all 16 possible 2 base combinations [14]. Amplified library undergoes multiple cycles of probe hybridization, ligation, imaging and analysis. The usage of oligonucleotides increases the accuracy of sequencing but since the data is produced by off-set steps, interpretation of raw data is complicated. Applied Biosystems SOLiD sequencer uses this method and they provide the software LifeScope for data analysis. The Promising Future All these methods differ in the PCR amplification applied to the library, read lengths they produce, time for sequencing and raw accuracies. With the current research going on in the field, there are new technologies coming up which would like to solve some problems of second generation sequencing like short reads, library amplification requirement and cost. This is what is termed as third generation sequencing. Multiple methods are under development at different stages, with few already been launched commercially. Ion Torrent™ Technology directly translates chemically encoded information (A, C, G, T) into digital information (0, 1) on a semiconductor chip. The Single Molecule Real Time (SMRT™) sequencing technology from Pacific Biosciences, enables faster results and longer read lengths and thus easy alignment. Nanopore DNA sequencing, uses an exonuclease to cleave nucleotides from DNA. Applications of Sequencing In recent years, the high throughput technologies which produce millions of short sequence reads are routinely being applied to genomes, transcriptomes and epigenomes. In this thesis, three different types of sequencing data have been used (Figure 1.3). 1.1. GENOMICS 9 1. DNA sequencing (DNA-seq) 2. RNA sequencing (RNA-seq) 3. Epigenetic mark sequencing (MeDIP-seq) Genomics Transcriptomics Epigenetics DNA-seq MeDIP-seq RNA-seq Differentially methylated regions Transcript quantification Figure 1.3. Various application of sequencing technologies used in different projects in this thesis. DNA Sequencing DNA sequencing is the process of determining the nucleotide order of a given DNA fragment. The first method of DNA sequencing was developed in 1977 using the chain termination method [10]. With the advancement of technology, DNA sequencing price is reducing and it is likely that sequencing will be an integral part of regular clinical diagnosis and treatments in near future. In this thesis whole genome DNA sequencing has been applied for pangenome and ancient genome projects. RNA Sequencing RNA sequencing is performed to assess the presence and quantification of all RNAs in a given cell. This technology was proposed by Nagalakshmi et al in 2008 where they used it to define the transcriptional landscape of yeast genome [15]. RNA-seq quantitatively determines steady-state RNA in a sample by generating cDNA and subjecting it to massively parallel sequencing. 10 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS Epigenome Sequencing Similarly, sequencing can also be used for finding variations in epigenetic marks like DNA methylation (Methylated DNA immunoprecipitation (MeDIP-Seq), Reduced representation bisulfite sequencing (RRBSeq)), transcription factor binding site (Chromatin immunoprecipitation (ChIP)) followed by high-throughput DNA sequencing (ChIP-seq)), chromatin structure (DNase I hypersensitive sites sequencing (DNASE-seq), FormaldehydeAssisted Isolation of Regulatory Elements sequencing (FAIRE-Seq)) etc. MeDIP-seq [16] uses an antibody specific to 5-methycytosine and retrieves the methylated regions of the genome for sequencing where as RRBSeq uses chemically modified methylated cytosines to capture methylated region for sequencing [17]. ChIP-seq technology enables researchers to identify protein binding sites across the entire genome [18]. DNAase-seq is used to identify the location of open chromatin regions based on the activity of DNAase enzyme [19]. Similarly, FAIRE-Seq is used for determining open genomic regions [20]. RNA-seq and MeDIP-seq are discussed in more details in the coming sections 1.4 and 1.5 of this thesis respectively. 1.2. PROCESSING OF SEQUENCING DATA 11 1.2 Processing of Sequencing Data Over the past few years, there has been a huge increase in NGS and with the price going further low, more and more sequencing data would be generated. All NGS platforms generate millions of small strings of nucleotide sequence called as reads. These reads are then assembled together by either mapping the individual reads to the reference genome or assembling the reads into continuous sequence to understand the variations in the target genome or transcriptome. Quality Control The sequencers are not very accurate and random errors occur while sequencing. The accuracy of the downstream data analysis depends on the quality of reads. Thus, the first step in NGS data analysis is quality control. Each base called by the sequencer is assigned a score, which reflects its quality. Developed by a group in Washington University in 1990, phred quality values [21] determin the probability of error at each base. QP RED = −10log10 P (error) These scores are used to filter bad quality reads and also as quality checks for further analysis. During sequencing the sequencing adapters and primers are sequenced which needs to be removed from the real read. If quality check analysis reveals that the quality score of bases towards the end of the read are below the accepted threshold, it is recommended to remove the bad quality bases from the end by trimming the reads. Standard data analysis protocols suggest investigating per base quality, k-mers presentation and the GC content to assess the overall quality of the data (Figure 1.4). These quality controls keep a check on sample contaminations and prevent alignment problems. Alignment The quality controlled reads are mapped back to the reference genome by using specialised sequence alignment methods such as Bowtie [22] or BWA [23] (Figure 1.4). Most genomes contain some repetitive regions, therefore, some reads will map to multiple places in the genome. Hence, it is advisable to fine tune multiple mapping parameters of read aligners. Another level of false positive mapping is PCR duplicates, that are artefacts of amplification during library preparation. Most sequencing pipeline recommends removing or marking such reads. It is better to remove such reads when the sequencing analysis is quantitative. Picard’s “MarkDuplicates” (http://picard.sourceforge.net) or samtools “rmdup”” [24] can be used to either mark or delete the PCR duplicates. The mapping tools calculate the probability of overall correctness of alignments is denoted by mapping quality (MAPQ). The misaligned regions are either realigned or marked with alignment qualities per base called Base Alignment Quality (BAQ). These two scores are used during subsequent steps such variation, indel, copy number calling and expression analysis. 12 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS Input data from sequencing [ raw reads in FASTQ file ] FASTQC/FASTX Filtered and trimmed reads in FASTQ file BWA /Bowtie Assembled or aligned reads in BAM/SAM files Samtools / PICARD Filtered reads based on mapping quality in BAM/SAM files GATK SNP and Structural variation calling DESeq / EdgeR Differential gene expression MEDIPS Differential methylation Figure 1.4. Flowchart of data analysis steps for sequencing data. 1.3 Genome Variation Analysis Variant Calling After quality control, these filtered reads are aligned and these aligned reads are then used for calling genotypic variants. Variant calling is finding mismatches in alignments with respect to the reference genome. Most variant callers find the variants by comparing the probabilities of bases occurring at that position in the mapped reads to the probabilities of bases in the reference genome. When analysing single genome, the genotyping and variant calling is more or less same, with the non-reference homozygous or heterozygous calls being treated as a SNP [25]. However, when there are multiple genomes, joint posterior probabilities or likelihood ratio test are used for SNP calling. The variant callers assume diploid individuals and take into account the Hardy-Weinberg equilibrium and linkage disequilibrium (LD) information as well as previous information about the SNPs present in the species and their allele frequencies. SAMtools [24] and GATK [26] are 1.3. GENOME VARIATION ANALYSIS 13 the commonly used SNP callers (Figure 1.4). When more than one base in the sequence is changed either by deletion of bases or by additions, then these variation are called insertion and deletions (indel). Large indels causing disruption of functional protein domains or regulatory region are called structural variations. There is no clear discrimination between indel and structural variations. Due to several technical and analytical artefacts, all variations need to be filtered to avoid false positives. The variant calling artefacts are minimised by checking the quality score of the variation or sequencing depth of the region. All variant calling tools generate results as variant calling file (VCF) format, where different information available about each variant is presented in a single line. VCF tools are widely used to manipulate these files e.g. merging them, extracting regions or selected SNPs [27]. Even after extensive filtering of variations, the number of called variants from sequencing data is overwhelming and thus automated annotation is required. Details for assigning function to SNPs and downstream analysis are described in the section 1.8 of this thesis. Genome Wide Association Study The completion of Human genome in 2003 and a pilot project of genotyping healthy individual, called the HapMap project, in 2005, gave the researchers an opportunity to find combination of genetic markers that can define and segregate individuals from each other. It will be of special intereste to find differences in genetic constitution of healthy and diseased individuals. To find such discriminating genetic traits, a comprehensive genome wide association study (GWAS) is required. A GWAS is an approach that involves scanning of multiple markers across the genomes of many individuals to find genetic variations associated with a phenotype. By principle, the studies are designed to associate variations to disease by comparing the allele frequencies between case and control groups. Such studies guides to determine which loci significantly differ between these two groups and which allele is significantly associated with the phenotype. Different models for example additive, multiplicative, recessive and dominant, can be used to determine the risk related alleles. To minimise random SNP-phenotype associations, it is imperative to apply robust statistical methods. For case-control studies, the association testing is done using logistic regression or contingency table method. Contingency table method tests the deviation from independence whereas logistic regression predicts probability of having case status given a genotype class. There are other factors like sex, age, race, ethnicity, disease severity etc. that influence the SNP-phenotype association, thus the score of the GWAS need to be adjusted for them. Since, in GWAS all SNPs detected in an experiment are tested for the association with the phenotype, therefore the calculated score or test statistics, that is generally “p-value”, needs to be corrected for 14 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS multiple testing [28]. The methods of multiple testing corrections applied in GWAS are Bonferroni correction, Benjamini and Hochberg and false discovery rate (FDR). Several software packages have been developed to perform GWAS, the frequently used ones include PLINK [29], TATES [30], SNPtest [31]. PLINK has been used in this thesis to perform GWAS in the asthma exacerbation study described in chapter 2. Imputation The current SNP arrays assay a dense set of markers across the genome but to cover the total genome they need to be evenly distributed across the genome. Since most of the genome is non-protein coding, GWAS tend to find associations between SNP and a trait which lie within these region and therefore it is difficult to assign a function to these SNPs. To over come the problem of non-functional SNPs, exome chips are designed to account for the functional aspect of the detected SNPs. However, they still fail to detect the causative variant and miss SNPs located in non-coding regions of genome. By Mendel’s law of segregation, all sites on the chromosome undergo recombination and can segregate separately but that is not totally true. Chromosomes are mosaic and different loci have different recombination rates. SNPs are not independent and there exists an association between closely located SNPs leading to the coinheritance of certain alleles more often than would be expected by chance. This phenomenon is called as linkage disequilibrium (LD). LD gradually declines with distance thus the farther away the SNPs are the less is the chance of them being dependent on each other. Thus, based on the above principle, a statistical method called imputation was designed [32]. This method learns the combinatorial patterns of variations from known datasets and can accurately estimate the genotype of an unobserved SNP based on the neighbouring SNPs present on the array. Prediction of genotypes based on imputation is fairly accurate and provides a detail view of the associated region to facilitate follow-up studies and also allows correction and validation of the genotyped data [31]. Many software are available for imputing the data e.g. PHASE [33], IMPUTE [34] and BEAGLE [35]. Imputation of genotypes while combining different datasets leads to the identification of susceptibility loci but requires rigorous quality checks at pre- and post-analysis stages [36]. In chapter 2, regional imputation was carried out for the regions with top hits in the initial GWAS. This gave a chance to find if there were any variations that were missed in genotyping data and could be associated with the phenotype. Targeted Sequencing GWAS in the past years have suggested that common variants just explain a modest percentage of total heritability of diseases. The remaining heritability can be explained by rare or novel variants. However genome wide capture of these variants is still not in routine usage because it requires cohorts of 1.3. GENOME VARIATION ANALYSIS 15 large sizes. Also, since multiple studies generally find different variations in the same gene being associated with same phenotype, it still needs to be exploited if one of them is a real causal variant or a proxy for the real variant. Thus, a study to sequence only selected region of the genome and detect phenotype associated variations in these target regions can be designed. There are multiple methods of target capture however PCR-based procedures have been the most widely used [37]. But the limitation that PCR requires individual primers for each selected target, led to the development of other methods like by hybridisation to microarray using capture probe [38, 39, 40] or using de novo synthesised microfluidic DNA chips [41]. The later methods are cost effective, flexible and specific. Recent advancements have been made to reduce the amount of starting DNA. At the same time, multiple samples can be sequenced simultaneously by using multiplexing. For doing so, the sequencing libraries from different samples are tagged using barcodes, so that they can be recognised and separated after sequencing [42]. The chapter 3 of this thesis describes a pilot study to target sequence samples from asthma cases, using small quantity of DNA and multiplexing using custom designed barcodes. 16 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS 1.4 Gene Expression Profiling The �transcriptome� consists of the complete set of transcripts, which include both coding and non-coding RNAs. The coding RNAs that are translated to proteins are called mRNAs, while there are various types of non-coding RNAs. Quantification of RNAs in a biological sample is called as transcriptomics or expression profiling. The abundance of given RNA depends upon the balance between transcription of the gene and RNA degradation. As only the coding RNAs are translated into proteins therefore, expression of coding RNAs should be proportional to the protein content in the cells. However, there are multiple factors guiding the amount of protein translated from the mRNA, and there is also a constant opposite process of protein degradation. Thus, the mRNA quantification is only an indication of the protein content of the cell. The quantity of different transcripts in a cell also varies between developmental stages or physiological condition. Estimating the abundance of the transcriptome benefits the understanding of the functional elements of the genome and reveals the molecular state of the cell. Expression profiling in different stages of cell life or different conditions like control and disease can be used to study regulatory gene defects in diseases, cellular responses to the environment, variations in cell cycle etc. There are various methods for expression profiling. The two major high-throughput approaches are microarrays and sequencing. Microarray Based Expression Profiling The microarray methods involve incubating fluorescently labelled complementary DNA (cDNA) with custom-made microarrays or commercially available high-density oligo microarrays. This technique was developed way back in 1995 and has been under constant development [43]. Various commercial technologies and microarray platforms are available which differ in probe designing, implementation of probes, density of probes, RNA isolation and labelling [44]. I have used Agilent arrays in this thesis therefore I would be discussing that in detail in the following section. Microarray Experiment Agilent arrays have long probes (60-mers) [45], which provide high hybridisation potential as well as are more tolerant to mis-matches [46]. However, this reduces the space available on the array and thus Agilent has less probe density as compared to other platforms [45]. Microarrays are used to detect the quantity of transcripts in cell lysates. In the first step, total mRNA is extracted from cell lysate, using the poly-A tail as the marker for mRNA [44] in Eukaryotes. Oligo (dT) primers are employed for cDNA synthesis [47]. The captured mRNA is amplified by reverse transcription polymerase chain reaction (RT-PCR) and PCR. This generates a library of cDNA, which are labelled to be recognised by the reader when they are hybridised to the complimentary probes on the array. 1.4. GENE EXPRESSION PROFILING 17 Microarray Data Analysis Analysis of the microarray data is to compare the signal intensities from multiple arrays, which can be done either on case vs controls or on different states of the cell or on different time points. The multiple processes involved in the microarray experiment can lead to some noise in the data. Thus, the intensity values from microarray experiment need to be corrected for background noise and normalised within the arrays and between arrays so that genes within an array and across arrays are comparable. We used Limma [48] package in R for microarray data analysis. For each chip, negative probe correction between the arrays was used for within chip background correction. Negative probe correction subtracts the negative background intensities from the foreground intensities. This is done prior to background correction, which in turn is based on the exponential model from intensities of negative probes on the array. To normalise arrays across samples the non-linear quantile normalisation was applied between the arrays. Statistical Testing of Differential Expression Limma fits multiple linear models by generalised or weighted least squares. The coefficients of the fitted models describe the differences between hybridisation of the probes in two experimental conditions. The results of the linear fit model is the log fold changes of the genes between conditions and it also includes moderated t-statistic [48] using the standard error. The p-value is obtained using the tstatistic and applying adjustment for multiple testing. The most common form of adjustment is “FDR”, which is Benjamini and Hochberg’s method to control the false discovery rate. In the study presented in chapter 4, Agilent whole genome microarray for mouse was used to find gene expression differences in the inguinal and epididymal tissues between the regular diet (RD) and high fat diet (HF) fed mice. Sequencing Based Expression Profiling In contrast to microarray-based methods, sequence-based approaches directly determine the cDNA sequence. During the developmental phase of sequencing based expression analysis methods, cDNA or EST libraries [49] were subjected to Sanger sequencing. However, this had drawbacks of being low throughput and non-quantitative. A more recent method of determining gene expression is the high throughput sequencing of the RNA using NGS technologies. Nagalakshmi et al. described this new technology for the first time in a landmark article published in 2008 [15]. In RNA-seq technology, we can quantitatively determine steady-state RNA in a sample by generating the cDNA and then subjecting it to massively parallel sequencing to generate short reads. The sequencing, quality control and mapping of the generated 18 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS reads follow same principles as that for DNA sequencing which are described in section 1.2. These reads are either mapped to an annotated reference genome or assembled de-novo. After successful read alignment, quantification of expression per gene is done by calculating the number of reads mapped to each gene. Tag-seq Based Expression Profiling The methods midway between microarray and RNA-seq are the Tag based quantitative assessments. These include serial analysis of gene expression (SAGE) [50], cap analysis of gene expression (CAGE) [51] and massively parallel signature sequencing (MPSS) [52]. These methods are collectively called digital gene expression (DGE). In the SAGE methods, the short sequences to be sequenced were concatenated to long clone for sequencing and this led to high cost, low throughput in sequencing and complication related to cloning step. Tag-seq is a tag-based variant of LongSAGE, where only 17 bases, called the tags, are sequenced from each transcript however, tag-seq does not requires tags concatenation and cloning as in SAGE [53, 54, 55, 56]. Tag-seq has been used in the chapter 5 of this thesis and thus would be discussed in the next section. Tag-seq Experiment Total RNA is extracted from the sample tissue and mRNA is isolated by capturing mRNA poly(A) tail using a magnetic oligo (dT) bead. The captured mRNA is subjected to restriction enzyme digestion, resulting in 17 nucleotide long tags. The 17 nucleotide long tags are PCR amplified and are subjected to high throughput sequencing (Figure 1.5). Tag-seq Data Analysis The data analysis of the tag-seq follows the same principles of quality control, adapter removal, trimming and mapping as other sequencing methods. The reads miss the 4 nucleotides from the restriction enzyme recognition sites, thus a string of 4 bases �CATG� were added to 17 nucleotides reads, which constitute a total of 21 nucleotides and helps in specific mapping to the reference genome [58]. The number of reads mapping to each gene are counted using HT-seq [59] or CuffDiff [60] and these counts can be used in different DGE packages in R [61] for finding differentially expressed genes. In chapter 5 of this thesis, we have used DESeq [62] for identification of differentially expressed genes. In brief, DESeq estimates the variance in count data from high-throughput sequencing assays and applies the test for differential expression based on negative binomial distribution. Comparison of the genes fitted in negative binomial distribution in the two conditions under question, results in a set of differentially expressed genes, which could be subjected to further downstream analysis. 1.4. GENE EXPRESSION PROFILING 19 Total RNA from cell lysate Sequencing Data analysis Figure 1.5. Procedure of tag-seq after the RNA is extracted and bound to the beads. The last product from the bead attached processing steps undergoes sequencing using the sequencing primers annealed to it [57]. Differential Gene Expression of Sequencing One of the most useful applications for transcriptomics studies using either microarray or high throughput RNA-seq is comparison of expression levels of transcripts between conditions or over different time points to identify differentially expressed genes. Differential gene expression (DGE) in a normal physiological condition leads to a number of biological mechanisms in cells that define basic cellular functions e.g. differentiation, growth, migration, 20 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS cell death etc. and is thus important for normal development. However, any divergence of gene expression from the normal state usually leads to a disease. The regulation of gene expression is inevitable for normal functioning and is controlled by many factors. Differential gene expression is not necessarily caused by loss or gain of genetic material, but more often by differential regulation of transcription which is mediated by transcription factors, cofactors, genome accessibility and epigenetic regulators. For analysis of such high throughput transcriptomics data, there are several programs that have been developed and are widely used. For differential gene expression data analysis, R packages like EdgeR [63], DESeq [62], cuffDiff [60] have been designed which are all based on common principles of normalisation, background correction and significance calculation. However, they differ based on their acceptable input data, experiment design and background statistics. Concluding Remarks for DGE Studies have found high concordance between microarray data and high throughput sequencing data, thus the microarray technology still holds good when the study is aimed on the known genes and transcripts [64]. However, sequencing based transciptome analysis has few advantages like absolute quantification, low background noise, larger dynamic range, suitable for non-model organism and high sensitivity. Tag-seq has a major drawback due to short reads, which results in unspecific and multiple mapping. Few difficulties that RNA-seq poses include library preparation for different types of RNAs, fragment length limitation, coverage of transcriptome etc. There are some common limitations of gene expression profiling methods. Most of them require PCR amplification, which is found to the major source of noise in the data. Abundance of some transcripts in RNA-seq may lead to skewed results. In all gene expression experiments, replicates are vital as they provide statistically significant results. In the chapter 5 of this thesis, tag-seq has been employed to study differential gene expression between the seven adipose tissue samples from sheep. Since, sheep is not a model organisms therefore a sequencing based method suited well for gene expression profiling. Tag-seq was used which substantially lowered the cost as compared to RNA-seq without compromising too much on biological information. 1.5. EPIGENETICS 21 1.5 Epigenetics Factors affecting the genome of the cell other than the nucleotide sequence, which are above (“epi”) genetics, are collectively termed as epigenetic factors. Therefore, epigenetics involves the study of these epigenetic changes that occurr above genome and the factors influencing them. Epigenetics is involved in normal cellular processes like cell differentiation, proliferation and maintenance of steady state. Many epigenetic factors control the expression of genes by altering DNA folding and its compactness. Examples of such epigenetic factors are methylation of DNA, acetylation, methylation, phosphorylation of histones, RNA-induced silencing and nucleosome positioning etc. Figure 1.6. Different epigenetic events that occur in nucleus of a cell [65]. In most cells epigenetic mark are established at the time of differentiation and maintained throughout the life of the cell. Under certain conditions the epigenetic marks become dynamic and reversible. They are also influenced by environmental factors, which might lead to development of abnormal phenotype. In higher organisms mostly cytosine residues in DNA are modified to 5-methylcytosine (Figure 1.6). Global hypomethylation has been observed in multiple cancers [66] while site specific hypermethylation occurs in CpG 22 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS islands in gene regions [67]. One of environmental factors having epigenetics effects is diet. Investigation by Wolff et al. revealed that maternal diet could alter coat colour of the offspring in mice [68]. Disruption of epigenetic mechanisms causes several pathologies including cancer, mental retardation, obesity and diabetes etc. [69]. In chapter 6 of this thesis, we have studied DNA methylation changes between tissues from lean and obese mice. Thus, this epigenetic mark has been discussed in details in the following section. DNA methylation is a complex process in terms of regulations and it depends on time, tissue, DNA sequence, region of the genome, and a concoction of other regulatory enzymes and proteins. DNA methylation is one of the highly studied epigenetic mark, which controls gene activity specifically during development and differentiation [70]. The extent of DNA methylation changes in an orchestrated way during mammalian development, starting with a wave of demethylation during cleavage, followed by genome-wide de novo methylation after implantation [71]. Different DNA methyltransferases (DNMTs) are responsible for the methylation of DNA, with each having specific function [72]. Multiple methods are available to map the DNA methylation on genome scale. These methods combine the methylation analysis of DNA with either microarray (methylation chips) or with sequencing. Chip based methods work on the same hybridization principles of expression microarrays, but use two probes one for methylated and other for unmethlayed region capturing. The ratio of these two probes gives the signal of a base being methylated or not. Infinium HumanMethylation Bead chip from Illumina is a widely used array platform for humans [73]. The different methods available for preparing enriched library of methylation sequencing are : •MeDIP-seq - uses an antibody specific to 5-methycytosine and retrieves the methylated regions of the genome for sequencing [16] •MethylCap-seq - uses methyl-binding domain proteins for capturing the methylated regions [74] •MRE-Seq - uses methylation sensitive restriction enzyme enriched data for sequencing [75] •MethylC-seq or BS-seq - uses bisulfite chemical reaction to convert unmethylated cytosines into uracils, thus introducing methylation-specific single nucleotide polymorphisms, which can be differentiated from methylated CpGs [76] The enriched DNA from these methods is subjected to sequencing and data analysis as described in section 1.2. There are differences in accuracy, coverage and resolution of these methods. Bisulphide methods have higher accuracy than any of the enrichment method and are free of CpG bias [77]. Reduced representation bisulfite sequencing (RRBSeq) is a type of bisulphide sequencing and gives more coverage in less sequencing [17]. MethylCap-seq 1.5. EPIGENETICS 23 and MeDIP-seq gives higher coverage of the genome. All of them are equally efficient at detecting the differentially methylated regions. Thus, the good practice would be use the enrichment or chip based method to find associations and validate the findings with bisulphide methods. Epigenetic and their transgenerational inheritance are also seen as an answer to the missing causality in complex traits. Study examining the short- and long effect of dietary supplement on genetically identical mouse suggests that diet supplements induces small but widespread epigenetic changes in exposed mice [78]. Comparing DNA methylation patterns of high and low responders to a hypo-caloric diet has identified novel potential epigenetic biomarkers for weight loss [79]. Thus, based on these finding a study was designed to elucidate how DNA methylation affects the weight gain in the diet-induced obese mice and is different from genetic obesity. We used MeDIP-seq for finding the methylation changes between lean mice and obese mice. MeDIP-seq provides high-quality whole genome methylation status at typically 100 to 300-bp and the cost is comparable to other capture-based techniques. As obesity studies have seen the effect of environmental factors like diet across generation [80], epigenetics marks can lead to risk prediction as well interventions for obesity. Established and potential psychopharmacological drug are already known to influence epigenetic mechanisms [81]. Although the field is in the early stages of understanding of the complex epigenetic regulatory mechanism, but the preliminary evidences suggest there are possibilities for the development of epigenetic therapy for some disorders. 24 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS 1.6 Proteomics The proteome is the set of proteins that is expressed by a genome and the study of structures and functions of proteins expressed at a given time in a sample is called proteomics. Proteomics studies play an important role in medicine and biology as it links genetics to the active molecules in cells under normal and pathophysiological states. Mass spectrometry (MS) based quantitative methods attempt to quantify constituent proteins in a sample [82]. Quantitative proteomics is important for disease biology as well as drug discovery, because expressed mRNAs may not be equivalent to the corresponding protein quantity [83] and therefore, protein quantification will indicate the real biological state of the cell. Proteins can be modified even after they have been translated from mRNA. Post translation modifications (PTMs) of proteins include addition and removal of small chemical groups like acetyl and methyl. One of the very prevalent modifications is the addition of phosphate residue called as phosphorylation. Protein phosphorylation on serine, threonine and tyrosine residues occurs on more than one-third of all cellular proteins [84]. Proteins called kinases are responsible for transferring phosphate groups from a donor to proteins [85] while phosphatases [86] are the family of proteins responsible for removing phosphate residues. Kinases and phosphatases are part of signalling processes and are also regulated by signalling processes. Therefore, along with expression of the protein, post-translational modifications also influence shape, function and cellular localisation of the proteins. Differences in the abundance of proteins and PTMs between disease and non-disease samples define cellular processes and pathways perturbed by the disease. Stable Isotope Labelling by Amino acids in Cell culture (SILAC) is a methodology of MS based quantitative proteomics. In this method, the total cellular proteome is labeled with non-radioactive, heavy isotope by supplementing the medium with labelled amino acid for substitution in the cell proteome by the normal biological process of protein synthesis [87]. These heavy amino acids can be distinguished in MS and when a labelled sample is compared to a control sample, the difference gives the relative quantification of the proteome. The study of the proteome and the phosphoproteome in the same sample assists in determining the effect of phosphorylation on the expressed protein set. This can further illustrate the control of signalling processes by phosphorylation of proteins by kinases [88]. Both kinases and phosphatases recognise their substrate by motif recognition. The methods described here have been employed in the finding differentially expressed proteins as well as differentially phosphorylated proteins in a chemotherapy resistant breast cancer cell line (see appendix chapter 8). The study aims at finding the effect of high TIMP1 expression on global proteome and phosphoproteome in the resistant cells. 1.7. MACHINE LEARNING 25 1.7 Machine Learning With all the big data generated in multiple fields, it is a complex process to analyse it. Automated methods of data analysis are the demand of time and machine learning helps in designing them. Machine learning can be defined as a set of methods that can automatically detect patterns in a data, recognise them when seen next time. The goal of machine learning is to learn the rules of mapping a set of inputs to a set of outputs. The method, which uses input from one set of data for learning and applies the knowledge to classifying another dataset, is called predictive or supervised. On the contrary, descriptive or unsupervised learning tries to find pattern within the same dataset. Pattern classification and knowledge discovery requires a subset of features to represent the pattern in question to the best. Selection of these features defines the performance of the classifier. Genetic algorithms offer an attractive approach to find near optimal solution to select best descriptors by generating multiple combinations and testing their accuracy [89]. However, the time required to converge to the best combination is long. Some other popular machine learning methods for classification are support vector machines (SVM), artificial neural networks (ANN), classification and regression trees, etc. ANN along with a combinatorial approach of feature selection has been used in supervised model for predicting asthma as presented in chapter 4 of this thesis and thus discussed in details. Artificial Neural Networks ANN is a method of artificial intelligence based on human brain functionality. The method was originally invented in 1943 [90] and has been applied extensively to solve non-linear problems. A typical ANN is comprised of three types of layers made up of nodes (denoting neurons), input layer that passes input data to other layers, an output layer that is layer that captures the classification outcome and the hidden layers, which captures the data from previous layers and passes the processed data to the next layer (Figure 1.7). The nodes of different layers are connected by edges (equivalent to synapse in brain) and denote weights. An ANN design consists of three cycles, learning, testing and decision making. A learning strategy is applied to change the weights in order to optimise the error. ANNs recognises patterns in the data from a known dataset called the training data and the main goal of the network is to make predictions on novel inputs, called the test data. During learning cycle, a function is optimised to maximise the capture of positives and rejection the negative data points. In the iterations over a number cycles called “epochs”, every data point in training data is fed to the ANN one after the other. The error in prediction is calculated and weights are updated. 26 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS Input layer Hidden layer Output layer X1 X2 Y X3 Xi Figure 1.7. Simplified artificial neural network presenting the three layers and the edges. The error function used for binary classification problem is the aggregate of square differences of predicted output from ANN and the target (known) value. This can be presented in equation: E= )2 1 ∑( ti − oi 2 i Where t: target value and o: output from ANN. The most widely used stopping criterion for the learning cycles is attainment of highest test correlation coefficient. There are two correlation coefficients, (i) Matthew’s correlation coefficient (MCC) used for binary classification where as (ii) Pearson correlation coefficient (PCC) used for continuous output variable. The weights from cycle having the highest test correlation coefficients are used for later classifications. In chapter 4, we have used ANN for predicting disease outcome, which is a binary variable and thus I used MCC as the stopping criteria. The formula to calculate MCC can be represented by the equation: TP ∗ TN − FP ∗ FN M CC = √ (T P + F P )(T P + F N )(T N + F P )(T N + F N ) 1.7. MACHINE LEARNING 27 where: TP: number of True Positive (Predicted = True, Actual = True) TN: number of True Negative (Predicted = False, Actual = False) FP: number of False Positive (Predicted = True, Actual = False) FN: number of False Negative (Predicted = False, Actual = True) Sensitivity and specificity are two more prediction accuracy parameters commonly use in machine learning. Sensitivity measures the ability of the model in predicting positives as positives, while specificity measures accuracy of the model in rejecting the negatives. Sensitivity = TP TP + FN Specif icity = TN TN + FP Test set Training set 4 fold cross validation for training and testing Average evaluation of the classier Evaluation set Figure 1.8. Four-fold cross validation of the training data. The training data divided into 4 parts each act as the test set for stopping the training once. 28 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS The output from the ANN is generally a probability. Probabilistic predictions [91] are suitable for classifications as they assign a probability value to each data point. The difference in these probabilities is an estimation of it belonging to one class. This probability is converted to positive or negative classification by applying a threshold value. All the above mentioned measures depend on this threshold and the threshold used in this work is 0.5. Increasing the accuracy of the model to high extents may lead to over fitted model. This model when tested against the test set cannot tolerate a minor variation of data and results in inaccurate predictions. To avoid such a pitfall of over training, every minor variation in the training dataset needs not to be modelled and thus a method predicting few false positives, is acceptable. If the data set is small and the data is partitioned in training and test, there would not be enough data to train and test the models. A simple but popular solution to this problem is to use cross validation (Figure 1.8). In this method, the total data is randomly partitioned into X parts. Xth part is used for testing while (X-1) are combined to form the training set. This method is repeated X times, each time using a different set as the test set. When all sets are done, the results are averaged to give a single measure of performance. Since, the test set is a part of the training set, thus it is better to have an external validation set, never seen by the model for selecting the model with best complexity. In the field of biology, ANN has been successfully applied to prediction problems, some well known are secondary structure [92], post-translational modifications [93], epitope prediction [94] and recently in disease outcome [95]. ANNs have been used to detect association of disease outcome and multiple marker genotypes and this provides a simple and practical method while allowing multiple markers to be analysed simultaneously. 1.8. TRANSLATING HIGH THROUGHPUT VARIATION DATA TO FUNCTION 29 1.8 Translating High Throughput Variation Data to Function All individuals differ from each other and these differences in them are encoded in the variations of their genetic or epigenetic state. The hypothesis most studies in disease biology follow is to find what makes patients more susceptible to disease than controls. These variations might be due to underlying genomic variation or a variation in the controlling mechanism of expression referred to as cellular signalling. In the association studies a test is made to find which of these variation are more related to the phenotypic condition than the others. Once a set of suspected variations is discovered the next step is to find the mechanism of action for these variations in light of the observed phenotype. This is done by applying the methodologies of functional analyses on these variations to uncover their mode of action. There are two principle ways of doing this, first looking for the action of the variation on the gene or the gene product thus analysing each variation individually. The second method is a cumulative method, where all variations are mapped to various genes and the functional analyses is done on this gene set and also taking into account the interacting partners of the genes and proteins coded by them (Figure 1.9). Effects of Genomic Variations The genomic variations are spread through out the genome and since 97% of the human genome is non-protein coding [96] finding an effect of the variation in these regions is difficult. Therefore, different methods need to be applied for annotating coding and non-coding variations. The most annotated share of variations located in the protein-coding region is based on the evolutionary and biochemical evidences. These are classified depending upon if the amino acid is altered, a stop codon is gained or lost or if a coding frame has been changed. In a nutshell, the variations are annotated according their effect on the protein. As it is known that a single gene can be transcribed to form multiple isoforms of a proteins, affects of these variations needs to be analysed on transcript level [97]. There are computational tools based on location of the variation, its biochemical effect along with its evolutionary history in different organism to predict the effect of a polymorphism on the proteins as well as if or not these polymorphisms are harmful for the organism. The most popular ones include SIFT [98] and Polyphen-2 [99]. There are other meta analysis tools which take the results from multiple predictors and produce a consensus score for each SNP e.g. ANNOVAR [100], Condel [101], SnpEff [102], Variant Effect Predictor (VEP) [97] etc. These tools vary in the number of information source they use and the statistics they apply. ANNOVAR uses six different scores [103] while others mainly rely on PolyPhen and SIFT. SNPs are annotated for their effects and there are databases storing the predicted effect of the 30 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS common SNPs as well as their known association with disease traits. These include Short Genetic Variations database (dbSNP) [104], Ensembl [105], Human Gene Mutation Database (HGMD) [106], clinVAR [107] etc. This knowledge base helps in filtering the pre-annotated SNPs before going to the prediction phase. These databases are still far from being complete even for the known variations and thus rechecking the top hits manually and validating experimentally is always advised. Figure 1.9. Annotations of SNPs, their relation to pathways, roles in diseases and comparative genomics [108]. The green lines show the solved translations of variations where as red connections show the areas under developement. Since the coding variations make only 1% of total genome variations, the big portion of variations lie in the non-coding regions. Based on evolutionary studies, even these non-coding regions are found to be conserved and many such conserved sequences are involved in regulating the expression of neighbouring genes [109]. As a result, variations in these regions would have a functional effect. Since, most GWAS are carried out on the genome wide SNP-chips, a large amount of GWAS hits are non-coding. Such SNPs impart regulatory effects either by coding for microRNA (miRNA) and long noncoding RNAs (lncRNA), or they harbour transcription factor binding sites 1.8. TRANSLATING HIGH THROUGHPUT VARIATION DATA TO FUNCTION 31 and regulate expression by modulating chromatin architecture. Therefore, it is very important to associate such SNPs with their function. The most prominent effort in annotating the non-coding variations is carried out by the Encyclopedia of DNA Elements (ENCODE) Consortium [110]. Although the population or cohort based studies are designed based on “common disease common variant” principle, they have only been successful in explaining a modest fraction of the genetic components of human common diseases. This is because there exists rare variants which are less than 1% but still polymorphic in certain human populations and few of these have been found to be associated with common diseases [111]. These variations can be detected with whole genome sequencing of the affected individuals along with the family members and detailed phenotypic knowledge. The above mentioned methods of annotating variations from genomic data have been applied in Chapter 2. Applications of these methods for analysing Danish pan-genome and an ancient genome project data have been explained in chapter 7. In brief, these methods were used to classifying the variations into functional clusters. These clusters are further subjected to functional or pathway enrichment analyses. Enrichment Analysis When a gene set is found to carry variations from a genetic study or differentially expressed in a transcriptome study, there is a need to find an enriched biological functions in the group. Enrichment analysis is about exploring the common feature, which can cover a big portion of the set rather than studying all gene products individually. The functional knowledge about any gene is either obtained from experiments or using sequence similarity approaches. Gene ontology (GO) is a resource which classifies genes based on their known functions using a systematic vocabulary [112]. GO is a hierarchical classification of gene functions where the lowest nodes represent most specific known function of the gene. The GO terms are classified into 3 major categories: cellular component, molecular function, biological process. A number of methods have been developed to enrich gene sets for GO classes, e.g. Amigo [113], Gorilla [114], EasyGO [115], Gene set enrichment analysis (GSEA) [116], DAVID [117]. To ascertain the significance of enrichment, p-values are calculated and corrected for multiple testing. The proteins do not work independently in the cell and majority of them interact physically with each other for proper functioning. The techniques applied to detect protein-protein interactions (PPIs) in a cell include immunoprecipitation, selective protease digestions, western blotting, phage display and two-hybrid analysis etc. There are numerous databases storing 32 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS known PPIs, for example Database of Interacting Proteins (DIP) [118], the Molecular INTeraction database (MINT) [119], IntAct [120], Biomolecular Interaction Network Database (BIND) [121], General Repository for Interaction Datasets (GRID) [122], Human Protein Reference Database (HPRD) [123] etc. These databases store the interaction information as interacting pairs. Mutations in different members of a protein complex lead to comparable phenotype. Based on this, protein complexes collection could be associated to known disease, organs and GO classes [124]. Transcription factors (TF) are the regulatory proteins required for the activation or deactivation of transcription by binding to specific DNA sequences called the TF binding sites. TF binding is sequence specific where each TF has a specific binding motif. The JASPAR CORE [125] and Transfac [126] databases contain curated, non-redundant set of TF profiles from experimentally derived TF binding sites. ChEA [127], CistromeMap [128], CTCFBSDB [129] and CHIPBase [130] are the databases with genome scale maps of TF binding. TFSEARCH [131], PROMO [132], MEME suite [133], P-match [134], SiTAR [135], are computational tools, which can be used to predict TF binding sites. TF binding site prediction tools were used to identify TF enrichment in differentially expressed genes in differentially regulated gene sets reported in chapter 5 and 6. Pathway Analysis A biological pathway is a series of events occurring among the molecules within a cell that leads to a change in the cell physiology or morphology. Pathway analysis gives insight into the underlying biology of differentially expressed or polymorphic genes (Figure 1.9). Grouping of gene set into biological pathways reduces the complexity as well as assists in identifying the mechanisms [136]. The biological pathway databases mainly used in the functional analysis performed in this thesis are Kyoto Encyclopedia of Genes and Genomes (KEGG) [137], Reactome [138] and Database of Cell Signaling (http://stke.sciencemag.org/cm/). Integrative Analysis Analysis of gene sets by complementing it with different types of data will improve the functional relevance of the gene set [139]. It is due to the fact that all genes are not affected at the same time and even all changes can not be captured by a single experiment. So, the method of augmenting results from one data type with other helps in filling these gaps. For example, differentially regulated gene sets can be further subjected to PPI analysis using several tools like Ingenuity Pathway Analysis (IPA) [(Ingenuit®Systems, www.ingenuity.com)], Explain [140] (http://www.biobaseinternational.com/product/explain), GeneMANIA [141], STRING [142], Enrichr [143]. Also, visualisation tools like cytoscape helps in better interpretation of these interactions [144]. 1.8. TRANSLATING HIGH THROUGHPUT VARIATION DATA TO FUNCTION 33 All the tools mentioned in this section can be complementary to each other with some having few advantages over the others. The GO enrichment tools are good for getting a general idea of the functional impact of the gene sets. The tools like Explain provides manually curated GO from published functional studies have higher confidence but less coverage. DAVID, GSEA, IPA, GeneMANIA, EXPLAIN and Enrichr have a long range of background data sets against which the test set can be queried. GeneMANIA provides an integrative network based on multiple sources. IPA and Explain are commercial tools with high confidence manually curated data. They provide information about signalling & transcriptional networks and specific cancer pathways. GSEA also has multiple sources and tools with varied functionalities, available as modules in GenePattern [145]. Pathway Based Prediction Tool Pathway-based methods group the variations or genes into pre-selected subset allowing the testing of joint effects. Pathway based GWAS have higher power than several other approaches to find pathway and disease associations [146]. This approach of combining SNPs in pathway subset is utilised in the asthma risk prediction tool presented in chapter 4 of the thesis. With the hypothesis that a specific combination of genetic factors when integrated with certain clinical or environmental features increases the disease risk [147], the genetic and clinical features were tested in combinations for childhood asthma prediction. For this prediction tool, a variation of genetic algorithm coupled with ANN, was designed for feature selection and prediction. The results from this pathway based approach are further discussed in the manuscript following the chapter. These pathway based approaches are used to complement single SNP studies to uncover the underlining biological mechanisms. However, these approaches suffers from drawbacks of the pathway knowledge base not yet been fully developed. There are few genes, which are very well studied while others are still to be included into any pathway (e.g.ARID5B). None of the pathway resource is complete as they are developed from different perspectives. Thus, they all complement each other and but individually they lack a comprehensive understanding of all biological processes. Therefore, the success of pathway based methods depend on the future development of pathway resources. Therefore, there is need to increase the resolution of databases and to complete and correct the information in them [108]. On the methodology side, the additional and precise benchmark datasets generated from real biological sets, would increase sensitivity and specificity of pathway based enrichment analysis methods. 34 CHAPTER 1. TOOLS, TECHNIQUES AND DATA ANALYSIS Challenges of Next Generation Sequencing Most of the data used the projects come from high throughput sequencing, it is important that we discuss the difficulties faced with this data. The base calling is the most critical step in the NGS data interpretation. Technology differences between platforms and use of different base calling algorithms lead to platform specific errors. Considering raw sequencing error rates, accurate mapping of the reads is a major bottleneck. During mapping, the multiple mapped reads are either discarded, if not, either one of random alignments or a user defined maximum number of alignments can be reported. The raw error rates and possibility of multiple alignments introduces mis-alignments. These mis-alignments can be efficiently reduced by using longer or paired end reads. Due to methodological differences in aligners and variant caller, they impact potential variant calls. However, these differences do not affect the robust calls but still a small portion of the variant calls may turn out to be false [148]. To reduce the false calls, it has been suggested to use variants called by multiple variant calling pipelines [148]. Also, using multigeneration familial data increases the accuracy of de novo variant calls [148]. Finally it is recommended to validate a SNP or indels by another method, which will substantiate the call made by the program. Above all this, sequencing generates massive amount of data which poses a big bioinformatics challenge for storing, quality control, alignment, assemble and annotation of all these million and billions of reads. Along with the technological worries, there are certain biological concerns of sequencing samples that are treated in non-standard ways. For example, the formalin-fixed and paraffin-embedded (FFPE) samples, which is a common method of storage in hospitals, are prone to degradation during sample preparation. The degraded samples lead to high error rates in sequencing and low coverage. Thus, new NGS technology as well as data analysis methods need to take into account these effects. Sequencing tumour samples pose another problem as tumours are very heterogeneous. Even if the sequencing is done on a single sample it is generally a population of non-identical cells which has been sequenced. More precise results can be obtained for such samples with the development of single cell sequencing. However, it is still in a developing phase and also there are no specialised data analysis tools for such sequencing. There is vast variety of NGS technologies and tools available, generally tied together to form an NGS pipeline, usage and choice of which of them depends on the biological problem in question. 1.9. COMPLEX PHENOTYPES 35 1.9 Complex Phenotypes The internally coded heritable information called “genotype”, found in all living organisms, contains the instructions regarding the structures and processes of the organism. These instructions are interpreted by the cellular machinery to manifest the external appearance and other complex phenomena like metabolism, tissues, organs, functions and behaviours, which are collectively called “phenotype”. On cellular level, phenotype can be defined as observable physical and/or biochemical characteristics of the genes expressed within a cell. It is known that phenotype is the result of genotype interacting with the environment. Thus, the phenotypes can be predicted from genotypes and vise-a-versa. The mechanisms of DNA, RNA and proteins interactions active inside the cell affect the observable traits of the cell. Most of the phenotypes are complex as they are a ensemble result of multiple interactions between different cellular components. When the balance between these complex interaction within a cell or an organism is disturbed, it leads to disorder or disease state. Diseases having multiple causative factors and that represent different symptoms in different individuals are termed as “complex diseases”. These diseases do not obey the standard Mendelian patterns of inheritance. The disease causing factors can be genetic, environmental or a combination of both [149]. Some examples of the well-known complex diseases are Alzheimer’s disease, scleroderma, asthma, Parkinson’s disease, multiple sclerosis, diabetes, obesity and cancer. These diseases differ in the symptoms amongst individual and thus can be divided into sub disease with overlapping symptoms, called as endophenotyes [150], which are also found to differ in the causal factors. Some individuals are predisposed for certain diseases. Genetic predisposition means that the genetic makeup makes the person susceptible to the disease but that does not mean the person will have the disease. The gene products interact with the environment at the molecular level. Similarly the environmental factors, which could potentially lead to a condition, might not be able to affect an individual because the macromolecules within the cells do not support action of environment on the body. Thus, the gene-environment coordination plays an important role in determining the course of disease and as we cannot change our genes, environmental and lifestyle changes may help in prevention of some of the diseases. Studies involving two complex diseases, childhood asthma and obesity, are part of this thesis and thus discussed in further details in the Part II and III respectively. Part II Childhood Asthma 37 Asthma Introduction Asthma is one of the most common non-communicable diseases. According to WHO in 2013 approximately 235 million people are currently suffering from asthma. Asthma is one of the most common chronic diseases of childhood and the most frequent reason for paediatric hospitalisations [151]. Asthma is a disease characterised by recurrent attacks of breathlessness and wheezing. Majority of asthmatics are also atopic as they are allergic to aeroallergens and food elements. IgE, the central player in the allergic response is found to be elevated in asthmatic individuals. Asthma has significant heterogeneity in phenotypes that led to multiple classifications. The phenotypic classification of asthma into early, transient, late onset and persistent wheeze by the Tucson group has been widely popular [152]. In asthma, airway inflammation contributes to airway hyperresponsiveness and airflow limitation due to mucus hypersecretion or smooth muscle hypertrophy. Evidence also suggests a key role for respiratory infections in these processes. Childhood asthma When asthma occurs at an age less than 18 years, it is treated as childhood asthma and one occurring in infants is called as early onset asthma. It is known that sensitisation by microbial infections in early life reduces the risk of asthma [154]. Asthma cases have risen in the last few decades due to the absence of multiple infections in the early age [155]. Asthma in children is hard to diagnose though it manifests similar symptoms as adults. There are multiple risk factors found to be associated with the childhood asthma [156]. Exposure of fetus to maternal smoking[157], maternal atopy, preeclampsia and hypertension are associated with asthma and similar phenotypes in newborns [158]. Early sensitisation to aeroallergens has been shown to have predictive power for wheeze, bronchial responsiveness and loss of lung functions [159]. Low birth weight of newborns is also a risk factor 39 40 Figure I.1 Genome wide spread of asthma related gene [153] of asthma. Racial disparities prevail in asthma with other socio-economic factors like size of the family, size of house, poverty and mother’s age at the time of birth being associated with increased risk of childhood asthma [160]. Environmental factors play protective as well as causative role in childhood asthma [161]. Exposure to microbes in the early stages of life may be sufficient to stimulate pattern-recognition receptors of the innate immune response and thus sensitising the body in form of memory T cells, to protect from more severe microbial infection, which could lead to asthma. Environmental conditions like air pollution, tobacco smoke and dampness in the house have adverse effects on childhood asthma. Multiple studies have associated more than 100 genetic factors to asthma based on GWAS carried on samples from different ethnicity (Figure I.1). Some of the strong ones have been replicated with the same power in separate studies. 17q21 locus encoding orosomucoid like 3 (ORMDL3) and gasdermin B (GSMDB) has been associated with childhood asthma in ethnically diverse subjects from Europe, North America and Asia [162]. Multiple IgE controlling genes e.g. FCER1A are found to mediate asthma [163]. Other genes that are replicated in multiple studies to be associated with asthma are DENND1B [164], locus containing IL1RL1 [162] and IL18R1 [162], HLA-DQ [162], IL33 [162] and SMAD3 [162]. Stress has also been found to play a critical role in asthmatic attacks in children connecting to the epigenetic and genetic alterations in ADCYAP1R1 gene [165]. Replications of asthma 41 GWAS studies have identified multiple loci to be associated with pulmonary functions [166] and lung function [167] thus indicating common cause behind related phenotypes. Asthma heritability is estimated to be 70-90% and GWAS have been able to identify limited loci explaining a small percentage of this, leaving a bigger portion still to be revealed. In case of complex disease like asthma, which has a vast variety of symptoms, phenotypic classification can help in effective search of the causative variance. Exacerbation, one of the asthma phenotypes, is defined as frequent admissions to hospital with asthma phenotype. Chapter 2 of the thesis includes a manuscript titled ”A genome-wide association study identifies CDHR3 as a susceptibility locus for early childhood asthma with severe exacerbations”. In this study, GWAS was carried out on a children cohort with exacerbation and normal adult controls. Genotyping was performed with SNP-arrays to identify loci with exacerbation associations. As different SNPs from same gene are discovered as the disease associated in multiple GWAS studies with close odd ratios, it shows that we still have not found all causal variant. Thus, there is a need to identify the missing causal variation within these genes. This can be done by selectively sequencing the candidate genes. This provides not only the single point variation in these loci but also the structural variations. The study in chapter 3 is based on sequencing of 16 candidate regions, which have been associated with asthma and related phenotype. Gene-to-environmental interactions are important in the development and expression of asthma. Thus, clinical features need to be included along with genetic featuring for predicting asthma. Chapter 4 of this thesis aims to combine genotypes and clinical features using ANN to predict asthma at age 7 years. Chapter 2 Paper I - Genome-wide association analysis of childhood asthma Prelude There are multiple dimensions in asthma disease, which can range from frequent wheeze to multiple hospitalisations called as exacerbation. Amongst all the asthma phenotypes, exacerbations have the greatest impact on health care and treatment costs [168]. Schatz et al [169] reported that exacerbation clusters separately from the daily symptoms and lung function in discriminant analysis, thus suggesting that the factors responsible for exacerbation risk may differ from the other asthma phenotypes. To find genetic factors for exacerbation, GWAS was designed in a children cohort. Cohort was stratified for number of exacerbations to discrete variations responsible for differences in the severity of the phenotype. To test the robustness of the significant hits in the GWAS, study was replicated in two birth cohorts of European ancestry. Replication in a cohort of mixed ancestry was done to examine cross ethnicity causal variants. Regional imputation as described in section 1.3 was performed for the top hits to examine the variants missed in genotyping. A novel SNP in the CDHR3 (rs6967330) was further analysed. The study showed the importance of specific phenotyping for using small cohorts in GWAS. 43 CHAPTER 2. PAPER I - GENOME-WIDE ASSOCIATION ANALYSIS 44 OF CHILDHOOD ASTHMA My contribution to the project My contribution to this project was to investigate the impact of the novel GWAS hit, rs6967330, on the CDHR3 gene product and its influence on asthma exacerbation outcome. Different databases where searched to collect known literature about CDHR3 and related proteins. CDHR3 is a transmembrane protein with six extracellular cadherin domains, belonging to a family of membrane proteins adhesion molecules. The members of cadherin family mediate Ca++ -dependent cell-cell adhesion in all solid tissues. These proteins also modulate a wide variety of processes including cell polarisation and migration [170] [171]. According to uniprot knowledge base, these proteins preferentially interact within the protein family in a homophilic manner in connecting cells, thus are suggested to contribute towards sorting of heterogeneous cell populations. Other members of cadherin family, E-cadherin [172] and protocadherin-1 [173] have been earlier associated with asthma related traits. Knowing the fact that not all genes are expressed in every cell of the body, the first aim was to find if CDHR3 is expressed in any of the asthma related tissues. Different data sets for gene expression data from GEO were curated. Datasets containing CDHR3 probes were queried to retrieve expression values for CDHR3. These expression values were normalised with respect to other probes on the array. CDHR3 was found differentially over expressed in lungs [174]. In another study of human post mortem tissue samples, CDHR3 was found to be over expressed in bronchi, trachea and lungs [175]. Gene expression profiling of the human hematopoietic system showed high expression of CDHR3 in B-lymphocytes as compared to other immunological white blood cells from healthy individuals [176]. CDHR3 was also found to be tenfold up-regulated in differentiating epithelial cells, which is a process involved in the development of airway epithelium [177]. These findings were used in hypothesis generation and in designing of the experiments for finding the effect of the SNP. SNP rs6967330 (G>A) is a non-synonymous variation due to which cysteine, a medium size and polar amino acid, is replaced by tyrosine, a large size and aromatic amino acid. This SNP is present in the cadherin 5 domain of the mutated protein. The ancestral allele “A” codes for tyrosine animo acid and is the frequent allele in mammals other than humans. The SNP is found to be deleterious by SNP effect prediction tool condel [101]. Based on these facts about the SNP and the gene, functional studies to find the expression of non mutated and mutated proteins were designed. The homology model of mutated CDHR3 suggest the interference of SNP in the protein stabilisation and folding, which is in accordance with the experimental results. LETTERS © 2013 Nature America, Inc. All rights reserved. A genome-wide association study identifies CDHR3 as a susceptibility locus for early childhood asthma with severe exacerbations Klaus Bønnelykke1,24,25, Patrick Sleiman2,24, Kasper Nielsen3,24, Eskil Kreiner-Møller1, Josep M Mercader4, Danielle Belgrave5,6, Herman T den Dekker7–9, Anders Husby1,10, Astrid Sevelsted1, Grissel Faura-Tellez11,12, Li Juel Mortensen1, Lavinia Paternoster13, Richard Flaaten1, Anne Mølgaard1, David E Smart10, Philip F Thomsen14, Morten A Rasmussen15, Silvia Bonàs-Guarch4, Claus Holst16, Ellen A Nohr17,18, Rachita Yadav3, Michael E March2, Thomas Blicher19, Peter M Lackie11, Vincent W V Jaddoe7,9,20, Angela Simpson5, John W Holloway11, Liesbeth Duijts8,9,21, Adnan Custovic5, Donna E Davies10, David Torrents4,22, Ramneek Gupta3, Mads V Hollegaard23, David M Hougaard23, Hakon Hakonarson2,25 & Hans Bisgaard1,25 Asthma exacerbations are among the most frequent causes of hospitalization during childhood, but the underlying mechanisms are poorly understood. We performed a genome-wide association study of a specific asthma phenotype characterized by recurrent, severe exacerbations occurring between 2 and 6 years of age in a total of 1,173 cases and 2,522 controls. Cases were identified from national health registries of hospitalization, and DNA was obtained from the Danish Neonatal Screening Biobank. We identified five loci with genome-wide significant association. Four of these, GSDMB, IL33, RAD50 and IL1RL1, were previously reported as asthma susceptibility loci, but the effect sizes for these loci in our cohort were considerably larger than in the previous genome-wide association studies of asthma. We also obtained strong evidence for a new susceptibility gene, CDHR3 (encoding cadherin-related family member 3), which is highly expressed in airway epithelium. These results demonstrate the strength of applying specific phenotyping in the search for asthma susceptibility genes. Acute asthma exacerbations are among the most frequent causes of hospitalization during childhood and are responsible for large healthcare expenditures1–4. Available treatment options for prevention and treatment of asthma exacerbations are inadequate5, suggesting that asthma with severe exacerbations may represent a distinct subtype of disease and demonstrating a need for improved understanding of its pathogenesis. Asthma heritability is estimated to be 70–90% (refs. 6,7), but only a limited number of susceptibility loci have been verified in genomewide association studies (GWAS)8–13. Larger GWAS may identify new susceptibility loci with smaller effects, but, owing to the large heterogeneity in asthma14, an alternative strategy is to increase phenotype specificity in genome-wide analyses. A specific phenotype is likely to be more closely related to a specific pathogenetic mechanism, and focusing on a particular phenotype may increase the power of genetic studies. We aimed to increase understanding of the genetic background of early childhood asthma with severe exacerbations by conducting a 1Copenhagen Prospective Studies on Asthma in Childhood, Health Sciences, University of Copenhagen & Danish Pediatric Asthma Center, Copenhagen University Hospital, Gentofte, Denmark. 2Center for Applied Genomics, Children’s Hospital of Philadelphia (CHOP), Philadelphia, Pennsylvania, USA. 3Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Lyngby, Denmark. 4Joint Institute for Research in Biomedicine and Barcelona Supercomputing Center (IRB-BSC) Program on Computational Biology, Barcelona Supercomputing Center, Barcelona, Spain. 5Centre for Respiratory Medicine and Allergy, Institute of Inflammation and Repair, University of Manchester and University Hospital of South Manchester, Manchester, UK. 6Centre for Health Informatics, Institute of Population Health, University of Manchester, Manchester, UK. 7Generation R Study Group, Erasmus Medical Center, Rotterdam, The Netherlands. 8Department of Pediatrics, Division of Respiratory Medicine, Erasmus Medical Center, Rotterdam, The Netherlands. 9Department of Epidemiology, Erasmus Medical Center, Rotterdam, The Netherlands. 10Brooke Laboratory, Clinical and Experimental Sciences, Faculty of Medicine, University of Southampton, University Hospital Southampton, Southampton, UK. 11Faculty of Medicine, University of Southampton, Southampton General Hospital, Southampton, UK. 12Pediatric Pulmonology and Pediatric Allergology, University of Groningen, University Medical Center Groningen, Beatrix Children’s Hospital, Groningen Research Institute for Asthma and COPD, Groningen, The Netherlands. 13Integrative Epidemiology Unit, School of Social & Community Medicine, University of Bristol, Bristol, UK. 14Center for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark. 15Department of Food Science, University of Copenhagen, Copenhagen, Denmark. 16Institute of Preventive Medicine, Copenhagen University Hospital, Copenhagen, Denmark. 17Institute of Clinical Research, University of Southern Denmark, Aarhus, Denmark. 18Department of Public Health, Section for Epidemiology, Aarhus University, Aarhus, Denmark. 19Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark. 20Department of Pediatrics, Erasmus Medical Center, Rotterdam, The Netherlands. 21Department of Pediatrics, Division of Neonatology, Erasmus Medical Center, Rotterdam, The Netherlands. 22Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain. 23Danish Centre for Neonatal Screening, Department of Clinical Biochemistry and Immunology, Statens Serum Institut (SSI), Copenhagen, Denmark. 24These authors contributed equally to this work. 25These authors jointly directed this work. Correspondence should be addressed to K.B. (kb@copsac.com). Received 27 May; accepted 28 October; published online 17 November 2013; doi:10.1038/ng.2830 NATURE GENETICS ADVANCE ONLINE PUBLICATION 1 LETTERS GSDMB –log10 (P value) 45 IL33 10 RAD50 IL1R1 CDHR3 5 © 2013 Nature America, Inc. All rights reserved. 0 1 2 3 4 5 6 7 8 9 10 11 12 13141516 19 22 17 20 18 21 Chromosome Figure 1 Manhattan plot for the discovery genome-wide association analysis. The horizontal line indicates the genome-wide significance threshold (P < 5 × 10−8). GWAS of this particular asthma phenotype. We identified children with recurrent acute hospitalizations for asthma occurring between 2 and 6 years of age (cases) from the Danish National Patient Register. We then extracted and amplified DNA from dried blood spot samples isolated from the Danish Neonatal Screening Biobank, as previously described15,16, before genome-wide array genotyping (Affymetrix Axiom CEU array). Case criteria were fulfilled for 2,029 of 1.7 million children born in Denmark between 1982 and 1995 (1.1/1,000 children). The final case cohort (Copenhagen Prospective Studies on Asthma in Childhood exacerbation cohort, COPSACexacerbation) after genotyping and quality control comprised 1,173 children (Supplementary Fig. 1). Compared to the general population, cases were more often boys (67 versus 51%) and more often had mothers who smoked during pregnancy (32 versus 15%) (Supplementary Tables 1 and 2). Controls consisted of 2,511 individuals of Danish descent without asthma who were previously genotyped (Illumina Human610-Quad v1.0 BeadChip). We analyzed association between disease and 124,514 SNPs genotyped in both cases and controls, and we accounted for population stratification by multidimensional scaling. The genomic inflation factor was 1.04. The genome-wide association analysis detected an excess of association signals beyond those expected by chance (Supplementary Fig. 2), and SNPs from five regions reached genome-wide significance (P < 5 × 10−8; Fig. 1 and Supplementary Fig. 3). The top SNPs from the five loci were rs2305480 in GSDMB (odds ratio (OR) = 2.28, P = 1.3 × 10−48), rs928413 near IL33 (OR = 1.50, P = 4.2 × 10−13), rs6871536 in RAD50 (OR = 1.44, P = 1.7 × 10−9), rs1558641 in IL1RL1 (OR = 1.56, P = 6.6 × 10−9) and rs6967330 in CDHR3 (OR = 1.45, P = 1.4 × 10−8) (Table 1). Validation of results for the top SNPs by regenotyping of cases and use of an alternative control population gave similar results (Supplementary Tables 3 and 4). Association analyses in the discovery cohort stratified on number of asthma-related hospitalizations showed higher OR with increasing number of hospitalizations for all five SNPs (Table 2). There was no significant interaction between the top SNPs and no effect modification by sex. We first sought replication in the childhood-onset stratum (with onset before 16 years of age) from a previous GWAS of asthma including 14,503 individuals conducted by the GABRIEL Consortium11 (Supplementary Table 5), which showed evidence of association for all 5 of the genome-wide significant loci reported here (Table 1). The CDHR3 locus was the only locus that had not previously been associated with asthma or any other atopic trait. We therefore followed up the top SNP from this locus (rs6967330) by further replication in a total of 3,975 children from 2 birth cohorts of European ancestry (COPSAC2000 and the Manchester Asthma and Allergy Study (MAAS)) and in 1 cohort with a population of mixed ancestry (Generation R). There was evidence for association with asthma before the age of 6 years in combined analyses of the three birth cohorts and in the combined replication sets (Table 1, Supplementary Fig. 4 and Supplementary Table 6), as well as in a subsample including the 980 individuals with nonEuropean ancestry (Supplementary Table 6). Phenotype-specific replication was possible in the COPSAC2000 and MAAS birth cohorts with prospective registration of acute asthma hospitalizations and exacerbations from birth to 6 years of age in a Table 1 Discovery and replication results for the five genome-wide significant loci in the discovery analyses Chr. SNP effect allele Nearest gene Distance to gene (bp) Effect allele frequency 17 rs2305480[G] GSDMB 0 0.60 9 rs928413[G] IL33 2,418 0.28 5 rs6871536[C] RAD50 0 0.22 2 rs1558641[G] IL1R1 0 0.85 7 rs6967330[A] CDHR3 0 0.19 Stage Discovery Replication 1 Discovery Replication 1 Discovery Replication 1 Discovery Replication 1 Discovery Replication 1 Replication 2 Replications 1 + 2 Discovery + replications 1 + 2 OR (95% CI) 2.28 1.32 1.50 1.24 1.44 1.17 1.56 1.11 1.45 1.18 1.40 1.21 1.26 (2.04–2.55) (1.23–1.39) (1.34–1.67) (1.17–1.32) (1.28–1.62) (1.10–1.25) (1.34–1.81) (1.04–1.19) (1.28–1.66) (1.10–1.27) (1.16–1.67) (1.13–1.29) (1.18–1.33) P value (fixed- P value (random effects model)a effects model) P heterogeneity 1.3 × 10−48 6.4 × 10−23 4.2 × 10−13 8.8 × 10−13 1.8 × 10−9 7.6 × 10−7 6.6 × 10−9 0.003 1.4 × 10−8 3.0 × 10−6 3.2 × 10−4 1.6 × 10−8 2.7 × 10−14 – 6.4 × 10−23 – 2.5 × 10−6 – 7.6 × 10−7 – 0.003 – 1.3 × 10−4 3.2 × 10−4 2.6 × 10−6 2.7 × 10−7 – 0.86 – 0.007 – 0.54 – 0.75 – 0.04 0.87 0.05 0.02 Replication P values are shown in bold if significant after Bonferroni correction for the five loci tested (P < 0.01). Replication 1 results are from a previously published large-scale GWAS of asthma (asthma onset before 16 years; subanalysis of ref. 11). Replication 2 results are from the COPSAC2000, MAAS and Generation R cohorts (asthma onset before 6 years). Chr., chromosome. aA 2 fixed-effects model was not applied in the discovery analysis. ADVANCE ONLINE PUBLICATION NATURE GENETICS LETTERS Table 2 Association results for the five genome-wide significant and replicated top SNPs stratified on number of hospitalizations for asthma or acute bronchitis from 0–6 years of age in the discovery cohort Number of asthma-related hospitalizations SNP effect allele 2 n = 272 3 n = 228 4–5 n = 277 6 or more n = 358 Association between number of hospitalizations and genotype P valuea Nearest gene OR (95% CI) P value OR (95% CI) P value OR (95% CI) P value OR (95% CI) P value rs2305480[G] GSDMB 1.87 (1.54–2.26) 1.5 × 10−10 2.24 (1.81–2.78) 2.1 × 10−13 2.24 (1.83–2.73) 1.7 × 10−15 2.72 (2.26–3.28) 3.5 × 10−27 0.002 rs928413[G] IL33 1.32 (1.09–1.61) 0.005 1.22 (0.98–1.50) 0.07 1.47 (1.21–1.79) 8.5 × 10−5 1.91 (1.61–2.26) 6.2 × 10−14 2.4 × 10−4 rs6871536[C] RAD50 1.31 (1.06–1.61) 0.01 1.26 (1.00–1.59) 0.05 1.45 (1.18–1.78) 3.6 × 10−4 1.58 (1.31–1.89) 1.3 × 10−6 0.09 rs1558641[G] IL1R1 1.53 (1.16–2.02) 1.20 (0.91–1.57) 1.32 (1.02–1.71) 2.19 (1.66–2.90) 0.02 CDHR3 0.002 1.23 (0.98–1,56) 0.20 1.37 (1.07–1.75) 0.04 1.42 (1.13–1.78) 3.2 × 10−8 1.63 (1.33–1.97) 0.04 0.07 0.01 0.003 1.6 × 10−6 rs6967330[A] Only the 1,135 children with full follow-up were included. The number of controls was 2,511 for all analyses. test for linear association. total of 1,091 children. The rs6967330 risk allele (A) was associated with greater risk of asthma hospitalizations (hazards ratio (HR) = 1.7 (95% confidence interval (CI) = 1.2–2.4), P = 0.002) and severe exacerbations (HR = 1.4 (95% CI = 1.1–1.9), P = 0.007) in combined analyses (Fig. 2, Supplementary Fig. 5 and Supplementary Table 6). In COPSAC2000, we observed a trend in the direction of increased neonatal bronchial responsiveness associated with the rs6967330 risk allele (P = 0.10) (Supplementary Table 7). There was no association with eczema in any of the three birth cohorts, and data on allergic sensitization were inconsistent (Supplementary Table 6). The top SNP at the CDHR3 locus (rs6967330) is a nonsynonymous coding SNP, where the risk allele (A), corresponding to the minor allele, results in an amino acid change from cysteine to tyrosine at position 529. This SNP is the only known nonsynonymous variant in this linkage disequilibrium (LD) region, but there are other variants located within Encyclopedia of DNA Elements (ENCODE)predicted regulatory regions that are in moderate to high LD (r2 > 0.5) with the sentinel SNP (Supplementary Table 8). Two SNPs with partial LD (r2 = 0.71 and 0.58) were also associated with asthma in the discovery analysis but with less statistical significance. A similar association pattern with rs6967330 as the top SNP was observed in the GABRIEL (replication) study (Supplementary Fig. 6) and in the Generation R (replication) subsample of individuals with nonEuropean ancestry (Supplementary Fig. 7), suggesting that rs6967330 might be the causal gene variant at this locus. We investigated the potential functional consequences of the top variant in CDHR3 (rs6967330; p.Cys529Tyr) by generating an expression construct encoding tagged human CDHR3 and introducing the mutation encoding p.Cys529Tyr (A allele at rs6967330 resulting in mutation of cysteine 529 to tyrosine) by site-directed mutagenesis. We transfected the constructs for wild-type and mutant CDHR3 into 293T cells. Consistent results from six independent experiments involving flow cytometry (n = 3) (Supplementary Fig. 8) and immunofluorescence staining (n = 3) (Supplementary Fig. 9) showed that the wild-type protein was expressed at very low levels at the cell surface, whereas the Cys529Tyr mutant showed a marked increase in cell surface expression (Supplementary Note). These results support the possibility that rs6967330 represents the causal variant at this locus. A recent study17 reported that a SNP (rs17152490) in high LD (r2 = 0.69) with our top SNP was associated with lung expression of CDHR3, further supporting a functional role for this locus. NATURE GENETICS ADVANCE ONLINE PUBLICATION CDHR3 is a transmembrane protein with six extracellular cadherin domains. Protein structure modeling showed that the risk-associated alteration (p.Cys529Tyr) was located at the interface between two membrane-proximal cadherin domains, D5 and D6 (Fig. 3). Interestingly, Cys592 and Cys566, which are expected to form a disulfide bridge within D6, are close to Cys529 in D5, and the short distance between them could allow disulfide rearrangement (for the wild-type, non-risk cysteine variant). The location of the variant residue at the domain interface suggests that the variant residue may interfere with interdomain stabilization, overall protein stability, folding or conformation, in agreement with the observation in our experimental studies of altered cell surface expression. The biological function of CDHR3 is unknown, but it belongs to the cadherin family of transmembrane proteins involved in homologous cell adhesion and important for several cellular processes, including epithelial polarity, cell-cell interaction and differentiation18. Other members of the cadherin family have been associated with asthma 0.4 AA AG 0.3 Risk of hospitalization © 2013 Nature America, Inc. All rights reserved. aMantel-Haenszel GG 0.2 0.1 0 0 1 2 3 4 Age (years) 5 6 Figure 2 Cumulative risk of asthma hospitalization during the first 6 years of life stratified on CDHR3 (rs6967330) genotype. Data are from combined analysis of the COPSAC2000 and MAAS birth cohorts (replication), including a total of 1,091 children, of whom 92 were hospitalized for asthma. Genotype distribution was as follows: AA, 30 individuals; AG, 312 individuals; GG, 749 individuals. The P value for the association between genotype and risk of hospitalization was 0.002 (Cox regression analysis using an additive genetic model). 3 LETTERS Figure 3 Overview of the CDHR3 protein model. The model covers cadherin domains 2–6 (D2–D6) and is based on the structure of the entire mouse N-cadherin ectodomain (Protein Data Bank (PDB) 3Q2W; domains 1–5). The location of the alteration at position 529 is indicated with a blue star. The distance between residue 529 and the disulfide bridge in D6 (between residues 566 and 592) is approximately 20 Å. Membrane Extracellular Intracellular Model D1 D2 D3 D4 D5 D6 D6 D5 © 2013 Nature America, Inc. All rights reserved. and related traits, including E-cadherin19 and protocadherin-1 (ref. 20). We demonstrated protein expression of CDHR3 in bronchial epithelium from adults and in fetal lung tissue (Supplementary Fig. 10). CDHR3 was previously found to be highly expressed in normal human lung tissue21 and specifically in the bronchial epithelium22. CDHR3 (probe 235650_at) was upregulated by tenfold in differentiating epithelial cells (with a rank of 123 out of more than 47,000 transcripts ranked by magnitude of upregulation)23 and seems to be highly expressed in the developing human lung24. There is an increasing focus on the role of the airway epithelium in asthma pathogenesis. Structural or functional abnormalities in the epithelium may increase susceptibility to environmental stimuli by exaggerating immune responses and structural changes in underlying tissues and increasing airway reactivity 25. Epithelial integrity is dependent on the interaction of proteins in cell-cell junction complexes, including adhesion molecules. Studies have shown impaired tight junction function26 and reduced E-cadherin expression27 in the airway epithelium of individuals with asthma. CDHR3 is a plausible candidate gene for asthma because of its high level of expression in the airway epithelium and the known role of cadherins in cell adhesion and interaction. Most asthma exacerbations in children are caused by respiratory infections, predominantly common viral infections such as rhinovirus28, but bacterial infection may also have a role29, as well as exposure to air pollution30. It is therefore plausible that CDHR3 variation increases susceptibility to respiratory infections or other airway irritants through impaired epithelial integrity and/or disordered repair processes. Interestingly, the CDHR3 asthma risk allele is the ancestral allele. Public data from protein databases suggest that humans are unique among 36 other vertebrate species in having the derived (non-risk) allele resulting in a cysteine at position 529 (Supplementary Table 9), which is now the wild-type allele in most human populations (Human Genome Diversity Project (HGDP) selection browser; see URLs). This finding suggests that the risk (ancestral) allele, associated with increased surface expression of CDHR3, may have been advantageous during early human evolution. This phenomenon in which the ancestral allele is the risk allele is known for other common diseases and may reflect a shift from a beneficial to a deleterious effect for a particular allele as a result of a changing environment31. The CDHR3 variant seems to be associated with an asthma phenotype of early onset, as demonstrated by the strongest replication of association in the GABRIEL stratum with asthma onset before 16 years of age (Supplementary Table 10) and in the second replication including children with asthma onset before 6 years of age (Table 1). Increased risk was already demonstrated in the first year of life (Fig. 2), particularly in children who were homozygous for the risk allele (A). This finding is in line with the tendency toward association of increased airway reactivity in neonates with the risk allele 4 C566 20 Å C529 C592 and findings of CDHR3 expression in the fetal lung. CDHR3 variation also seems to be more strongly associated with an asthma phenotype with exacerbations (Supplementary Table 6), particularly with recurrent exacerbations (Table 2 and Supplementary Table 6). The top locus in this study, on chromosome 17q12-21, has consistently been associated with childhood-onset asthma11,13. The effect size in the present study is remarkably high, with an OR of 2.3 that increases to 2.7 for the children with the highest number of exacerbations. This finding suggests a key role for this locus in severe exacerbations in early childhood, in line with a previous report from the COPSAC2000 birth cohort study32. Genome-wide significant association with asthma has previously been shown for variants in or near IL33, RAD50-IL13 and IL1RL1 (refs. 11,33). The fact that the top loci in our study were generally shared with previous GWAS of asthma suggests that early-onset asthma with severe exacerbations is at least partly driven by multiple common variants in the same genes that contribute to asthma without severe exacerbations. The sample size of the present GWAS was less than one-fifth that of the largest published GWAS of asthma (GABRIEL)11, and, yet, we found a similar number of genome-wide significant loci, similar statistical significance and considerably larger effect estimates. Further increasing phenotypic specificity by stratified analysis in the 358 children with the highest number of exacerbations resulted in an additional increase in effect estimates, with ORs between 1.6 and 2.7 per risk allele, and strong statistical significance. Effect estimates were also higher than previously reported when replicating the exact top SNP from the GABRIEL study (Supplementary Table 11). This finding demonstrates that specific phenotyping is a helpful approach in the search for asthma susceptibility genes. The narrow age criteria (2–6 years) for disease may be an important phenotypic characteristic, as heritability has been demonstrated to be higher for early-onset asthma34. The method of case identification through national registries allowed us to define a specific and rare phenotype of repeated acute hospitalizations in young children from 2 to 6 years of age, which, to our knowledge, has not previously been done in a GWAS. One limitation of this study is that we had relatively poor genomewide coverage (approximately 125,000 SNPs). In conclusion, our results demonstrate the strength of specific phenotyping in genetic studies of asthma. Future research focusing on understanding the role of CDHR3 variants in the development of asthma and severe exacerbations may increase understanding and improve treatment of this clinically important disease entity. ADVANCE ONLINE PUBLICATION NATURE GENETICS LETTERS URLs. HGDP selection browser data for rs6967330, http://hgdp. uchicago.edu/cgi-bin/alfreqs.cgi?pos=105445687&chr=chr7&rs=rs 6967330&imp=false. METHODS Methods and any associated references are available in the online version of the paper. Note: Any Supplementary Information and Source Data files are available in the online version of the paper. © 2013 Nature America, Inc. All rights reserved. ACKNOWLEDGMENTS A full list of acknowledgments for each study is given in the Supplementary Note. AUTHOR CONTRIBUTIONS K.B. was the main author responsible for designing the study, analyzing and interpreting data, writing the manuscript and directing the work. He had full access to the data and final responsibility for the decision to submit the work for publication. H.B. contributed to design of the study, analysis of data and writing of the manuscript. P.S. and H.H. contributed to design of the study and analysis of data in relation to whole-genome genotyping. K.N. performed the GWAS analysis and contributed to regional imputation. E.K.-M., A. Sevelsted, M.A.R., R.Y. and R.G. contributed to data analysis. J.M.M., S.B.-G. and D.T. directed and contributed to regional imputation and data analyses. M.V.H. and D.M.H. were responsible for subject identification, collection of dried blood spots and DNA extraction and amplification. K.B., E.K.-M., L.J.M., R.F. and A.M. contributed to data acquisition. T.B. performed modeling of the CDHR3 protein structure. L.P., C.H. and E.A.N. were responsible for data from the discovery control cohort. H.H. and M.E.M. were responsible for the functional studies of the CDHR3 variant involving flow cytometry. A.H., D.E.S. and D.E.D. were responsible for the experimental studies involving immunofluorescence staining. A. Simpson, A.C. and D.B. were responsible for data from the MAAS cohort. H.T.d.D., L.D. and V.W.V.J. were responsible for data from the Generation R cohort. G.F.-T., P.M.L. and J.W.H. were responsible for the studies of lung tissue. P.F.T. studied the evolutionary aspects of the CDHR3 risk variant (rs6967330). All coauthors provided important intellectual input to the study and approved the final version of the manuscript. COMPETING FINANCIAL INTERESTS The authors declare competing financial interests: details are available in the online version of the paper. Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html. 1. Kocevar, V.S. et al. Variations in pediatric asthma hospitalization rates and costs between and within Nordic countries. Chest 125, 1680–1684 (2004). 2. Lozano, P., Sullivan, S.D., Smith, D.H. & Weiss, K.B. The economic burden of asthma in US children: estimates from the National Medical Expenditure Survey. J. Allergy Clin. Immunol. 104, 957–963 (1999). 3. Matterne, U., Schmitt, J., Diepgen, T.L. & Apfelbacher, C. Children and adolescents’ health-related quality of life in relation to eczema, asthma and hay fever: results from a population-based cross-sectional study. Qual. Life Res. 20, 1295–1305 (2011). 4. Smith, D.H. et al. A national estimate of the economic costs of asthma. Am. J. Respir. Crit. Care Med. 156, 787–793 (1997). 5. Bush, A. Practice imperfect—treatment for wheezing in preschoolers. N. Engl. J. Med. 360, 409–410 (2009). 6. Duffy, D.L., Martin, N.G., Battistutta, D., Hopper, J.L. & Mathews, J.D. Genetics of asthma and hay fever in Australian twins. Am. Rev. Respir. Dis. 142, 1351–1358 (1990). NATURE GENETICS ADVANCE ONLINE PUBLICATION 7. van Beijsterveldt, C.E. & Boomsma, D.I. Genetics of parentally reported asthma, eczema and rhinitis in 5-yr-old twins. Eur. Respir. J. 29, 516–521 (2007). 8. Ferreira, M.A. et al. Identification of IL6R and chromosome 11q13.5 as risk loci for asthma. Lancet 378, 1006–1014 (2011). 9. Gudbjartsson, D.F. et al. Sequence variants affecting eosinophil numbers associate with asthma and myocardial infarction. Nat. Genet. 41, 342–347 (2009). 10. Himes, B.E. et al. Genome-wide association analysis identifies PDE4D as an asthmasusceptibility gene. Am. J. Hum. Genet. 84, 581–593 (2009). 11. Moffatt, M.F. et al. A large-scale, consortium-based genomewide association study of asthma. N. Engl. J. Med. 363, 1211–1221 (2010). 12. Sleiman, P.M. et al. Variants of DENND1B associated with asthma in children. N. Engl. J. Med. 362, 36–44 (2010). 13. Torgerson, D.G. et al. Meta-analysis of genome-wide association studies of asthma in ethnically diverse North American populations. Nat. Genet. 43, 887–892 (2011). 14. Anderson, G.P. Endotyping asthma: new insights into key pathogenic mechanisms in a complex, heterogeneous disease. Lancet 372, 1107–1119 (2008). 15. Hollegaard, M.V. et al. Genome-wide scans using archived neonatal dried blood spot samples. BMC Genomics 10, 297 (2009). 16. Hollegaard, M.V. et al. Robustness of genome-wide scanning using archived dried blood spot samples as a DNA source. BMC Genet. 12, 58 (2011). 17. Hao, K. et al. Lung eQTLs to help reveal the molecular underpinnings of asthma. PLoS Genet. 8, e1003029 (2012). 18. Hulpiau, P. & van Roy, F. Molecular evolution of the cadherin superfamily. Int. J. Biochem. Cell Biol. 41, 349–369 (2009). 19. Nawijn, M.C., Hackett, T.L., Postma, D.S., van Oosterhout, A.J. & Heijink, I.H. E-cadherin: gatekeeper of airway mucosa and allergic sensitization. Trends Immunol. 32, 248–255 (2011). 20. Koppelman, G.H. et al. Identification of PCDH1 as a novel susceptibility gene for bronchial hyperresponsiveness. Am. J. Respir. Crit. Care Med. 180, 929–935 (2009). 21. Yanai, I. et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21, 650–659 (2005). 22. McCall, M.N., Uppal, K., Jaffee, H.A., Zilliox, M.J. & Irizarry, R.A. The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Res. 39, D1011–D1015 (2011). 23. Ross, A.J., Dailey, L.A., Brighton, L.E. & Devlin, R.B. Transcriptional profiling of mucociliary differentiation in human airway epithelial cells. Am. J. Respir. Cell Mol. Biol. 37, 169–185 (2007). 24. Kho, A.T. et al. Transcriptomic analysis of human lung development. Am. J. Respir. Crit. Care Med. 181, 54–63 (2010). 25. Holgate, S.T. The sentinel role of the airway epithelium in asthma pathogenesis. Immunol. Rev. 242, 205–219 (2011). 26. Xiao, C. et al. Defective epithelial barrier function in asthma. J. Allergy Clin. Immunol. 128, 549–556 (2011). 27. de Boer, W.I. et al. Altered expression of epithelial junctional proteins in atopic asthma: possible role in inflammation. Can. J. Physiol. Pharmacol. 86, 105–112 (2008). 28. Johnston, S.L. et al. Community study of role of viral infections in exacerbations of asthma in 9–11 year old children. Br. Med. J. 310, 1225–1229 (1995). 29. Bisgaard, H. et al. Association of bacteria and viruses with wheezy episodes in young children: prospective birth cohort study. Br. Med. J. 341, c4978 (2010). 30. Iskandar, A. et al. Coarse and fine particles but not ultrafine particles in urban air trigger hospital admission for asthma in children. Thorax 67, 252–257 (2012). 31. Di Rienzo, A. & Hudson, R.R. An evolutionary framework for common diseases: the ancestral-susceptibility model. Trends Genet. 21, 596–601 (2005). 32. Bisgaard, H. et al. Chromosome 17q21 gene variants are associated with asthma and exacerbations but not atopy in early childhood. Am. J. Respir. Crit. Care Med. 179, 179–185 (2009). 33. Li, X. et al. Genome-wide association study of asthma identifies RAD50-IL13 and HLA-DR/DQ regions. J. Allergy Clin. Immunol. 125, 328–335 (2010). 34. Thomsen, S.F., Duffy, D.L., Kyvik, K.O. & Backer, V. Genetic influence on the age at onset of asthma: a twin study. J. Allergy Clin. Immunol. 126, 626–630 (2010). 5 ONLINE METHODS © 2013 Nature America, Inc. All rights reserved. The individual studies are described in further detail in the Supplementary Note. COPSAC exacerbation cohort (GWAS). This is a register-based cohort of children with asthma who were identified and characterized from national health registries. The study was approved by the Ethics Committee for Copenhagen (H-B-2998-103) and the Danish Data Protection Agency (2008-41-2622). According to Danish law, research ethics committees can grant exemption from obtaining informed consent for research projects based on biobank material under certain circumstances. For this study, such an exemption was granted (H-B-2998-103). Case selection. Children with repeated acute hospitalizations (cases) were identified in the Danish National Patient Register covering all diagnoses of discharges from Danish hospitals35. Information on birth-related events was obtained from the national birth register. Inclusion criteria were at least two acute hospitalizations for asthma (ICD8-codes 493, ICD-10 codes J45-46) from 2 to 6 years of age (both years included). Duration of hospitalization had to be more than 1 d, and two hospitalizations had to be separated by at least 6 months. Exclusion criteria were side diagnosis during hospitalization, registered chronic diagnosis considered to affect risk of hospitalization for asthma, low birth weight (<2.5 kg) or gestational age of under 36 weeks at birth. Cases were further characterized with respect to the number of hospitalizations from asthma and acute bronchitis and for concurrent atopy. DNA sampling and genotyping of cases. DNA was obtained from blood spots sampled as part of the Danish neonatal screening program and stored in the Danish Neonatal Screening Biobank36. Two disks, each 3.2 mm in diameter, were punched from each blood spot. DNA was extracted, and the whole genome for each individual sample was amplified in triplicate as previously described15,16. Cases were genotyped on the Affymetrix Axiom CEU array (567,090 SNPs). Top SNPs from the five genome-wide significant loci were regenotyped with the PCR KASPar genotyping system (KBiosciences) to validate the results (Supplementary Table 3). Two additional SNPs in the proximity of the newly discovered CDHR3 variant were genotyped for further exploration of the region encompassing it. Controls. The control population was randomly drawn from two large Danish cohorts: the Danish National Birth Cohort (females) and the Copenhagen draft board examinations (males). Individuals who indicated in a questionnaire that they had physician-diagnosed asthma were excluded. Genome-wide genotyping had previously been performed as part of the Genomics of Overweight in Young Adults (GOYA) study37 on the Illumina Human610-Quad v1.0 BeadChip (545,350 SNPs). Potential bias introduced by differences in chemistry between the different platforms used for cases and controls (Affymetrix and Illumina, respectively) was investigated by also using control data from the Wellcome Trust Case Control Consortium 2 (WTCCC2) project that performed genotyping on an Affymetrix platform (Affymetrix 6.0) (Supplementary Table 4). Replication in a previously published GWAS. Replication of the five genomewide significant loci from the discovery analysis was sought in publically available data from a GWAS performed by the GABRIEL Consortium11. This replication included 19 studies of childhood-onset asthma (onset before 16 years of age) with a total of 6,783 cases and 7,720 controls. Replication in birth cohorts for the CDHR3 top SNP. The COPSAC2000 replication cohort. Replication and phenotypic characterization of the CDHR3 risk locus were sought in the COPSAC2000 cohort, a prospective clinical study of a birth cohort of 411 children. This cohort is not overlapping with the COPSAC exacerbation discovery study. The COPSAC 2000 cohort study was approved by the Ethics Committee for Copenhagen (KF 01-289/96) and the Danish Data Protection Agency (2008-41-1754), and informed consent was obtained from both parents of each child. All mothers had a history of a doctor’s diagnosis of asthma after 7 years of age. Newborns were enrolled in the first month of life, as previously described in detail38–40. This cohort is characterized by deep phenotyping during close clinical follow-up. Doctors employed in the clinical research unit were acting primary physicians for the children NATURE GENETICS from the cohort and diagnosed and treated respiratory and skin symptoms, and asthmatic symptoms were recorded in daily diaries41. Acute, severe exacerbations from birth to 6 years of age were defined as requiring the use of oral prednisolone or high-dose inhaled corticosteroid for wheezy symptoms, prescribed at the discretion of the doctor in the clinical research unit, or by acute hospitalization at a local hospital for such symptoms32. Asthma from birth to 7 years of age was diagnosed on the basis of predefined algorithms of symptoms and response to treatment, as previously described40. Neonatal spirometry and analysis of neonatal bronchial responsiveness to methacholine were carried out by 4 weeks of age, applying the raised volume, rapid thoracic compression technique. Lung function was measured by spirometry in the child’s seventh year of life. Specific airway resistance (sRaw) was measured at 4 and 6 years by whole-body plethysmography. Bronchial responsiveness at ages 4 and 6 years was determined as the relative change in sRaw after hyperventilation of cold, dry air. Allergic sensitization against common inhalant allergens was determined at 6 years of age by measurement of serum-specific IgE levels. Atopic dermatitis was diagnosed using the Hanifin-Rajka criteria42 from birth to 7 years of age. High-throughput genome-wide SNP genotyping was performed using the Illumina Infinium II HumanHap550 v1, v3 or Quad BeadChip platform at the Children’s Hospital of Philadelphia’s Center for Applied Genomics. We excluded SNPs with call rate of <95%, minor allele frequency (MAF) of <1% or Hardy-Weinberg equilibrium P value of <1 × 10−5. rs6967330 was a genotyped SNP on this array. MAAS replication cohort. The Manchester Asthma and Allergy Study is a population-based birth cohort described in detail elsewhere43. Subjects were recruited prenatally and were followed prospectively. The study was approved by the local research ethics committee (South Manchester, reference 03/SM/400). Parents gave written informed consent. Participants attended follow-up at ages 1, 3 and 5 years of age. For asthma, validated questionnaires were administered by interviewers to collect information on parentally reported symptoms, physician-diagnosed asthma and treatments received. ‘Current wheeze and asthma treatment’ was defined as parentally reported wheeze in the past 12 months. ‘Asthma ever’ was defined as positive if, at any given time point, two of three responses were positive to the following questions: “Has your child wheezed within the past 12 months?”, “Does your child currently take asthma medication?” or “Has a doctor ever told you that your child has asthma?” Controls were defined as children with none of these symptoms. For exacerbations, a pediatrician extracted data from primary-care medical records, including information on diagnosis with wheeze and/or asthma, all prescriptions (including inhaled corticosteroids (ICS) and B2 agonists), unscheduled visits and hospital admissions for asthma and/or wheeze during the first 8 years of life. Following American Thoracic Society guidelines, we defined asthma exacerbations by either admission to a hospital or an emergency department visit and/or by receipt of oral corticosteroids for at least 3 d44. DNA samples were genotyped on the Illumina Human610-Quad BeadChip. Genotypes were called using the Illumina GenCall application, following the manufacturer’s instructions. Quality control criteria for samples included call rate of greater than 97%, exclusion of samples with outlier autosomal heterozygosity and sex validation. We excluded SNPs with call rate of <95%, HardyWeinberg equilibrium P value of >5.9 × 10−7 and MAF of <0.005. We then performed a look-up for SNP rs6967330, which showed a genotyping success rate of 100% and a Hardy-Weinberg equilibrium P value of 0.4164. Generation R replication cohort. The Generation R Study is a populationbased prospective cohort study of pregnant women and their children from fetal life onward in Rotterdam, The Netherlands45. The study protocol was approved by the Medical Ethical Committee of the Erasmus Medical Center, Rotterdam (MEC 217.595/2002/20). Written informed consent was obtained from all mothers and biological fathers or legal guardians. Information on wheezing, asthma and eczema was collected for the children by questionnaires at the ages of 1 to 4 and 6 years46. Questions about wheezing included: “Has your child had problems with a wheezing chest during the last year? (never, 1–3 times, >4 times) (age 1 to 4 years)” and “Did your child ever suffer from chest wheezing? (never, 1–3 times, doi:10.1038/ng.2830 © 2013 Nature America, Inc. All rights reserved. >4 times) (age 6 years).” Questions about asthma included: “Has a doctor diagnosed your child as having asthma during the past year? (yes, no) (age 2 and 4 years)” and “Was your child ever diagnosed with asthma by a doctor? (yes, no) (age 3 and 6 years).” On the basis of the last obtained questionnaire, we grouped children as having ‘asthma ever before 6 years of age’. Reported asthma at 2, 3 or 4 years of age was used to reclassify children included in this group where appropriate. We then recategorized children as those with an asthma diagnosis before 3 years of age and at 3 years of age or older. Reported numbers of wheezing episodes at 1 and 2 years of age and at 3 to 6 years of age, respectively, were used to reclassify asthma diagnosis before and at 3 years of age into ‘asthma diagnosis or q3 episodes of wheezing before 3 years of age’. Questions about eczema included: “Has a doctor diagnosed your child as having eczema during the past year? (yes, no) (age 1 to 4 years)” and “Was your child ever diagnosed with eczema by a doctor? (yes, no) (6 years).” As with asthma, we grouped children into those with ‘eczema ever before 6 years of age’ on the basis of the last obtained questionnaire and used reported eczema at 1 or 4 years of age to reclassify children included in this group where appropriate. Samples were genotyped using Illumina Infinium II HumanHap610 Quad arrays, following standard manufacturer’s protocols. Intensity files were analyzed using BeadStudio Genotyping Module software v.3.2.32, and genotypes were called using default cluster files. Any sample with a call rate of less than 97.5%, excess autosomal heterozygosity (F < mean – 4 s.d.) or mismatch between called and phenotypic sex was excluded. rs6967330 was a genotyped SNP in this set. Individuals identified as genetic outliers by identity-by-state (IBS) clustering analysis (>3 s.d. away from the mean for the HapMap CEU population (Utah residents of Northern and Western European ancestry)) were considered to have non-European ancestry. Ancestry determination analysis included genomic data from all Generation R individuals merged with data for three reference panels from Phase 2 of the HapMap Project (YRI (Yoruba from Ibadan, Nigeria), CHB + JPT (Han Chinese in Beijing, China, and Japanese in Tokyo, Japan) and CEU). Analysis of association between an asthma or eczema phenotype and GWAS SNPs was carried out using a regression framework, adjusting for population stratification in the Generation R cohort using MACH2QTL, as implemented in GRIMP. Ten genomic principal components obtained after the application of SNP quality exclusion criteria and LD pruning were used to adjust for population substructure in the combined population, four principal components were used for the European subpopulation and eight principal components were used for the non-European subpopulation. Individuals were grouped as having European (n = 1,962; 64.5%) or nonEuropean (n = 1,078; 35.5%) ancestry on the basis of genetic ancestry. On the basis of information on the country of birth of parents and grandparents obtained by questionnaires, the largest non-European ancestry groups included individuals of Turkish (5.4%), Surinamese (4.6%), Dutch Antillean (4.0%), Moroccan (2.9%) and Cape Verdean (2.3%) origin. Statistical analyses. Genome-wide association analysis. Quality control was carried out separately on cases and controls. This included filtering on SNP call rate (>99%) and sample call rate (>98%) and tests for excess heterozygosity, deviation from Hardy-Weinberg equilibrium, sex mismatch and familial relatedness. Non-European individuals were excluded on the basis of deviation from the HapMap CEU reference panel (release 22). Indication of population stratification or genotyping bias was tested by multidimensional scaling (MDS) after quality control. This analysis showed evidence of association with disease status for the first seven MDS components, and these were therefore included as covariates in the association analysis. Additional analyses including the first 100 MDS components did not materially alter the results. Merged data for SNPs present on both arrays after quality control were used for association testing with PLINK (v. 1.07) using a logistic additive model, adjusting for the first seven MDS components. Additional quality control was performed for genome-wide significant SNPs after association analysis, including a test for genotyping batch effects, resulting in the removal of one genome-wide significant SNP with strong evidence of batch-related genotyping error. Functional annotation for the SNPs in LD (r2 > 0.5) with the CDHR3 top SNP (rs6967330) was obtained from the RefSeq track downloaded from the UCSC Genome Browser. SNPs were associated with regulatory elements by HaploReg47 in terms of predicted ENCODE chromatin state doi:10.1038/ng.2830 (promoter and enhancer histone modification signals) and DNase I hypersensitivity (Supplementary Table 8). Regional imputation was performed to describe the identified loci from the discovery analysis (Supplementary Fig. 3) as well as reported loci from the previous largest published GWAS (GABRIEL)11 (Supplementary Table 11). We used two-step genotype imputation as described48. We used the SHAPEIT algorithm to prephase the haplotypes 49 and then used IMPUEv2 software for the imputation of unknown genotypes50 separately in cases and controls. We used the 1000 Genomes Project reference panel51 (April 2012 version). We used a strict cutoff (info of 0.88), which, according to our analyses, provides an allelic dosage R2 correlation between real and imputed genotypes of greater than 0.8 and shows an optimal balance between sufficient accuracy and power52. We then compared the resulting allelic frequencies using SNPTEST 2.4.1 (ref. 53). CDHR3 protein expression in experimental models. The top SNP at the CDHR3 locus is a nonsynonymous SNP (encoding p.Cys529Tyr). To determine the functional consequences of the p.Cys529Tyr variant, we generated expression constructs encoding tagged human CDHR3 protein, and the mutation encoding the p.Cys529Tyr alteration was introduced by site-directed mutagenesis. Plasmids encoding wild-type or mutant CDHR3 or empty vector were transfected into 293T cells, and cells were monitored for surface and intracellular expression of CDHR3 by flow cytometry. 293T cells were from the American Type Culture Collection (ATCC), catalog number CRL-3216. They were recently tested for mycoplasma contamination but were not authenticated. For protein blotting, cells expressing CDHR3 proteins were lysed, and wholecell lysates were separated by SDS-PAGE under reducing or non-reducing conditions, transferred to PVDF membranes and blotted for Flag (anti-Flag antibody, clone M2 (Agilent Technologies, 200470-21) at a dilution of 1:2,000). For immunofluorescence and confocal microscopy, 293T cells were grown on glass coverslips in DMEM with 3 mM glutamine and 10% heat-inactivated FBS at 37 °C and 5% CO2 before and for 2 d after transfection with expression constructs for Flag-tagged wild-type CDHR3 and CDHR3 Cys529Tyr using TransIT 2020 reagent according to a standard protocol (Mirus Bio). Cells were obtained and used at a low passage from ATCC and had recently been tested for mycoplasma. Cells were incubated in 10% serum-containing culture medium plus primary anti-Flag mouse antibodies (F3165, Sigma; 1:300 dilution) for 1 h at 37 °C before being washed briefly with culture medium. Cells were then stained with secondary rabbit anti-mouse antibodies (F0261, Daco; 1:600 dilution) conjugated with fluorescein isothiocyanate (FITC) with incubation at 37 °C for 30 min and washed with culture medium before PBS. Afterward, cells were fixed in 2% paraformaldehyde for 15 min, washed with PBS and permeabilized in 0.2% Triton X-100 in PBS for 5 min, washed and incubated with Cy3-conjugated mouse anti-Flag antibody (Cy3-labeled F3165, Sigma; 1:300 dilution). Finally, cells were mounted with ProLong Gold antifade reagent with DAPI (Invitrogen). Images were acquired using a Leica DMI 6000-B confocal microscope (Leica Microsystems) with 40× magnification and were processed in Photoshop (Adobe Systems). Experiments were performed in triplicate (independent transfections) for both flow cytometry and immunofluorescence staining. Data presented (Supplementary Figs. 8 and 9) were chosen as being representative of the repeated experiments. CDHR3 protein structure modeling. A homology model of CDHR3 domains 2–6 (residues 141–681) was generated using the HHpred server54. The model was based on the structure of mouse N-cadherin (PDB 3Q2W) domains 1–5. A disulfide bridge was manually introduced in the final model between the structurally adjacent residues Cys566 and Cys592, as this corresponds to a disulfide bridge commonly observed in cadherin domains. 35. Lynge, E., Sandegaard, J.L. & Rebolj, M. The Danish National Patient Register. Scand. J. Public Health 39, 30–33 (2011). 36. Nørgaard-Pedersen, B. & Hougaard, D.M. Storage policies and use of the Danish Newborn Screening Biobank. J. Inherit. Metab. Dis. 30, 530–536 (2007). 37. Paternoster, L. et al. Genome-wide population-based association study of extremely overweight young adults—the GOYA study. PLoS One 6, e24303 (2011). 38. Bisgaard, H. The Copenhagen Prospective Study on Asthma in Childhood (COPSAC): design, rationale, and baseline data from a longitudinal birth cohort study. Ann. Allergy Asthma Immunol. 93, 381–389 (2004). NATURE GENETICS 47. Ward, L.D. & Kellis, M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res. 40, D930–D934 (2012). 48. Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through prephasing. Nat. Genet. 44, 955–959 (2012). 49. Delaneau, O., Marchini, J. & Zagury, J.F. A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2012). 50. Howie, B., Marchini, J. & Stephens, M. Genotype imputation with thousands of genomes. G3 1, 457–470 (2011). 51. Abecasis, G.R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012). 52. Auer, P.L. et al. Imputation of exome sequence variants into population- based samples and blood-cell-trait-associated loci in African Americans: NHLBI GO Exome Sequencing Project. Am. J. Hum. Genet. 91, 794–808 (2012). 53. Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010). 54. Söding, J., Biegert, A. & Lupas, A.N. The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res. 33, W244–W248 (2005). © 2013 Nature America, Inc. All rights reserved. 39. Bisgaard, H., Hermansen, M.N., Loland, L., Halkjaer, L.B. & Buchvald, F. Intermittent inhaled corticosteroids in infants with episodic wheezing. N. Engl. J. Med. 354, 1998–2005 (2006). 40. Bisgaard, H. et al. Childhood asthma after bacterial colonization of the airway in neonates. N. Engl. J. Med. 357, 1487–1495 (2007). 41. Bisgaard, H., Pipper, C.B. & Bonnelykke, K. Endotyping early childhood asthma by quantitative symptom assessment. J. Allergy Clin. Immunol. 127, 1155–1164 (2011). 42. Hanifin, J.M. & Rajka, G. Diagnostic features of atopic dermatitis. Acta Derm. Venereol. 92, 44–47 (1980). 43. Lowe, L. et al. Specific airway resistance in 3-year-old children: a prospective cohort study. Lancet 359, 1904–1908 (2002). 44. Reddel, H.K. et al. An official American Thoracic Society/European Respiratory Society statement: asthma control and exacerbations: standardizing endpoints for clinical asthma trials and clinical practice. Am. J. Respir. Crit. Care Med. 180, 59–99 (2009). 45. Jaddoe, V.W. et al. The Generation R Study Biobank: a resource for epidemiological studies in children and their parents. Eur. J. Epidemiol. 22, 917–923 (2007). 46. Jaddoe, V.W. et al. The Generation R Study: design and cohort update 2012. Eur. J. Epidemiol. 27, 739–756 (2012). NATURE GENETICS doi:10.1038/ng.2830 Chapter 3 Candidate gene study of childhood asthma This chapter describes the designing strategy for candidate gene based resequencing study for asthma exacerbation cases. The study was designed to sequence gene related asthma and similar phenotype associated. The sequencing of the samples is still in progress at the time of submission of this thesis. The chapter here, describes the design for the candidate gene, the multiplexing strategy, the sample preparation and the capturing method. Prelude According to WHO report, childhood asthma has become epidemic in the world [155]. Childhood asthma ranges from mild to severe, depending upon the number of asthma events and acute asthma attacks. Age of onset of asthma has significant effect on prognosis and implications, as early onset increases the risk of severity and persistence in later stages of life [178]. Childhood Asthma has high phenotypic heterogeneity that is different individuals exhibit different phenotypes, which are also thought to differ in the causal mechanisms. Exacerbation is one of the severe phenotypes of asthma. Asthma exacerbation is marked by change in lung volume and plural pressure, which significantly affects the cardiopulmonary interactions [179]. There are known genetic factors associated with childhood asthma along with the environmental factors. Multiple genes have been identified to be associated with childhood asthma and related phenotypes. 17q21 loci on 53 54 CHAPTER 3. CHILDHOOD ASTHMA CANDIDATE GENE STUDY chromosome 17 is strongly associated with asthma and has been found in multiple studies [180, 181]. Similarly, loci 9q24, 2q12 and 6p21 have appeared to be robust across ethnicities for their association to asthma. Different GWAS detect discrete variations with lower replication in other independent studies. Also, it is hard to replicate GWAS SNPs with consistent effect size and direction. Most of the GWAS are done on the commercially available genome wide arrays. These SNP arrays try to maximise the coverage of the genome by evenly distributing the SNP probes across the genome and minimise the number of SNPs probed within high LD regions. The probe designs from different array suppliers differ which may result in different variations within the same genomic region. The LD patterns in the region may result in different polymorphisms being associated with the disease, although only one of them is the causal variant. Therefore, comparing results from different studies of even comparable sample sizes is difficult. The GWAS findings lead path for more detailed candidate gene studies. Candidate gene studies focus on the plausibility of the gene to be involved in disease pathogenesis. Focusing on the genomic regions with the known disease genes, called candidate genes, would assist in detecting the causal variations. This will also help in finding the functional alterations leading to the phenotype. These studies are relatively fast, less costly, require less amount of DNA and small sample size. Candidate gene studies augment the array based GWAS by maximising the variation coverage in these genes. These studies are suitable in detecting variation underlying common and more complex diseases where the effect size is small. The additional information on the variation and their function in the gene would help in discovering the biological mechanism leading to asthma phenotype. To capture the genomic regions of interest from DNA samples prior to sequencing, target enrichment is carried out in these studies [182]. The reduction in region for sequencing enabled multiplexing. Candidate Gene Selection Literature survey based selection of the candidate genes was done for strong asthma and asthma risk factors associated loci. Sixteen SNPs with corresponding regions were selected (Table 3.1). The selected regions include, TCR α/δ region on chromosome 14q which contain V, J and D coding segments. Rearrangements in these regions give rise to an α or a δ chain of the T cell receptor. TCR α/δ region has been associated with IgE responses [183] and is known for its variability and the tight linkage disequilibrium [184]. Thus, the baits were designed for this region. The two independent loss-of-function variations in gene encoding filaggrin (FLG) have been found as very strong predisposing factor of atopic dermatitis [185]. Locus 1q31 has been implicated in asthma susceptibility in North American children of European ancestry and in African-American children. The regions is also been 55 suggested to influence the age of onset of asthma. The implicated region had two genes, CRB1 and DENND1B. CRB1 has restricted expression in retina and brain where as DENND1B encodes a protein expressed on the immune dendritic cells and has been associated with susceptibility to asthma [164]. Accordingly, the region of DENND1B as well as the SNP (rs2786098) were included in the design. A set of genes namely IL1RL1/IL18R1, HLA-DQ, IL33, SMAD3, IL2RB, RORA, SLC22A5 and the ORMDL3/GSDMB locus have been associated to asthma in a large-scale cohort study [162]. Additionally, since region on chromosome 17 has been associated with asthma and exacerbation in early childhood and a total region including GSDMB, ORMDL3, ERBB2, GSDMA is a part of the sequencing panel [180]. Novel Chromosome 14 1 Locus TCRA FLG 1 DENND1B, CRB1 SLC22A5 RAD50-IL13 HLA-DQ IL33 RORA SMAD3 GSDMB IL2RB IL6R TSLP 5 5 6 9 15 15 17 22 1 5 11 C11ORF30, LRRC32 IL1R1, IL18R1 TLR1/6/10 RAD50-IL13 CDHR3 2 4 5 7 SNP rs227870 multiple SNPs rs2786098 Reference Moffatt MF, Hum Mol Gen 2000 Rodr�guez E, J Allergy Clin Immunol. 2009 Sleiman PM, N Engl J Med 2009 rs2073643 rs1295686 rs9273349 rs1342326 rs11071559 rs744910 rs2305480 rs2284033 rs4129267 rs1837253 rs1558641 Moffatt MF, N Engl J Med 2010 Moffatt MF, N Engl J Med 2010 Moffatt MF, N Engl J Med 2010 Moffatt MF, N Engl J Med 2010 Moffatt MF, N Engl J Med 2010 Moffatt MF, N Engl J Med 2010 Moffatt MF, N Engl J Med 2010 Moffatt MF, N Engl J Med 2010 Ferreira MA, Lancet 2011 Hirota T, Nat Genet 2011; Torgerson DG, Nat Genet 2011 Xingnan Li, J Allergy Clin Immunol 2012 Bønnelykke K, Nat Genet 2013 rs17616434 rs6871536 rs6967330 Bønnelykke K, Nat Genet 2013 Bønnelykke K, Nat Genet 2013 Bønnelykke K, Nat Genet 2013 rs7130588 Table 3.1. List of candidate gene selected for targeted sequencing, with known asthma associated SNPs and corresponding publications variants in IL6R were identified to be associated with asthma along with 11q13.5 locus [186]. The variations in IL6R are of great interest as they support the hypothesis that genetic alteration of cytokine signalling increases asthma risk and can be used as target of genotype specific therapeutics [186]. Different types of thymic stromal lymphopoietin (TSLP) are associated with asthma in cross ethnic cohorts of North American population as well as in 56 CHAPTER 3. CHILDHOOD ASTHMA CANDIDATE GENE STUDY Japanese population. Thus, testing its association with asthma in a Danish cohort would add to the asthma association value of this gene and establish it as a cross ethnic gene[187, 188]. Two independent signals in chromosome 11 open reading frame 30 (C11orf30) and leucine rich repeat containing 32 (LRRC32) are being associated with total serum IgE levels, a risk factor of asthma [189] and the SNP rs7927894, which lies between C11orf30 and LRRC32 has been reported to be associated with atopic asthma [190]. Thus, this region is of interest when looking for variants associated with asthma phenotypes. SNP (rs17616434) lies in the region of the human genome also codes multiple Toll-like receptors (TLR), that are recently associated with allergic sensitization[191] and thus the full region was included in the target sequence design. The SNPs in region RAD50-IL13 as well as in the 3� untranslated region of HLA-DQB1 are found to be associated with asthma [192]. CDHR3 is a novel finding in the Danish cohort, which has been associated with asthma exacerbations in children and has been replicated in multiple Danish as well as cross ethnicity cohorts [193]. Capture Region The target designing was done to capture the gene boundaries and promoter region (-2Kb of transcription start site of gene) for all genes associated with the SNPs in the respective studies. To capture the specific SNPs from the reference studies, +/-50 bases of the position of the SNPs (Table 1) were also sequenced. RNA baits to densely capture these regions were designed using the Agilent custom design service. Samples and DNA Extraction A total of 24 samples with the most severe symptoms of exacerbation were selected from the Danish national birth registry with acute hospitalizations for asthma (ICD8-codes 493, ICD-10 codes J45-46) from 2 to 6 years of age (both years included). The criteria of inclusion required more that one day of hospitalization with two hospitalizations had to separated by 6 months. DNA for these samples was obtained from blood stops collected as a part of Danish neonatal screening program and stored at the Danish Neonatal Screening Biobank [194]. Genomic DNA was thereafter extracted using the Extract-N-Amp kit (Sigma-Aldrich). Whole genome aplification was carried out in triplicate using the REPLI-g mini kit (Qiagen) and quantifications were preformed as described previously [195, 196]. Library Preparation DNA shearing and library preparations were performed according to the SureSelect XT Target Enrichment System protocol version 1.6 2013 (Agilent Technologies, Santa Clara, CA, USA) with minor modifications. 200 ng of whole amplified genomic DNA was sheared by Covaris E210 System using 57 10% duty cycle, intensity of 5, cycles per burst of 200 for 360 sec. To create 150bp fragments. Then end-repair was performed (by applying T4 DNA polymerase, T4 phosphonucleotide kinase and Klenow fragment enzyme) and 3′ ends A-overhang were produced (by applying Klenow 3′ to 5′ exo minus). Barcodes In the study, five bases long 25 customised barcodes (Table 3.2), which were ligated to the primers having the last base as a thymidine (T) necessary for ligation to DNA fragments for sequencing with a 3′ adenosine (A) overhang, were designed based on primers from Agilent, NimbelGen and Illumina. The diversity of these barcodes was highly required by the sequencing machine to distinguish the samples. The logo diagram shows the percentage of each base in the forward stand of the barcodes (Figure 3.1). The custom made adapters containing unique barcodes were prepared. Figure 3.1. Logo block diagram for the frequency of four bases in different positions in the barcodes used in the study. Pooling, Target Enrichment and Sequencing The complementary oligos (DNA technology A/S, Risskov, Denmark) were dissolved in Nuclease free water to a final concentration of 300µM. Complementary oligonucleotide pairs were mixed in ratio 1:1 in 1X annealing buffer (10X buffer contained 100mM Tris-HCL pH8.1; 0.5M NaCl). The barcoded adapter mix was heated to 90◦ C for 2 minutes, then cooled down to 30◦ C at a rate of 2◦ C per minute, and diluted to a working concentration of 1.5µM. The DNA libraries were amplified with a denaturation time of 30 seconds at 98◦ C, followed by 10 cycles of denaturation at 98◦ C for 30 seconds, annealing at 65◦ C for 30 seconds and extension at 72◦ C for 1 minutes according to the protocol. The final extension was performed at 72◦ C for 5 minutes. DNA quantity and quality was checked on a NanoDrop ND- 1000 UV-VIS Spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and Agilent 2100 Bioanalyzer using the Bioanalyser DNA High sensitivity (Agilent Technologies), respectively (Figure 3.2 [A]). The DNA libraries were mixed in groups of 25 in equimolar ratios to yield a final concentration of 221ng/uL of each pooled library. The pooled libraries were hybridized with our custom designed SureSelect 58 CHAPTER 3. CHILDHOOD ASTHMA CANDIDATE GENE STUDY oligo capture library for 16 hours according to the manufacturer’s instructions. After incubation the selected hybrids custom made primers for Pairend sequencing and Herculase II Fusion DNA Polymerase (Stratagene, Agilent Technologies). The PCR reaction was performed with a denaturation time of 2minutes at 98◦ C, followed by 12 cycles of denaturation at 98◦ C for 30 seconds, annealing at 57◦ C for 30 seconds and extension at 72◦ C for 30 seconds. The final extension was performed at 72◦ C for 10 minutes. After purification, DNA quantity and quality was checked on a NanoDrop ND- 1000 UV-VIS Spectrophotometer and Agilent 2100 Bioanalyzer using the Bioanalyser DNA High sensitivity, respectively (Figure 3.2 [B]). DNA quantity and quality of the libraries were again checked before se- A B 50 100 150 200 250 300 350 400 50 150 250 350 450 Figure 3.2. DNA qualities of the samples as measured by length post multiplexing and before capturing [A]. DNA qualities of the samples as measured by length post capturing [B]. quencing using different markers to be sure about the capture (Figure 3). Sequencing is performed for 100 bp paired-end run on HiSeq (Illumina Int., San Diego, CA USA) at BGI facility in Copenhagen following the manufacturer�s recommendations. Exploration of method The methods presented in this chapter describe the pilot study of 24 samples, where the 16 loci with known childhood asthma associations were resequenced for second time. During first attempt for the resequencing study, a different set of samples were used. Also, a larger panel of genes covering 7Mb of the genome was designed. This earlier design included a higher number the genes associated with asthma and related phenotype as well as the interaction partners of these genes. The design was very suitable for a hypothesis driven candidate gene study. This is based on the fact that pathways based genes and SNP selection helps in discovering the underlying biological mechanisms [197]. The studies based on pathways [198] and PPIs [199, 200] have found that the genes other than the central genes can have an effect on the phenotype and thus they can also 59 PE2_F PE3_F PE4_F PE5_F PE6_F PE7_F PE8_F PE9_F PE10_F PE11_F PE12_F PE13_F PE14_F PE15_F PE16_F PE17_F PE18_F PE19_F PE21_F PE22_F PE26_F PE27_F PE28_F PE29_F PE30_F gctta acagt cggta tcgta acgct ccgta acgta cagta gtcta tgcta ggcta cgcta agcta gccta gacta cgata gcata cttga tctga gctga ggtca ttgca gctaa ggact agtca Table 3.2. List of forward strands barcodes designed for this study. The design was aimed at maximising the variability as each base on these barcodes be therapeutic target. As we were interested in testing the association of variations in these interacting genes, those were also included in the design. Unfortunately, the sequencing of this highly explanatory design failed due to bad DNA quality. The samples had a very low yield during target capture. The samples used from the second attempt, are whole genome amplified (WGA) and to increase the yield in the capture, we reduced the capture size. Also, the capture kit used was upgraded to a newer technology of target capture, which can work with lower amounts of DNA (200 ng). So, now we aim at identifying potentially causal mutations in the proximity of a known GWAS hits. The total region of the design was 2.482 Mbps, which was captured by 39499 probes with an average coverage of 82.4%. The SNP regions had 100% covered, while the promoter regions were amongst the least covered regions. The selective sequencing of WGA DNA has been successful in discovering majority of variants and achieves high concordance with the corresponding arrays [201]. 60 CHAPTER 3. CHILDHOOD ASTHMA CANDIDATE GENE STUDY Selective sequencing and cases-only sequencing is a useful tool for discovering disease related variants. This leads the focus on rare variants, and elucidate their effect on the phenotype by avoiding the dilution in effect size caused by collective test of cases and controls [202]. Deep resequencing of the GWAS loci associated with inflammatory bowel disease (IBD) has previously resulted in functional confirmation of known susceptibility genes, Nucleotide-binding oligomerization domain-containing protein 2 (NOD2) as well as finding a protective effect of an isoform, Caspase recruitment domain-containing protein 9 (CARD9) [203]. Also, the newly discovered risk alleles in the study explain more risk variance in the overall population than the original common variant known from GWAS analysis. Thorough sequencing of significantly associated regions in GWAS not simply expands the variance explained, but also identifies specific alleles that may substantially be important for the understanding of the functional role of each gene [203]. Candidate-genes studies are criticised for the low replication in independent studies and for being “hypothesis-driven” [204]. Lack of replication of variation across studies does not necessarily imply non-causality but it might indicate population differences and LD structure differences [205]. The strength of candidate gene studies depends on the selection of the targets regions. These fine mapping studies are based on the prior finding of multiple studies and it is an advantage to select candidate genes from loci found in the same population or in cross ethnic studies [206] to minimise the risk of false positives. It is also beneficial to have functional effects for the variations [207] and thus sometimes, only the coding regions of the candidate genes (exomes) are sequenced. With advancements in annotations of non-coding variations, these regions are also gaining importance in disease association studies. To include the regulatory variations in the study, the promoters regions of the candidate as well as the introns were included in the design [208]. The success of the �hypothesis-driven� candidate gene study depends on the choice of hypothesis. Integrative systems biology based methods using information extracted from public databases as well as automated data mining would supplement in better candidate gene selection for disease and drug studies [209, 210]. The success of the sequencing of this pilot study would eventually lead to the re-sequencing of samples from the total cohort. We aim at analysing the sequencing data with the state-of-the-art methods and find the causal variations and as well as variations that could be used to stratify the cases in the study. Chapter 4 Paper II - Machine learning based prediction of childhood asthma Prelude This chapter describes neural network based discovery tool for selecting the genetic and clinical features till the age of two years that can predict asthma outcome at the age of seven years. Genotyping of the two cohort used in this study was done using SNP arrays. Deep phenotyping on the study cohort COPSAC2000 includes several longitudinal phenotypes such as recurrent wheeze, eczema, asthma and exacerbation. To supplement the GWAS where a single SNP association to phenotype is made, we tested a group of SNPs. Since all SNPs not always have additive effects and might have variety of interactions amongst themselves, we used non-linear method of artificial neural networks to test these associations. To use the information about the risk factors of asthma available in early stages of life, we included the clinical information about allergy, eczema, white blood cell (WBC) count, lung function and presence of bacteria in the hypopharyngeal region as rules to predict asthma later in life. The pregnancy and birth conditions also play an important role in asthma risk and thus were also used as risk features. As it is known that not all SNPs in the human genome interact with each other and SNPs mapping to the genes within a pathway have higher chances of affecting the activity of each other. So, the SNPs were grouped based on signalling pathways from the database of cell signaling. We used genotyping data from discovery cohort for selecting pathways with high association to childhood asthma. The second and more informative cohort COPSAC2000 61 62 CHAPTER 4. PAPER II - MACHINE LEARNING BASED PREDICTION OF CHILDHOOD ASTHMA with eighteen clinical features was used to reduce the SNPs features within these selected pathways. The study tests the predictive power of SNPs and clinical features individually and then to find does the combination add any predictive values the combinations of two types of features were tested. The number of SNPs genotyped in the two dataset was high and even after grouping them into pathways, trying all possible combinations, which increases exponentially with adding every extra feature was not possible in real time. A brute force version of genetic algorithm was employed to make all possible combination of three SNPs to be trained and tested for association to the phenotype. The method and results are described in the attached manuscript. Manuscript 1 Ranking genetic and clinical features for prediction of asthma at age 7 Rachita Yadav1 , Thomas Nordahl Petersen1 , Eskil Kreiner-Møller2 , Hans Bisgaard2 , Kluas Bønnelykke2 and Ramneek Gupta∗1 1 2 Center for Biological Sequence Analysis, The Technical University of Denmark, Copenhagen, Denmark Copenhagen Prospective Studies on Asthma in Childhood, Health Sciences, University of Copenhagen and Copenhagen University Hospital, Gentofte, Copenhagen, Denmark. ABSTRACT Background Asthma is one of the most common chronic diseases of childhood and the most frequent reason for paediatric hospitalisation. Several genetic and environmental risk factors are known for childhood asthma. The study aims at prioritising clinical and genetic features that are predictive of childhood asthma. The goal is to prioritise a set of genetic and clinical features markers that should be replicated in other studies. Results We present an artificial neural network based approach using genetic data in form of the single nucleotide polymorphisms from genotyping along with clinical features before the age of 2 to predict asthma at the age of 7 years. The methodology designed for this prediction, performs feature selection on SNP groups based on biological pathways. Estrogen receptor pathway was shown to be associated with asthma at age 7, with Matthews Correlation Coefficient of 0.71. Other pathways ranked high are Insulin Signaling Pathway, Mitochondrial Pathway of Apoptosis, Phosphoinositide 3-kinase Pathway. Several of the pathway have known asthma association, this methods allows the prioritisation of the genes within these pathways. Conclusions The method prioritises 11 pathways carrying the predictive values towards asthma. Inclusion of selected 10 out of 18 clinical features added further value to predictive value. This method prioritises pathways with association to childhood asthma rather that single SNPs. Additionally combining the clinical and genetic features in the same models. This method helps in identification of pathways and variations that can be studied in more detail in the upcoming asthma studies with replication and functional studies. Prognosis of asthma at an early stage of life would help in earlier treatment and management of the asthma disorder in children. KEY WORDS – Childhood asthma, artificial neural network, prediction, GWAS, SNPs, pathway based Introduction Asthma is one of the most common chronic diseases in childhood. Definition of childhood asthma is a topic of constant debate. Asthma can be characterised as an inflammatory disease with difficulties in airflow. This is due to the narrowing of lung airways and hypersensitivity of the mucous membrane caused by inflammation. The symptoms of asthma include coughing, wheezing, shortness of breath and tightness of chest. Asthma heritability is estimated to be 7090% [12, 46]. There are multiple childhood asthma susceptibility loci, which have been verified in genome-wide association studies (GWAS) [14, 28, 29, 41]. These loci have variable effect sizes and since asthma is a heterogenetic disease it is still hard to explain ∗ Corresponding author: Ramneek Gupta, Center for Biological Sequence Analysis, The Technical University of Denmark, Copenhagen, Denmark. Telephone +45 25252425, e-mail: ramneek@cbs.dtu.dk asthma just based on hereditary features. Along with genetics, there are racial, social and environmental risks involved in childhood asthma susceptibility [49]. It is found that 30% of children with preschool wheezing develops asthma at the later stages [33]. Multiple phenotypes within asthma have been observed depending upon the symptoms, duration of event and on the age of the child. The asthma and related phenotypes share overlapping symptoms. So, the difficult part is to distinguishing between these endo-phenotypes with similar symptoms. Multiple independent as well as combination of clinical and environmental features have been found to be involved in the cause of childhood asthma. Neonates colonized in the hypopharyngeal region with S. pneumoniae, H. influenzae, or M. catarrhalis, or with a combination of these organisms, have an increased risk for recurrent wheeze and asthma, early in life [4]. Atopic sensitization plays a major role in the development of asthma. A Manuscript 2 Figure 1: The overall method for integrating heterogeneous features for prediction of asthma, using artificial neural network based feature selection and prediction. study testing asthma and allergy at age 8 found that children born by caesarean section are more prone to asthma and allergy [24]. Also, there is a modest association found between very low birth weight and asthma [8]. When analyzing the influence of maternal and paternal asthma and atopy on children’s asthma at the age of 7 years separately, persistently sensitized children with asthmatic mothers were at 10 times higher risk of having current asthma at the age of 7 years [20]. The children born to asthmatic mothers are at high risk of developing asthma but it is not a strict rule that all children born to asthmatic women are asthmatic. Smoking by mothers during pregnancy puts the children in high asthma risk group and only a small fraction of the effect seems to be mediated through fetal growth [22]. Use of antibiotics during pregnancy particularly in the third trimester also increases the risk of asthma in the born child. Studies have been designed to uncover gene-gene and gene-environment interactions that may occur between different pathophysiological pathways in asthma, which lead to the discovery of genes related to home dampness, an environment risk factor for asthma [42]. Various asthma susceptibility studies have identified several genetic and environmental factors to be implicated in asthma pathogenesis. Several prognosis tools have been designed for childhood asthma like Asthma Predictive Index (API) [9], the modied API [17], the cumulative risk score of the Isle of Wright birth cohort [26], the severity score for obstructive airway disease [10] an extension of the severity score [27] and PRIMA score [18]. These tools use clinical features like allergies, bronchial obstruc- tion and lung function, but none of the tools use genetic constitution of the individuals or a combination of these genetic and clinical features. This study aims at selecting a set of clinical features with genetic markers to predict asthma outcome at age 7 years (figure 1). Pathway based methods complement conventional association analysis and offer additional insight. Several methods have been developed to study pathway-mediated effects in GWAS studies. The most widely used of these have been described in the review by Wang et al. 2010 [48]. To take biological mechanisms into account, the available features were grouped based on the biological pathways and pathway selection followed by feature reduction and was performed using machine learning methods to discover the most predictive features and find their predictive performance. Material and methods Discovery cohort The discovery cohort includes 2,029 individuals selected from the Danish national birth registry with acute hospitalizations for asthma (ICD8-codes 493, ICD-10 codes J45-46) from 2 to 6 years of age (both years included). Details for the selection of individuals and approvals for the study are previously described [7]. DNA samples for these cases was obtained from the blood spots stored in the Danish Newborn Screening Biobank as a part of the Manuscript neonatal screening program [32] . The cases were genotyped on the Affymetrix Axiom CEU array (567,090 SNPs) [7]. The controls are a combined set of two population-based cohorts, the Danish National Birth Cohort (females) and the Copenhagen draft board examinations (males). The individuals answering negatively to the question of having a physician-diagnosed asthma in the questionnaire were included as controls in the study. These individuals were previously genotyped on Illumina Human610-Quad v1.0 BeadChip array (545,350 SNPs) as part of the Genetics of Overweight Young Adults (GOYA) study [31]. Quality control measures were applied separately on cases and controls included in the discovery cohort. The sample call rate of >97.5% was used as inclusion criteria along with removal of individuals with excess heterozygosity, gender mismatch and familial relatedness. Ethnicity check was performed using HapMapII CEU reference panel and non-Danish samples were removed. SNP call rate of 100% was used, as the applied machine learning method cannot deal with missing data. After the quality filtering, discovery cohort consisted of 1,173 asthma cases and 2,522 controls. The overlaping SNPs between cases and controls were selected based on SNP position and they were mapped to the ensembl [15] genes using ensembl API version 62 with the nearest gene function. SNPs not mapping to any gene were excluded from further steps. There were 124,514 SNPs present on both case and control genotyping. 92,012 SNPs of these SNPs could be mapped to a total of 13,737 genes. COPSAC2000 cohort The COPSAC2000 cohort consists of children born between 1998-2001 to mothers having a history of asthma diagnosed after 7 years of age. Newborns were enrolled in the study during the first month after birth and the cohort is characterized by deep phenotyping during close clinical follow-up [3–5]. Doctors employed in the clinical research unit were acting primary physicians for the children in the cohort. The diagnosed and treated respiratory and skin symptoms, and asthmatic symptoms were recorded in daily diaries [6]. Predefined algorithms for symptoms and responses were deployed to diagnose asthma from birth to 7 years of age, as previously described [4]. The COPSAC2000 cohort study was approved by the Ethics Committee for Copenhagen (KF 01-289/96) and the Danish Data Protection Agency (2008-41-1754), and informed consents were obtained from both parents of each child. 3 PKU$cohort$gene-c$ features$ GFs$grouped$in$59$ pathways$ 11$top$pathways$selected$ gene-c$features$ COPSAC2000$cohort$ gene-c$features$$ $ COPSAC2000$cohort$ clinical$features$$ $ GFs$grouped$into$top$11$ pathways$ Selected$10$best$clinical$ features$$ Best$GF$combina-ons$for$ 11$pathways$selected$$ Best$GFs$+$CFs$ combina-ons$for$11$ pathways$selected$$ Evaluated$the$power$of$ each$pathway$with$ selected$GFs$ $ Evaluated$the$power$of$ each$pathway$with$ selected$GFs$+CFs$ $ Figure 2: Flowchart for the datasets used and methods applied. GF= genetic features i.e. SNPs and CF= Clinical features. The flowchart shows the flow of pathways selection using PKU cohort and feature selection using the COPSAC2000 cohort with evaluation of the two sets of selected features (only SNPs and SNPs + CFs) for 11 top pathways on the COPSAC2000 cohort. Clinical data Pregnancy conditions like smoking history, antibiotics usage, type of birth, newborn birth weight and weight 2 weeks after birth were recorded for all participants enrolled in the COPSAC2000 cohort. The presence on any microbial growth in airway was checked on growth mediums [3]. Lung function for all individuals was measured at the age of 1 month using the raised volume rapid thoracoabdominal compression technique [45]. Allergic sensitization against common inhalant and food allergens was determined at age of six and eighteen months by the skin prick test measuring ring diameter [3]. Atopic dermatitis was diagnosed using the Hanifin-Rajka criteria [19] from birth to 7 years of age. Two mutations detected outside the genotyping array, the ORMDL3 and filaggrin mutations were included in the clinical data. Accordingly, the 18 Manuscript clinical features used for the COPSAC2000 cohort were neonatal lung functions (FEF50 and PD15), airway bacteria presence in 1st month, allergy at 6 months, allergy at 18 months, Birth type either natural and C-section, eczema in 1st year of age, WBC counts at 6 months, WBC counts at 18 months, weight at birth, weight at two weeks age, exacerbation, wheeze and asthma before 2 years, antibiotics intake by mother in third trimester, smoking history of mother, ORMDL3 mutation and filaggrin mutations. WBC counts and lung functions were converted to z-scores and the remaining features were binary encoded. Genotyping data For the COPSAC2000 cohort, Genome-wide SNP genotyping was performed using SNP array Illumina Infinium II HumanHap550 v1, v3 or Quad BeadChip platform at the Childrens Hospital of Philadelphias Center for Applied Genomics. SNPs with minor allele frequency (MAF) of <1% or Hardy-Weinberg equilibrium p-value of < 10e?5 were excluded from the analysis. SNPs passing the filtering criteria were mapped to the ensemble genes using ensemble API version 62 with the nearest gene function. SNPs that did not map to any gene or have missing genotyping values were excluded from further steps. Two hundred thirty-six participants from the COPSAC2000 cohort had complete set of 18 clinical features and 411534 SNPs. Out of the 411534 complete set SNPs, 271706 SNPs mapped to 20026 genes in the COPSAC2000 cohort. 4 A high number of SNPs were genotyped in the two datasets it was not feasible to try all SNPs combinations in the feature selection - even after grouping them into pathways. Thus, a combinatorial approach was designed to train and test each pathway using 3-fold cross-validated ANNs in combinations of up to three features. 59 different pathway sets created from 92,012 SNPs in the discovery cohort data. Each pathway was trained and tested independently to rank SNPs and SNP combinations based on the average 3-fold cross-validated test Matthewss correlation coefficients (MCC). A set of combinations with best predictive values was selected if they have MCCs with difference of less than 0.1 from the best MCC (Figure 2). The top combination was used as seed and for each SNP from the descending ordered list a new ANN was trained and tested. Average Matthewss correlation coefficient (MCC) was calculated for this new combination using 3-fold cross-validation. If the SNP increased the average MCC, it was added to the combination otherwise the next SNP on the list was tested (Figure 2). The pathways were ranked by the MCCs of their respective best combination. The top pathways with MCC >=0.3 was selected from the discovery cohort. These were used for feature reduction and training and testing using ANN on the COPSAC2000 cohort data, including as well as without the clinical features. In the sets comprising only of genetic features, all combination till 3 were exhausted in singletons, pairs and in combinations of three features to rank the feature combinations. The combinations are tested and selected using the combinatorial approach as described in the previous section. All possible combinations of the available Artificial neural network 18 clinical features were tested to select a set of best discriminating features based on the 3-fold cross-validated A dataset with 3000 individuals, both cases and controls, from the discovery cohort having genotyping data for 124,514 ANNs and average MCC. In order to find the genetic features adding power to the selected clinical within the top SNPs were used for feature selection and ranking of pathpathways, the genetic features were tested in combination ways using a machine learning algorithm. Feed forward with selected clinical features. The set of selected clinical fully connected artificial neural networks (ANNs) with features were used as a constant set, to train all possible a single hidden layer using a standard back-propagation combinations of the genetic features using them in single procedure [37] were used for pathway selection and to asand in pairs with 3-fold cross-validated ANNs and aversess the predictive performance of the included genetic age highest MCC as the selection criteria. features. Pathway sets of the SNPs were created based on the pathway definitions from the database of cell signaling (http://stke.sciencemag.org/cm/). Each SNP was enThe total dataset of COPSAC2000 with 236 individuals coded by three binary input neurons, 100 for homozygous was divided into a training-test set of 200 and a small evalreference, 010 for heterozygous and 001 for homozygous uation set of 36 with balanced case-control division. A non-reference. The pathways solely discovered in non4-fold cross-validation was performed on the select clinmammalian species were excluded from the analysis. The ical features for testing their power independently. The gene or the gene set defined as token by the database of selected combinations of genetic features in the top pathcell signaling were manually mapped to human homoways with and without the clinical features were used for logues using HGNC nomenclatures for ease of SNP maptraining 4-fold cross-validated ANNs using the training ping using Ensembl API. This resulted in 59 mammalian data set of 200 individuals. These trained ANNs were pathways. Manuscript 5 Feature reduction Features grouped in pathways The SNPs from COPSAC2000 cohort genotyping data were mapped to the genes from the top 11 pathways with best performance from PKU cohort. These sets when used for feature selection, lead to average 97% reduction in the size of the pathway to result in best performing SNPs. The feature selection performance of these 11 pathways showed increase in the MCC in the COPSAC2000 cohort as compared to KU cohort (Table 2). Train and test Combinations (in 1’s, 2’s and 3’s) Ranked feature combinations Combinations if MCC > (MCCBEST – 0.1) Top combinations Add feature to the top combination Add new feature No MCC increases Yes Reject the new feature Best feature combination for pathway Figure 3: Schematic diagram of the combinatorial approach Clinical feature selection Out of the 18 clinical features for the COPSAC2000, 10 features were selected to have the best discriminative value for cases and controls from all possible combinations with a MCC of 0.6418. They were allergy at 6 months of age, allergy at 18 months, Birth type for natural and Csection, Eczema at 1st year, exacerbation before 2 years, flaggrin mutation, WBC counts at 18 months of age, WBC counts at 6 months, weight at birth, weight at 2 weeks age, wheeze before 2 years. These selected clinical features when combined with the genetic factors in groups, defined top 11 pathway boundaries selected from PKU cohort showed an increase in the performance with fewer SNPs per pathway (Table 3). used for the selection of features. evaluated for their predictive power for the 36 individuals using MCC and AUC as the measures. The arithmetic mean of the four trained networks was sued to evaluate the evaluation set of 38 individuals for each pathway. The GWAS to test the association of single SNPs of interest in the dataset was carried out using PLINK [34]. Results Feature selection SNPs from the genotyping of discovery cohort were grouped into 59 pathways from database of cell signaling using the genes they were mapped to by Ensembl. When trained and tested with 3-fold cross-validation, the 59 tested pathways resulted in top eleven pathways with MCC >=0.3 (Table 1). These top pathways were reduced by average 90% in size by number of features selected for the best performance. Training and evaluation The dataset of COPSAC2000 cohort was divided into 200 train-test set and 36 evaluation set. The networks were trained on the 200 data points with 4 fold-cross validation using the maximum MCC as the selection criteria for the best performing network. The selected 10 clinical features with 4 fold cross validation gave MCC= 0.62 using the evaluation set of 36 individuals. The MCC for the seven pathway sets, estrogen receptor pathway, differentiation pathway in PC12 cells, insulin signaling pathway, B cell antigen receptor, mitochondrial pathway of apoptosis (caspases), PI3K pathway, FAS signaling pathway were higher as compared to clinical features used alone (Table 4). The test correlation coefficient for three pathways, FAS signaling pathway, IL-1 pathway, B cell antigen receptor is less than the MCC obtained from clinical features alone and thus cannot be compared with the other pathways. When prediction accuracy is checked for the individual pathways, PI3K pathway is more accurate in Manuscript Pathway Name Pathway Name FAS signaling pathway Interleukin 1 (IL-1) pathway Insulin signaling pathway Differentiation pathway in pc12 cells Mitochondrial pathway of apoptosis BH3-only Bcl-2 family G alpha 12 pathway B cell antigen receptor pathway Estrogen receptor pathway PI3K pathway PI3K class IB pathway in neutrophils Mitochondrial pathway of apoptosis (Caspases) 6 Total SNPs Total SNPs 113 157 155 109 136 85 70 193 83 157 103 Test MCC Test MCC 0.3356 0.3203 0.3121 0.306 0.3046 0.3041 0.3016 0.3015 0.3007 0.2997 0.2927 No of selected SNPs No of selected SNPs 17 10 9 8 13 10 8 12 10 9 9 Table 1: Selected pathways from PKU cohort using genetic data. The Table documents the total number of SNPs used for feature selection, MCC of feature selection and number of features selected. Pathway Name Mitochondrial Pathway of Apoptosis (BH3-only Bcl-2 Family) Mitochondrial pathway of apoptosis (BH3-only Bcl-2 Family) PI3K class IB pathway in neutrophils B cell antigen receptor pathway Differentiation pathway in PC12 cells Insulin signaling pathway IL-1 pathway Estrogen receptor pathway PI3K pathway FAS signaling pathway Mitochondrial pathway of apoptosis (Caspases) G alpha 12 pathway Total SNPs 511 511 525 236 451 546 553 579 288 713 664 312 Test MCC 0.6394 0.6394 0.623 0.5575 0.5432 0.5247 0.5229 0.5104 0.502 0.482 0.4662 0.3981 No of selected SNPs 21 21 20 10 17 10 13 16 13 11 6 7 Table 2: Feature reduction for the top 11 pathways in COPSAC2000 cohort using genetic data. The Table documents the total number of SNPs used for feature selection, MCC for feature selection and number of features selected. predicting true positive along G alpha 12 pathway, mitochondrial pathway of apoptosis (caspases) and B cell antigen receptor pathway. The overall performance of B cell antigen receptor pathway is the worst amongst the tested pathways, as the features of that pathway do not describe the negatives precisely. Similarly, FAS signaling pathway and PI3K Class IB pathway in neutrophils are better at assigning non-asthmatic class to controls. Estrogen receptor pathway comes as the pathway with best predictive value as it is good are both assigning the asthmatic class to cases and non-asthmatic to controls. Discussion The genome wide association analyses to associate the variations to various asthma phenotypes using the discovery cohort and COPSAC2000 cohort have been carried out as a part of different studies. The GWAS results from discovery cohort showed associations of single SNP to asthma exacerbation, replicating previously known loci IL-33, RAD50/IL13, HLA-DQ and IL1RL1 and also discovering a new asthma associated gene CDHR3 [7]. The GWAS analysis of the 411 children COPSAC2000 cohort was recently published where variations in PCDH1 was shown to increase risk for early asthma as well as atopic dermatitis in early childhood [30]. In this study, we search for genetic and clinical feature combinations that are associated with the development of asthma before age 7 years. As all individuals in COPSAC2000 are born to asthmatic mothers, it might be speculated that all of them will develop asthma or related phenotypes. But that is not the case in our cohort. It has been observed by other studies also that a family history of asthma is not a strong predictor of asthma outcome in children but absence of it better predicts that the child will not develop asthma [25]. Longitudinal study inves- Manuscript 7 Pathway Name Mitochondrial pathway of apoptosis (BH3-only Bcl-2 Family) Mitochondrial pathway of apoptosis (Caspases) Insulin signaling pathway Total SNPs 10 CF + 511 SNPs Test MCC 0.7068 10 CF + 579 SNPs 0.6928 10 CF + 288 SNPs 0.6897 PI3K class IB pathway in neutrophils PI3K pathway 10 CF + 525 SNPs 0.6766 10 CF + 553 SNPs 0.6702 FAS signaling pathway 10 CF + 451 SNPs 0.6686 G alpha 12 pathway 10 CF + 236 SNPs 0.6678 Estrogen receptor pathway 10 CF + 713 SNPs 0.6658 IL-1 pathway 10 CF + 664 SNPs 0.6655 Differentiation pathway in PC12 cells B cell antigen receptor pathway 10 CF + 312 SNPs 0.6583 10 CF + 546 SNPs 0.6568 No of selected SNPs Selected 10 CFs, rs538874 Selected 10 CFs, rs838457 Selected 10 CFs, rs4151657 Selected 10 CFs, rs7566856 Selected 10 CFs, rs7713645 Selected 10 CFs, rs905238 Selected 10 CFs, rs4965238 Selected 10 CFs, rs2163800 Selected 10 CFs, rs12051769 Selected 10 CFs, rs2281976 Selected 10 CFs, rs3027623 rs11596895, rs4647693, rs12051769, rs2617815, rs7566856, rs3027623, rs16951319, rs1529711, rs10242595, rs11076787, rs11997038, Table 3: Feature selection in COPSAC2000 cohort using 10 selected clinical features and genetic data from the top 11 pathways. The table documents the total number of SNPs used for feature selection, MCC of feature selection and selected features. tigating multiloci profile of genetic risk for asthma in cohort with family history with asthma, found that multiple GWAS discoveries are associated with childhood-onset of asthma [2]. Thus, we design a methodology using combinations of multiple loci and testing them for association with the trait. The environmental factors cannot be ignored in designing studies concerning complex diseases. Multi-ethnic group study found that caregiver reports of physician diagnosis of asthma (CRPDA) when augmented with assessment of bronchial hyperresponsiveness (BHR) results in precise identification of children with Asthma at age 7 [43]. Similarly, we also found that none of clinical features alone can have better prediction power than the combination of 10 clinical features. Article by Hans et al. describes the details of clinical data collection, phenotyping of children as well as long follow-up giving us high quality of clinical data [6] used as input in this study. The combinations of genetic and clinical data have more predictive value than any of them been alone as shown by the increase in MCC (Table 3), which supports the idea of gene-environment interaction. Multifactor dimensionality reduction analysis suggests genegene interactions may occur between different physiological pathways as well as between gene and environment factors like dampness in childhood asthma [1]. While test- ing at multiple loci, defining and evaluation of genomic profile at risk is valuable. The biological pathway based combination of SNPs, gives clues of biological mechanisms occurring in the pathophysiology of the disease. This study does not involve any of the environmental factors like the surroundings, climate and living and economic conditions, which might add more power to the tool. The risk of fracture is determined by genetic as well as non-genetic factors and a combined model of genes and clinical features is 45% accurate than the genetic model with 41% specificity [44]. In vague situations like primary evaluation of head trauma patients based on clinical data, ANN out performed the logistic regression [13]. Similarly, in this study ANN based prediction using clinical risk features and genetic data out performs when tested against the single data type predictors. The ANN models in our study use only one continuous variable (WBC counts) at the end and rest of the features are binary variable. It is known that binary data is less powerful than the continuous data types to detect association between feature and outcome but binary variable have earlier proved their ability to represent genetic data used to ascertain a priori score for the predisposition of coronary infract [51]. All SNPs in the dataset were not used, as the current architecture was unable to handle missing Manuscript 8 Pathway Name Estrogen receptor pathway Differentiation pathway in PC12 cells Insulin signaling pathway Mitochondrial pathway of apoptosis (Caspases) PI3K pathway Mitochondrial pathway of apoptosis (BH3-only Bcl-2 Family) PI3K Class IB pathway in neutrophils G alpha 12 pathway FAS Signaling pathway IL-1 pathway B Cell Antigen Receptor pathway Features for selection 0.72193 0.72193 0.72193 0.6742 0.83666 0.72193 0.6742 0.657376 0.542326 0.512392 0.4 Test MCC 0.711698 0.693989 0.678929 0.663075 0.638754 0.546179 0.542781 0.531513 0.633754 0.603481 0.663875 Selected features Table 4: Evaluation results for pathways. 11 selected pathways with the features used for training the 4-fold crossvalidated networks and evaluation MCC and AUC on the evaluation set. B Cell AnAgen Receptor Interleukin 1 (IL-‐1) Pathway Mitochondrial Pathway of Apoptosis(Caspases) Fas Signaling Pathway G alpha 12 Pathway Mitochondrial Pathway of Apoptosis (BH3-‐only Bcl-‐2 Family) Insulin Signaling Pathway DifferenAaAon Pathway in PC12 Cells PI3K Class IB Pathway in Neutrophils Estrogen Receptor Pathway PI3K Pathway 0 False predicAons 5 10 True negaAves 15 20 25 30 35 True posiAves Figure 4: The accuracy of different pathways in correctly predicting asthmatic and non-asthmatic individuals. True positive are the individual with asthma at 7, which are predicted asthmatic while true negative are controls predicted as non-asthmatic. Negatives are the count of mis-predictions of asthmatic been predicted as non-asthmatic and non-asthmatic been predicted as asthmatic. data. Since, the genotyping is never 100% for all samples, using probabilities of the genotype can overcome this problem and would increase the coverage of SNPs and the pathways to be tested in the method. The top 11 pathways found to be associated with asthma include pathways, which either as the pathway or the gene component have been associated with asthma. Variation is estrogen receptors, the key molecule of the best performing pathway has being associated with different asthma like phenotypes [11] and reduced ER-alpha receptor has been reported in the mitochondria of fatal asthma cases. This indicates the function of ER-alpha during the inflamma- tion of airways and their crucial role in pathophysiology of asthma [39]. Increasing links between asthma, obesity and diabetes are not only due to mechanical pulmonary disadvantage but there are some molecular connects between these phenotype. Multiple studies suggests insulin affects lungs and airway smooth muscles and also insulin is downstream pathway of PI3K/Akt signalling [40]. PI3K is an intracellular signalling pathway, which is important in apoptosis. A common SNP detected in the two pathways rs7566856, is mapped to Inositol Polyphosphate5-Phosphatase, 145kDa (INPP5D /SHIP-1) gene, which acts a positive regulator in Th2 cells in the adaptive im- Manuscript mune response to aeroallergen [36]. It has been found that inhibitors targeting PI3K isoforms can serve as therapeutic agents for treatment of asthma and chronic obstructive pulmonary disease [21]. This study finds two Phosphoinositide 3-kinase (PI3K) pathways to be associated with the asthma outcome, only one of which is being found to be perform well on evaluation phase. Thus, the selection of insulin signalling and PI3K pathways over the other pathways in the database indicates towards a link between the genetic features from these pathways, and interaction with each other and asthma. Mitochondria mediated apoptosis have been found to affect atopic asthma by delayed cell death of neutrophils, which contribute to neutrophilic inammation in asthma [38]. Thus, the inflammation reactions occurring during the phenotype and their role in prediction of asthma can be facilitated though these mitochondria mediated apoptosis pathways. The pathway ranking second in the evaluation list is the differentiation pathway in PC12 cells, which has been detected in the tumor cells of adrenal glands. These cells are known to be under the control of different growth factors [47]. Thus, the pathway definition contains genes PI3K, Protein kinase B (AKT), cAMP response element-binding proteins (CREB), which are common between different pathways. Thus, the selection of pathways to be tested and the definition of pathways play a crucial role in the success of this method. Taking expression of genes in asthma related tissues would avoid false positive hits. Though the other 6 pathways do not increase the performance of these 10 selected clinical features, they still have predictive power and role in asthma phenotype. The Galpha 12 pathway activity leads to transformation, regulation of mitogenesis, regulation of survival, induction of stress fibers and is under the control of Thrombin [35]. Higher concentration of thrombin is found in the sputum of asthmatic patients and is relevant to airway tissue remodelling during the disease [16]. FAS signalling, the top ranker in the discovery cohort belongs the death receptor subgroup of the TNF receptor superfamily. The three pathways having the test as well as the evaluation MCC less than the clinical features alone show that the selected genetic features do not correlate with the clinical features. Asthmatic cell lines have been found to be resistant to First apoptosis signal (FAS) signalled apoptosis. Also, FAS signal transduction was suggested to contribute to T-cell-dependent immunoinammation in asthma [23]. So, the genetic features in FAS signalling pathway are relevant to asthma but might not present a coordinated effect with selected clinical risk features. The other 2 pathways lowering the MCC of the clinical features MCC in the test set only are B-cell antigen receptor and IL-1 pathways. These are known candidates of immune response in asthmatic conditions. Thus, all the top 11 pathways selected by the discovery method are asthma associated. 9 This method helps in reducing the complexity from thousands of SNPs and hundreds of gene to few best predictive SNPs when combined with the clinical features can be used as a predictive tool for asthma. Earlier prediction studies have been simplistic in the sense that they were based on small number of variants explaining only a fraction of genetic variability. High predictive value gives chances of discovering interventions where as low predictive results give discovery tool and risk predictions. The aim of these combination studies is to define a genetic risk score which can be used to predicting the disease risk of healthy people without any symptoms of the disease at the present [50]. This would be of great advantage for the management and treatment of disease like asthma. This method tries to maximise the exploration of the SNPs and thus the gene space but still it is not complete. Same SNPs are not present in all arrays and not all SNPs are detected in sequencing data, thus it is difficult to find a oneto-one replication set. Due to the small size of the cohort available with complete clinical data, we were not able to use an external evaluation set, which might have led to the problem of over fitting. Also, a replication of the method in an independent cohort would add more confidence to the pathway selection and predictions. Functional studies based on the replication results would lead to the causative variations within pathways. The method can be improved if we were able to try more combination by introducing more parallelization in the method. The definition of the pathways also play important role in this method and all the resources of pathway information are under development, meaning that we might be missing some genes and connection in our background data. Conclusion This method allows selection of pathways along with the selection of features with these pathways. The identified pathways and variations that can serve as basis of upcoming asthma studies to be studied in more details. This study prioritises clinical risk features as well as genetic features with predictive power for asthma, which can lead to further functional studies. This study shows the advantage and success of combination of multiple data sources for better predictive power. References [1] K. C. Barnes, “Gene-environment and gene-gene interaction studies in the molecular genetic analysis of asthma and atopy”, Clin Exp Allergy, Vol. 29 Suppl 4, pp. 47–51, 1999. Manuscript [2] D. D. W. Belsky, P. M. R. Sears, R. J. Hancox, H. Harrington, R. Houts, P. T. E. Moffitt, K. Sugden, B. Williams, P. R. Poulton, and P. A. Caspi, “Polygenic risk and the development and course of asthma: an analysis of data from a four-decade longitudinal study”, The Lancet Respiratory Medicine, Vol. 1, No. 6, pp. 453 – 461, August 2013. [3] H. Bisgaard, “The Copenhagen Prospective Study on Asthma in Childhood (COPSAC): design, rationale, and baseline data from a longitudinal birth cohort study”, Ann Allergy Asthma Immunol, Vol. 93, No. 4, pp. 381–9, 2004. [4] H. Bisgaard, M. N. Hermansen, F. Buchvald, L. Loland, L. B. Halkjaer, K. Bonnelykke, M. Brasholt, A. Heltberg, N. H. Vissing, S. V. Thorsen, M. Stage, and C. B. Pipper, “Childhood asthma after bacterial colonization of the airway in neonates”, N Engl J Med, Vol. 357, No. 15, pp. 1487–95, 2007. [5] H. Bisgaard, M. N. Hermansen, L. Loland, L. B. Halkjaer, and F. Buchvald, “Intermittent inhaled corticosteroids in infants with episodic wheezing”, N Engl J Med, Vol. 354, No. 19, pp. 1998–2005, 2006. [6] H. Bisgaard, C. B. Pipper, and K. Bonnelykke, “Endotyping early childhood asthma by quantitative symptom assessment”, J Allergy Clin Immunol, Vol. 127, No. 5, pp. 1155–64 e2, 2011. [7] K. Bonnelykke, P. Sleiman, K. Nielsen, E. Kreiner-Moller, J. M. Mercader, D. Belgrave, H. T. den Dekker, A. Husby, A. Sevelsted, G. Faura-Tellez, L. J. Mortensen, L. Paternoster, R. Flaaten, A. Molgaard, D. E. Smart, P. F. Thomsen, M. A. Rasmussen, S. Bonas-Guarch, C. Holst, E. A. Nohr, R. Yadav, M. E. March, T. Blicher, P. M. Lackie, V. W. Jaddoe, A. Simpson, J. W. Holloway, L. Duijts, A. Custovic, D. E. Davies, D. Torrents, R. Gupta, M. V. Hollegaard, D. M. Hougaard, H. Hakonarson, and H. Bisgaard, “A genome-wide association study identifies CDHR3 as a susceptibility locus for early childhood asthma with severe exacerbations”, Nat Genet, Vol. 46, No. 1, pp. 51–5, 2014. [8] A. M. Brooks, R. S. Byrd, M. Weitzman, P. Auinger, and J. T. McBride, “Impact of low birth weight on early childhood asthma in the United States”, Arch Pediatr Adolesc Med, Vol. 155, No. 3, pp. 401–6, 2001. [9] J. A. Castro-Rodriguez, C. J. Holberg, A. L. Wright, and F. D. Martinez, “A clinical index to define risk of asthma in young children with recurrent wheezing”, Am J Respir Crit Care Med, Vol. 162, No. 4 Pt 1, pp. 1403–6, 2000. [10] H. G. M.-K. M. P. M. M. P. e. a. Devulapalli CS, Carlsen KC, “Severity of obstructive airways disease by age 2 years predicts asthma at 10 years of age”, Thorax, Vol. 63, pp. 8–13, 2008. [11] A. Dijkstra, T. D. Howard, J. M. Vonk, E. J. Ampleford, L. A. Lange, E. R. Bleecker, D. A. Meyers, and D. S. Postma, “Estrogen receptor 1 polymorphisms are associated with airway hyperresponsiveness and lung function decline, particularly in female subjects with asthma”, J Allergy Clin Immunol, Vol. 117, No. 3, pp. 604–11, 2006. [12] D. L. Duffy, N. G. Martin, D. Battistutta, J. L. Hopper, and J. D. Mathews, “Genetics of asthma and hay fever in Australian twins”, Am Rev Respir Dis, Vol. 142, No. 6 Pt 1, pp. 1351–8, 1990. [13] B. Eftekhar, K. Mohammad, H. E. Ardebili, M. Ghodsi, and E. Ketabchi, “Comparison of artificial neural network 10 [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] and logistic regression models for prediction of mortality in head trauma based on initial clinical data”, BMC Med Inform Decis Mak, Vol. 5, p. 3, 2005. M. e. a. Ferreira, “Identification of IL6R and chromosome 11q13.5 as risk loci for asthma.”, Lancet, Vol. 378, pp. 1006–1014, 2011. P. Flicek, M. R. Amode, D. Barrell, K. Beal, S. Brent, D. Carvalho-Silva, P. Clapham, G. Coates, S. Fairley, S. Fitzgerald, L. Gil, L. Gordon, M. Hendrix, T. Hourlier, N. Johnson, A. K. Kahari, D. Keefe, S. Keenan, R. Kinsella, M. Komorowska, G. Koscielny, E. Kulesha, P. Larsson, I. Longden, W. McLaren, M. Muffato, B. Overduin, M. Pignatelli, B. Pritchard, H. S. Riat, G. R. Ritchie, M. Ruffier, M. Schuster, D. Sobral, Y. A. Tang, K. Taylor, S. Trevanion, J. Vandrovcova, S. White, M. Wilson, S. P. Wilder, B. L. Aken, E. Birney, F. Cunningham, I. Dunham, R. Durbin, X. M. Fernandez-Suarez, J. Harrow, J. Herrero, T. J. Hubbard, A. Parker, G. Proctor, G. Spudich, J. Vogel, A. Yates, A. Zadissa, and S. M. Searle, “Ensembl 2012”, Nucleic Acids Res, Vol. 40, No. Database issue, pp. D84– 90, 2012. E. C. Gabazza, O. Taguchi, S. Tamaki, H. Takeya, H. Kobayashi, H. Yasui, T. Kobayashi, O. Hataji, H. Urano, H. Zhou, K. Suzuki, and Y. Adachi, “Thrombin in the airways of asthmatic patients”, Lung, Vol. 177, No. 4, pp. 253–62, 1999. T. W. Guilbert, W. J. Morgan, M. Krawiec, J. Lemanske, R. F., C. Sorkness, S. J. Szefler, G. Larsen, J. D. Spahn, R. S. Zeiger, G. Heldt, R. C. Strunk, L. B. Bacharier, G. R. Bloomberg, V. M. Chinchilli, S. J. Boehmer, E. A. Mauger, D. T. Mauger, L. M. Taussig, and F. D. Martinez, “The Prevention of Early Asthma in Kids study: design, rationale and methods for the Childhood Asthma Research and Education network”, Control Clin Trials, Vol. 25, No. 3, pp. 286–310, 2004. E. Hafkamp-de Groen, H. F. Lingsma, D. Caudri, D. Levie, A. Wijga, G. H. Koppelman, L. Duijts, V. W. Jaddoe, H. A. Smit, M. Kerkhof, H. A. Moll, A. Hofman, E. W. Steyerberg, J. C. de Jongste, and H. Raat, “Predicting asthma in preschool children with asthma-like symptoms: Validating and updating the PIAMA risk score”, J Allergy Clin Immunol, Vol. 132, No. 6, pp. 1303–1310 e6, 2013. G. Hanifin, J.M. & Rajka, “Diagnostic features of atopic dermatitis”, Acta Derm. Venereol., Vol. 92, pp. 44–47, 1980. S. Illi, E. von Mutius, S. Lau, R. Nickel, B. Niggemann, C. Sommerfeld, and U. Wahn, “The pattern of atopic sensitization is associated with the development of asthma in childhood”, J Allergy Clin Immunol, Vol. 108, No. 5, pp. 709–14, 2001. K. Ito, G. Caramori, and I. M. Adcock, “Therapeutic potential of phosphatidylinositol 3-kinase inhibitors in inflammatory respiratory disease”, J Pharmacol Exp Ther, Vol. 321, No. 1, pp. 1–8, 2007. J. J. Jaakkola and M. Gissler, “Maternal smoking in pregnancy, fetal development, and childhood asthma”, Am J Public Health, Vol. 94, No. 1, pp. 136–40, 2004. S. Jayaraman, M. Castro, M. O’Sullivan, M. J. Bragdon, and M. J. Holtzman, “Resistance to Fas-mediated T cell apoptosis in asthma”, J Immunol, Vol. 162, No. 3, pp. 1717–22, 1999. Manuscript [24] O. Kolokotroni, N. Middleton, M. Gavatha, D. Lamnisos, K. N. Priftis, and P. K. Yiallouros, “Asthma and atopy in children born by caesarean section: effect modification by family history of allergies - a population based crosssectional study”, BMC Pediatr, Vol. 12, p. 179, 2012. [25] G. H. Koppelman, G. J. te Meerman, and D. S. Postma, “Genetic testing for asthma”, Eur Respir J, Vol. 32, No. 3, pp. 775–82, 2008. [26] H. S. A. S. Kurukulaaratchy RJ, Matthews S, “Predicting persistent disease among children who wheeze during early life”, Eur Respir Journal, Vol. 22, pp. 767–71, 2003. [27] M. P. H. G. P. M. M. K. M. e. a. Lodrup Carlsen KC, Soderstrom L, “Severity of obstructive airways disease by age 2 years predicts asthma at 10 years of age”, Allergy, Vol. 65, pp. 1134–40, 2010. [28] A. L. Marat and P. S. McPherson, “Variants of DENND1B associated with asthma in children”, N Engl J Med, Vol. 363, No. 10, pp. 988–9; author reply 989, 2010. [29] M. F. Moffatt, I. G. Gut, F. Demenais, D. P. Strachan, E. Bouzigon, S. Heath, E. von Mutius, M. Farrall, M. Lathrop, and W. O. Cookson, “A largescale, consortium-based genomewide association study of asthma”, N Engl J Med, Vol. 363, No. 13, pp. 1211–21, 2010. [30] L. J. Mortensen, E. Kreiner-Moller, H. Hakonarson, K. Bonnelykke, and H. Bisgaard, “The PCDH1 gene and asthma in early childhood”, Eur Respir J, Vol. 43, No. 3, pp. 792–800, 2014. [31] E. A. Nohr, N. J. Timpson, C. S. Andersen, G. Davey Smith, J. Olsen, and T. I. A. Sorensen, “Severe obesity in young women and reproductive health: the Danish National Birth Cohort”, PloS one, Vol. 4, No. 12, p. e8444, 2009. [32] B. Norgaard-Pedersen and D. M. Hougaard, “Storage policies and use of the Danish Newborn Screening Biobank”, Journal of inherited metabolic disease, Vol. 30, No. 4, pp. 530–6, 2007. [33] G. K. D. P. O.E. Savenije, M. Kerkhof, “Predicting who will have asthma at school age among preschool children”, J Allergy Clin Immunol, Vol. 130, No. 2, p. 325331, 2012-. [34] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. A. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. de Bakker, M. J. Daly, and P. C. Sham, “PLINK: a tool set for wholegenome association and population-based linkage analyses”, Am J Hum Genet, Vol. 81, No. 3, pp. 559–75, 2007. [35] S. R. N. Ravi Iyengar, Prahlad Ram, “G alpha 12 Pathway”, G alpha 12 Pathway. [36] S. Roongapinun, S. Y. Oh, F. Wu, A. Panthong, T. Zheng, and Z. Zhu, “Role of SHIP-1 in the adaptive immune responses to aeroallergen in the airway”, PLoS One, Vol. 5, No. 11, p. e14174, 2010. [37] D. Rumelhart, G. Hinton, and R. Williams, Learning internal representations by error propagation, Parallel Distributed Processing, vol. 1, 318-362, MIT Press, Cambridge, 1986. [38] A. S. Saffar, M. P. Alphonse, L. Shan, K. T. Hayglass, F. E. Simons, and A. S. Gounni, “IgE modulates neutrophil survival in asthma: role of mitochondrial pathway”, J Immunol, Vol. 178, No. 4, pp. 2535–41, 2007. [39] D. C. Simoes, A. M. Psarra, T. Mauad, I. Pantou, C. Roussos, C. E. Sekeris, and C. Gratziou, “Glucocorticoid and 11 [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] estrogen receptors are reduced in mitochondria of lung epithelial cells in asthma”, PLoS One, Vol. 7, No. 6, p. e39183, 2012. S. Singh, Y. S. Prakash, A. Linneberg, and A. Agrawal, “Insulin and the Lung: Connecting Asthma and Metabolic Syndrome”, J Allergy (Cairo), Vol. 2013, p. 627384, 2013. P. M. Sleiman, J. Flory, M. Imielinski, J. P. Bradfield, K. Annaiah, S. A. Willis-Owen, K. Wang, N. M. Rafaels, S. Michel, K. Bonnelykke, H. Zhang, C. E. Kim, E. C. Frackelton, J. T. Glessner, C. Hou, F. G. Otieno, E. Santa, K. Thomas, R. M. Smith, W. R. Glaberson, M. Garris, R. M. Chiavacci, T. H. Beaty, I. Ruczinski, J. S. Orange, J. Allen, J. M. Spergel, R. Grundmeier, R. A. Mathias, J. D. Christie, E. von Mutius, W. O. Cookson, M. Kabesch, M. F. Moffatt, M. M. Grunstein, K. C. Barnes, M. Devoto, M. Magnusson, H. Li, S. F. Grant, H. Bisgaard, and H. Hakonarson, “Variants of DENND1B associated with asthma in children”, N Engl J Med, Vol. 362, No. 1, pp. 36–44, 2010. M. W. Su, K. Y. Tung, P. H. Liang, C. H. Tsai, N. W. Kuo, and Y. L. Lee, “Gene-gene and gene-environmental interactions of childhood asthma: a multifactor dimension reduction approach”, PLoS One, Vol. 7, No. 2, p. e30694, 2012. G. P. Tamesis, R. A. Covar, M. Strand, A. H. Liu, S. J. Szefler, and M. D. Klinnert, “Predictors for asthma at age 7 years for low-income children enrolled in the Childhood Asthma Prevention Study”, J Pediatr, Vol. 162, No. 3, pp. 536–542 e2, 2013. B. N. Tran, N. D. Nguyen, V. X. Nguyen, J. R. Center, J. A. Eisman, and T. V. Nguyen, “Genetic profiling and individualized prognosis of fracture”, J Bone Miner Res, Vol. 26, No. 2, pp. 414–9, 2011. D. J. Turner, S. M. Stick, K. L. Lesouef, P. D. Sly, and P. N. Lesouef, “A new technique to generate and assess forced expiration from raised lung volume in infants”, Am J Respir Crit Care Med, Vol. 151, No. 5, pp. 1441–50, 1995. C. E. van Beijsterveldt and D. I. Boomsma, “Genetics of parentally reported asthma, eczema and rhinitis in 5-yr-old twins”, Eur Respir J, Vol. 29, No. 3, pp. 516–21, 2007. A. M. Vignola, G. Chiappara, P. Chanez, A. M. Merendino, E. Pace, M. Spatafora, J. Bousquet, and G. Bonsignore, “Growth factors in asthma”, Monaldi Arch Chest Dis, Vol. 52, No. 2, pp. 159–69, 1997. K. Wang, M. Li, and H. Hakonarson, “Analysing biological pathways in genome-wide association studies”, Nat Rev Genet, Vol. 11, No. 12, pp. 843–54, 2010. M. Weitzman, S. Gortmaker, and A. Sobol, “Racial, social, and environmental risks for childhood asthma”, Am J Dis Child, Vol. 144, No. 11, pp. 1189–94, 1990. N. R. Wray, M. E. Goddard, and P. M. Visscher, “Prediction of individual genetic risk to disease from genomewide association studies”, Genome Res, Vol. 17, No. 10, pp. 1520–8, 2007. N. Yiannakouris, A. Trichopoulou, V. Benetou, T. Psaltopoulou, J. M. Ordovas, and D. Trichopoulos, “A direct assessment of genetic contribution to the incidence of coronary infarct in the general population Greek EPIC cohort”, Eur J Epidemiol, Vol. 21, No. 12, pp. 859–67, 2006. Part III Obesity 75 Obesity aetiology Introduction Obesity is another endemic complex disease growing at high rates in the developed as well as developing parts of the world [211]. Obesity is caused when energy intake exceeds energy expenditure [212] and is influenced by genetics, diet, age and lifestyle [213]. Physiological presentation of obesity is when abnormal amounts of triglycerides are stored in adipose tissue and, later released from adipose tissue as free fatty acids (FFA) with detrimental effects on other organs. Obesity can lead to other chronic conditions, including cardiovascular diseases, type II diabetes mellitus, osteoarthritis of the lower extremities, mobility disorders, and increases overall mortality. The goal of ongoing obesity research is to elucidate pathways and mechanisms that control obesity and to improve prevention, management and therapy [214]. Adipose tissue plays a major role in nutrient homeostasis, by serving as the energy storage organ and as the source of energy during fasting, thus making it important in pathophysiology of obesity. Adipose tissue is a mesh of different cells like adipocytes (commonly called as fat cells), stromal cells, vessels, nerves held together by elements of the extracellular matrix. Adipose tissue is also regarded as an endocrine organ as it secretes factors like adipsin, TNF-α, and leptin which are known to affect the activity of other organs. Adipose tissue also differs in size, function and their potential contribution to disease is based on their type and location within the body. In humans, the adipose tissue can be broadly classified into subcutaneous (below the skin) and visceral (around the organs). In mice, the adipose tissue is made up of two subcutaneous depots, called the inguinal, and several visceral depots near multiple organs. For example, the fat near the kidneys is called perirenal and fat near the epididymis is called epididymal. Adipose tissue Traditionally, adipocytes have been classified on the basis of their morphology into two types: brown adipocytes and white adipocytes (Figure 77 78 II.1). Brown adipocytes contain numerous fat droplets and are specialised to dissipate stored chemical energy in the form of heat. They make the brown adipose tissue (BAT). Brown adipocytes are characterised by high expression of uncoupling protein-1 (UCP-1) that catalyses the passage of a proton to the inner mitochondrial membrane for adenosine triphosphate (ATP) synthesis. On the other hand, white adipocytes present in the white adipose tissue (WAT) are unilocular cells known for storing energy and increasing weight. Big mammals like human are born with brown fat that disappears in first few months after birth and is replaced by WAT later on. Another form of temporary or intermediate adipose tissue is “beige” or “brite” adipocytes. They are UCP-1 positive cells with a brown fat-like morphology within white fat depots. White fat cell Brown fat cell Figure II.1 White adipose cell and brown adipose cell. M = Mitochondria, LV= Lipid vesicles. Adipocytes develop from mesenchyme, but there are differences in the field about the origin of brown and white fat cells and their replacement by each other. Recent review by Rosen et al [215] discusses the state-of-art knowledge about different adipocytes and their mechanisms of survival. BAT is more efficient in energy utilisation and it is seen as a perspective key holder for preventing obesity. Adult humans have small amount of functional BAT and detection of brown adipose tissue inversely correlate with age [216]. The mechanism to increase the BAT content of body or making it more efficiently have been sought as therapeutic methods to overcome obesity. The knowledge of the mechanism by which BAT in early life is converted to WAT in mammals is one of the blocks of the puzzle, which is important for understanding the BAT WAT inter conversions. Chapter 5 presents the work in the field of adipose biology where the brown fat tissue to white adipose tissue conversion has been modelled in another precocial mammal, i.e. sheep, to replicate what happens in humans. Precocial mammals have a long gestation period, and are born with UCP-1 79 expressing brown fat which is replaced later by white fat [217]. This replacement happens by a immediate start of non-shivering thermogenesis at birth [217]. The project aims at identifying the factors responsible for the replacement of BAT by WAT with the help of transcriptome profiling of adipose tissue using the RNA-seq over a period of 14 days starting at 2 days before the birth. Figure 4. Adipocyte-Mat Play a Role in the Patholog Adipocytes secrete numerous maintain the structure of the d nutrition, adipocytes increase expansion becomes limited b undergoes fibrotic changes. Th that include hypoxia, inflamma all of which contribute to insul glucose uptake and a he profile, may account fo metabolic health of some This is consistent with o dence demonstrating th dione treatment impro parameters despite incre cell number and total et al., 2011; Yamauchi e well as findings that m healthy obese patients ha preadipocyte pool (Gu Figure II.2 The transformation of white adipose tissue during obesity. In obe2013). Whether increas sity, adipose tissue undergoes hypertrophy followed by inflammation by invading macrophages Fibrosis is an[215]. additional key element in determining the health or increased adipogenesis accounts for the p of the fat pad (Figure 4). Adipocytes can be likened to ‘‘grapes in MHO individual, it certainly raises the paradox aObesity mesh bag,’’ with elements of the extracellular matrix serving as of the obese population might be improved if is basically adipose tissue growth involving enlargement of adipocytes the mesh. Fat cells express a wide variety of matrix proteins as even more obese. We do not, however, expe called hypertrophy (Figure II.2) as well as increase in number of adipocytes well as the enzymes required to break them down, and the become a high priority for the pharmaceutical in by the recruitment of new adipocytes called hyperplasia. Hypertrophy expression of these genes is highly regulated by changes in usually precedes hyperplasia. This pre-adipocytes to adipocytes conversion nutrient availability (Maquoi et al., 2002). Current thinking holds Adipocyte-Immune Cell Interactions: Come during hyperplasia is influenced by neural inputs and hormones secreted eithat relaxation of the matrix allows healthy expansion of the fat Pad! ther by other endocrine organs or by adipose tissue itself. With the increase pad; if the matrix is too rigid, then adipocytes become limited In addition to a matrix of extracellular prote in size of the adipose tissue, the neurovasculature development also occurs in their ability to store excess nutrients, and this leads to patho- are surrounded by a wide variety of cells that to supply blood and nervous signal to the enlarged tissue as well as to drain logical features that include activation of stress-related path- thelium, immune cells, fibroblasts, preadipoc the lymph. Sometimes the growth of adipose tissue exceedscells. the vasculature ways, inflammation, and ectopic lipid deposition in other tissues Overall, mature lipid-laden adipocytes development and that leads to oedema. An enhanced pro-inflammatory (Sun et al., 2013a). Collagen VI, for example, is the predominant make up only 20%–40% or so of the cellular statusof iscollagen observed in obesity with elevated of adipokines e.g. leptin form produced by adipocytes. Whenlevels the Col6a1 pad (although they account for >90% of fat pad and cytokines likeintumor necrosis ob factor TNF-α (Figure II.2). gene is disrupted leptin-deficient mice, they develop gram Accumulatof adipose tissue contains 1–2 million adi ing evidences in epidemiological studies in obesity have implied a role for much larger adipocytes than wild-type littermates (but smaller million stromal-vascular cells, of which more tha “metabolic memory” in fat reasons), cells. Forcoupled example, malnutrition in prenatal stage and Dixit, 2012). Immune c fat pads overall, for unclear with reduced cytes (Kanneganti or childhoodand obesity significantly risks of adult-onset inflammation improved glycemicincreases and lipidthe parameters known to obesity populate the fat pad for decades (Khan et al., 2009). More recently, fibroblast growth factor 1 1963), but it was not clear until recently that th (FGF1) was shown to be a critical mediator of adipose remodel- central role in adipose biology (Figure 5). This re ing, such that Fgf1!/! mice display dramatically altered adipose with the observation that adipose tissue is an im morphology upon chronic overfeeding or fasting, accompanied of TNF-a and other cytokines, an effect magnifi by insulin resistance and dysglycemia (Jonker et al., 2012). tion (Hotamisligil et al., 1993). These proinflamm The Col6a1-deficient model and others with similar features significantly impair the insulin sensitivity of local have been likened to a subgroup of human subjects called the also liver and muscle. Later work showed tha ‘‘metabolically healthy obese’’ (MHO). These individuals tend cytokines are produced by macrophages wit 80 including insulin resistance, referred to as “fetal programming” [218]. On the other hand, obesity is also known as life style disease and junk food and fats are being blamed for the increased rate of obesity not only in adults but also in children. Thus, the current research in obesity tries to explore the genetics as well as the effect of diet. Obesity and Epigenetics Obesity has a genetic component and the human obesity gene map in 1996 collected 127 genes from various studies linked to obesity phenotypes [219]. Obesity gene atlas identified 1,515 protein-coding genes and 221 miRNAs compiled from studies in four different mammals: human, cattle, rat, and mouse [220]. Along with genetics, alterations in the epigenetic marks like DNA and histone methylation are also connected to body weight and weight loss [221]. Thus, epigenetic mark profiling can be used to predict the susceptibility to obesity, and with the implementation of weight loss programs and other therapeutic approaches the negative outcome can be prevented. Leptin and TNF-α methylation levels can be used as epigenetic biomarkers for weight loss as well as other comorbidities like diabetes and hypertension [222]. Environmental exposures are likely to have an epigenetic effect on complex diseases, as it is known that tobacco smoke modifies the gene expression by DNA hypermethylation [223]. Body weight homeostasis is regulated through complex mechanism involving genetics and epigenetics, which are influenced by dietary intake and physical activity [224]. In a human study, the dietary folic acid intake has been associated with DNA methylation, with a transgeneration effect [225]. Thus, the food we eat directly or indirectly affects the cells and its epigenome. Leptin-deficient (ob/ob) mice is a widely used mouse model of genetic obesity and diabetes as they have hyperglycemia and obesity. High-fat containing foods are critically prevalent in modern society and they are main contributor to human obesity these days. High fat diet fed mouse model closely mimics high fat diet induced obesity in humans and thus serves as an important model to study obesity caused by high-fat foods. These mice have elevated blood glucose, impaired glucose tolerance, and subsequently acquire insulin resistance. With the knowledge of the epigenetic biomarker and the exposome for obesity, two models of obesity, genetic and diet induced, were designed. For genetic model, DNA methylation levels were compared between ob/ob and wild type mice. While in diet induced obesity model, mice were fed with high fat diet for 15 weeks and methylation was compared to regular diet fed mice. Gene expression data was also used to support the methylation data in diet model as it is of more clinical relevance for the diet induced obesity in humans. To find how different adipose tissues react to genetic and diet induced obesity, both inguinal and epididymal fat depots were examined and compared. The results from this study are presented in the chapter 6 . Chapter 5 Paper III - Brown to white adipose tissue transition Prelude In all mammals (including humans) there are two types of adipose tissues, BAT and WAT. BAT is specialised in energy dissipation and the generation of heat by oxidation of glucose and fatty acids, whereas WAT is wired for energy storage. This project addresses the postnatal transformation of the innate to white adipose tissue in sheep (Ovis aries). From earlier studies it is known that this transformation takes place in about two weeks after the birth. As sheep is a large mammal, that the transformation in sheep mimics the postnatal brown-to-white adipose conversion occurring in newborn human babies. To find out the underlying mechanism of this transformation, adipose tissue was collected at multiple time points from sheep. It includes time points before birth (day -2) and shortly after birth (till 14th day). Tag-RNA sequencing as described in chapter 1.4, was performed on the samples to evaluate differential gene expression at different time points. The seven time points were clustered based on gene expression data into three classes representing brown adipose tissue, transition state and white adipose tissue. The differentially expressed genes between these three states represent the changes occurring when one type of adipose tissues is transformed into another. Gene ontology and pathway enrichment were done in the significantly changing gene clusters to uncover the underlying biological mechanism during the transformation. Functional analysis was carried out to reveal novel TFs linked to the adipose transformation process in large mammals. 81 Submitted manuscript 1 Global gene expression profiling of brown to white adipose tissue transformation in sheep reveals novel transcriptional components linked to adipose remodeling Astrid L. Basse1,2,** , Karen Dixen1,2,** , Rachita Yadav3, ** , Malin P. Tygesen4 , Klaus Qvortrup2 , Karsten Kristiansen1 , Bjørn Quistorff2 , Ramneek Gupta∗3 , Jun Wang1,5,6,7 and Jacob B. Hansen†1 1 Department of Biology, University of Copenhagen, DK-2100 Copenhagen, Denmark Department of Biomedical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark 3 Center for Biological Sequence Analysis, The Technical University of Denmark, Copenhagen, Denmark 4 Department of Veterinary Clinical and Animal Sciences, University of Copenhagen, DK-1870 Frederiksberg, Denmark 5 BGI-Shenzhen, Shenzhen 518083, China 6 Princess Al Jawhara Center of Excellence in the Research of Hereditary Disorders, King Abdulaziz University, Jeddah 21589, Saudi Arabia 7 Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau 999078, China ** Joint first authorship 2 ABSTRACT Background Large mammals are capable of thermoregulation shortly after birth due to the presence of brown adipose tissue (BAT). The majority of BAT disappears after birth and is replaced by white adipose tissue (WAT). Results We analyzed the postnatal transformation of adipose in sheep with a time course study of the perirenal adipose depot. We observed changes in tissue morphology, gene expression and metabolism within the first two weeks of postnatal life consistent with the expected transition from BAT to WAT. The transformation was characterized by massively decreased mitochondrial abundance and down-regulation of gene expression related to mitochondrial function and oxidative phosphorylation. Global gene expression profiling demonstrated that the time points grouped into three phases; a brown adipose phase, a transition phase and a white adipose phase. Between the brown adipose and the transition phase 170 genes were differentially expressed, and 717 genes were differentially expressed between the transition and the white adipose phase. Thirty-eight genes were shared among the two sets of differentially expressed genes. We identified a number of regulated transcription factors, including NR1H3, MYC, KLF4, ESR1, RELA and BCL6, which were linked to the overall changes in gene expression during the adipose tissue remodeling. Finally, the perirenal adipose tissue expressed both brown and brite/ beige adipocyte marker genes at birth, the expression of which changed substantially over time. Conclusions Using global gene expression profiling of the postnatal BAT to WAT transformation in sheep, we provide novel insight into adipose tissue plasticity in a large mammal, including identification of novel transcriptional components linked to adipose tissue remodeling. Moreover, our data set provides a useful resource for further studies in adipose tissue plasticity. KEY WORDS – BAT, brite/ beige adipose tissue, global gene expression profiling, mitochondrial number, sheep, tag-based sequencing, transcription factors, UCP1, WAT ∗ Corresponding author: neek@cbs.dtu.dk † Corresponding author: cob.hansen@bio.ku.dk Ramneek Jacob B. Gupta, Hansen, e-mail: e-mail: ramja- Background Two types of adipose tissue exist based on morphological appearance and biological function. White adipose tissue (WAT) stores energy in the form of triacylglycerol (TAG) for later release and use by other tissues, whereas brown Submitted manuscript adipose tissue (BAT) metabolizes fatty acids and glucose for heat production. Thermogenesis through uncoupled respiration in BAT depends on a high mitochondrial density and expression of uncoupling protein 1 (UCP1) [11]. Larger mammals such as primates and ruminants are born fully developed and able to thermoregulate minutes after birth due to the presence of relatively large amounts of functional BAT, which becomes activated at birth. The majority of this BAT disappears after birth and is replaced by WAT [11,12,21]. Contrary to larger mammals, rodents are born with immature BAT that matures only postnatally and is largely retained throughout life [11, 36]. It is relevant to understand the brown to white adipose tissue remodeling in large mammals, as it is likely to mimic the transition occurring in human infants. The most frequently studied adipose tissue transition in a large mammal is the postnatal transformation of the perirenal adipose tissue in sheep. Around the time of birth, all visceral adipose depots in lambs are brown of nature [21] [20]. Lambs are normally born with approx. 30 g perirenal adipose tissue constituting 80 % of all their adipose tissue. The brown characteristics of the perirenal adipose depot change dramatically to a white adipose phenotype within a few weeks after birth [21] [20]. Although some gene expression details have been reported [33, 40], relatively little is known about this transition at the molecular level. Here we report a comprehensive time course analysis of the postnatal BAT to WAT transformation process of the perirenal adipose depot in lambs, including histological, biochemical and molecular examination as well as analyses of global gene expression profiles. We provide evidence for dramatic changes in mitochondrial function and fatty acid metabolism during the adipose remodeling and we identified a number of transcriptional components linked to this adipose tissue transformation process. Results Characterization of the postnatal brown to white adipose transformation At birth (designated day 0) the perirenal adipose tissue macroscopically appeared dark brown. The brown color fainted steadily during the time course, and the tissue ended up being white in appearance at postnatal days 30 and 60 (data not shown). Accompanying the “whitening”, the volume of the tissue gradually increased (data not shown). HE stained sections were prepared from all lambs, and representative sections from days 0, 2, 4, 14 and 30 are presented in Figure 1A. In the first week of life, the tissue was an apparent mixture of brown adipocytes with multilocular lipid droplets and white adipocytes with large unilocular lipid droplets. The perirenal adipose tissue contained by appearance mostly brown adipocytes at early 2 ages (days 0 to 4), whereas white adipocytes were predominant from day 14. To approach the BAT to WAT transformation in molecular terms, we measured mRNA and protein levels of selected marker genes by RT-qPCR and immunoblotting, respectively (Figure 1B and 1C). Expression of UCP1, the brown adipocyte-specific key thermogenic factor, was high and relatively stable until day 4, after which it became nearly undetectable. The BATenriched factors type II iodothyronine deiodinase (DIO2) and peroxisome proliferator-activated receptor γ (PPARG) co-activator 1α (PPARGC1A) were also highly expressed at day -2 and 0, but displayed a faster and stepwise decrease in expression, being considerably reduced already at days 0.5 and 1 and poorly expressed after day 4 (Figure 1B). In summary, at the level of macroscopic, microscopic and molecular analyses, we observed the expected postnatal transformation of BAT to WAT. Mitochondrial density declined during brown to white adipose transformation The ultra-structure of the perirenal adipose tissue was investigated at selected days by TEM (Additional file 1). TEM confirmed the mixed presence of multilocular and unilocular adipocytes at days 0 to 4 and the predominant presence of the latter at day 14. Adipocyte mitochondrial density was very high at days 0 to 4 and appeared lower at day 14. To estimate mitochondrial density quantitatively, we determined mtDNA content by qPCR as the ratio of mtDNA and nDNA (Figure 2A). This ratio decreased approx. 7-fold between days 0 and 60, indicating that the number of mitochondria per cell diminished during the BAT to WAT transformation. The ultra-structural observations and the mtDNA/nDNA ratios prompted us to investigate more carefully gene expression of relevance for mitochondrial abundance and function. The mRNA levels of the tricarboxylic acid (TCA) cycle enzyme CS decreased gradually during the time course (Figure 2B). CS activity, on the other hand, was high and stable until day 4, after which it dropped (Figure 2C). Two other mitochondrial genes were analyzed; the electron transporter cytochrome c1 (CYC1) and ATP5B. Levels of CYC1 mRNA (Figure 2D) and ATP5B protein (Figure 1C) displayed a time profile similar to that of CS activity. A number of nuclear transcription factors regulate expression of genes encoding mitochondrial proteins. These nuclear transcription factors include PGC-1 family members, nuclear respiratory factor 1 (NRF1) and a number of nuclear receptors. In addition to PPARGC1A (see Figure 1B), we measured the expression of PPARGC1B, PPARA (also known as NR1C1), estrogen-related receptorα (ERRA, also known as NR3B1) and NRF1 by RT-qPCR (Additional file 1). The expression pattern of PPARGC1B and Submitted manuscript 3 Figure 2 ANOVA p< 0.0001 mtDNA / nDNA 2000 1500 1000 * * 500 0 C -2 0 0.5 1 2 4 14 30 60 Age (days) 2.5 CS U / mg protein * ANOVA p< 0.0001 2.0 1.5 1.0 0.5 0.0 * * 14 30 * -2 0 0.5 1 2 4 Age (days) 60 B Relative CS mRNA 2500 1.6 ANOVA p< 0.0001 1.2 0.8 0.4 0.0 D Relative CYC1 mRNA A -2 0 0.5 * * * 1 2 4 * * * 14 30 60 Age (days) 0.12 ANOVA p< 0.0001 0.09 0.06 0.03 * 0.00 -2 0 0.5 1 2 4 14 30 * 60 Age (days) Figure 2: Mitochondrial density declines during brown to white Figure 1: Characterization of the postnatal brown to white adipose transformation. (A) Hematoxylin-eosin (HE) staining of perirenal adipose tissue at postnatal days 0, 2, 4, 14 and 30. Representative HE-stained sections are shown for the indicated time points (n = 5). (B) Total RNA was isolated from perirenal adipose tissue and used for RT-qPCR analysis. Relative expression was measured for uncoupling protein 1 (UCP1), type II iodothyronine deiodinase (DIO2) and peroxisome proliferatoractivated receptor γ (PPARG) co-activator 1α (PPARGC1A). The mRNA expression levels were normalized to expression of β-actin (ACTB). Data are mean +SEM (n =4-5); *, p<0.05 vs. day 0. (C) The level of uncoupling protein 1 (UCP1) and ATP synthase subunit β (ATP5B) was determined by immunoblotting on protein pools, one for each day during the time course. Transcription factor IIB (TFIIB) was used as a loading control. ERRA was similar to CYC1, with stable expression until day 4, followed by lower expression at subsequent time points. Expression of PPARA and NRF1 transiently decreased after birth (Additional file 1). Accordingly, based on ultra-structure, relative mtDNA measurements, expression and activity of key mitochondrial enzymes as well as expression of nuclear transcription factors controlling levels of mitochondrial factors, we concluded that mitochondrial density and function declined remarkably during the transition from BAT to WAT. adipose transformation. (A) Total DNA was isolated from perirenal adipose tissues and analyzed by qPCR with primers specific for mtDNA (cytochrome c oxidase I (MT-CO1)) and nDNA (suppression of tumorigenicity 7 (ST7)). The relative mtDNA copy number was obtained as the ratio of MT-CO1 to ST7 levels. (B) Total RNA was isolated from perirenal adipose tissue and used for RT-qPCR analysis. Relative expression was measured for citrate synthase (CS). The mRNA expression levels were normalized to expression of β-actin (ACTB). (C) Enzyme activity (U) of CS was determined and normalized to protein content. (D) Relative expression of cytochrome c1 (CYC1) was measured by RT-qPCR as described in panel B. Data are mean +SEM (n = 4-5); *, p <0.05 vs. day 0. Global gene expression analysis of postnatal brown to white adipose transformation To obtain a global view of gene expression changes during the BAT to WAT transformation in the perirenal adipose depot, tag-based sequencing was performed on pools of mRNA from days -2, 0, 0.5, 1, 2, 4 and 14. The resulting reads were mapped to the sheep genome (v3.1). We were able to map approx. 20 % of the total reads from all 7 time points to 13,963 annotated genes in the sheep genome. The relatively low number of mapped reads accounts for the low annotation coverage available for the sheep genome. The number of mapped reads per annotated gene was counted, and normalized read counts for the 7 time points were calculated (Additional file 2). To facilitate downstream analyses, the human genes homologous to the sheep genes were mined, and the results from the gene expression analysis are discussed using the human protein symbols and names. Next, we compared the expression of the genes measured by RT-qPCR in Figure 1, 2 and Additional file 1 to their expression in the tag-based sequencing data. Expression A. Submitted manuscript !"$ 4 Day -2 Day 0 Day 0.5 Day 1 Day 2 Day 4 Day 14 ï GO Term ï ï ï Figure 3 !"#"$%&'( )*+%,-./"0$)1 "*(+, Day -2 Day 0 Day 0.5 Day 1 Day 2 Day 4 Day 14 B B. !"$ AA. !"# %&'() Day 2 ï Day 4 Day 1 ï ï ï Day 0.5 !"# C 0.5 Day -2 Day 0 1 2 4 14 Day14 Day 2 Day 1 Day 4 Day 0 White Day -2 Transition Day 0.5 Brown Day 14 "*(+, -2 B. Day !"#"$%&'( 0 )*+%,-./"0$)1 %&'() 73 genes up-regulated 97 genes down-regulated 378 genes up-regulated 339 Daygenes 2 down-regulated Day 4 Day 1 Day 0.5 Day -2 Day 0 Day 2 Day 1 Day 4 Day 0 Day -2 Day 0.5 Day 14 Day14 Figure 3: Identification of the brown adipose phase, transition phase and white adipose phase. (A) Principal component analysis (PCA) plot for the expression data from seven time points showing the clustering of time points in the first two components. (B) Heatmap showing the hierarchical clustering based on Euclidean distances between the time points. (C) Allocation of the different time points to the three phases and summary of numbers of induced and repressed genes between phases. of UCP1, DIO2, PPARGC1A, PPARGC1B, CS, CYC1 and ERRA decreased from day 0 to day 14 in both the sequencing data and when measured by RT-qPCR (Figure 1, 2, Additional file 1 and 2). In general, there was a relatively high correlation in the expression data obtained by the two methods. The most highly expressed gene at both, day 0 and day 14 was fatty acid-binding protein 4 (FABP4), a gene known to be strongly enriched in adipocytes. Among the 20 genes with the highest expression level at day 0 and day 14 were several genes encoding ribosomal proteins, FABP5, the fatty acid transporter cluster of differentiation 36 (CD36), the glycolytic enzyme aldolase A (ALDOA) and regulator of cell cycle (RGCC), a cell cycle regulator and kinase modulating protein. Genes highly expressed at day 14 included the pentose phosphate pathway enzyme transaldolase (TALDO1) and catalase (CAT). When comparing the 20 most highly expressed genes, genes related to fatty acid oxidation, electron transport chain and ATP synthase activity were more prevalent at day 0 compared to day 14. Identification of three phases in the brown to white adipose tissue transformation process To analyze the distribution of gene expression, a PCA was performed on the total gene expression data set (Figure 3A). The PCA plot indicated that total gene expression at the different time points clustered into three groups; a group including days -2 and 0, a second group including days 0.5, 1, 2 and 4, and a third group comprising day 14. Hierarchical clustering of the total gene expression Enrichment from upregulated genes Muscle cell migration Negative regulation of adaptive immune response Enrichment from down-regulated genes Organic acid metabolic process Isocitrate metabolic process Small molecule metabolic process Number of genes p-value 4 3 1.3774E-06 3.38951E-05 16 1.26078E-06 3 1.47831E-06 28 1.0335E-06 Table 1: GO enrichment analysis of the 170 genes differentially expressed from the brown adipose phase to the transition phase. data set clustered the 7 time points into the same three groups (Figure 3B). We interpreted the three clusters as distinct phases in the BAT to WAT transition (Figure 3C). At days -2 and 0 the tissue is in the brown adipose phase. The tissue is in a transition phase at days 0.5, 1, 2 and 4, where gene expression starts to change, e.g. illustrated by the decrease in PPARGC1A expression from day 0 to day 0.5 (see Figure 1B). Day 14 represents the white adipose phase, as was also suggested by tissue morphology, mitochondrial numbers and function as well as expression level of UCP1 (Figure 1 and 2). The expression of 170 genes changed significantly (p-value < 0.1) between the brown adipose phase and the transition phase (Additional file 3). Of these, 73 genes were upregulated and 97 genes were down-regulated (Figure 3C). A heatmap with Euclidian distances for the 170 genes is shown in Figure 4A. GO enrichment analysis on the 73 up-regulated genes revealed that they were enriched for genes related to “negative regulation of adaptive immune response” and “smooth muscle cell migration”, whereas the 97 down-regulated genes were enriched for genes related to “organic acid metabolic processes” (Table 1). Between the transition phase and the white adipose phase, the expression of 717 genes changed significantly, of which 378 genes were up-regulated and 339 were down-regulated (Figure 3C and Additional file 4). These differentially expressed genes are presented in a heatmap in Figure 4B. A GO enrichment analysis demonstrated that the 378 upregulated genes were enriched for genes related to “cell death” and “negative regulation of cell death” (Table 2). The 339 down-regulated genes were enriched for genes related to “metabolic process”, including “fatty acid betaoxidation” (Table 2). !"#"$%&'( )*+%,-./"0$)1 5 .-- Submitted manuscript 2- Figure 6 3.11131E-06 36 7.74765E-05 80 6.11349E-05 67 9.08327E-48 - . /'01, 0+1)23 0+1)4 0+1)456 0+1)7 0+1)3 0+1)8 0+1)78 !"#"$%&'( )*+%,-./"0$)1 2- .-- 38 A p-value 9 3.85639E-05 !"#"$ ! "#$%& %&'(#) .'/&01 .'/&2 *&+#$,-,'# ./,-" !"#$%&'()*#+,'()*#+, .'/&234 .'/&5 .'/&1 .'/&6 .'/&56 %&'(#) ! The GO term “metabolic process” included 242 genes, a number of which have been measured by RT-qPCR, in- !"#$% ! !"#$%&'()*#+,'()*#+, Figure 4: Gene expression changes in the two phase shifts. (A) Heatmap of the 170 genes differentially expressed from the brown adipose phase to the transition phase. (B) Heatmap of the 717 genes differentially expressed from transition to the white phase. Figure 5 * 0.01 0.00 5 4 3 2 1 0 B ANOVA p< 0.0001 * * 0.5 ANOVA p= 0.271 C 0.4 0.3 0.2 0.1 0.0 0.6 0.4 0.2 0.0 ANOVA p< 0.0001 0.006 Relative DGAT1 mRNA 0.02 * * * * Brown Transition White 0.004 0.002 0.000 ANOVA p= 0.1096 0.024 Relative DGAT2 mRNA ANOVA p< 0.0001 * Relative ACACA mRNA 0.03 Relative FASN mRNA A Relative CPT1B mRNA The changes in gene expression related to the GO term “fatty acid beta-oxidation” were investigated in more detail by RT-qPCR (Figure 5). Of notice, the white adipose phase included samples from days 14, 30 and 60 for RTqPCR measurements, whereas the white adipose phase for the global gene expression analysis included samples from day 14 only (see Figure 3). Two key enzymes in βoxidation are carnitine palmitoyltransferase 1B (CPT1B) and the hydroxyacyl-CoA dehydrogenase complex (HADH). The relative mRNA expression levels of both CPT1B and the catalytic subunitα of HADH (HADHA) decreased from the brown adipose phase to the transition phase and from the transition phase to the white adipose phase (Figure 5A). We also measured expression of two genes involved in fatty acid synthesis by RT-qPCR; acetyl-CoA carboxylase 1 (ACACA) and fatty acid synthase (FASN). Expression of both tended to increase during the postnatal adipose transformation (Figure 5B), suggesting a higher rate of fatty acid synthesis in WAT compared to BAT. * !"#$%&'()*#+ ! Table 2: GO enrichment analysis of the 717 genes differentially expressed from transition to the white adipose phase. 0+1 B -,%,+ 9.96388E-15 2.26018E-11 8.84412E-07 4.64066E-06 ! '(%)* 242 11 13 18 0+1)456 ! 9.55025E-28 2.16333E-21 0+1)4 ! 33 19 . ! 1.20894E-31 !"#"$%&'( )*+%,-./"0$)1 ! 49 - /'01, 0+1)23 ! Enrichment from upregulated genes Enzyme linked receptor protein signaling pathway Negative regulation of cell death Cell death Enrichment from down-regulated genes Generation of precursor metabolites and energy Oxidation-reduction process Cellular respiration Mitochondrial ATP synthesis coupled electron transport Metabolic process Fatty acid beta-oxidation Lipid modification Mitochondrion organization Monocarboxylic acid transport Number of genes Relative HADHA mRNA GO Term 0.018 ANOVA p< 0.0001 0.012 0.006 0.000 Figure 5: Expression of selected metabolic enzymes related to fatty acid metabolism. Total RNA was isolated from perirenal adipose tissue and used for RT-qPCR analysis. Relative expression was measured for: (A) carnitine palmitoyltransferase 1b (CPT1B) and hydroxyacyl-CoA dehydrogenase subunitα (HADHA); (B) acetyl-CoA carboxylase (ACACA) and fatty acid synthase (FASN); (C) diacylglycerol O-acyltransferase 1 (DGAT1) and DGAT2. The mRNA expression levels were normalized to expression of β-actin (ACTB). Data are mean +SEM (brown, n = 9; transition, n = 20; white, n = 15); *, p<0.05. cluding UCP1, CYC1 and CS. The two isoforms of the Submitted manuscript 6 TAG synthesis enzymes diacylglycerol O-acyltransferase in the three phases were validated by RT-qPCR (Figure 1 (DGAT1) and DGAT2, were also among the regulated 6). Expression of NR1H3 and MYC was significantly inmetabolic genes. RT-qPCR measurements confirmed a creased and decreased, respectively, in the transition phase decreased expression of DGAT1 and DGAT2 in the white compared to both the brown and the white adipose phase adipose phase compared to the transition phase (Figure (Figure 6). Expression of the three transcription factors 5C). ESR1, RELA and KLF4 was up-regulated between the tranOf the 170 genes differentially expressed between the brown sition and white adipose phase, whereas BCL6 was sigadipose and transition phase and 717 genes differentially nificantly down-regulated from the transition to the white expressed between the transition and the white adipose adipose phase (Figure 6). phase, 38 genes were in common. A Venn diagram of the 849 regulated genes is shown in Additional file 5. FifFigure 6 teen of the 38 common genes were down-regulated at both phase shifts, whereas 9 genes were up-regulated at both phase shifts. Among the 15 consistently down-regulated * * * genes were several mitochondrial genes, e.g. the TCA cycle enzyme isocitrate dehydrogenase 3α (IDH3A), and the two transcription factors myeloid leukemia factor 1 (MLF1) and autoimmune regulator (AIRE) (Additional file 6). Among the consistently up-regulated genes were * * two receptors involved in cellular lipid uptake; low den* sity lipoprotein receptor-related protein 1 (LRP1) and macrophage scavenger receptor 1 (MSR1). The 38 genes also included 9 genes transiently up-regulated and 5 genes transiently * down-regulated during the transition phase (Additional file * 6). Among the genes up-regulated during the transition phase were two enzymes involved in TAG synthesis; the mitochondrial glycerol-3-phosphate acyltransferase (GPAM) and 1-acylglycerol-3-phosphate O-acyltransferase 9 (AGPAT9). Figure 6: Validation of expression patterns of the transcripTranscriptional components regulated between the three tional components regulated between the three phases of adipose tissue transformation and having a consensus putative rephases of adipose tissue transformation sponse element in an enriched set of regulated genes. Total RNA Expression of 17 transcription factors was significantly was isolated from perirenal adipose tissue and used for RT-qPCR changing between the brown adipose and the transition analysis. Relative expression was measured for nuclear receptor phase, of which 7 were up-regulated (Additional file 7). subfamily 1, group H, member 3 (NR1H3), v-myc avian myeloBetween the transition and the white adipose phase, 74 cytomatosis viral oncogene homolog (MYC), B-cell lymphoma transcription factors were differently expressed, with 48 6 (BCL6), estrogen receptor 1 (ESR1), v-rel reticuloendotheliobeing up-regulated (Additional file 7). Four transcripsis viral oncogene homolog A (RELA) and krüppel-like factor 4 tion factors exhibited differential expression at both phase (KLF4). The mRNA expression levels were normalized to expression of β-actin (ACTB). Data are mean +SEM (brown, n = shifts. Two of the 17 transcription factors differently ex9; transition, n = 20; white, n = 15); *, p<0.05. pressed between the brown adipose and the transition phase had consensus putative response elements in an enriched set of the 170 genes displaying altered expression in the Figure 7A lists genes with differential expression between same phase shift. These were nuclear receptor subfamily the brown adipose and transition phase that have a consen1, group H, member 3 (NR1H3, also called LXRA), and sus putative response element for either MYC or NR1H3. v-myc avian myelocytomatosis viral oncogene homolog Among the up-regulated genes that were potentially reg(MYC). Of the 717 differently expressed genes between ulated by MYC from the brown adipose to the transition the transition and the white adipose phase, an enriched phase were the adhesion protein thrombospondin 2 (THBS2) set of genes contained consensus putative response eleand the Rab1 GTPase activator TBC1 domain family memments for six transcription factors that were themselves ber 20 (TBC1D20). Among the down-regulated genes regulated in the same phase shift. These transcription facfrom the brown adipose phase to the transition phase potors are NR1H3, MYC, B-cell lymphoma 6 (BCL6), estrotentially regulated by MYC were a 9-cis-retinoic acid syngen receptor 1 (ESR1, also called NR3A1 or ESRA), v-rel thesizing enzyme, aldehyde dehydrogenase 8 family memreticuloendotheliosis viral oncogene homolog A (RELA) ber A1 (ALDH8A1), and the transcription factor basic helixand krüppel-like factor 4 (KLF4). loop-helix family member E40 (BHLHE40). The expression levels of these six transcription factors ANOVA p< 0.0001 0.008 Relative RELA mRNA Relative NR1H3 mRNA 0.5 0.4 0.3 0.2 0.1 0.0 0.3 0.2 0.1 0.10 0.05 0.00 0.002 ANOVA p< 0.0001 0.08 0.06 0.04 0.02 0.00 ANOVA p< 0.0001 0.20 Relative KLF4 mRNA Relative ESR1 mRNA 0.15 0.004 0.10 0.0 0.20 Brown Transition White 0.000 ANOVA p< 0.0025 Relative BCL6 mRNA Relative MYC mRNA 0.4 ANOVA p< 0.0008 0.006 0.15 0.10 0.05 0.00 ANOVA p< 0.0004 Submitted manuscript 7 protein 2 (ANGPTL2), and one anti-angiogenic factor, serpin peptidase inhibitor F1 (SERPINF1) (Additional file 8). Figure 7 A Up-regulated genes Down-regulated genes MYC CCDC158, LDHA, TBC1D20, THBS2 ALDH8A1, BHLHE40, C1ORF51, OLFM1, SHISA3 NR1H3 ABCD2, AIMP1, IFT52, NR1H3, PI16, SIK2, ITPR2 COQ10A, GCDH, IDH1, IMPDH2, SEC14L3 ! B ! 2'3'*4' ((((.051 ! ! ! ! FG+%&;"88@8&% ! (((((((.:H1 ! ! ! ! ! C%D&'"88"((((((.E1 !"#$%&'" (()%*+",'(((./01 !"# (#%) $%&' A%&'-;%,)+,*' (((B&;+*%(((((.501 *+%, 67+*)8&-# ((((((.9:1 -*./0 +1*. !,+*;<*'=%,&8 (((()%*+",'((((((./>1(( ?@;8"&%( )%*+",'(((./>1 Figure 7: Genes with altered expression that potentially are controlled by transcription factors regulated between the three phases of adipose tissue transformation. (A) List of genes regulated from the brown adipose phase to the transition phase, which have consensus putative response elements for nuclear receptor subfamily 1, group H, member 3 (NR1H3) and v-myc avian myelocytomatosis viral oncogene homolog (MYC). (B) Subcellular localization of differentially expressed genes from the transition phase to the white adipose phase that have consensus putative response elements for NR1H3, MYC, B-cell lymphoma 6 (BCL6), estrogen receptor 1 (ESR1), krüppel-like factor 4 (KLF4) and v-rel reticuloendotheliosis viral oncogene homolog A (RELA). Red nodes indicate down-regulated genes in the white adipose phase as compared to the transition phase and green nodes indicate up-regulated genes in the white adipose phase as compared to the transition phase. The corresponding gene names are listed in Additional file 8. Figure 7B depicts the subcellular distribution of the 288 genes potentially regulated by MYC, ESR1, RELA, BCL6, KLF4 and NR1H3 between the transition and the white adipose phase. Gene names corresponding to Figure 7B are presented in Additional file 8. Forty of the regulated genes were mitochondrial genes, 37 of which were downregulated. This is in accordance with the decreased activity and amount of mitochondria in the white adipose phase (see Figure 2 and Additional file 1). Of notice, half of the down-regulated mitochondrial genes have been described to be regulated by RELA. Nineteen genes encoded secreted proteins, 11 of which were up-regulated, including two pro-angiogenic factors; vascular endothelial growth factor B (VEGFB) and angiopoietin-related Expression of white, brite/ beige and brown adipose markers in the three phases of adipose tissue transformation To study the three phases in the transformation process in more detail, we measured a number of brown and white adipose marker genes by RT-qPCR. As evident from Figure 1, we observed a down-regulation of UCP1 between the transition and white adipose phase, and a stepwise decrease in expression of DIO2 and PPARGC1A through the three phases of the transformation (Additional file 9). Expression of two transcription factors promoting white adipogenesis; nuclear receptor-interacting protein 1 (NRIP1, also called RIP140) and retinoblastoma 1 (RB1), was increased between the transition and the white adipose phase (Figure 8A). A typical white adipose marker gene leptin (LEP) displayed decreased expression in the transition phase compared to both the brown and white adipose phase (Figure 8A). Expression of the key transcriptional driver of brown adipogenesis, PR domain containing 16 (PRDM16), was not changed significantly between the phases (Figure 8B). Overall, these measurements supported the brown to white adipose transformation occurring in the sheep perirenal adipose tissue within the first two weeks after birth. To address if the sheep perirenal adipose tissue qualified as being brown, brite/ beige or a mixture of brown and brite/ beige at birth, and whether this status of the tissue changed over time, we measured a number of recently proposed marker genes selectively expressed in brown versus brite/ beige adipose tissue and adipocytes [56] [53] [57] [46] [44]. We determined the expression of the classical brown adipose marker genes solute carrier family 29 member 1 (SLC29A1), LIM homeobox 8 (LHX8), myelin protein zero-like 2 (MPZL2, also called EVA1) and zinc finger protein 1 (ZIC1). SLC29A1 was expressed at birth and elicited a stepwise down-regulation through the two adipose phase shifts (Figure 8B). MPZL2 expression did not significantly change. Contrary, LHX8 mRNA increased steadily through the three phases (Figure 8B). ZIC1 was not detected in any of the perirenal adipose samples, but was easily detected in sheep brain (data not shown). We also measured the expression of the three brite/ beige marker genes homeobox C8 (HOXC8), HOXC9 and tumor necrosis factor receptor superfamily member 9 (TNFRSF9, also called CD137). The expression of all three genes increased from the transition phase to the white adipose phase (Figure 8C). Of notice, HOXC8 and HOXC9 are marker genes for both WAT and brite/ beige adipose tissue [53, 57]. However, expression of the brite/ beige marker genes transmembrane protein 26 (TMEM26) and T-box protein 1 (TBX1) did not change (Figure 8C). In summary, three out of four markers of classical BAT Submitted manuscript 8 Figure 8 0.0000 Relative RB1 mRNA Relative SLC29A1 mRNA ANOVA p< 0.0001 0.015 * 0.010 0.005 0.000 Relative LEP mRNA 0.015 ANOVA p< 0.0091 * * 0.010 0.005 0.0006 0.0004 0.0002 ANOVA p< 0.0001 0.10 * 0.06 * 0.04 0.02 0.00 0.0005 0.0000 ANOVA p< 0.0001 Relative LHX8 mRNA 0.012 * 0.009 0.003 0.000 2 ANOVA p< 0.0001 * 0.012 0.008 0.004 0.000 0.0010 0.006 * 4 0.016 ANOVA p< 0.2682 0.0015 0.000 Brown Transition White 0 0.0000 0.08 ANOVA p< 0.0001 6 Relative HOXC8 mRNA 0.0002 C Relative HOXC9 mRNA 0.0004 ANOVA p< 0.1468 0.0008 Relative TNFRSF9 mRNA 0.0006 0.020 B * Relative TMEM26 mRNA * ANOVA p< 0.0001 Relative MPZL2 mRNA Relative NRIP1 mRNA 0.0008 Relative PRDM16 mRNA A * 0.0010 0.0005 0.0000 0.04 ANOVA p< 0.0919 0.03 0.02 0.01 0.00 0.003 Relative TBX1 mRNA ANOVA p< 0.0004 0.0015 ANOVA p< 0.8829 0.002 0.001 0.000 Figure 8: Expression of genetic markers for brown and brite/ beige adipose tissue during the three phases of the postnatal perirenal adipose tissue transformation. Gene expression was determined by RT-qPCR. (A) Relative levels of genes associated with white adipocytes; nuclear receptor-interacting protein 1 (NRIP1), retinoblastoma 1 (RB1) and leptin (LEP). (B) Relative levels of the brown adipose associated and marker genes; PR domain containing 16 (PRDM16), solute carrier family 29 member 1 (SLC29A1), LIM homeobox 8 (LHX8) and myelin protein zero-like 2 (MPZL2). Zinc finger protein 1 (ZIC1) was not detectable in any of the adipose samples. (C) Relative levels of the brite/ beige adipose markers; homeobox C8 (HOXC8), HOXC9, tumor necrosis factor receptor superfamily member 9 (TNFRSF9), transmembrane protein 26 (TMEM26) and T-box 1 (TBX1). The mRNA expression levels were normalized to expression of β-actin (ACTB). Data are mean +SEM (brown, n = 9; transition, n = 20; white, n = 15); *, p<0.05. were detectable, and two out of these three changed expression over time. All five measured brite/ beige markers were detectable; the expression of three increased and two remained unchanged from the transition to the white adipose phase. Thus, markers of both brown and brite/ beige adipose tissue were expressed in perirenal adipose tissue from sheep, and most of these markers displayed altered expression over time. Discussion Plasticity of adipose tissues is important for adaptation to changing physiological conditions [48]. In response to prolonged cold exposure, subcutaneous WAT depots of rodents undergo a transformation process during which numerous brite/ beige adipocytes appear, thereby increasing overall thermogenic capacity of the animal [23,28,48]. In large mammals, a substantial part of the BAT present in the newborn converts to WAT after birth, which may reflect that the need for endogenous thermogenesis drops after the early postnatal period. As little molecular insight into this conversion in large mammals is available, we have in the present study conducted a detailed analysis of the postnatal transformation of perirenal adipose tissue in sheep. We chose this particular tissue, as it is the most frequently studied BAT depot in large mammals. Moreover, we reasoned that the transformation of this depot was suitably modeling the postnatal brown to white adipose transformation in humans. The postnatal transformation process from BAT to WAT in perirenal adipose tissue occurred within the first two weeks after birth as determined by changes in tissue morphology, gene expression and mitochondrial density. Adipocyte morphology changed from being mainly multilocular to unilocular and the amount of mitochondria decreased. The expression of brown adipocyte-selective genes, e.g. UCP1, DIO2 and PPARGC1A, declined, as did expression of additional genes encoding mitochondrial proteins. To understand the adipose transformation in more detail, we performed a global gene expression analysis with seven time points ranging from approx. two days before birth to two weeks after birth. By two independent analyses of the gene expression data, we determined that the transformation clustered into three phases: a brown adipose phase, a transition phase and a white adipose phase. Regulated transcription factors Between the brown adipose and the transition phase were 170 genes differentially expressed, including 17 transcription factors, 10 of which were down-regulated (Additional file 3 and 7). Five of these have chromatin modifying activity: circadian locomotor output cycles kaput (CLOCK), nuclear receptor co-activator 1 (NCOA1, also called SRC1), proviral insertion site in Moloney murine leukemia virus lymphomagenesis (PIM1) and the SWI/SNF-related matrixassociated actin-dependent regulator of chromatin subfamily members SMARCC2 and SMARCD3. This leaves open the possibility that extensive remodeling of chromatin is occurring between the brown adipose and the transition phase. In accordance with its down-regulation in the perirenal adipose transformation (Additional file 7), NCOA1 has been reported to promote BAT activity in mice [39]. Between the transition and the white adipose phase 717 genes were differentially expressed, of which 74 were transcription factors (Additional file 4 and 7). The list of regulated transcription factors included a few transcription factors known to be differently expressed in mouse WAT and BAT, e.g. NRIP1 and cell death-inducing DFFA-like effector a (CIDEA). Among the down-regulated transcription factors from the transition to the white phase were some related to brown adipocyte function: early B-cell Submitted manuscript factor 2 (EBF2), which have been described to induce expression of brown adipose-specific PPARG target genes [42], leucine-rich PPR motif-containing protein (LRPPRC), a PPARGC1A co-activator playing an important role in BAT differentiation and function [15] and Y box binding protein 1 (YBX1), an inducer of bone morphogenetic protein 7 (BMP7) transcription and brown adipocyte differentiation [38]. 9 the mice are obesity resistant and less sensitive to cold [50] (Additional file 10). ESR1 knockout mice have increased fat mass caused by adipocyte hyperplasia [25]. In addition, ESR1 can inhibit the transcriptional activity of RELA [19] (Additional file 10). Based on this, we speculate that RELA and ESR1 might contribute to the regulation of TAG accretion during the adipose transformation. RELA might also contribute to the mitochondrial depletion observed, as RELA negatively impacts mitochondrial Regulated transcription factors with a consensus pucontent in C2C12 myocytes [8]. RELA has putative retative response element in an enriched set of differensponse elements in the promoter of 20 genes encoding mitially expressed genes tochondrial proteins regulated between the transition and Expression of four transcription factors NR1H3, MYC, AIRE the white adipose phase, of these 17 were down-regulated. and MLF1 was regulated at both phase shifts. The forThis is in accordance with RELA functioning both as a mer two have a consensus putative response element in transcriptional activator and repressor [10]. RELA has an enriched set of genes displaying altered expression at been described to repress BCL6 expression through inthe two phase shifts. Expression of NR1H3 and MYC was terferon regulatory factor 4 [45]. This is in accordance transiently increased and decreased, respectively, during with the increased expression of RELA and the decreased the transition phase (Figure 6). Consistent with the opexpression of BCL6 in the white adipose phase (Figure posite regulation during the transition phase, NR1H3 have 6). BCL6 is a transcriptional repressor with the ability to been reported to suppress MYC expression in colon canreduce the expression of e.g. MYC [37] (Additional file cer cells [52] (Additional file 10). MYC is known to in10). The expression of BCL6 is strongly down-regulated hibit adipogenesis [18, 24], which might explain why it is by growth hormone in 3T3-F442A adipocytes [14]. Apart down-regulated in the transition phase, where the tissue from this, BCL6 has not been linked to adipocyte or adiexpands. NR1H3 has been described to regulate gene expose tissue function. KLF4 and RELA have been reported pression linked to several important aspects of both brown to be functionally intertwined, as they directly interact to and white adipocyte biology, including adipogenesis, eninduce expression of selected genes [16], but also comergy expenditure, lipolysis and glucose transport [29]. NR1H3 pete for interaction with a co-activator, thereby inhibiting was reported to be present at higher levels in mouse BAT each others activity [4] (Additional file 10). KLF4 is imthan WAT [49] and to suppress PPARγ-induced UCP1 exportant for induction of adipogenesis in vitro and its expression by binding to the UCP1 enhancer together with pression was reported to be induced in pre-adipocytes by NRIP1 in mouse adipocytes [54]. Accordingly, NR1H3 cAMP. KLF4 stimulates CCAAT/enhancer-binding procan regulate the expression of brown adipocyte-selective tein β (C/EBPB) expression, and C/EBPB in turn downgenes. However, the increased NR1H3 expression in the regulates KLF4 expression, thereby forming a negative transition phase did not correlate with decreased UCP1 feedback loop [9]. expression in our study, which might be explained by the In summary, six transcription factors with differential exrelative low expression of NRIP1 in the transition phase pression during the adipose transformation have consen(Figure 8). Beside NR1H3 and MYC, four other transus putative response elements in an enriched set of the scription factors, RELA, KLF4, ESR1 and BCL6, were regulated genes, suggesting that they are involved in the regulated between the transition and the white adipose control of the overall gene expression changes and thus phase and found to have consensus putative response elepotentially have an impact on remodeling of the tissue. ments in an enriched number of genes regulated between Moreover, the six factors are mutually functionally linked, these two phases. The former three were up-regulated leaving open the possibility that they are part of a tranfrom the transition to the white adipose phase, whereas scriptional network (Additional file 10). BCL6 was down-regulated (Figure 6). Both RELA and Brown and brite/ beige markers ESR1 have been described to stimulate MYC expression A number of marker genes selectively expressed in white, [27, 43], which would be consistent with the increased brite/ beige and brown adipose tissue and adipocytes have expression of MYC in the white adipose phase (Figure 6 been reported [44, 46, 53, 56, 57]. It is being discussed and Additional file 10). Of interest, RELA and ESR1 have whether human BAT is composed of brown or brite/ beige opposite effects on adipogenesis, as both knockdown of adipocytes or a mixture of these. Moreover, it is not fully RELA and activation of ESR1 by estrogen supplementaestablished to what extent expression of white, brite/ beige tion attenuated adipogenesis [30, 50]. WAT from mice and brown adipose marker genes changes in a particuwith an adipocyte-specific knockout of RELA have delar adipose depot during development or remodeling. To creased lipid droplet size, increased glucose uptake and elucidate this in sheep, we analyzed the expression of sereduced expression of adipogenic marker genes such as lected marker genes in the brown adipose, transition and LEP, PPARG and adiponectin (ADIPOQ). Furthermore, Submitted manuscript white adipose phase of the perirenal adipose depot. Although we only measured marker gene expression in the perirenal depot, and thus have not compared expression levels to those in other adipose depots, we could detect both brown (e.g. LHX8) and brite/ beige (e.g. TNFRSF9) adipocyte marker genes in the newborn sheep (Figure 8). Co-expression of marker genes selective for brown and brite/ beige adipose tissues have also been observed in a recent study of the human supraclavicular brown adipose tissue [26]. In this human study, expression of UCP1 in supraclavicular biopsies was positively associated with expression of both BAT markers (e.g. ZIC1 and LHX8) and brite/ beige adipose markers (e.g. TBX1 and TMEM26). Contrary, expression of the two WAT and brite/ beige adipose markers HOXC8 and HOXC9 correlated with low UCP1 expression [26]. In our time course study, we did not observe a correlation between high UCP1 expression and high expression of ZIC1 (which was not detected), LHX8, TBX1 or TMEM26, but we did detect higher expression of HOXC8 and HOXC9 in the white adipose state, where UCP1 expression was low (Figure 8 and Additional file 9). A similar HOXC9 profile in sheep perirenal adipose tissue has been reported by others [40]. BAT, brite/ beige adipose and WAT marker gene expression have not previously been studied in detail during adipose tissue remodeling in large mammals. Based on the selective expression profile in mice, our observation that HOXC8 and HOXC9 are induced in the white adipose phase suggests that the perirenal adipose depot converts from brown (not brite/beige) to white (Figure 8). The brown adipose origin of the perirenal depot might be supported by the down-regulation of the BAT marker gene SLC29A1 during whitening. Contrary, the 5-fold increase in expression of LHX8 from the brown to the white adipose phase and the absence of ZIC1 expression were not consistent with this model. In addition to HOXC8 and HOXC9, a number of other brite/ beige marker genes were expressed shortly after birth (Figure 8). Of notice, the expression of these was either unchanged or up-regulated during whitening. In summary, markers of both brite/ beige and brown adipose tissue are expressed in the sheep perirenal adipose tissue at birth and the expression of a number of these changes substantially over time. The latter might be important to consider when analyzing adipose tissue type-selective gene expression. Model of the transformation process The postnatal transformation from brown (or brite/beige) to white in the perirenal adipose tissue can occur through at least three different mechanisms: 1) through transdifferentiation of brown (or brite/beige) to white adipocytes; 2) through proliferation and differentiation of white adipogenic precursor cells and death of mature brown (or brite/beige) adipocytes; 3) through a combination of the two. In mice, brite/ beige adipocytes can transdifferentiate into white adipocytes [44], but whether this obser- 10 vation extents to large mammals is unclear. If the white adipocytes arise exclusively from proliferation and differentiation of precursor cells, it would require an enormous cell turnover during the transition phase of the transformation, including extensive death of mature brown (or brite/beige) adipocytes. We would expect this to result in induction of cell cycle genes, but we did not observe this in the GO term analysis. Moreover, the expression profile of key cyclins (CCNA, CCNB, CCND and CCNE) was not consistent with massive cell cycling during the transition phase (Additional file 2). In the transition to white adipose phase shift, we did observe an up-regulation of genes related to cell death, but nearly half of the upregulated cell death associated genes are negative regulators of cell death (Table 2). Thus, it is not obvious from the gene expression data if cell death is increased or not. Of notice, we did not observe evidence for significant cell death in any of the tissue sections analyzed. Consistently, Lomax et al. [33] failed to detect apoptotic cells during the transformation of the perirenal adipose tissue in sheep. Based on this, we find it plausible that transdifferentiation of brown (or brite/beige) to white adipocytes is a significant component of the postnatal transformation of the perirenal adipose depot in sheep. Clearly, additional studies are required to validate this, including a time course with more time points and a dedicated search for evidence of cell proliferation and death. Conclusions By global gene expression profiling, we provide novel information of the postnatal BAT to WAT transformation in sheep. This transformation process is poorly understood in molecular terms, but is of significant interest, as a similar transformation occurs in human infants after birth. An improved understanding of this tissue remodeling increases insights into adipose plasticity and might allow identification of targets suitable for interfering with the balance between energy-storing and energy-dissipating adipose tissue. Our results reveal novel transcription factors linked to the adipose transformation process in large mammals. Clearly, validation of their actual relevance in adipose function will require dedicated functional studies. Finally, we show that expression of adipose tissue-type selective marker genes change substantially over time, which might be an underappreciated variable in such analyses. Material and methods Animals and tissues Experimental procedures were in compliance with guide- Submitted manuscript lines laid down by the Danish Inspectorate of Animal Experimentation. Lambs from cross-bred ewes (Texel x Gotland) in their second or third parturition sired by purebred Texel ram, born and raised at a commercial farm in Denmark, were used. During gestation ewes were fed hay ad libitum, 200 g barley and 200 g commercial concentrate per day. Ewes were housed in groups of 40 until lambing. After lambing they were housed individually for 2 days and subsequently housed in groups of 20 until they were transferred to pasture approx. one week after lambing. The ewe-reared lambs were kept on pasture being a mixture of 70 % ray grass and 30 % white clover. Lambs were killed by bolt pistol and bled by licensed staff. Perirenal adipose tissue was carefully dissected and frozen in liquid nitrogen for biochemical or molecular analyses or fixed for histology as described below. Lambs at the following ages (day relative to the time of birth) were used: -2 (n = 4), 0 (n = 5), 0.5 (n = 5), 1 (n = 5), 2 (n = 5), 4 (n = 5), 14 (n = 5), 30 (n = 5) and 60 (n = 5). Live weights of the lambs were kept similar within groups. 11 µg of total RQ1 DNase (Promega)-treated RNA and 200 units of Moloney murine leukemia virus reverse transcriptase (Life Technologies). Reactions were left for 10 min at room temperature, followed by incubation at 37 C for 1 h. After cDNA synthesis, reactions were diluted with 50 µl of water and frozen at -80 ◦ C. The cDNA was analyzed by RT-qPCR using the Stratagene Mx3000P QPCR System. Each PCR mixture contained, in a final volume of 20 µl, 1.5 µl of 1st strand cDNA, 10 µl of SensiFASTTM SYBR Lo-ROX Kit (Bioline) and 2 pmol of each primer (Additional file 11). All reactions were run using the following cycling conditions: 95 ◦ C for 10 min, then 40 cycles of 95 ◦ C for 15 s, 55 ◦ C for 30 s and 72 ◦ C for 15 s. PCR was carried out in 96-well plates and each sample was run in duplicate. Target gene mRNA expression was normalized to expression of β-actin (ACTB) mRNA. Protein extracts and immunoblotting Tissues were homogenized in a GG-buffer (pH 7.5) containing 25 mM glycyl-glycin, 150 mM KCl, 5 mM MgSO4 and 5 mM ethylenediaminetetraacetic acid (EDTA) as well Hematoxylin-eosin (HE) staining as freshly added dithiothreitol (1 mM), bovine serum alSamples were fixed in 4 % neutral buffered formaldehyde bumin (0.02 %) and Triton X-100 (0.1 %). Homogeniza(pH 7.4) at room temperature for 24 h and subsequently tion was performed with a TissueLyser (QIAGEN) usat 4 ◦ C until preparation. The tissue was processed to ing 5 mm stainless steel beads, and homogenates were paraffin and sectioned in 4 ?m sections. HE staining was subsequently frozen in liquid nitrogen. Protein concenperformed according to standard procedures. trations were determined by the Lowry method [34] and equal amounts of protein from each animal were pooled according to age and diluted in a buffer containing 2.5 Transmission electron microscopy (TEM) Samples were fixed in Karnowskys fixative (2 % paraformalde- % SDS and 10 % glycerol. Proteins were separated on 4-12 % Bis-Tris gradient gels (NuPAGE, Life Technolohyde and 2.5 % glutaraldehyde in 0.08 M cacodylate buffer, gies), blotted onto Immobilon PVDF membranes (MillipH 7.4) for 3-5 days at room temperature and subsequently pore) and stained with Amido Black 10B (Sigma-Aldrich). stored in 0.08 M cacodylate buffer at 4 ◦ C until further processing. The samples were rinsed three times in 0.15 Membranes were blocked in Tris-buffered saline (pH 7.4) or phosphate-buffered saline (pH 9.0) with 5 % nonfat M Sorensens Phosphate Buffer (pH 7.4) and subsequently dry milk and 0.1 % Tween 20 (Sigma-Aldrich) and then postfixed in 1 % OsO4 in 0.12 M sodium cacodylate buffer (pH 7.4) for 2 h. The specimens were dehydrated in graded probed with antibodies. Primary antibodies used were against transcription factor IIB (TFIIB) (sc-225) (Santa series of ethanol, transferred to propylene oxide and embedded in Epon according to standard procedures. UltraCruz Biotechnology), ATP synthase β (ATP5B) (ab14730) (Abcam) and UCP1 (ab10983) (Abcam). Secondary anthin sections were cut with a Reichert-Jung Ultracut E tibodies were horseradish peroxidase-conjugated (Dako). microtome and collected on single slot copper grids with Enhanced chemiluminescence (Biological Industries) was Formvar supporting membranes. Sections were stained used for detection. with uranyl acetate and lead citrate and examined with a Philips CM-100 transmission electron microscope, operQuantification of relative mitochondrial DNA (mtDNA) ated at an accelerating voltage of 80 kV. Digital images copy numbers were recorded with a SIS MegaView2 camera and the Relative mtDNA amount (copy number) was measured as analySIS software package. the ratio between mtDNA and nuclear DNA (nDNA). Tissues were homogenized using a TissueLyser (QIAGEN) Reverse transcription-quantitative polymerase chain in lysis buffer containing 100 mM Tris-base (pH 8.0), 5 reaction (RT-qPCR) mM EDTA (pH 8.0), 0.2 % sodium dodecyl sulphate, 200 Tissues were homogenized in TRIzol (Life Technologies) mM NaCl and 100 mg/ml proteinase K and incubated using a Dispomix (Xiril) and total RNA was purified. Reovernight at 55 ◦ C with rotation. DNA was precipitated verse transcriptions were performed in 25 µl reactions with two volumes of 99 % ethanol and isolated with incontaining 1st Strand Buffer (Life Technologies), 2 µg oculation loops, washed in 70 % ethanol and dissolved random hexamers (Bioline), 0.9 mM of each dNTP (Sigmain Tris-EDTA buffer containing 10 mg/ml RNase A at 55 Aldrich), 20 units of RNaseOUT (Life Technologies), 1 Submitted manuscript ◦ C overnight. DNA concentrations were determined on the Eppendorf BioPhotometer at 260 nm and 50 ng DNA was used for qPCR. PCR reactions and cycling conditions were as described above, and primers were against cytochrome c oxidase I (MT-CO1) (mtDNA) and suppression of tumorigenicity 7 (ST7) (nDNA) (Additional file 11). Citrate synthase (CS) activity Tissue homogenates (10 %) were generated in GG-buffer (pH 7.5) as described above. Homogenates were thawed on ice and centrifuged at 4 ◦ C at 20,000 g for 2 min. Supernatants were used for activity measurements. CS activity was measured spectrophotometrically at 25 ◦ C and 412 nm in CS buffer containing 100 mM Tris-base (pH 8.0), 10 mM 5,5?-dithiobis(2-nitrobenzoic acid), 5 mM acetyl-CoA and 50 mM oxaloacetic acid and activity was measured as described [47]. Each sample was measured in duplicate and the mean was used for subsequent calculations. Activities were normalized to the amount of total protein determined by the Lowry method [34]. Statistical analyses of qPCR data The time course study was analyzed for statistical significance using one-way ANOVA and Students t-test with Bonferroni correction for multiple testing as post hoc test. A p-value < 0.05 was considered statistically significant. Targeted RNA-sequencing and data analysis Isolation of mRNA and synthesis of first strand cDNA: Equal amounts of total RNA from perirenal adipose tissue from lambs at the same age (days -2, 0, 0.5, 1, 2, 4, 14) were pooled. mRNA was isolated from 4 ?g of total RNA by magnetic oligo(dT) beads, which was used to synthesize bead-bound cDNA, according to the instructions of the manufacturer (Illumina). Tag library construction: The library for digital gene expression analysis was constructed according to the instructions of the manufacturer (Illumina). Bead-bound cDNA was digested with NlaIII, followed by ligation of the GEX adapter 1 to the bead-bound NlaIII-digested cDNA. This was then digested with MmeI, releasing the GEX adaptor 1 linked to 17 bp cDNA from the beads. The released fragment was ligated to GEX adapter 2. The 17 bp tags of cDNA were PCR amplified using two primers that anneal to the two adapters. The resultant tag library was used for Illumina sequencing. Data analysis for RNA-seq data: Quality control, trimming and adapter removal was done using FastQC [2] and fastx_clipper from the FASTX-Toolkit [1]. The 4 bases CATG were added to the 5 end of the reads to increase the specificity of mapping. BWA [31] was employed for the alignment and mapping of the reads to the Sheep genome v3.1. Mapped reads were sorted and indexed with Sam- 12 tools [32]. HT-Seq [6] was used for counting mapped reads per annotated gene. DESeq [7] and R [51] were used for the post processing and statistical analysis of the count data from the mapped reads. Principle component analysis (PCA) and hierarchical clustering of time points: A two-dimensional PCA plot was employed to visualize the overall effect of experimental co-variates. Hierarchical clustering of the total gene expression was performed using a distance matrix to assess the relationship between the samples and identify clusters amongst the time points. Grouping of time points: Based on the PCA and hierarchical clustering of the total gene expression, days 2 and 0 were used as replicates of the "brown adipose state", days 0.5, 1, 2 and 4 as replicates of the "transition state" and day 14 as the "white adipose state”. Using the DESeq package of Bioconductor [22], differentially expressed genes were found between the brown adipose and the transition state as well as between the transition and the white adipose state. Heatmaps representing clustering for the differentially expressed genes were created using the ggplots [55] package in R. The sheep proteins were queried against the human non-redundant protein database using BLAST [5] to find human homologous genes for further functional analysis. Reciprocal BLAST, a computation method used to countercheck the BLAST results was employed to filter the mapping between the sheep and human proteins. The BioMart tool on Ensemble version 72 [17] was used for gene identification conversion including obtaining Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) approved gene names for human proteins. UniProt [3] was used to annotate the proteins for function, transcriptional activity and subcellular localization. Enrichr [13] was used to find transcription factors and enrichment of targets for the differentially expressed transcription factors from Transfac [35] and Jasper [41] in the two sets of differentially expressed genes. ABBREVIATIONS ACACA, acetyl-CoA carboxylase 1; AGPAT9, 1-acylglycerol3-phosphate O-acyltransferase 9; AIRE, autoimmune regulator; ALDH8A1, aldehyde dehydrogenase 8 family member A1; ALDOA, glycolytic enzyme aldolase A; ANGPTL2, angiopoietin-related protein 2; ATP5B, ATP synthase β; BAT, brown adipose tissue; BCL6, B-cell lymphoma 6; BHLHE40, basic helix-loop-helix family member E40; BMP7, bone morphogenetic protein 7; C/EBPB, CCAAT/ enhancerbinding protein β; CAT, catalase; CCN, cyclin; CD36, cluster of differentiation 36; CIDEA, cell death-inducing Submitted manuscript 13 DFFA-like effector a; CLOCK, circadian locomotor out[5] put cycles kaput; CPT1B, carnitine palmitoyltransferase 1B; CS, citrate synthase; CYC1, cytochrome c1; DGAT, [6] diacylglycerol O-acyltransferase; DIO2, type II iodothyronine deiodinase; EBF2, early B-cell factor 2; EDTA, [7] ethylenediaminetetraacetic acid; EPAS1, endothelial PAS domain-containing protein 1; ERRA, estrogen-related receptorα; ESR1, estrogen receptor 1; FABP4, fatty acid-binding pro[8] tein 4; FASN, fatty acid synthase; GPAM, mitochondrial glycerol-3-phosphate acyltransferase; HADH, hydroxyacylCoA dehydrogenase complex; HADHA, HADH catalytic subunitα; HGNC, HUGO Gene Nomenclature Committee; HE, hematoxylin-eosin; HOXC8, homeobox C8; HUGO, [9] Human Genome Organization; IDH3A, isocitrate dehydrogenase 3α; KLF4, krüppel-like factor 4; LEP, leptin; LHX8, LIM homeobox 8; LRP1, low density lipoprotein [10] receptor-related protein 1; LRPPRC, leucine-rich PPR motifcontaining protein; MLF1, myeloid leukemia factor 1; MPZL2, myelin protein zero-like 2; MSR1, macrophage scavenger receptor 1; MT-CO1, mitochondrially encoded cytochrome [11] c oxidase I; mtDNA, mitochondrial DNA; MYC, v-myc avian myelocytomatosis viral oncogene homolog; NCOA1, nuclear receptor co-activator 1; nDNA, nuclear DNA; NRF1, [12] nuclear respiratory factor 1; NRIP1, nuclear receptor-interacting protein 1; PCA, principle component analysis; PIM1, proviral insertion site in Moloney murine leukemia virus lymphomagenesis; PPARG, peroxisome proliferator-activated receptor γ; PPARGC1A, PPARG co-activator 1α; PRDM16, PR domain containing 16; RB1, retinoblastoma 1; RELA, [13] v-rel reticuloendotheliosis viral oncogene homolog A; RGCC, regulator of cell cycle; SERPINF1, serpin peptidase inhibitor F1; SLC29A1, solute carrier family 29 member [14] 1; SMARCC2, SWI/SNF-related matrix-associated actindependent regulator of chromatin subfamily member C2; ST7, suppression of tumorigenicity 7; TAG, triacylglycerol; TALDO1, transaldolase; TBX1, T-box protein 1; TCA, tricarboxylic acid; TEM, transmission electron microscopy; TFIIB, transcription factor IIB; TMEM26, transmembrane protein 26; TNFRSF9, tumor necrosis factor receptor su[15] perfamily member 9; UCP1, uncoupling protein 1; VEGFB, vascular endothelial growth factor B; WAT, white adipose tissue; YBX1, Y box binding protein 1; ZIC1, zinc finger protein 1. References [1] “FASTXToolkit [http://hannonlab.cshl.edu/fastx_toolkit/]”. [2] “http://www.bioinformatics.babraham.ac.uk/projects/fastqc/”. [3] “Update on activities at the Universal Protein Resource (UniProt) in 2013”, Nucleic Acids Res, Vol. 41, No. Database issue, pp. D43–7, 2013. [4] K. L. Allen, A. Hamik, M. K. Jain, and K. R. McCrae, “Endothelial cell activation by antiphospholipid antibodies is modulated by Kruppel-like transcription factors”, Blood, Vol. 117, No. 23, pp. 6383–91, 2011. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool”, J Mol Biol, Vol. 215, No. 3, pp. 403–10, 1990. S. Anders, “HTSeq: Analysing high-throughput sequencing data with Python”. S. Anders and W. Huber, “Differential expression analysis for sequence count data”, Genome Biol, Vol. 11, No. 10, p. R106, 2010. N. Bakkar, J. Wang, K. J. Ladner, H. Wang, J. M. Dahlman, M. Carathers, S. Acharyya, M. A. Rudnicki, A. D. Hollenbach, and D. C. Guttridge, “IKK/NF-kappaB regulates skeletal myogenesis via a signaling switch to inhibit differentiation and promote mitochondrial biogenesis”, J Cell Biol, Vol. 180, No. 4, pp. 787–802, 2008. K. Birsoy, Z. Chen, and J. Friedman, “Transcriptional regulation of adipogenesis by KLF4”, Cell Metab, Vol. 7, No. 4, pp. 339–47, 2008. K. J. Campbell, S. Rocha, and N. D. Perkins, “Active repression of antiapoptotic gene expression by RelA(p65) NF-kappa B”, Molecular Cell, Vol. 13, No. 6, pp. 853–65, 2004. B. Cannon and J. Nedergaard, “Brown adipose tissue: function and physiological significance”, Physiol Rev, Vol. 84, No. 1, pp. 277–359, 2004. L. Casteilla, O. Champigny, F. Bouillaud, J. Robelin, and D. Ricquier, “Sequential changes in the expression of mitochondrial protein mRNA during the development of brown adipose tissue in bovine and ovine species. Sudden occurrence of uncoupling protein mRNA during embryogenesis and its disappearance after birth”, Biochemical Journal, Vol. 257, No. 3, pp. 665–71, 1989. E. Y. Chen, C. M. Tan, Y. Kou, Q. Duan, Z. Wang, G. V. Meirelles, N. R. Clark, and A. Ma’ayan, “Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool”, BMC Bioinformatics, Vol. 14, p. 128, 2013. Y. Chen, G. Lin, J. S. Huo, D. Barney, Z. Wang, T. Livshiz, D. J. States, Z. S. Qin, and J. Schwartz, “Computational and functional analysis of growth hormone (GH)regulated genes identifies the transcriptional repressor Bcell lymphoma 6 (Bc16) as a participant in GH-regulated transcription”, Endocrinology, Vol. 150, No. 8, pp. 3645– 54, 2009. M. P. Cooper, M. Uldry, S. Kajimura, Z. Arany, and B. M. Spiegelman, “Modulation of PGC-1 coactivator pathways in brown fat differentiation through LRP130”, Journal of Biological Chemistry, Vol. 283, No. 46, pp. 31960–7, 2008. [16] M. W. Feinberg, Z. Cao, A. K. Wara, M. A. Lebedeva, S. Senbanerjee, and M. K. Jain, “Kruppel-like factor 4 is a mediator of proinflammatory signaling in macrophages”, Journal of Biological Chemistry, Vol. 280, No. 46, pp. 38247–58, 2005. [17] P. Flicek, I. Ahmed, M. R. Amode, D. Barrell, K. Beal, S. Brent, D. Carvalho-Silva, P. Clapham, G. Coates, S. Fairley, S. Fitzgerald, L. Gil, C. Garcia-Giron, L. Gordon, T. Hourlier, S. Hunt, T. Juettemann, A. K. Kahari, S. Keenan, M. Komorowska, E. Kulesha, I. Longden, T. Maurel, W. M. McLaren, M. Muffato, R. Nag, B. Overduin, M. Pignatelli, B. Pritchard, E. Pritchard, H. S. Riat, G. R. Ritchie, M. Ruffier, M. Schuster, D. Sheppard, D. Sobral, K. Taylor, A. Thormann, S. Trevanion, Submitted manuscript [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] S. White, S. P. Wilder, B. L. Aken, E. Birney, F. Cunningham, I. Dunham, J. Harrow, J. Herrero, T. J. Hubbard, N. Johnson, R. Kinsella, A. Parker, G. Spudich, A. Yates, A. Zadissa, and S. M. Searle, “Ensembl 2013”, Nucleic Acids Res, Vol. 41, No. Database issue, pp. D48–55, 2013. S. O. Freytag and T. J. Geddes, “Reciprocal regulation of adipogenesis by Myc and C/EBP alpha”, Science, Vol. 256, No. 5055, pp. 379–82, 1992. R. Galien and T. Garcia, “Estrogen receptor impairs interleukin-6 expression by preventing protein binding on the NF-kappaB site”, Nucleic Acids Res, Vol. 25, No. 12, pp. 2424–9, 1997. R. T. Gemmell and G. Alexander, “Ultrastructural development of adipose tissue in foetal sheep”, Aust J Biol Sci, Vol. 31, No. 5, pp. 505–15, 1978. R. T. Gemmell, A. W. Bell, and G. Alexander, “Morphology of adipose cells in lambs at birth and during subsequent transition of brown to white adipose tissue in cold and in warm conditons”, Am J Anat, Vol. 133, No. 2, pp. 143–64, 1972. R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. Yang, and J. Zhang, “Bioconductor: open software development for computational biology and bioinformatics”, Genome Biol, Vol. 5, No. 10, p. R80, 2004. M. Harms and P. Seale, “Brown and beige fat: development, function and therapeutic potential”, Nat Med, Vol. 19, No. 10, pp. 1252–63, 2013. V. J. Heath, D. A. Gillespie, and D. H. Crouch, “Inhibition of the terminal stages of adipocyte differentiation by cMyc”, Exp Cell Res, Vol. 254, No. 1, pp. 91–8, 2000. P. A. Heine, J. A. Taylor, G. A. Iwamoto, D. B. Lubahn, and P. S. Cooke, “Increased adipose tissue in male and female estrogen receptor-alpha knockout mice”, Proc Natl Acad Sci U S A, Vol. 97, No. 23, pp. 12729–34, 2000. N. Z. Jespersen, T. J. Larsen, L. Peijs, S. Daugaard, P. Homoe, A. Loft, J. de Jong, N. Mathur, B. Cannon, J. Nedergaard, B. K. Pedersen, K. Moller, and C. Scheele, “A classical brown adipose tissue mRNA signature partly overlaps with brite in the supraclavicular region of adult humans”, Cell Metabolism, Vol. 17, No. 5, pp. 798–805, 2013. C. Jiang, M. Ito, V. Piening, K. Bruck, R. G. Roeder, and H. Xiao, “TIP30 interacts with an estrogen receptor alphainteracting coactivator CIA and regulates c-myc transcription”, Journal of Biological Chemistry, Vol. 279, No. 26, pp. 27781–9, 2004. S. Kajimura and M. Saito, “A New Era in Brown Adipose Tissue Biology: Molecular Control of Brown Fat Development and Energy Homeostasis”, Annu Rev Physiol, 2013. J. Laurencikiene and M. Ryden, “Liver X receptors and fat cell metabolism”, Int J Obes (Lond), Vol. 36, No. 12, pp. 1494–502, 2012. Y. R. Lea-Currie, D. Monroe, and M. K. McIntosh, “Dehydroepiandrosterone and related steroids alter 3T3-L1 preadipocyte proliferation and differentiation”, Comp Biochem Physiol C Pharmacol Toxicol Endocrinol, Vol. 123, No. 1, pp. 17–25, 1999. 14 [31] H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows-Wheeler transform”, Bioinformatics, Vol. 25, No. 14, pp. 1754–60, 2009. [32] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin, “The Sequence Alignment/Map format and SAMtools”, Bioinformatics, Vol. 25, No. 16, pp. 2078–9, 2009. [33] M. A. Lomax, F. Sadiq, G. Karamanlidis, A. Karamitri, P. Trayhurn, and D. G. Hazlerigg, “Ontogenic loss of brown adipose tissue sensitivity to beta-adrenergic stimulation in the ovine”, Endocrinology, Vol. 148, No. 1, pp. 461–8, 2007. [34] O. H. Lowry, N. J. Rosebrough, A. L. Farr, and R. J. Randall, “Protein measurement with the Folin phenol reagent”, Journal of Biological Chemistry, Vol. 193, No. 1, pp. 265–75, 1951. [35] V. Matys, O. V. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre-Dirrie, I. Reuter, D. Chekmenev, M. Krull, K. Hornischer, N. Voss, P. Stegmaier, B. Lewicki-Potapov, H. Saxel, A. E. Kel, and E. Wingender, “TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes”, Nucleic Acids Res, Vol. 34, No. Database issue, pp. D108–10, 2006. [36] A. Mostyn, S. Pearce, T. Stephenson, and M. E. Symonds, “Hormonal and nutritional regulation of adipose tissue mitochondrial development and function in the newborn”, Exp Clin Endocrinol Diabetes, Vol. 112, No. 1, pp. 2–9, 2004. [37] R. Nahar, P. Ramezani-Rad, M. Mossner, C. Duy, L. Cerchietti, H. Geng, S. Dovat, H. Jumaa, B. H. Ye, A. Melnick, and M. Muschen, “Pre-B cell receptor-mediated activation of BCL6 induces pre-B cell quiescence through transcriptional repression of MYC”, Blood, Vol. 118, No. 15, pp. 4174–8, 2011. [38] J. H. Park, H. J. Kang, S. I. Kang, J. E. Lee, J. Hur, K. Ge, E. Mueller, H. Li, B. C. Lee, and S. B. Lee, “A multifunctional protein, EWS, is essential for early brown fat lineage determination”, Dev Cell, Vol. 26, No. 4, pp. 393–404, 2013. [39] F. Picard, M. Gehin, J. Annicotte, S. Rocchi, M. F. Champy, B. W. O’Malley, P. Chambon, and J. Auwerx, “SRC-1 and TIF2 control energy balance between white and brown adipose tissues”, Cell, Vol. 111, No. 7, pp. 931–41, 2002. [40] M. Pope, H. Budge, and M. E. Symonds, “The developmental transition of ovine adipose tissue through early life”, Acta Physiol (Oxf), Vol. 210, No. 1, pp. 20–30, 2014. [41] E. Portales-Casamar, S. Thongjuea, A. T. Kwon, D. Arenillas, X. Zhao, E. Valen, D. Yusuf, B. Lenhard, W. W. Wasserman, and A. Sandelin, “JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles”, Nucleic Acids Res, Vol. 38, No. Database issue, pp. D105–10, 2010. [42] S. Rajakumari, J. Wu, J. Ishibashi, H. W. Lim, A. H. Giang, K. J. Won, R. R. Reed, and P. Seale, “EBF2 determines and maintains brown adipocyte identity”, Cell Metabolism, Vol. 17, No. 4, pp. 562–74, 2013. [43] J. A. Romashkova and S. S. Makarov, “NF-kappaB is a target of AKT in anti-apoptotic PDGF signalling”, Nature, Vol. 401, No. 6748, pp. 86–90, 1999. Submitted manuscript [44] M. Rosenwald, A. Perdikari, T. Rulicke, and C. Wolfrum, “Bi-directional interconversion of brite and white adipocytes”, Nat Cell Biol, Vol. 15, No. 6, pp. 659–67, 2013. [45] M. Saito, J. Gao, K. Basso, Y. Kitagawa, P. M. Smith, G. Bhagat, A. Pernis, L. Pasqualucci, and R. Dalla-Favera, “A signaling pathway mediating downregulation of BCL6 in germinal center B cells is blocked by BCL6 gene alterations in B cell lymphoma”, Cancer Cell, Vol. 12, No. 3, pp. 280–92, 2007. [46] L. Z. Sharp, K. Shinoda, H. Ohno, D. W. Scheel, E. Tomoda, L. Ruiz, H. Hu, L. Wang, Z. Pavlova, V. Gilsanz, and S. Kajimura, “Human BAT possesses molecular signatures that resemble beige/brite cells”, Plos One, Vol. 7, No. 11, p. e49452, 2012. [47] D. Shepherd and P. B. Garland, “The kinetic properties of citrate synthase from rat liver mitochondria”, Biochemical Journal, Vol. 114, No. 3, pp. 597–610, 1969. [48] A. Smorlesi, A. Frontini, A. Giordano, and S. Cinti, “The adipose organ: white-brown adipocyte plasticity and metabolic inflammation”, Obes Rev, Vol. 13 Suppl 2, pp. 83–96, 2012. [49] K. R. Steffensen, M. Nilsson, G. U. Schuster, T. M. Stulnig, K. Dahlman-Wright, and J. A. Gustafsson, “Gene expression profiling in adipose tissue indicates different transcriptional mechanisms of liver X receptors alpha and beta, respectively”, Biochem Biophys Res Commun, Vol. 310, No. 2, pp. 589–93, 2003. [50] T. Tang, J. Zhang, J. Yin, J. Staszkiewicz, B. GawronskaKozak, D. Y. Jung, H. J. Ko, H. Ong, J. K. Kim, R. Mynatt, R. J. Martin, M. Keenan, Z. Gao, and J. Ye, “Uncoupling of inflammation and insulin resistance by NFkappaB in transgenic mice through elevated energy expenditure”, Journal of Biological Chemistry, Vol. 285, No. 7, pp. 4637–44, 2010. [51] R. C. Team, “A Language and Environment for Statistical Computing”, 2013. [52] S. Uno, K. Endo, Y. Jeong, K. Kawana, H. Miyachi, Y. Hashimoto, and M. Makishima, “Suppression of betacatenin signaling by liver X receptor ligands”, Biochem Pharmacol, Vol. 77, No. 2, pp. 186–95, 2009. [53] T. B. Walden, I. R. Hansen, J. A. Timmons, B. Cannon, and J. Nedergaard, “Recruited vs. nonrecruited molecular signatures of brown, "brite," and white adipose tissues”, Am J Physiol Endocrinol Metab, Vol. 302, No. 1, pp. E19– 31, 2012. [54] H. Wang, Y. Zhang, E. Yehuda-Shnaidman, A. V. Medvedev, N. Kumar, K. W. Daniel, J. Robidoux, M. P. Czech, D. J. Mangelsdorf, and S. Collins, “Liver X receptor alpha is a transcriptional repressor of the uncoupling protein 1 gene and the brown fat phenotype”, Molecular and Cellular Biology, Vol. 28, No. 7, pp. 2187–200, 2008. [55] H. Wickham, “ggplot2: elegant graphics for data analysis. 2009”. [56] J. Wu, P. Bostrom, L. M. Sparks, L. Ye, J. H. Choi, A. H. Giang, M. Khandekar, K. A. Virtanen, P. Nuutila, G. Schaart, K. Huang, H. Tu, W. D. van Marken Lichtenbelt, J. Hoeks, S. Enerback, P. Schrauwen, and B. M. Spiegelman, “Beige adipocytes are a distinct type of thermogenic fat cell in mouse and human”, Cell, Vol. 150, No. 2, pp. 366–76, 2012. 15 [57] Y. Yamamoto, S. Gesta, K. Y. Lee, T. T. Tran, P. Saadatirad, and C. R. Kahn, “Adipose depots possess unique developmental gene signatures”, Obesity (Silver Spring), Vol. 18, No. 5, pp. 872–8, 2010. Chapter 6 Paper IV - Epigenetic changes in obesity Prelude DNA methylation is a marker for metabolic memory and thus, it is important to examine its status under obesity, a known metabolic disorder. The following chapter presents the efforts to capture epigenetic changes associated with altered gene expression in obese mice as compared to the lean mice. The hypothesis under test in this chapter is: “does the DNA methylation of the adipose tissues differs between lean and obese mice and between adipose depots”?. On one side, ob/ob obese mice compared against the wild type lean mice while on other hand, diet-induced obese was compared against the regular diet fed mice. MeDIP-seq was performed on mature adipocytes from lean and obese mice. This lead to the discovery of differentially methylated regions (DMRs) between the lean and obsess mice in respective tissues and models. As we know that, genetic obesity is different from diet-induced obesity, as later is more life style dependent and guided by external stimuli. The DMRs obtained from obtained from diet induced obesity and genetic obesity were compared. The different depots of adipose tissues in the body differ in their metabolic activity, gene expression and response to obesity. Thus, the comparison of effects of obesity on the DNA methylation between different adipose depots is also of interest.The DNA methylation changes were combined with gene expression changes to find the mechanisms of obesity in inguinal and epididymal tissue. 97 Manuscript 1 Adipose-depot specific gene regulation by DNA methylation in obesity Rachita Yadav1 , Si Brask Sonne2,3 , Yin Guangliang4 , Lise Madsen7 , Ramneek Gupta1 , Jun Wang4,5,6 , Karsten Kristiansen3 and Shingo Kajimura∗2 1 Center for Biological Sequence Analysis, The Technical University of Denmark, Copenhagen, Denmark UCSF Diabetes Center and Department of Cell and Tissue Biology, University of California San Francisco, San Francisco, California, USA 3 Department of Biology, University of Copenhagen, DK-2100 Copenhagen, Denmark 4 BGI-Shenzhen, Shenzhen, China 5 Princess Al Jawhara Center of Excellence in the Research of Hereditary Disorders, King Abdulaziz University, Jeddah, Saudi Arabia 6 Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau, China 7 National Institute of Nutrition and Seafood Research, Bergen, Norway 2 ABSTRACT The study aims at examining the global DNA methylation, the metabolic memory marker of adipocytes in obesity and its effects on the gene expression of adipose tissue. Additionally, the study aimed at finding differences between genetic and diet-induced obesity and visceral and subcutaneous depots of adipose tissue in the obese state. Genome wide DNA hypomethylation occured more commonly in both depots of adipose tissue under study as well as in both types of obesity. We observe huge differences between the two models and between epididymal and inguinal tissue. Common to all was the hypomethylation that followed obesity. This report present the first study of global methylation study in mature adipocytes of genetic and diet induced obesity in male mouse combining global methylation and gene expression data. KEY WORDS – Obesity, mature adipocytes, DNA methylation, microarray, epididymal adipose tissue, inguinal adipose tissue Introduction Obesity is a growing problem worldwide [7] due to easier access to food and a more sedentary lifestyle. When energy intake exceeds energy expenditure, excess energy is stored in adipose tissues. The amount of adipose tissue in the body defines the physiological state of the organism. Increase in adipose tissue, is associated with disorders like insulin resistance, hyperglycemia, dyslipidemia, hypertension and inflammation [27]. Adipose tissue is a complex tissue with high metabolic and endocrine activity. Adipose tissue is made up of multiple cell types i.e. adipocytes, vascular endothelial cells, fibroblasts, and macrophages. The adipose tissues are classified as either subcutaneous or visceral, depending on their localization in the body. Generally, visceral fat, which surrounds the organs in the body cavity, is more associated with metabolic disorders than the subcutaneous fat depot, which is found right below the skin and can be especially abundant on hips and thighs [41]. A high ∗ Corresponding author: jimura@diabetes.ucsf.edu Shingo Kajimura, e-mail: SKa- waist-to-hip ratio rather than BMI is predictive of insulin resistance and cardiovascular complications [32]. Increasing amounts of visceral fat are associated with increased inflammation and release of inflammatory cytokines such as TNF-alpha and IL-6 and increasing blood levels of FFA [34]. The preadipocytes from the two depots have specific gene expression signatures that continue in mature adipocytes. Adipose tissue present in different parts of body are not same, they differ in metabolic properties [70], gene expression [69], protein secretion [23] as well as depot specific angiogenesis [3]. One of the main features of adipose tissue is the ability to change its size. Adipose tissue accomplishes increase in size by increasing size of adipocytes (hypertrophy) as well by increasing the number of cells (hyperplasia). In obesity, hypertrophy precedes hyperplasia in order to meet the requirement of excess energy storage [48]. Over nutrition causes energy imbalance in the body that in turn leads to adipose tissue to first increase in size and then more adipocytes are recruited, which initiates obesity. Along with nutrition, genetic makeup of the organisms is also found to be responsible for obesity. Studies Manuscript in model organisms found that mechanisms of excess fat mass accumulation differ substantially in the genetic obesity and diet induced obesity [74]. Multiple genes and loci have been found to be associated with weight gain and obesity related phenotypes. Along with genes, epigenetics have been found to be associated with metabolic disorders including diabetes and obesity [49]. There are multiple epigenetic factors, which make individuals more predisposed towards certain phenotypes. Amongst all the epigenetic features, DNA methylation has been associated with adipogenesis [45], appetite regulation [66] as well as body weight homeostasis [64]. Diabetic patients that undergo therapy to normalize their blood glucose levels still have cardiovascular problems [11]. This phenomenon is called the metabolic memory, and has been suggested to be associated with epigenetic changes [50]. We speculate that such a metabolic memory may also be present in fat cells and may contribute to the difficulty in sustaining weight loss. It is well known that exposure to certain external factors causes cell type-dependent epigenetic changes. Different chemical and environmental toxins induce changes to DNA methylation patterns leading to epimutations, which are associated with phenotypes [24]. DNA methylation depends on the activity of certain genes called DNA methyltransferases (Dnmt1, Dnmt2 and Dnmt3), which transfer the available methyl groups to DNA. Availability or scarcity of methyl groups also plays a vital role in DNA methylation. In humans, the major sources of methyl groups in foods come from methionine, one-carbon metabolism via methylfolate, and from choline [2]. An experiment in mice showed that an increase in folic acid intake leads to increased DNA methylation of the agouti locus, causing a change in phenotype [73]. With this study, we aimed to find the methylation differences between the visceral and subcutaneous fat depots along with the implication of these differences in obesity. To infer the effect of genetics and diet on methylation differences in obesity, we tested the variance in methylation in genetic as well as diet model. As a genetic model of obesity, we chose the leptin deficient ob/ob mice, which gain weight due to excessive food intake. In order to find the effect of diet on the methylation of obese tissue, we compared mice fed with high fat diet versus regular diet for 15 weeks. The time point of 15 weeks of high fat diet leads to a dramatically increased body weight, accompanied by a reduction of central leptin sensitivity [46] with beginning insulin resistance/glucose intolerance [14]. 2 Methods and material Experiment design (Animal model and tissue collection) In the Genetic obese model, nine week old male wild type (wt) and ob/ob mice were obtained from the Jackson Laboratory. DNA was isolated from mature adipocytes in epididymal and inguinal adipose tissue (n=4, obese and lean). These mice were fed a chow diet corresponding to the Regular diet used as a control in the diet-induced model. For the Diet-induced obese model, 4 week- old male C57BL/6J mice were obtained from the Jackson Laboratory. Mice were fed a regular diet (RD: 10% kcal fat, D12450B, Research Diets Inc.) or a high fat diet (HFD: 60% kcal fat, D12492, Research Diets Inc.) for fifteen weeks. DNA was isolated from mature adipocytes in epididymal and inguinal adipose tissue (n=3, obese and lean). For both models, mature fat cells were isolated from the epididymal and inguinal adipose tissues, by digesting with collagenase D and dispase II. DNA was isolated from the mature fat cells using a commercially available kit (DNeasy, Qiagen). Library preparation A library was prepared from 5 µg original DNA as previously described [43]. Briefly, DNA was fragmented. End repair, <A> base addition and adaptor ligation steps were performed using Illuminas Paired-End DNA Sample Prep kit following the manufacturers instructions. Adaptorligated DNA was immunoprecipitated by anti-5mC, and MeDIP products were validated by qPCR using SYBR green mastermix (Applied Biosystems) and primers for positive and negative control regions supplied in the MeDIP kit (Diagenode). The qPCR cycling conditions were of 95 ◦ C 5 min, followed by 40 cycles 95 ◦ C 15 s and 60 ◦ C 1 min. MeDIP DNA was purified with ZYMO DNA Clean & Concentrator-5 column following the manufacturers instructions and amplified by adaptor-mediated PCR in a final reaction volume of 50 µl. After excising amplified DNA between 220 and 320 bp on a 2% agarose gel, amplification quality and quantity were evaluated using Agilent 2100 bioanalyzer and DNA 1000 chips. The paired-end sequencing was performed using an Illumina platform. MeDIP-Seq data analysis Raw paired end reads from the MeDIP sequencing were checked for quality using FastQC [1]. 49 nucleotide clean reads were mapped to the mouse reference genome build mm9 using Bowtie2 [40] for each sample independently. Mapped reads were filtered for mapping quality 30 and Manuscript sorted using Picard (http://picard.sourceforge.net) and samtools [42] where as duplicates removed using Picard MarkDuplicates (http://picard.sourceforge.net). After alignment, reads were filtered for bad quality of alignment, PCR duplicates and missing mate in the alignment. Using the mapped reads the correlation between replicates was checked using spearman correlation coefficient. Mapped data in BAM format were further analysed for differentially methylated regions between the ob/ob vs wt and HFD vs RD in epididymal and inguinal fat. In the MEDIPS package of R, reads (49 nucleotides each) mapped to the genome were extended to 300 nucleotides to capture all CpGs in a region. The genome was divided into non-overlapping bins of 250 nucleotides during this analysis and reads mapped per region were counted. Relative methylation scores (rms) were calculation for each bin of genome by counting the number of mapped reads. Normalization was applied on this count data to convert it to reads per million. MEDIPS internally uses the EdgeR package to identify differentially methylated regions within the genome. EdgeR uses negative binomial distribution (especially useful for discrete data) to find the DMRs between two states under comparison and thus calculates mean methylation values (rpm, rms, ams), log fold changes, variances and p-values comparing two sample sets. The DMRs are mapped to the gene if they lie within the gene body or with 10kB +/- of the gene body. 3 down regulated genes in two tissues using GSEA [62] pre-ranked module with Molecular Signatures Database (MSigDB) [44]. Integrative methods The regions with differential methylation (DMR) between RD and HFD were mapped to genes with differential gene expression (DGE). If the DMR fall with the gene boundary or +/- 10 KB of the gene boundary, the DMR is taken to affect the gene expression. The genes with DMRs and DGE are only considered for further analysis if the log2 fold change between two conditions is >=1. Functional analysis Genes found differentially expressed and having a methylation effect were analysed for their functions in relation to adipogenesis and their impact on obesity. For the differentially methylated regions, motif search was carried out to find the transcription factor binding sites using meme suit using Uniprobe database. To utilise the previous knowledge, transcriptional factors controlling the methylation controlled differentially expressed genes were queried from the public ChIP-X data analysed and stored in ChEA database [39] using the enrichr [12] tool. Upstream analysis for the two sets of genes using the signalling molecules was carMicroarray analysis ried out using the key node functionality of Explain [36] (www.biobase-international.com/explain) from BIOBASE For the microarray analysis, 20 four-week old male C57BL/6J Corporation. The maximum distance to search was almice were obtained from the Jackson Laboratory and fed lowed to be six and FDR < 0.05 was used as the cut-off. either RD (n=10) or a HFD as described above. Total RNA was extracted from mature adipocytes using Trizol LS (Invitrogen), DNAse treated (Qiagen) and LiCl precipitated. The quality and quantity of RNA was determined Results using a Bioanalyzer nano kit (Agilent Technologies, Santa Clara, California, US) and Qubit RNA BR Assay (Life We wanted to compare mature adipocytes from diet inTechnologies, Waltham, Massachusetts, US) respectively. duced obesity and obesity caused by leptin deficiency (geOf these, the 5 mice with the highest RIN values were netic obesity). For the genetic obesity, ob/ob mice were chosen for subsequent analysis. Gene expression profiles significantly heavier than the wild type mice when the were determined using the Mouse Agilent 4X44 v2 gene mice were sacrificed at 9 weeks of age (46.1 +/- 2.9 g, expression arrays. and, 24.8 +/- 2.1 g, respectively)(p-value<0.0001). In the diet-induced model, after 15 weeks on RD vs HFD, average body weight was significantly higher in mice fed the HFD (28.7 +/- 1.5g and 47.2 +/- 1.3g, respectively)(pData analysis of microarray value<0.0001). The single colour data from the 20 samples was analysed using the limma package [61]. Background correction and normalisation of the data was done based on the negative controls. After the normalization, the differentially expressed genes between RD and HFD were identified in inguinal and epididymal data independently using the bayesian method. Pathway enrichment for Reactome database pathways was carried out on the up and Manuscript 4 Inguinal genetic model 150 -log10 Adj. p-value -log10 Adj. p-value Inguinal diet model 100 50 150 100 50 0 0 -5 0 log2 fold change 5 -10 Epididymal diet model -5 0 5 log2 fold change Epididymal genetic model -log10 Adj. p-value -log10 Adj. p-value 300 200 100 0 30 20 10 0 -5 0 log2 fold change 5 -4 0 log2 fold change 4 Figure 1: Volcano plots showing mean methylation differences between the tissue from obese mice and lean mice in inguinal and epididymal fat of the two models. MeDIP-seq reveals extensive hypomethylation in obesity Sequencing was carried out for 28 samples from two tissues in four different conditions: wt, ob/ob, RD and HFD. This resulted in 4,898,872,318 paired end reads of 49 nucleotides. Cleaned data showed no adapter and sequencing primer contamination, the base quality was good and consistent through out the read length. MeDIP-sequencing read counts and uniquely mapped reads are represented in Supplementary Table ST1. On average 170 millions paired end reads per sequence were obtained from sequencing. After removing PCR duplicates, missing pair mapped reads and bad quality mapped reads, approx. 35% of reads from each sample are used in the later analysis. Short reads are obtained from the MEDIP-seq data, which when mapped to the reference genome, gave less mapping because the CpG islands lie in the repeat regions of the genome, which are hard to map using short reads. High correlation of R2 > 0.90 is found between the replicate samples collected for each tissue for respective ex- perimental conditions . The region of 250 bases that is differentially methylated in obese mice as compared to lean is represented as one point in each volcano plot, with p-value < 0.001 marked in red (figure 1). The genome was divided into promoter, exon or intron of the gene or in the intergenic regions to find the distribution of differentially methylated regions. The distribution of differentially methylated regions varies between the inguinal and epididymal tissue in diet-induced and genetic obesity(Figure 2). Both the tissues in the diet model have higher DMRs in exons as compared to the other genetic regions. Only the inguinal tissue in genetic model has higher number of DMRs in the promoters as compared to other regions of the genome. The counts of the differentially methylated regions in the two models and two tissues are summarised in Table 1. There are more hypomethylated regions in the obese mice as compared to the corresponding lean mice. 5 15 promoter exon intron intergenic 10 15 5 10 0 5 0 Inguinal tissue in genetic model 5 5 10 10 15 15 Inguinal tissue in diet model 0 0 Fold change relative to whole genome Fold change relative to whole genome Manuscript Epididymal tissue in diet model Epididymal tissue in genetic model Figure 2: DMR partition distribution with the genomic region based on functional properties. Experiment Name Inguinal High Fat Vs Regular diet Epididymal High Fat Vs Regular diet Inguinal Obese Vs Wild type Epididymal Obese Vs Wild type Number DMRs 10174 12056 291 1532 of Hypermethylated DMRs 1 749 28 0 Hypomethylated DMRs 10173 11262 263 1532 Number genes 2603 3013 57 318 of Table 1: Differentially methylated regions of 250 bases in epididymal and inguinal tissues in two different models at p-value <= 0.001. Region in each comparison is divided into hypomethylated and hypermethylated regions based on their state in obese as compared to corresponding lean mice. The fifth column shows the number of mouse genes mapped by the differentially methylated regions. Epididymal tissue and inguinal tissue is more similar in diet model The DMRs are mapped to genes if they are located within the gene body or within 10kB upstream or downstream of the gene body. The numbers of genes mapped by DMRs in the two models in inguinal and epididymal tissue are documented in the last column of Table 1.There are common genes affected by methylation between different tissues and models. The Venn diagram (Figure 3) shows an overlap of six genes between inguinal and epididymal tissues in the genetic model. These six genes are Sntg1, Galntl6, 2210408I21Rik, Park2, Rn45s and Mid1. On the contrary, the diet model has 2004 genes, which are common between inguinal and epididymal tissues. Out of 2004 genes, 1969 are hypo methylated in the HFD mice in both inguinal and epididymal tissues. Thus, the diet model is more consistent between the two tissues than the genetic model. The remaining 35 genes are hypermethylated in the epididymal tissue while hypo-methylated in the inguinal tissue More genes are affected by diet-induced obesity in epididymal fat tissue The HFD feeding was repeated, to obtain samples for gene expression analysis. After 15 weeks of feeding RD or HFD, there was a significant difference in the weights of mice used for microarray samples. The weight of RD and HFD mice at 19th week were 31.76 +/- 1.8g and 46.03 +/4.8g respectively. High correlation was observed between the biological replicates used for the gene expression analysis. The gene expression levels of different probes on Manuscript 6 Color Key and Histogram Color Key and Histogram Count 3000 Gene expression changes in epididymall tissue in diet model 0 0 8000 Count Gene expression changes in inguinal tissue in diet model 0 1 Value −2 2 0 1 Value 2 10_HF 9_HF 8_HF 6_HF 1_RD 18_HF 11_RD 5_RD 3_RD 4_RD 4_RD 11_RD 3_RD 1_RD 5_RD 9_HF 10_HF 6_HF 18_HF 8_HF Genes −2 Figure 4: Heatmaps for differentially expressed genes in the diet models for (A) inguinal and (B) epididymal adipocytes. Genes in inguinal tissue in diet model 332 1694 0 2 10 0 294 6 685 4 0 6 0 8 35 Genes in epididymal tissue in diet model Genes in inguinal tissue in genetic model Genes in epididymal tissue in genetic model Figure 3: Overlap of genes with differentially methylated regions. Genes harbouring differentially methylated regions in the inguinal and epididymal tissue in the genetic and diet models. The highest overlap of 2004 genes is found between inguinal diet model and epididymal diet model the arrays were transformed to gene level expression using the median of the expression values of the multiple probes mapped to that gene. At the log2 fold change of >=1 between regular diet and high fat diet, 411 genes were differentially expressed in inguinal tissue and 1135 genes were differentially expressed in epididymal tissue. The differential gene expression for the epididymal and inguinal tissue shows that we have almost twice as many gene changes in epididymal as compared to inguinal tissue in obese mice. The heatmaps show that approximately 75 % of genes in inguinal tissue have higher expression in the HFD, in line with the hypomethylation of most DMRs in inguinal HFD. In epididymal tissue, there is 2:1 ratio between up-regulated and down-regulated genes (Figure 4). Epididymal diet model had the maximum number of hypermethylated region in HFD amongst the four tested models. Gene ontology enrichment for differentially expressed genes Out of the 411 genes differentially expressed in inguinal tissue between HFD and RD, 304 genes were up-regulated in HFD where as 107 genes were down-regulated. At FDR <0.01, the up-regulated genes are enriched in biological functions like angiogenesis, vascular system development, immune system, cell migration and mortality, response to stimuli and stress. On the contrary the downregulated genes from the inguinal tissue are involved in lipid metabolic process, response to chemical stimulus, fat cell differentiation, mitochondrion, response to hormone stimulus, response to organic substance. In the epididymal tissue, among the 1135 differentially expressed genes, 757 were up-regulated while 378 were down-regulated. The prominent results of gene ontology enrichment shows that the up-regulated gene are responsible for immune system process, response to other organism, leukocyte activation, cytokine production, hemopoiesis, response to stress and stimuli, angiogenesis, phagocytosis and membrane organization including motility, adhesion, differentiation and death of cells. On the other hand genes showing decrease in expression in HFD are metabolism related genes Manuscript involved in fatty acid oxidation and lipid metabolism, along with fat cell differentiation and response to nutrient levels which are all adipose tissue related. Other classes are response to chemical stimulus and response to peptide hormone stimulus. A large portion of these gene (53) are localised to mitochondrion. Pathway enrichment for differentially expressed genes The enrichment for Reactome database pathways is observed for both up-regulated (enrichment score) and downregulated (enrichment score with -ve sign) genes in the inguinal and epididymal tissue (Figure 7). In the inguinal adipocytes from diet-induced obese mice, we can see that the innate immunity genes, fatty acid and other metabolism related genes are going down where as adaptive immune response, cell cycle and organ development classes show up-regulation in HFD. The pathways reflect that the cells are storing the fatty acids and growing in size and at the same time getting invaded by macrophages. In the epididymal tissue, similar to inguinal, the fatty acid and other metabolism pathways are down-regulated and additionally epididymal has insulin signalling genes in the downregulated genes. The up-regulated genes are active in transcription, nervous tissue development and adaptive response. These classes indicate that epididymal fat reacts more to insulin activity and higher neurovascular tissue development occurs in epididymal tissue under obesity. Methylation driven changes in gene expression differ between inguinal and epididymal tissue In mature adipocytes from inguinal gene expression of 39 genes has been affected by DMRs while in epididymal, 103 genes show change in expression with methylation control. Thirty-nine genes from inguinal include 33 of the genes that are up-regulated with lower methylation levels in HFD. Six genes were down-regulated and hypermethylated in HFD (figure ). In the epididymal tissue, 57 genes were up-regulated with lower methylation levels in HFD, Thirty genes were hypomethylated in HFD but were also down-regulated at the gene expression level. Ten genes were hypermethylated in HFD and also show up-regulation in this state where as 6 genes were hypermethylated in HFD and show down-regulation in this state (Figure 5). Twenty-four genes (Nrp2, Notch3, S100a8, Ptpn18, Lgals3, Icam1, Cd300lg, Ptprc, Tspan2, Ncf4, Tnfaip8l1, Myh9, Fbxl7, Dusp7, Mpzl1, Mmp11, Prcp, Rcsd1, Gmfg, Lipe, 4930519F09Rik, C4a, Aacs, Mup5) are common between inguinal and epididymal fat in diet induced obese mice (Figure 6), where all but one (Mup5) shows same direction of methylation and gene expression in both the tissues. Nineteen of these genes are hypomethylated and 7 up-regulated while 4 genes are hypermethylated but increase in expression in HFD. In the two tissues, there are some common mechanisms that are found to be affected by the HDF by some common genes. The up-regulation in HFD with the hypomethylation control for genes like S100 calcium binding protein A8 (S100a8) and Lectin, Galactoside-Binding, Soluble, 3 (Lgals3/Gal-3), show that these cells have high immunological activity. Lgals3 is another proinflammatory mediator, which is up-regulated in both adipose tissues of the diet model. The down-regulation of lipid metabolism genes namely Acetoacetyl-CoA synthetase (Aacs) and Hormonesensitive lipase (Lipe/Hsl) show that metabolic activities in these cells are highly reduced. Prolylcarboxypeptidase (Prcp) is a serine protease, which is expressed in multiple peripheral organs, white blood cells, fibroblasts and endothelial cells where it is localised to the membrane. Neuropilin-2 (Nrp2), a lymphatic vessel development gene is found to be up-regulated in both inguinal and epididymal adipocytes. There are genes that differ in methylation and expression in two the issues. We see functional differences between the genes under regulation of HFD in these tissues and thus, there are physiological differences in the obesity phenotype of these two tissues [8]. In the inguinal tissue, Carnitine palmitoyltransferase I (Cpt1) and Protein kinase (cAMP-dependent, catalytic) inhibitor gamma (Pkig) is up-regulated in HFD which have functions related to obesity. In the epididymal tissue, there are the specific genes differentially expressed with DMRs within the effective boundaries. Some of them with functions in obesity are High mobility group A1 (Hmga1), Complement component 3a receptor 1 (C3ar1), BTB and CNC homology 1 (Bach1), Minichromosome maintenance complex component 10 (Mcm10). All of these are hypomethylated and upregulated. Zinc-finger nuclear protein (Zfp152) is key regulator of adipose commitment [35] and differentiation and acts as a repressor of adipogenesis [35]. We find it up-regulated in epididymal dataset, which is not the property of the adipose tissues. The published ChIP-seq data from ChEA database using enrichr revealed thirteen transcription factors binding (FDR <0.01) to the 34 genes with differential gene expression and differential methylation in inguinal tissues (Table 2). In the epididymal tissue, thirteen transcription factors (FDR <0.01) have a binding site in 93 genes, with differential gene expression and differential methylation (Table 3). There are 5 transcription factors with are common between the two tissues, Friend leukemia integration 1 (Fli1), Krueppel-like factor 4 (Klf4), SWI/SNF Related, Matrix Associated, Actin Dependent Regulator Of Chromatin, Subfamily A, Member 4 (Transcription activator BRG1 /Smarc4a), Wilms tumor protein (Wt1) and T-cell acute lymphocytic leukemia 1 (Tal-1/scl). Manuscript 2 8 Inguinal diet model Epididymal diet model 5 Methylation level Methylation level 0 regulation -2 Down-regulated Non-regulated -4 Up-regulated regulation 0 Down-regulated Non-regulated Up-regulated -5 -6 -8 Down-regulated Up-regulated Non-regulated Down-regulated Gene expression class Up-regulated Non-regulated Gene expression class Figure 5: Boxplot showing the mean methylation fold changes in the three classes of gene expression, up-regulated, down-regulated and genes not changing in expression in two tissues Epididymal diet model 79 Inguinal diet model 24 15 Figure 6: Venn diagram representing the overlap between genes with differential gene expression along with differential methylation between inguinal and epididymal diet models. Discussion Obesity is known to be polygenic and environment, especially diet, plays an important role in the regulation of gene functions in obesity. In this study, we initially included two separate obesity models, diet induced obesity and genetic obesity (ob/ob), to identify common methylation patterns in these two settings. Interestingly, the methylation data consistently revealed more hypomethylated regions in obese compared to lean mice. This was the case in both inguinal and epididymal fat from diet induced obese mice and ob/ob mice. Hypomethylation due to decreased methionine levels [75] or mutations in the emphMTHFR gene [67] is associated with an increased risk of Type 2 diabetes, and feeding pregnant mice a high fat diet has been shown to lead to hypomethylation in the offspring [9, 21]. Also, in obesity multiple micronutrients deficiencies are observed, including zinc, selenium, vitamin B1/B12, folate [19]. Folate is important for multiple biological processes as 1-carbon source for methylation of different molecules including DNA, RNA and proteins. Anaemia and obesity were suggested causes of folate deficiency in females [10]. Also, serum folate concentra- Manuscript Trancription Factor Fli1 Tcf4 Scl Suz12 Klf4 Tcfap2c Sox2 Egr1 Smarca4 Wt1 Hnf4a Nr0b1 Runx1 9 Upregulated tagets Acoxl, Gmfg, Icam1, Plekhg2, Cpt1a, Ptpn18, Dusp7, Pkig, Myo1b, Myo1g, Fbxl7, Nrp2, Crim1, Ncf4, Cotl1, Lgals3, Myh9, Cdh13, Tnfaip8l1, Slco3a1 Mpzl1, Tspan2, Myo1b, Nrp2, Crim1, Rcsd1 Dusp7, Pkig, Nrp2, Ncf4, Acoxl, Cotl1, Gmfg, Myh9, Tnfaip8l1, Slco3a1 Rcsd1, Ptpn18, Notch3, Dusp7, Myo1b, Lrrc8c, Fbxl7, Nrp2, Crim1, Lgals3, Cdh13, Slco3a1, Tspan2, Mpp6, Cdh13, Dusp7, Nrp2, Crim1, Slco3a1, Ptpn18, Rcsd1 Mmp11, Mpzl1, Myo1b, Lgals3, Gmfg, Icam1, Plekhg2, Cpt1a Mmp11, Dusp7, Myo1g, Nrp2, Lgals3, Gmfg, Prcp, Tnfaip8l1, Cpt1a, Notch3 Mpzl1, Acoxl, Prcp, Cpt1a, Ptpn18, Mmp11, Myo1g, Myh9, Cdh13, Pde7a, Cd300lg, Mpp6 Acoxl, Rcsd1, Prcp, Plekhg2, Cpt1a, Mmp11, Pkig, Myo1g, Fbxl7, Cotl1, Tnfaip8l1, Gmfg, Icam1, Notch3, Crim1, Cdh13, Mpp6 Prcp, Lrrc8c, Icam1, Lgals3, Notch3, Cd300lg Dusp7, Myo1b, Crim1, Rcsd1, Prcp, Plekhg2, Slco3a1, Notch3 Mpzl1, Adamts9, Acoxl, Icam1, Plekhg2, Cpt1a, Mmp11, Pkig, Myo1b, Nrp2, Crim1, Lgals3, Myh9, Tnfaip8l1, Pde7a Mmp11, Mpzl1, Gmfg, Myh9, Icam1, Plekhg2, Cpt1a, Notch3 Gmfg, Prcp, Icam1, Cpt1a, Ptpn18, Notch3, Mmp11, Pkig, Lrrc8c, Ncf4, Cotl1, Lgals3, Myh9 Downregulated targets Lipe, Aacs Slc25a23 Aacs Lipe Lipe Aacs Slc25a23 Table 2: Transcriptional control of the differentially expressed genes with methylation region in the inguinal diet model with the effect range with the adjusted p-value < 0.01 for the group and the target genes tions were lower in obese patients with non-alcoholic fatty liver [30] as well as in individuals with high BMI [38]. The lower folate concentration in serum and low folate availability to the cells is associated with an increased urinary 8-hydroxy-2?-deoxyguanosine and may further promote DNA strand breaks and global DNA hypomethylation [17, 71]. creased transcription [37]. Intragenic CGIs may indirectly affect the gene expression, through regulatory noncoding RNAs, alternative splicing regulated by methylation status or regulate transcriptional elongation [20]. The effect of methylation on gene expression is not totally understood mechanism and DNA methylation can lead to positive and negative regulation of gene expression. During obesity, the adipose tissue is fast growing in mass of cells with endocrine activity with adipocytes undergoing hypertrophy and hyperplasia and neovascularisation occurring within the adipose tissue. The adipose tissue obesity shares these properties with the cancerous tissue [51]. The DNA damage caused by folate deficiency can lead to abnormal DNA repair and methylation, which can be the cause of the cells to be undergo neoplasia like situation [33]. Thus, we can suggest that obesity might be causing hypomethylation in the adipocytes as an effect of folate deficiency, which is responsible for disrupted DNA repair mechanisms and lead to selective hypomethylation of the adipocytes in obese tissue. Thus, methyl donor supplementation in our HFD mice could potentially change the methylation pattern we observe. The overlap between the two models turned out to be very limited, which is maybe not surprising given that ob/ob mice gain weight due to increased intake of a chow diet, whereas the diet induced obese mice are challenged with a high fat diet. Furthermore the degree of obesity/metabolic syndrome may have progressed to different stages. Finally, leptin not only acts to decrease satiety, it also has systemic effects on metabolism, i.e. decreasing energy expenditure by 30% [28]. Therefore we decided to focus on the diet-induced model, as it is a more relevant model in a clinical setting. Only a small proportion of the DMRs identified were localised to promotor regions, while especially exons were frequent targets. Methylation of CpG islands in promotor regions usually decreases gene expression through sterical interference with transcription factor binding, but in a few cases hypomethylation has been correlated with de- Epididymal and inguinal adipose tissue vary in their cellular composition, origin and gene expression patterns [5]. Accordingly the effect of obesity shows disparities in these tissues. Epididymal adipose tissue associated with insulin resistance, diabetes, hypertension, atherosclerosis and hepatic steatosis [26]. On the other hand, inguinal adipose tissue secretes more adiponectin and less inflammatory cytokines and has better response towards insulin [26]. In this study we discovered methylation differences between these tissues. DMRs also differ amongst the models re- Manuscript vealing that these tissues respond to genetic and diet induced obesity differently. These two tissues react to the high fat diet stimulation differently. The epididymal tissue showed more differences in obese mice as compared to inguinal tissue reflecting that the epididymal tissue is more variable than the inguinal tissue. The epididymal fat tissue are known to respond to short term of high fat feeding and percentage of weights gains is also more in epididymal fat as compared to inguinal tissue which requires longer exposure to high fat diet for weight gain [46]..Differential gene expression showed more genes changing expression in epididymal tissue that the inguinal tissue. In the inguinal adipocytes, pathways reflect that the cells are storing the fatty acids and growing in size and at the same time getting invaded by macrophages where as the epididymal tissue reacts more to insulin activity and higher neurovascular tissue development occurs in epididymal tissue under obesity. In the inguinal tissue, immune system shadows other functions of the cells in this tissue as shown earlier [68]. Common methylation and expression changes in epididymal and inguinal tissue As higher number of DMRs and differentially expressed gene are observed in the epididymal tissues than the inguinal tissue. S100a8 mRNA was highly expressed in white adipose tissue of mice and macrophage expressing cells with increased expression in mature adipose tissue from obese mice [31]. Also, high circulating levels of S100a8 are observed in obese male individuals [57]. S100a8 is endogenous ligand of TLR4 along with S100a9. TLR4 is known to play important role in systemic glucose and lipid metabolism as well as in obesity-induced adipose tissue inammation. Lgals3 up-regulation indicates that both epididymal and inguinal tissues in HFD have inflammation [54]. Lipe is one of the major enzymes for fat cell lipolysis, where trigycerides are converted to free fatty acids (FFA). During obesity, higher concentrations of FFA are present in blood and the cells do not need lipolysis to get more FFAs ad thus lipolysis is down-regulated. Aacs is a ketone utilising enzyme, which provides acetyl substrate for lipogenesis and knock out of the gene in mice leads to suppression of adipocyte markers like Pparγ and C/rebp-α that play an important role in adipocyte differentiation [29]. As HFD provides fatty acids to the cells and there is a surplus of them to the cells, there is no need for the cells to convert other compounds to fatty acids and lipogenesis is reduced in the adipocytes under obesity. Prcp inactivates the α-MSH hormone and acts an appetite stimulant [58]. Prcp up-regulation is found to be associated with obesity, diabetes mellitus and cardiovascular abnormalities. Obesity does not affect just the adipocytes but also the surrounding cells like stromal 10 vascular tissue and nerves. The growth of adipose tissue is highly linked to angiogenesis [16] and Nrp2 is selectively required for the development of small lymphatic vessels and capillaries [76]. It has been suggested that neovascularization might play a critical role in adipose tissue growth [55]. Myosin, heavy chain 9 (Myh9), a nonmuscle myosin gene known to be associated with diabetic neuropathy [15] is also up-regulated. Inguinal specific methylation and expression changes in diet-induced obesity Overexpression of Cpt1 significantly reduces the content of intracellular non-esterified fatty acids (NEFAs) when adipocytes are challenged with fatty acids. These changes were caused by an increase in fatty acid uptake and a decrease in fatty acid release [25]. Methylation differences of Cpt1 are associated with Triglyceride Levels, BMI and WHR [18]. On the other hand, Pkig deletion or knockdown simultaneously increases osteogenesis and decreases adipogenesis [13]. Thus, fatty acid accumulation is happening in inguinal tissue but inguinal specific genes points towards down regulation of adipogenesis. Epididymal specific methylation and expression changes in diet-induced obesity The genes specifically regulated in the epididymal adipocytes from HFD mice are related to adipocyte differentiation and adipose tissue development. Hmga1 forms a complex with Retinoblastoma Protein (emphRb) protein, which positively regulates adipocyte differentiation and is also a downstream nuclear target of insulin signalling [22]. C3ar1 is also found up-regulated in the obese co-twins as well in the epididymal HFD of our experiments, is among the three new genes with causal relationships for obesity in an study in rodents integrating gene expression and DNA variations [52]. Also, C3aR, a Gi-coupled G proteincoupled receptor is found to play an significant role in macrophages and adipose tissue and control energy homeostasis and insulin resistant in HFD exposure [47]. Bach1 is hypomethylated and up-regulated in the epididymal adipocytes and is a leucine zipper transcription factor and downstream targets of Bach1 are involved in oxidative stress response and cell cycle which is the similar state of hypoxia and cell number increase adipose tissue under go during obesity [72]. Aldh1a3 also known as RALDH3 was earlier reported as not expressed in subcutaneuos or visceral adipose tissue [59] but we find it up-regulated in the HFD in epididymal. Retinaldehyde dehydrogenase 1 (Raldh1/ Aldh1a1) is another member of Rald-catabolizing enzyme family as Aldh1a3. Raldh1 knockouts are resistant to dietinduced obesity and insulin resistance and also showed Manuscript 11 Trancription Factor Pparg Upregulated tagets Downregulated targets S100a8, Gng2, Tpst1, Bach1, Fgd6, Aldh1a3, Notch3, Man1c1, Tnfrsf1b, Myh9, Rab31, Timp3 Suz12 Plxdc2, Tm6sf1, Zfp521, Rcsd1, Hmha1, Aldh1a3, Ptpn18, Notch3, Elk3, H2-Ab1, Dusp7, Rnf128, Cd44, Fbxl7, Nrp2, Zswim6, Man1c1, Lgals3, Tnfrsf1b, Lyn, Rab31, Itga9, Tspan2, Timp3 Mmp11, Elmo1, Zfp710, Fn1, Gng2, Tnfrsf1b, Hmha1, Myh9, Bach1, Prcp, Cysltr1, Itga9, Rps26, Cyb5r4, Tpm3, Bach1, Gng2, Tnfrsf1b Plxdc2, Gng2, Tm6sf1, Gmfg, Hmha1, Fgd6, Icam1, Cyfip1, Rps26, Cyb5r4, Ptpn18, Elk3, Elmo1, Dusp7, Fbxl7, Nrp2, Ncf4, Zswim6, Ncoa7, Man1c1, Lgals3, Nckap1l, Myh9, Lyn, Rab31, Tnfaip8l1, Tpm3 Elmo1, Cd44, Nrp2, Zfp710, Gng2, Zswim6, Rcsd1, Hmha1, Bach1, Abi3, Prcp, Fgd6, Notch3 Bach1, Fgf7, Clic1, Prcp, Icam1, Rab31, Cyb5r4, Lgals3, Notch3, Cd300lg Elmo1, Emr1, Gpr65, Lgals3, Gmfg, Fgf7, Clic1, Prcp, C3ar1, Pla1a, Tnfaip8l1, Tspan2, H2-Aa Elmo1, Plxdc2, Fbxl7, Zfp710, Gng2, Ncoa7, Hmga1, Man1c1, Tnfrsf1b, Bach1, Itga9, Timp3, Cyb5r4, Mcm10, Elk3, 4930503l19rik Plxdc2, Dusp7, Ncoa7, Hmga1, Rcsd1, Clic1, Prcp, Notch3, Elk3, Tpm3 Plxdc2, Cd44, Ncf4, Mrc1, Ncoa7, Nckap1l, Gmfg, Hmha1, Lyn, Prcp, Pla1a, Notch3 Cd44, Elmo1, Elk3, Ptprc Mmp11, Elmo1, Mpzl1, Fn1, Lgals3, Gmfg, Hmha1, Clic1, Abi3, Icam1, Rps26 Plxdc2, Fbxl7, Fn1, Gng2, Ncoa7, Man1c1, Myh9, Fgd6, Aldh1a3 Atp6v0e2, A530016l24rik, Fgf10, Hspb8, Slc1a5, Pcx, Mgst1, Lipe, Fbxo21, Tns1, Sh3pxd2a, Sod3, Nrip1, Cyp21a1, Isoc2a, Pde1a, Aacs, Galnt2 Gnai1, Tns1, Cib2, Sod3, Sgce, Otud3, Nrip1, Fgf10, Ntrk2, Galnt2, Plcd1 Tcf3 Fli1 Tal1 Smarca4 Gata2 Rcor3 Wt1 Spi1 Foxp3 Klf4 Yap1 Nrip1, Gnai1, Fgf10, Cib2, Slc1a5, Adipor2, Pde1a, Hspb8, Slc1a5, Adipor2, Sgce Fbxo21, Gnai1, Sh3pxd2a, Zdhhc5, Atp6v0e2, Nrip1, Lipe, Chchd6, Adipor2, Aacs, Galnt2 Grina, Nrip1, Pcx, Isoc2a, Aacs, Galnt2 Lrrc58, D830050j10rik, Adipor2, Lipe Tns1, Hspb8, Galnt2, Plcd1 C4a, Gnai1, Slc1a5, Ntrk2, Sod3, Aacs Grina, Gnai1, Sh3pxd2a, Sod3, Hspb8, Galnt2 Otud3, Grina, Chchd6, Zdhhc5 Sgce, Nrip1 D830050j10rik, Aacs, Hspb8, Plcd1 Lrrc58, Atp6v0e2, Nrip1, Gnai1, Fgf10, Sh3pxd2a, Slc1a5, Ntrk2, Aacs Table 3: Transcriptional control of the differentially expressed genes with methylation region in the epididymal diet model with the effect range with the adjusted p-value < 0.01 for the group and the target genes increased energy dissipation. It also suggested that Ralds transcriptionally regulate the metabolic responses to highfat diet [77]. Mcm10 belongs to class involved in the initiation of eukaryotic genome replication and plays a role in preventing DNA damage during replication. Mcm10 upreglation in HFD indicates towards underlying DNA repair mechanisms needs to be active in obese tissues, which might either be the cause or the effect of hypomethylation. Regulatory mechanisms The transcription controls in the two tissues show five common TFs which share target genes. Klf4 is known to regulate adipogenesis together with Early Growth Response 2 (Krox20), cooperatively trans-activates CCAAT/ enhancer binding protein beta (C/EBPβ), which in turn activates C/EBPα and PPARγ [6]. Sumyolation Klf4 also stimulates adipocyte differentiation [63], making it an important regulator of adipose tissue growth during obesity. The two TFs, Fli1 and Scl are regulators of hematopoiesis. In epididymal adipocytes, we found other two regulators of hematopoiesis Gata1 and Runx1 having target site in differentially expressed and methylated genes [53]. Substantial increase is seen in lymphopoietic and hematopoietic processes in HFD mice, which indicates the immune system dysregulation by obesity. The catalytic subunits of Brg1 / Smarca4 interacts with PPARγ and are required for induction of adipogenic transcription programs [56]. Brg1 containing BAF chromatin remodelling complexes have shown to essential for embryonic development and for reprogramming of somatic cells [65]. Brg1 is found to Manuscript regulate the pluripotency factors [60] as well as altered expression of genes influencing cell proliferation and metastasis in cancer cells by demethylation [4]. Conclusion In conclusion, we show that mature adipocytes from inguinal and epididymal fat acquire tissue specific changes in DNA methylation in diet induced obesity. Obesity caused either genetically or by challenging the mice with HFD triggers hypomethylation in both inguinal and epididymal tissues. The changes in methylation as well as gene expression are more pronounced in epididymal tissue than the inguinal tissue. The hypomethylation could be result of micronutrient deficiency for methyl donors like folate or methionine caused by obesity. These deficiencies lead to DNA damage during replication with faulty DNA repair. We also found numerous inflammation related genes been up-regulated in the two tissues, which is dues to invasion of these tissues by macrophages. The inguinal tissue is undergoing hypertrophy by accumulation of triglycerides inside the adipocytes where as the epididymal tissue has a lot of adipogenesis regulation along with triggered in impaired insulin signalling. This study suggests that epididymal tissue reacts more vigorously when challenged with high fat diet and there are mechanistic differences in obesity of two tissues. References [1] “http://www.bioinformatics.babraham.ac.uk/projects/fastqc/”. [2] In Dietary Reference Intakes for Thiamin, Riboflavin, Niacin, Vitamin B6, Folate, Vitamin B12, Pantothenic Acid, Biotin, and Choline, The National Academies Collection: Reports funded by National Institutes of Health, Washington (DC), 1998, Institute of Medicine (US) Standing Committee on the Scientific Evaluation of Dietary Reference Intakes and its Panel on Folate, Other B Vitamins, and Choline Book. [3] A. H. Bakker, F. M. Van Dielen, J. W. Greve, J. A. Adam, and W. A. Buurman, “Preadipocyte number in omental and subcutaneous adipose tissue of obese individuals”, Obes Res, Vol. 12, No. 3, pp. 488–98, 2004. [4] F. Banine, C. Bartlett, R. Gunawardena, C. Muchardt, M. Yaniv, E. S. Knudsen, B. E. Weissman, and L. S. Sherman, “SWI/SNF chromatin-remodeling factors induce changes in DNA methylation to promote transcriptional activation”, Cancer Res, Vol. 65, No. 9, pp. 3542–7, 2005. [5] N. Billon and C. Dani, “Developmental origins of the adipocyte lineage: new insights from genetics and genomics studies”, Stem Cell Rev, Vol. 8, No. 1, pp. 55–66, 2012. [6] K. Birsoy, Z. Chen, and J. Friedman, “Transcriptional regulation of adipogenesis by KLF4”, Cell Metab, Vol. 7, No. 4, pp. 339–47, 2008. 12 [7] B. Caballero, “The global epidemic of obesity: an overview”, Epidemiol Rev, Vol. 29, pp. 1–5, 2007. [8] R. Caesar, M. Manieri, T. Kelder, M. Boekschoten, C. Evelo, M. Muller, T. Kooistra, S. Cinti, R. Kleemann, and C. A. Drevon, “A combined transcriptomics and lipidomics analysis of subcutaneous, epididymal and mesenteric adipose tissue reveals marked functional differences”, PLoS One, Vol. 5, No. 7, p. e11525, 2010. [9] J. Carlin, R. George, and T. M. Reyes, “Methyl donor supplementation blocks the adverse effects of maternal high fat diet on offspring physiology”, PLoS One, Vol. 8, No. 5, p. e63549, 2013. [10] E. Casanueva, A. Drijanski, A. C. Fernandez-Gaxiola, C. Meza, and F. Pfeffer, “Folate deficiency is associated with obesity and anemia in Mexican urban women”, Nutrition Research, Vol. 20, No. 10, pp. 1389–1394, 2000. [11] A. Ceriello, M. A. Ihnat, and J. E. Thorpe, “Clinical review 2: The "metabolic memory": is more than just tight glucose control necessary to prevent diabetic complications?”, J Clin Endocrinol Metab, Vol. 94, No. 2, pp. 410–5, 2009. [12] E. Y. Chen, C. M. Tan, Y. Kou, Q. Duan, Z. Wang, G. V. Meirelles, N. R. Clark, and A. Ma’ayan, “Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool”, BMC Bioinformatics, Vol. 14, p. 128, 2013. [13] X. Chen, B. S. Hausman, G. Luo, G. Zhou, S. Murakami, J. Rubin, and E. M. Greenfield, “Protein Kinase Inhibitor gamma Reciprocally Regulates Osteoblast and Adipocyte Differentiation by Downregulating Leukemia Inhibitory Factor”, Stem Cells, Vol. 31, No. 12, pp. 2789–99, 2013. [14] S. C. Collins, M. B. Hoppa, J. N. Walker, S. Amisten, F. Abdulkader, M. Bengtsson, J. Fearnside, R. Ramracheya, A. A. Toye, Q. Zhang, A. Clark, D. Gauguier, and P. Rorsman, “Progression of diet-induced diabetes in C57BL6J mice involves functional dissociation of Ca2(+) channels from secretory vesicles”, Diabetes, Vol. 59, No. 5, pp. 1192–201, 2010. [15] J. N. Cooke, M. A. Bostrom, P. J. Hicks, M. C. Ng, J. N. Hellwege, M. E. Comeau, J. Divers, C. D. Langefeld, B. I. Freedman, and D. W. Bowden, “Polymorphisms in MYH9 are associated with diabetic nephropathy in European Americans”, Nephrol Dial Transplant, Vol. 27, No. 4, pp. 1505–11, 2012. [16] D. L. Crandall, G. J. Hausman, and J. G. Kral, “A review of the microcirculation of adipose tissue: anatomic, metabolic, and angiogenic perspectives”, Microcirculation, Vol. 4, No. 2, pp. 211–32, 1997. [17] K. S. Crider, T. P. Yang, R. J. Berry, and L. B. Bailey, “Folate and DNA methylation: a review of molecular mechanisms and the evidence for folate’s role”, Adv Nutr, Vol. 3, No. 1, pp. 21–38, 2012. [18] S. A. J. S. L. L. W. D. Z. K. S. T. J. O. D. K. A. D. M. Absher, M. R. Irvin, “DNA Methylation at CPT1A is Associated with Triglyceride Levels, BMI and WHR”, 2012. [19] A. Damms-Machado, G. Weser, and S. C. Bischoff, “Micronutrient deficiency in obese subjects undergoing low calorie diet”, Nutr J, Vol. 11, p. 34, 2012. [20] A. M. Deaton and A. Bird, “CpG islands and the regulation of transcription”, Genes Dev, Vol. 25, No. 10, pp. 1010–22, 2011. Manuscript [21] Y. Ding, J. Li, S. Liu, L. Zhang, H. Xiao, J. Li, H. Chen, R. B. Petersen, K. Huang, and L. Zheng, “DNA hypomethylation of inflammation-associated genes in adipose tissue of female mice after multigenerational high fat diet feeding”, Int J Obes (Lond), Vol. 38, No. 2, pp. 198– 204, 2014. [22] F. Esposito, G. M. Pierantoni, S. Battista, R. M. Melillo, S. Scala, P. Chieffi, M. Fedele, and A. Fusco, “Interaction between HMGA1 and retinoblastoma protein is required for adipocyte differentiation”, J Biol Chem, Vol. 284, No. 38, pp. 25993–6004, 2009. [23] J. N. Fain, A. K. Madan, M. L. Hiler, P. Cheema, and S. W. Bahouth, “Comparison of the release of adipokines by adipose tissue, adipose tissue matrix, and adipocytes from visceral and subcutaneous abdominal adipose tissues of obese humans”, Endocrinology, Vol. 145, No. 5, pp. 2273–82, 2004. [24] R. Feil, “Environmental and nutritional effects on the epigenetic regulation of genes”, Mutat Res, Vol. 600, No. 1-2, pp. 46–57, 2006. [25] X. Gao, K. Li, X. Hui, X. Kong, G. Sweeney, Y. Wang, A. Xu, M. Teng, P. Liu, and D. Wu, “Carnitine palmitoyltransferase 1A prevents fatty acid-induced adipocyte dysfunction through suppression of c-Jun N-terminal kinase”, Biochem J, Vol. 435, No. 3, pp. 723–32, 2011. [26] A. Gil, J. Olza, M. Gil-Campos, C. Gomez-Llorente, and C. M. Aguilera, “Is adipose tissue metabolically different at different sites?”, Int J Pediatr Obes, Vol. 6 Suppl 1, pp. 13–20, 2011. [27] G. R. Hajer, T. W. van Haeften, and F. L. Visseren, “Adipose tissue dysfunction in obesity, diabetes, and vascular diseases”, Eur Heart J, Vol. 29, No. 24, pp. 2959–71, 2008. [28] J. L. Halaas, K. S. Gajiwala, M. Maffei, S. L. Cohen, B. T. Chait, D. Rabinowitz, R. L. Lallone, S. K. Burley, and J. M. Friedman, “Weight-reducing effects of the plasma protein encoded by the obese gene”, Science, Vol. 269, No. 5223, pp. 543–6, 1995. [29] S. Hasegawa, Y. Ikeda, M. Yamasaki, and T. Fukui, “The role of acetoacetyl-CoA synthetase, a ketone bodyutilizing enzyme, in 3T3-L1 adipocyte differentiation”, Biol Pharm Bull, Vol. 35, No. 11, pp. 1980–5, 2012. [30] S. Hirsch, J. Poniachick, M. Avendano, A. Csendes, P. Burdiles, G. Smok, J. C. Diaz, and M. P. de la Maza, “Serum folate and homocysteine levels in obese females with non-alcoholic fatty liver”, Nutrition, Vol. 21, No. 2, pp. 137–41, 2005. [31] A. Hiuge-Shimizu, N. Maeda, A. Hirata, H. Nakatsuji, K. Nakamura, A. Okuno, S. Kihara, T. Funahashi, and I. Shimomura, “Dynamic changes of adiponectin and S100A8 levels by the selective peroxisome proliferatoractivated receptor-gamma agonist rivoglitazone”, Arterioscler Thromb Vasc Biol, Vol. 31, No. 4, pp. 792–9, 2011. [32] R. Huxley, S. Mendis, E. Zheleznyakov, S. Reddy, and J. Chan, “Body mass index, waist circumference and waist:hip ratio as predictors of cardiovascular risk–a review of the literature”, Eur J Clin Nutr, Vol. 64, No. 1, pp. 16–22, 2010. [33] S. J. James, I. P. Pogribny, M. Pogribna, B. J. Miller, S. Jernigan, and S. Melnyk, “Mechanisms of DNA damage, DNA hypomethylation, and tumor progression in the 13 [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] folate/methyl-deficient rat model of hepatocarcinogenesis”, J Nutr, Vol. 133, No. 11 Suppl 1, pp. 3740S–3747S, 2003. K. Jiao, H. Liu, J. Chen, D. Tian, J. Hou, and A. D. Kaye, “Roles of plasma interleukin-6 and tumor necrosis factoralpha and FFA and TG in the development of insulin resistance induced by high-fat diet”, Cytokine, Vol. 42, No. 2, pp. 161–9, 2008. S. Kang, P. Akerblad, R. Kiviranta, R. K. Gupta, S. Kajimura, M. J. Griffin, J. Min, R. Baron, and E. D. Rosen, “Regulation of early adipose commitment by Zfp521”, PLoS Biol, Vol. 10, No. 11, p. e1001433, 2012. A. Kel, N. Voss, T. Valeev, P. Stegmaier, O. KelMargoulis, and E. Wingender, “ExPlain: finding upstream drug targets in disease gene regulatory networks”, SAR QSAR Environ Res, Vol. 19, No. 5-6, pp. 481–94, 2008. S. J. Kim, H. S. Kang, H. L. Chang, Y. C. Jung, H. B. Sim, K. S. Lee, J. Ro, and E. S. Lee, “Promoter hypomethylation of the N-acetyltransferase 1 gene in breast cancer”, Oncol Rep, Vol. 19, No. 3, pp. 663–8, 2008. J. E. Kimmons, H. M. Blanck, B. C. Tohill, J. Zhang, and L. K. Khan, “Associations between body mass index and the prevalence of low micronutrient levels among US adults”, MedGenMed, Vol. 8, No. 4, p. 59, 2006. A. Lachmann, H. Xu, J. Krishnan, S. I. Berger, A. R. Mazloom, and A. Ma’ayan, “ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments”, Bioinformatics, Vol. 26, No. 19, pp. 2438–44, 2010. B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with Bowtie 2”, Nat Methods, Vol. 9, No. 4, pp. 357–9, 2012. M. J. Lee, Y. Wu, and S. K. Fried, “Adipose tissue heterogeneity: implication of depot differences in adipose tissue for obesity complications”, Mol Aspects Med, Vol. 34, No. 1, pp. 1–11, 2013. H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin, “The Sequence Alignment/Map format and SAMtools”, Bioinformatics, Vol. 25, No. 16, pp. 2078–9, 2009. N. Li, M. Ye, Y. Li, Z. Yan, L. M. Butcher, J. Sun, X. Han, Q. Chen, X. Zhang, and J. Wang, “Whole genome DNA methylation analysis based on high throughput sequencing technology”, Methods, Vol. 52, No. 3, pp. 203–12, 2010. A. Liberzon, A. Subramanian, R. Pinchback, H. Thorvaldsdottir, P. Tamayo, and J. P. Mesirov, “Molecular signatures database (MSigDB) 3.0”, Bioinformatics, Vol. 27, No. 12, pp. 1739–40, 2011. K. A. Lillycrop, E. S. Phillips, C. Torrens, M. A. Hanson, A. A. Jackson, and G. C. Burdge, “Feeding pregnant rats a protein-restricted diet persistently alters the methylation of specific cytosines in the hepatic PPAR alpha promoter of the offspring”, Br J Nutr, Vol. 100, No. 2, pp. 278–82, 2008. S. Lin, T. C. Thomas, L. H. Storlien, and X. F. Huang, “Development of high fat diet-induced obesity and leptin resistance in C57Bl/6J mice”, Int J Obes Relat Metab Disord, Vol. 24, No. 5, pp. 639–46, 2000. Y. Mamane, C. Chung Chan, G. Lavallee, N. Morin, L. J. Xu, J. Huang, R. Gordon, W. Thomas, J. Lamb, E. E. Schadt, B. P. Kennedy, and J. A. Mancini, “The C3a anaphylatoxin receptor is a key mediator of insulin resistance Manuscript [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] and functions by modulating adipose tissue macrophage infiltration and activation”, Diabetes, Vol. 58, No. 9, pp. 2006–17, 2009. B. G. Marques, D. B. Hausman, and R. J. Martin, “Association of fat cell size and paracrine growth factors in development of hyperplastic obesity”, Am J Physiol, Vol. 275, No. 6 Pt 2, pp. R1898–908, 1998. J. A. Martinez, F. I. Milagro, K. J. Claycombe, and K. L. Schalinske, “Epigenetics in adipose tissue, obesity, weight loss, and diabetes”, Adv Nutr, Vol. 5, No. 1, pp. 71–81, 2014. J. Okabe, C. Orlowski, A. Balcerczyk, C. Tikellis, M. C. Thomas, M. E. Cooper, and A. El-Osta, “Distinguishing hyperglycemic changes by Set7 in vascular endothelial cells”, Circ Res, Vol. 110, No. 8, pp. 1067–76, 2012. D. Onmer and E. Alyamac, “Obesity: an endocrine tumor?”, Medical hypotheses, Vol. 63, No. 5, pp. 790–792, 2004. K. H. Pietilainen, J. Naukkarinen, A. Rissanen, J. Saharinen, P. Ellonen, H. Keranen, A. Suomalainen, A. Gotz, T. Suortti, H. Yki-Jarvinen, M. Oresic, J. Kaprio, and L. Peltonen, “Global transcript profiles of fat in monozygotic twins discordant for BMI: pathways behind acquired obesity”, PLoS Med, Vol. 5, No. 3, p. e51, 2008. J. E. Pimanda, K. Ottersbach, K. Knezevic, S. Kinston, W. Y. Chan, N. K. Wilson, J. R. Landry, A. D. Wood, A. Kolb-Kokocinski, A. R. Green, D. Tannahill, G. Lacaud, V. Kouskoff, and B. Gottgens, “Gata2, Fli1, and Scl form a recursively wired gene-regulatory circuit during early hematopoietic development”, Proc Natl Acad Sci U S A, Vol. 104, No. 45, pp. 17692–7, 2007. D. H. Rhodes, M. Pini, K. J. Castellanos, T. MonteroMelendez, D. Cooper, M. Perretti, and G. Fantuzzi, “Adipose tissue-specific modulation of galectin expression in lean and obese mice: evidence for regulatory function”, Obesity (Silver Spring), Vol. 21, No. 2, pp. 310–9, 2013. M. A. Rupnick, D. Panigrahy, C. Y. Zhang, S. M. Dallabrida, B. B. Lowell, R. Langer, and M. J. Folkman, “Adipose tissue mass can be regulated through the vasculature”, Proc Natl Acad Sci U S A, Vol. 99, No. 16, pp. 10730–5, 2002. N. Salma, H. Xiao, E. Mueller, and A. N. Imbalzano, “Temporal recruitment of transcription factors and SWI/SNF chromatin-remodeling enzymes during adipogenic induction of the peroxisome proliferator-activated receptor gamma nuclear hormone receptor”, Mol Cell Biol, Vol. 24, No. 11, pp. 4651–63, 2004. R. Sekimoto, K. Kishida, H. Nakatsuji, T. Nakagawa, T. Funahashi, and I. Shimomura, “High circulating levels of S100A8/A9 complex (calprotectin) in male Japanese with abdominal adiposity and dysregulated expression of S100A8 and S100A9 in adipose tissues of obese mice”, Biochem Biophys Res Commun, Vol. 419, No. 4, pp. 782– 9, 2012. B. Shariat-Madar, D. Kolte, A. Verlangieri, and Z. ShariatMadar, “Prolylcarboxypeptidase (PRCP) as a new target for obesity treatment”, Diabetes Metab Syndr Obes, Vol. 3, pp. 67–78, 2010. A. Sima, D. C. Manolescu, and P. Bhat, “Retinoids and retinoid-metabolic gene expression in mouse adipose tissues”, Biochem Cell Biol, Vol. 89, No. 6, pp. 578–84, 2011. 14 [60] N. Singhal, J. Graumann, G. Wu, M. J. Arauzo-Bravo, D. W. Han, B. Greber, L. Gentile, M. Mann, and H. R. Scholer, “Chromatin-Remodeling Components of the BAF Complex Facilitate Reprogramming”, Cell, Vol. 141, No. 6, pp. 943–55, 2010. [61] G. K. Smyth, “Linear models and empirical bayes methods for assessing differential expression in microarray experiments”, Stat Appl Genet Mol Biol, Vol. 3, p. Article3, 2004. [62] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov, “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles”, Proc Natl Acad Sci U S A, Vol. 102, No. 43, pp. 15545–50, 2005. [63] S. Tahmasebi, M. Ghorbani, P. Savage, K. Yan, G. Gocevski, L. Xiao, L. You, and X. J. Yang, “Sumoylation of Kruppel-like factor 4 inhibits pluripotency induction but promotes adipocyte differentiation”, J Biol Chem, Vol. 288, No. 18, pp. 12791–804, 2013. [64] G. Toperoff, D. Aran, J. D. Kark, M. Rosenberg, T. Dubnikov, B. Nissan, J. Wainstein, Y. Friedlander, E. LevyLahad, B. Glaser, and A. Hellman, “Genome-wide survey reveals predisposing diabetes type 2-related DNA methylation variations in human peripheral blood”, Hum Mol Genet, Vol. 21, No. 2, pp. 371–83, 2012. [65] M. D. Trottier, A. Naaz, Y. Li, and P. J. Fraker, “Enhancement of hematopoiesis and lymphopoiesis in diet-induced obese mice”, Proc Natl Acad Sci U S A, Vol. 109, No. 20, pp. 7622–9, 2012. [66] G. Uriarte, L. Paternain, F. I. Milagro, J. A. Martinez, and J. Campion, “Shifting to a control diet after a highfat, high-sucrose diet intake induces epigenetic changes in retroperitoneal adipocytes of Wistar rats”, J Physiol Biochem, Vol. 69, No. 3, pp. 601–11, 2013. [67] E. A. Varga, A. C. Sturm, C. P. Misita, and S. Moll, “Cardiology patient pages. Homocysteine and MTHFR mutations: relation to thrombosis and coronary artery disease”, Circulation, Vol. 111, No. 19, pp. e289–93, 2005. [68] J. A. Villena, B. Cousin, L. Penicaud, and L. Casteilla, “Adipose tissues display differential phagocytic and microbicidal activities depending on their localization”, Int J Obes Relat Metab Disord, Vol. 25, No. 9, pp. 1275–80, 2001. [69] M. C. Vohl, R. Sladek, J. Robitaille, S. Gurd, P. Marceau, D. Richard, T. J. Hudson, and A. Tchernof, “A survey of genes differentially expressed in subcutaneous and visceral adipose tissue in men”, Obes Res, Vol. 12, No. 8, pp. 1217–22, 2004. [70] B. L. Wajchenberg, “Subcutaneous and visceral adipose tissue: their relation to the metabolic syndrome”, Endocr Rev, Vol. 21, No. 6, pp. 697–738, 2000. [71] T. C. Wang, Y. S. Song, H. Wang, J. Zhang, S. F. Yu, Y. E. Gu, T. Chen, Y. Wang, H. Q. Shen, and G. Jia, “Oxidative DNA damage and global DNA hypomethylation are related to folate deficiency in chromate manufacturing workers”, J Hazard Mater, Vol. 213-214, pp. 440–6, 2012. [72] H. J. Warnatz, D. Schmidt, T. Manke, I. Piccini, M. Sultan, T. Borodina, D. Balzereit, W. Wruck, A. Soldatov, M. Vingron, H. Lehrach, and M. L. Yaspo, “The BTB and CNC Manuscript [73] [74] [75] [76] [77] homology 1 (BACH1) target genes are involved in the oxidative stress response and in control of the cell cycle”, J Biol Chem, Vol. 286, No. 26, pp. 23521–32, 2011. R. A. Waterland and R. L. Jirtle, “Transposable elements: targets for early nutritional effects on epigenetic gene regulation”, Mol Cell Biol, Vol. 23, No. 15, pp. 5293–300, 2003. D. B. West and B. York, “Dietary fat, genetic predisposition, and obesity: lessons from animal models”, Am J Clin Nutr, Vol. 67, No. 3 Suppl, pp. 505S–512S, 1998. J. D. Wren and H. R. Garner, “Data-mining analysis suggests an epigenetic pathogenesis for type 2 diabetes”, J Biomed Biotechnol, Vol. 2005, No. 2, pp. 104–12, 2005. L. Yuan, D. Moyon, L. Pardanaud, C. Breant, M. J. Karkkainen, K. Alitalo, and A. Eichmann, “Abnormal lymphatic vessel development in neuropilin 2 mutant mice”, Development, Vol. 129, No. 20, pp. 4797–806, 2002. O. Ziouzenkova, G. Orasanu, M. Sharlach, T. E. Akiyama, J. P. Berger, J. Viereck, J. A. Hamilton, G. Tang, G. G. Dolnikowski, S. Vogel, G. Duester, and J. Plutzky, “Retinaldehyde represses adipogenesis and diet-induced obesity”, Nat Med, Vol. 13, No. 6, pp. 695–702, 2007. 15 Manuscript 16 ( 0.66 ) INTERACTIONS OF LYMPHOID & NON LYMPHOID CELL ( 0.56 ) SIGNALING BY THE B CELL RECEPTOR BCR ( 0.54 ) SIGNALING BY PDGF ( 0.54 ) DOWNSTREAM SIGNAL TRANSDUCTION Infuinal tissue of diet model ( 0.53 ) NEURONAL SYSTEM ( 0.49 ) SEMAPHORIN INTERACTIONS ( 0.49 ) SIGNALING BY RHO GTPASES ( 0.46 ) CELL SURFACE INTERACTIONS AT THE VASCULAR WALL ( 0.46 ) AXON GUIDANCE ( 0.45 ) ADAPTIVE IMMUNE SYSTEM ( 0.44 ) CELL CELL COMMUNICATION ( 0.42 ) DEVELOPMENTAL BIOLOGY ( 0.4 ) G ALPHA S SIGNALLING EVENTS ( 0.37 ) TOLL RECEPTOR CASCADES ( 0.37 ) SIGNALING BY EGFR IN CANCER ( 0.35 ) MHC CLASS II ANTIGEN PRESENTATION ( 0.34 ) NGF SIGNALLING VIA TRKA ( 0.29 ) CELL CYCLE ( 0.28 ) IMMUNE SYSTEM ( 0.28 ) CYTOKINE SIGNALING IN IMMUNE SYSTEM ( −0.16 ) INNATE IMMUNE SYSTEM ( −0.19 ) G ALPHA Q SIGNALLING EVENTS ( −0.19 ) G ALPHA I SIGNALLING EVENTS ( −0.19 ) GPCR DOWNSTREAM SIGNALING ( −0.26 ) METABOLISM OF CARBOHYDRATES ( −0.3 ) SLC MEDIATED TRANSMEMBRANE TRANSPORT ( −0.36 ) PHASE1 FUNCTIONALIZATION OF COMPOUNDS ( −0.44 ) BIOLOGICAL OXIDATIONS ( −0.44 ) INTEGRATION OF ENERGY METABOLISM ( −0.51 ) METABOLISM OF LIPIDS & LIPOPROTEINS ( −0.55 ) PPARA ACTIVATES GENE EXPRESSION ( −0.56 ) FATTY ACID TRIACYLGLYCEROL & KETONE BODY METABOLISM ( −0.6 ) METABOLISM OF AMINO ACIDS & DERIVATIVES ( −0.61 ) INITIAL TRIGGERING OF COMPLEMENT ( −0.61 ) COMPLEMENT CASCADE Down−regulated up−regulated −4 ( −0.79 ) BRANCHED CHAIN AMINO ACID CATABOLISM −2 0 ( 2.18 ) ADAPTIVE IMMUNE SYSTEM ( 2.13 ) MHC CLASS II ANTIGEN PRESENTATION ( 2.04 ) ANTIGEN PROCESSING CROSS PRESENTATION ( 1.91 ) CLASS I MHC MEDIATED ANTIGEN PROCESSING PRESENTATION ( 1.9 ) IMMUNE SYSTEM ( 1.9 ) CHEMOKINE RECEPTORS BIND CHEMOKINES ( 1.83 ) SEMAPHORIN INTERACTIONS ( 1.81 ) INTERACTIONS OF LYMPHOID & NON LYMPHOID CELL ( 1.78 ) EXTRACELLULAR MATRIX ORGANIZATION ( 1.77 ) TRAFFICKING & PROCESSING OF ENDOSOMAL TLR ( 1.77 ) RNA POL I PROMOTER OPENING ( 1.76 ) RNA POL I TRANSCRIPTION ( 1.76 ) TRANSCRIPTION ( 1.76 ) RNA POL I RNA POL III & MITOCHONDRIAL TRANSCRIPTION ( 1.69 ) LATENT INFECTION WITH MYCOBACTERIUM TUBERCULOSIS ( 1.68 ) INTEGRIN CELL SURFACE INTERACTIONS ( 1.65 ) MEIOTIC RECOMBINATION ( 1.61 ) AXON GUIDANCE ( 1.58 ) MEIOSIS ( 1.58 ) G ALPHA I SIGNALLING EVENTS Epididymal tissue of diet model −4 −2 2 4 Down−regulated up−regulated ( −1.35 ) SIGNALING BY INSULIN RECEPTOR ( −1.41 ) PHOSPHOLIPID METABOLISM ( −1.51 ) GLYCOLYSIS ( −1.54 ) PPARA ACTIVATES GENE EXPRESSION ( −1.55 ) ACTIVATION OF MRNA BY CAP BINDING COMPLEX, EIFS & 43S ( −1.58 ) BIOLOGICAL OXIDATIONS ( −1.65 ) METABOLISM OF PROTEINS ( −1.66 ) PI3K CASCADE ( −1.68 ) INSULIN RECEPTOR SIGNALLING CASCADE ( −1.69 ) GLYCEROPHOSPHOLIPID BIOSYNTHESIS ( −1.79 ) TRANSLATION ( −1.88 ) GLUCOSE METABOLISM ( −1.89 ) TRIGLYCERIDE BIOSYNTHESIS ( −1.89 ) PYRUVATE METABOLISM & CITRIC ACID TCA CYCLE ( −1.93 ) MITOCHONDRIAL FATTY ACID BETA OXIDATION ( −1.97 ) TCA CYCLE & RESPIRATORY ELECTRON TRANSPORT ( −2.6 ) BRANCHED CHAIN AMINO ACID CATABOLISM ( −2.67 ) METABOLISM OF LIPIDS & LIPOPROTEINS ( −2.97 ) FATTY ACID TRIACYLGLYCEROL & KETONE BODY METABOLISM ( −3.2 ) METABOLISM OF AMINO ACIDS & DERIVATIVES 0 Enrichment Score 2 4 Figure 7: Boxplot showing the mean methylation fold changes in the three classes of gene expression, up-regulated, down-regulated and genes not changing in expression in two tissues Part IV Genotype to Phenotype 115 Chapter 7 Discovering phenotypes Introduction The differences between the individuals lie in the variations of their genomes which govern their phenotypic differences. Therefore, in order to understand the phenotypes better, it is required to know the genotype of an individual. The completion of human genome project has provided vast information about human genetic diversity. Some most prominent international initiatives to catalogue human variations are the International HapMap Project [226] and the 1000 Genomes Project [227], which have led to the understanding of population specific differences in humans by dividing them in the major populations around the world. These common variations are stored in databases like dbSNP [104], which stores information about SNPs and indels found in multiple organisms along with their allele frequencies and population specific information. Similarly, medically important human variation knowledge is critical and therefore, it is collected in resources like Human Gene Mutation Database (HGMD) [106] and ClinVar [107]. Some of these databases are disease specific e.g. Catalogue Of Somatic Mutations In Cancer (COSMIC) [228] and Obesity Gene Atlas in Mammals [220]. Online Mendelian Inheritance in Man (OMIM) [229] database catalogues diseases with all known genetic information. OMIM can be queried, either using gene or phenotype. Another resource documenting phenotypic or disease information for genes is MalaCards [230] derived from GeneCards [231]. GWAS efforts have resulted in discovery of a large number of SNPs to be associated with variety of phenotypes. According to GWAS catalogue [232] in 2011, 1617 SNPs were published as GWAS ‘hits’ with p-value <= 5×10−8 for 249 traits. SNPedia [233] is a wiki based resource for documenting effects of variations to phenotypes gathered from publications. The two sections of 117 118 CHAPTER 7. DISCOVERING PHENOTYPES this chapter describe the efforts in two projects to utilise this variation to trait association knowledge. In the first project, variation-phenotype associations are used to compare phenotypes between populations. The second personal genomics project suggests probable phenotypes to the sequenced ancient genome. The known variations are annotated based on the prior knowledge where as the novel mutations are indicted to affect by finding their function in the resultant protein. 7.1 Danish Pan-genome Genetic variations in populations arise due to natural selection pressure on genes and alleles. These variations persist over generations as they contribute to survival. Thus, analytical tools in genetics need to take into consideration population history, population substructure and admixture of populations. These factors may confound the genetic results and the findings might be suboptimal. In the de novo assemble of the Asian and African genomes from NCBI, 5 Mb of novel sequences where identified which are not present in the reference human genome [234]. Most of these novel sequences are individual and population specific and comparative analysis with other species indicated that they may be functional. Thus, the healthy individuals also differ from the reference genome. Accordingly, when applying the NGS methods in population specific studies, it is necessary to use the population specific background data. This quest for population specific requirements guided the development of population specific genomes called the pan-genome. The current genetic variation catalogues like 1000 Genomes is based on populations divided into 4 continents and then by subdividing these continents into regions. To analyse population specific data, a population specific reference genome would be an advantage. Owing to the benefits of pan-genomes, a project is designed to assemble reference genome for the Danish population. The study discussed here is a part of danish pan-genome pilot project. The manuscript of the danish pan-genome pilot project is under preparation. Data and Analysis The pilot study for Danish pan-genome is based on sequencing data from ten randomly selected trios (mother-father-child) from the Copenhagen Family Bank. They were sequenced on Illumina Hi-seq2000 at an average sequencing depth of 40X. Sequencing data was mapped to reference genome with BWA-MEM; version 0.7.5a [235]. SAMtools [24] and Picard were used to prune the alignment files and to mark duplicate reads. GATK was used to call variants from the mapped data [26]. The SNVs and indels that occur in any of the parents were annotated for their effect on the proteins using variant effect predictor tool from Ensembl [105]. The results were concentrated around the loss-of-function (LOF) variations. The SNPs causing the termination of the protein (stop gain), 7.1. DANISH PAN-GENOME 119 change in amino acid in the protein (missense) or affecting a splice site are considered as LOF variation. Indels were considered LOF if they are frame shift, splice acceptor or splice donor variant. Some variation mapping to multiple transcripts may have different consequence on the proteins, such as truncation, substitution, ablation etc. Therefore, most severe consequence observed in the set of transcripts is assigned to the variation. SNPs affecting transcripts without established consensus coding DNA sequence (CCDS) were filtered from further analysis. The frequencies of the variations were calculated from the 16 parents having Danish ancestry till two generations back. To calculate the allele frequency (AF) in Danish population, the alleles were modelled as binary variables either being reference or variant, and AF was estimated from a binomial distribution. To account for uncertainty, because of the small sample size, we calculated confidence interval for the AF at 95% confidence [236]. The AF boundaries from the Danish cohort were compared with the AF for the European (EUR) population from 1000 Genomes. Results and Discussion In the pilot pan-genome dataset, 8.5 million SNVs and 1.8 million indels were detected. A large part of genomic variations present in Danes seems to be shared with EUR population. Even though these changes are undoubtedly deleterious at protein level, some could confer evolutionary advantages, either by the absence of function or the ability to develop novel ones. The wellknown stop-gained SNP from the European population rs497116 occurring in the Caspase 12 (CASP12) gene was also observed with 100% frequency in the pan-genome dataset. TT is the common genotype in Northern Europeans from Utah (CEU), and all Danish participants were homozygous for this. The derived “T” allele encodes for an inactive CASP12, which leads to increased resistance to various infections and thus, consequently underwent positive selection. Eurasians are practically fixed for the inactive variant, whereas in Sub-Saharan Africa, the active variant is still common ( 24%). Studies have found it be a pre-neolithic event [237]. Another such event is a pair of nonsense and missense mutations in coiled-coil alpha-helical rod protein 1 (CCHCR1). These variations form a pair to determine the risk of psoriasis. Psoriasis is found to have 0.37 % prevalence in Danish population [238]. The variants with functional impact and an AF that differs significantly from the parent population, i.e. EUR could be termed as Danes specific variations. These are undoubtedly of special interest to understand some of the genomic differences that set the Danish population apart from other related populations, and could help in determining important phenotypic traits such as disease risk or drug metabolism. Among the variations with difference in frequency, we found a novel stop-gain mutation in the gene, 120 CHAPTER 7. DISCOVERING PHENOTYPES ubiquitin specific peptidase 17-like family member 11 (USP17L11). This mutation truncates the protein, leaving only 25%. This mutation is present in 9 individuals from the study. It is found that this mutation removes proton acceptor and predicted nuclear localisation signal. The function of the protein is not defined experimentally but by homology and it is inferred to regulate cellular processes. A set of known SNPs having different AFs in the Danish cohort and the EUR population were also detected in this data. The stop-gained mutation is found in H2B histone family member M (H2BFM ), which is more frequent in Danish dataset than EUR, truncates 72% of the protein. Due to a missense mutation in Cyclin-Dependent Kinase 11A (CDK11A), an arginine is replaced by tryptophan in the 93rd position of the protein leading to disruption of a highly conserved residue, likely affecting the structure and nearby phosphorylation sites. There were high number of SNVs in the non-coding regions with frequency differences between Danish cohort and the EUR population. Some of them were annotated to lie within the TF binding site or miRNA coding regions. Haploreg tool[239] was used to annotate the non coding SNPs. They were only considered functional, if there was experimental evidence of binding of a TF and also overlapped by DNAase binding site. Indels annotated for LOF were filtered for unknown CCDS resulting in 1.1M indels. A filter for >95% of gene knock out, provided 335 LOF indels with 304 frameshift, 11 splice acceptor, 19 splice donor, 1 stop-gain. Most frequent class of gene disrupted by indels is olfactory receptor followed by zinc finger proteins and HLA region. Olfactory receptors are significantly enriched for extremely large proteins whereas HLA undergoes numerous rearrangements. Due to these reasons, these classes have the tendency of accumulating variations. The number of variation in different functional classes was compared to the LOF data for CEU individual from 1000 Genomes [240]. An increase in percentage of variations in all functional classes as compared to the 1000 Genomes data was observed. This annotation data needs more thorough investigations and validations. This pilot project will eventually lead to a bigger study with 50 trios sequenced at 80X to establish a Danish population specific reference genome. This will provide a reference genome to be used in the low-coverage resequencing studies of big cohorts in both evolutionary and medical studies. The data from the pilot study would be verified in the large study using higher depth and multiple pipelines to overcome false positives. 7.2. ANCIENT GENOME 121 Personal genomics Sequencing price are on a decline in the last few years and data analysis tools are coming at par with the sequencing data generation, leading to sequencing of numerous genomes at higher depths and coverage. In 2007, James Watson and Craig Venter’s genome was published adding a new research milestone to genetics, called personal genomics [241, 242]. There is a tremendous excitement for these studies and a lot of companies like 23andMe (www.23andme.com), deCODEme (www.decodeme.com), and Navigenics (www.navigenics.com) started to offer personal genome information services. Danish science writer, Lone Frank describes her experiences with the personal genome information in her book “My Beautiful Genome: Exposing Our Genetic Future, One Quirk at a Time”. She guides the readers through various aspects of the personal genomics information while exploring her own genes, ancestry and behaviour. Through her own self-discoveries, she tries to explain the benefits of this genomic information in medical future as predicted by the variations but also points towards the shortcomings of this data and uncertainty surrounding the interpretation of these evidences from the genome. Concerns have been raised regarding the clinical utility of these tests. The results provide a relative risk of having or not having a trait against the population. Also, other factors like environment and life style are not incorporated in these tests, adding high uncertainty to the results. 7.2 Ancient Genome Individual genomes are just not used for finding disease risk, but beyond. There had been successful studies that use personal genomes from fossils to track the migration of the human population around the world in past [243]. Also, comparing the variations in these ancient genomes to the known trait associated variations; the physical and anthropological characters of the ancient individuals could be predicted. This was applied for phenotypic characterization of the Saaqaq genome, an individual from the extinct PalaeoEskimo Saqqaq culture sequenced from a lock of hair preserved in permafrost [243] and the genome of an Aboriginal Australian sequenced from a lock of hair found in a museum [244]. A more recent study is about the phenotypic characterization of the Mesolithic man found in Spain 7000 years ago [245]. In this chapter, I describe a study in which another ancient DNA of a male infant (Anzick-1) recovered from the Anzick burial site in western Montana, was sequenced and analysed for phenotypic traits. Data and Analysis For the phenotypic analysis of the Anzick-1 genome, we annotated the variations obtained from whole genome sequencing of the Anzick-1 genome for the functional effects on the resultant proteins using the Ensembl database [105]. Genes harbouring the LOF variations were annotated for associations 122 CHAPTER 7. DISCOVERING PHENOTYPES with diseases using GeneCards [230] and we also included traits observed in Native Americans as documented in OMIM [229]. In a high throughput approach, SNP-phenotype associations were selected from the National Human Genome Research Institute (NHGRI) GWAS catalog [246] (p-value <1×10−7 ), 23andMe (www.23andme.com) and SNPedia [233] for phenotypes related to traits which can be classified broadly into appearance and anthropometric traits, cognitive function, nutritional preferences, metabolism, personality, biochemical traits and diseases. Type 2 diabetes (T2D) associated SNPs were extracted from a recent review [247]. Genetic risk scores (GRS) were calculated for multi-SNP phenotypes as the count of risk alleles normalizing by highest possible risk allele count. To minimize the risk of DNA damage, all heterozygous (C>T) and (G>A) variants were filtered out (Figure 7.1). For phenotypes with a single known associated SNP, the risk was estimated by comparing the Anzick-1 genotype to risk allele (Figure 7.2). The details for processing of sequencing data and variant calling can be found in the article by Rasmussen et al. [248]. The Anzick-1 genotypes for different phenotypes were compared to four 1000 Genomes super-populations namely Ad Mixed American (AMR), East Asian (ASN), European (EUR) and African (AFR) (Figure 7.1). The variants associated with interesting phenotypes were mapped to the diploid Anzick1 genome, which was divided into Native American, Asian, European and African ancestry and visualized using idiographica web tool [249] (Figure 7.3). Results and Discussion As expected physical traits such as dark hair and eyes, medium dark skin colour and average height were found to be similar to modern day Native American, the decedents of this ancient genome (Figure 7.1). GRS suggest increased risk for certain modern lifestyle diseases including T2D, coronary heart disease, stroke, celiac disease and obesity as indicated by body mass index. Some of these diseases are present at high prevalence in contemporary Native Americans populations [250, 251, 252], also reflected in the ancestry painting (Figure 7.3). Anzick-1 genome has higher number of T2D risk alleles that the three modern populations (ASN, AMR, EUR) populations, however it was similar to modern day Africans. The decreased burden of risk alleles for T2D in non-African population has been suggested to represent an adaption to agriculture [253]. This hypothesis may support a similar risk alleles in the Anzick-1 genome, who likely was a hunter-gatherer. The thrifty phenotype hypothesis proposes high risk of chronic conditions such as coronary hearth disease, T2D, and stroke to be associated to the limited nutrition during pregnancy and infant growth [254]. This �thrifty� phenotype hypothesis was likely an advantage in populations where food supplies were scarce and sporadic. The APOE e3/e4 genotype along with a missense mutation in the Apolipoprotein E (APOE) gene indicate high risk of Alzheimer’s 7.2. ANCIENT GENOME 123 disease to Anzick-1. The e4 APOE allele additionally supports the �thrifty� gene hypothesis in the Anzick-1 genome [255]. The genetic risk of celiac disease of the Anzick-1 individual was noticeably high as compared to all modern day populations. This indicates that the individual at that ancient time might not have tolerated gluten at par with modern populations. This is possible because modern populations are adapted to gluten-rich diet due to the advent of agriculture. Hair colour (5) Height (115) Skin colour (9) Eye colour (6) Body mass index (21) Population EUR ASN AFR AMR 0.2 (Light) 0.0 1.0 (Dark) (Blue) Type 2 diabetes (36) 0.4 (LR) 0.1 0.8 (Brown) (Light) Coronary heart disease (18) 0.7 (HR) 0.3 (LR) 0.8 (HR) 0.8 0.4 (Dark) (Short) Stroke (4) 0.0 (LR) 0.2 0.6 (Tall) (Low) Celiac disease (14) 1.0 (HR) 0.2 (LR) 0.8 (High) C-reactive protein levels (17) 0.7 0.3 (HR) (Low) 0.8 (High) Figure 7.1. Density plots showing the distribution of GRS for ten interesting phenotypes across the 1000 Genomes Project superpopulations. ASN (orange), AFR (red), AMR (green) and EUR (blue) with the Anzick-1 genome score denoted as a dashed vertical black line. The numbers in the parenthesis are the number of SNP sites used for calculating GRS. (LR = low risk and HR = high risk). The genotype of the Anzick-1 individual for the variant associated with cleft lip suggested an increased risk of cleft lip, similar to Native Americans who have the highest worldwide frequency of this disease [256]. Absence of the genotype that is responsible for working copy of the Actinin Alpha 3 (ACTN3), suggests that this individual was more likely to have endurance type muscles rather than sprinting. This variation in ACTN3 is suggested to be positively selected variant in recent populations [257]. Additionally, the Anzick-1 genome had a variant in the oxytocin receptor (OXTR), which has been associated with optimism, social behaviour and empathy. The genotype of the variant in the vitamin D receptor was associated with increased activity of the protein and two independent variants suggested in increase pain sensitivity. A missense mutation in BRCA1 Interacting Protein C-Terminal Helicase 1 (BRIP1) is indicator of increased risk to anaemia and breast cancer. The Anzick-1 individual has a missense mutation in the gene coding for 4-Aminobutyrate Aminotransferase (ABAT), leading to increased risk of GABA-transaminase deficiency, which causes mental abnormalities (Figure 7.2). Also, the Anzick-1 GRS suggested lower baseline levels of the inflammation marker C-reactive protein (CRP) in the 0 0.5 0.5 0.5 0.5 1 1 1 1 1 1 1 0.5 0.57 0.9 0.88 0.19 0.09 0.92 0.62 0.32 0.48 0.78 0.62 0.65 0.05 0.47 0.12 0.32 0.18 0.11 0.94 0.57 0.66 0.4 0.72 0.61 0.64 0.12 0.57 0.14 0.01 0.22 0.14 0.93 0.6 0.64 0.25 0.55 0.63 0.6 0.14 0.89 0.01 0.01 0.08 0.26 0.91 0.83 0.78 0.06 0.7 0.3 0.64 0.12 ASN AMR EUR AFR CHAPTER 7. DISCOVERING PHENOTYPES Clovis 124 Muscle performance Earwax type Shoveled teeth Cleft lip ApoE E4 − Alzheimer's disease ApoE E4 − Alzheimer's disease Vitamin D receptor activity Empathetic behavior Intracranial volume Anemia and breast cancer GABA−transaminase deficiency Pain sensitivity Pain sensitivity Figure 7.2. Heatmap that compares the Anzick-1 genotypes on the scale from 0 f(or absence of effect allele) to 1 (for homozygous) for single phenotype-associated SNPs to the average frequency of the effect allele in the 1000 Genomes super-populations ASN, AFR, AMR and EUR blood, which may indicate decreased risk of the inflammation associated with different metabolic diseases. It should be noted that the GRS as well as single SNP comparisions were derived from variants that have largely been identified in population based cohort studies of individuals of European ancestry, which may lead to ascertainment bias. During the peer review process, this section of the article received criticisms from one of the three reviewers while other two did not comment. Following the suggestions from the third reviewer, the section was removed from the final article. The reviewer’s comments, the clarifications and rationalisations in response to reviewer’s comments are discussed in the following section. The third reviewer commented for the usage of appropriate p-value cut offs from GWAS catalog studies for the trait associated SNP as 5×10−8 and not 10×10−7 as used in the study. This lowering of cut off would reduce the number of SNPs in some phenotype as well as eliminate some phenotypes totally. Also, due to the fact that, there are indications of associations in the tail of SNPs ranked by p-value, these low significant SNPs were included in the analysis. For the high risk of metabolic diseases, the reviewer questioned the implication of SNPs discovered in modern, sedentary European populations and their effect in an ancient hunter-gatherer individual. It was discussed as an explanation for these phenotypes, that these SNPs have been 7.2. ANCIENT GENOME 125 Populations Undetermined NAT CEU JPT + CHB YRI 1 2 3 4 5 6 7 8 9 Phenotypes Eye colour Hair colour Skin colour Height BMI T2D Coronary heart disease Stroke Celiac disease CRP level Shovel shaped teeth Wet earwax 10 11 12 13 14 15 16 17 18 19 20 21 22 Figure 7.3. Ancestry painting showing the regions in the Anzick1 genome that are either predicted to be of Asian (yellow), European (blue), African (red) and Native American (green) ancestry. The locations of SNPs used for the phenotype analysis are marked by triangles (multi-locus traits) and circles (single-locus traits) and coloured by their phenotype associations. associated with thrifty phenotype, which was an advantage for the ancient population with scarcity of nutrition and food during the prenatal life which extended to adult life. Also, there is no public GWAS on Native Americans (to our best knowledge), and similar to this study, making GRS based on a separate population has been done before [258]. The reviewer also commented that selection of the phenotypes is biased. The phenotypes assessed in the analysis were broadly based on appearance and anthropometric traits, cognitive function, nutritional preferences and metabolism, behaviour and personality. The diseases included are known to occur at high prevalence in the modern Native American population. The rationale behind using these phenotypes was to connect the heritability of these traits from the ancient to the modern population. The reviewer questioned the comparison with four major populations instead 126 CHAPTER 7. DISCOVERING PHENOTYPES of the North and South Native Americans. We agree to the reviewer that it would be a better option to compare individual against the frequency of North and South Native American populations and not admixed Americans but since the genomic data for these population is not available publicly, this comparison was not possible. The comparison was made against the major world populations as the main aim of the study was to find the migration wave to the ancient America from other parts of the world. Phenotypic similarity would show the common gene pool shared by the ancient individuals with the modern sub-populations, and thus their ancesters. The average frequency of effect allele was used to find the suggestive phenotypes for the Anzick-1 individual. As the reviewer suggested, the usage of penetrance of the allele as well as the phenotype, would be really important for finding the effective phenotypes for the genome in question [259]. The limitation is that the penetrance information is not available for a big portion of the phenotypes. It would be an advantage to develop a method that can use penetrance along with allele frequencies to deduce the phenotype from genetic information. Also, the usage of cross ethnic SNPs would help to over come the issues of population bias [253, 193]. Though the section was removed from the article, the usage of genomic variation and its association to the phenotypes makes it an important part of this thesis. This section articulates development of a methodology for genotype to phenotype association, and reviewer�s inputs would help in designing a robust method. Conclusion The methods of associating variations with phenotypes are still under development and progressing consistently. The difficulties in genotype to phenotype association studies are at multiple levels, like inadequate description of phenotypes, too little data on genotypes, and the underlying complexity of the networks that regulate cellular functions. Population bias arises when the background data is based on the studies from specific populations. Effect of genetics on disease susceptibility vary across different populations. It has been found that the genetic risk for T2D and pancreatic cancer decreased as humans migrated out of Africa towards East Asia [260]. The effect of common SNPs contributing to complex traits are modest but consistent across ancestry groups and these SNPs would only be discovered in trans-ethnic large cohorts [261]. Systems biology based genotype to phenotype methods would be advantageous as they would not only consider a single variation or a gene. Instead, they would account for inheritance of natural variation and the biological networks [262]. Combining personal genomics with other high throughput data like gene expression and proteomics along with clinical and pathological test results would help in revealing the unexpected molecular complexity. Part V Epilogue 127 Summary and perspectives This thesis presents and discusses the state-of-art methods implemented in analysing and interpreting high throughput data as well as integrating various data sources to uncover the underlying biological mechanisms. The phenotypes of interest discussed in this thesis work are multifactorial, therefore it is essential to utilise multiple data sets, data types and resources in order to investigate these complex phenotypes. Chapter 2 (Paper I) discusses a GWAS study conducted with regional imputation and multiple cross ethnicity cohort replications, which was successful in re-establishing five known childhood asthma genes as well as discovering a new susceptibility gene CDHR3, for exacerbation phenotype in asthma. A knowledge-based functional analysis of normal versus mutated protein showed altered expression on the cell surface of airway epithelium, suggesting its role in the infections during asthma exacerbations. Chapter 3 describes an ongoing project work where we have essentially established a candidate gene panel for childhood asthma study. Currently we are awaiting the sequencing data from pilot study. Chapter 4 explains a prediction tool that combines selective genetic features along with clinical risk features to predict asthma outcome when children are 7 years old. The method first attempts to reduce the search space pertaining to genetic features by grouping SNPs into a biological pathway and subsequently selecting the top phenotype associated pathways. This is accomplished by machine learning based method. Selected pathways are then used to identify the best predictive SNPs and clinical risk features combinations for childhood asthma. These studies are aimed at uncovering the biological mechanism behind the pathophysiology of childhood asthma, which might help in better prognosis, management and treatment of the disease. Adipose tissue plays a central role in lipid and glucose metabolism as it acts as an endocrine organ by secreting multiple hormones and cytokines. 129 130 However, imbalances in adipose tissue metabolism leads to obesity and other related traits like T2D and cardiovascular diseases. Chapter 5 of this thesis examines the underlying mechanism behind conversion of BAT into WAT via an intermediate transition state. It has been discovered that two TFs govern the conversion of BAT to the transition state while five TFs control transition to WAT conversion. An understanding of various mechanisms involved in these adipose tissue conversions can lead to the development of therapeutic measures aimed towards controlling obesity. Just like different adipose tissue types, there are different kinds of depots in body as well. Study in chapter 6 found that adipose tissues react differently to genetic and diet induced obesities. Furthermore, it is also noteworthy that obesity induces hypomethylation in adipose tissues. Also, the genetic and diet induced obesity causes different effects on adipose tissues, which vary amongst adipose depots. Chapter 7.1 focuses on the Danish pan-genome study and addresses the variations observed between Danish cohort and the European population. In Chapter 7.2, GRS calculated from a twelve thousand year old ancient genome are compared against the current populations in order to provide an assessment of the phenotypes as well as its ancestry. This thesis work consists of six different projects with diverse goals, which were accomplished by employing the fundamental principles of data integration, annotation and enrichment analysis. I would like to conclude this thesis by discussing some aspects surrounding the future perspective of systems biology-based analysis of variations. As discussed in this thesis that the variations dictate observed phenotypic differences among individuals but they can also lead to many disorders. Particularly, SNPs are one of the most discussed and explored variations when it comes to functional annotations. However, indels along with large structural variations also need to be annotated with same specificity and uniformity that will lead to an accomplished set of annotated genetic variations. When it comes to identifying causal variations usually the coding region of the genome is targeted. Subsets of these coding SNPs do not lead to a change in amino acid in the protein sequence and are generally considered non-functional. This is due to degeneracy of codons meaning that multiple codons code for same amino acid and thus, a change in nucleotide is not reflected as amino acid change in protein. These variations are called the synonymous SNPs. They are usually considered to be non-functional, as they do not change the final protein sequence. But they have been demonstrated to alter the translation kinetics and affect protein folding. This in turn affects protein structure and function. A recent study by Stergachis et al found that >14% of codons in human exons, simultaneously specify both amino acids and regulatory information in the form of transcription factor recognition sites, also called as duons. These duons are highly conserved and at least 17% of human coding variants (including synonymous, 131 non-synonymous, and disease-associated variants) lie within duons [263]. On the other hand, ENCODE has also shown multiple regulatory effects in the non-coding regions of the genome, making them as important as the coding variations. However, the usage of different available datasets for non-coding variations is not very straightforward. The recently developed tools like genome-wide annotation of variants (GWAVA) [264] integrate various genomic and epigenomic annotations for non-coding variations thereby predicting their functional impact. In general, when we discuss variation it is considered to be genetic but as we have discussed in different projects during this thesis that variations can span beyond the genome of an organism. There are other factors to consider like cell type, state of the cell and cell surroundings, which contribute to the effect of these variations. Therefore, the annotation strategies should be designed accordingly in order to take into account these features and their interplay. In disease state, disruption of biological processes are caused by multiple variants with each having modest contributions. Pathway and network based methods agglomerate these variants into clusters and find a cumulative effect of these low risk factors. This will reflect the molecular landscape underlying the observed phenotype. Eventually, as the high throughput technologies continue to improve, integrative interactions would be used to characterise and classify individuals. With further evolution of the field, along with the interpretation algorithms, advancements in the visualisation tools and techniques is also necessary. Several tools with nonoverlapping functionalities have been developed for visualisations of “omics” data namely Gitools [265], Cytoscape [266], Circos [267], NaAViGaTo [268] etc. The existing and future tools need to be robust enough to handle the enormous amounts of data. Barring few variations leading to a single point abnormality in proteome and metabolome, other variations found in the complex disorders do not have a one to one relation with the proteins or metabolites. These variations can also segregate into different genes, which are parts of isolated or interacting pathways. As we know many genes act cooperatively, a variation in one of them may lead to a network imbalance effect. This network balance can be modelled in disease studies if the level of knowledge regarding these coordinated effects and pathways is complete. Active research in the field is required to contribute for a better understanding of different biological pathways. In the new era of translational science, there is an explosion of high throughput data, which presents difficulties in data interpretation. Thus, it requires generating new paradigms for data analysis and knowledge extraction. There are certain challenges that need to be addressed in translational science like the lack of maker-disease association information and detailed phenotypic descriptions. Intelligent data mining of the clinical databases is required for 132 finding molecular markers for diseases and to reclassify the diseases according to these markers. Clinical usage of high-throughput genomic measurements for improved diagnosis, prognosis, disease profiling, and target identification is also required in practice. In the future, there would be immense data resources derived from multiple “omics” analysis. Generating valuable information from currently unexploited data resources would be beneficial towards understanding common and rare diseases. Since the drug responses could be genotype dependent, precise medication corresponding with the underlying altered genetics can be seen as future perspective of translational medicine. This would make clinical trials more cost-effective by reducing the number of required patients and time. Part VI Appendix 133 Chapter 8 Paper V - Role of TIMP-1 in chemotherapy resistant breast cancer Prelude The study aims at elucidating the role of TIMP-1 in chemotherapy resistant breast cancer cell line by using principles of proteomics discussed in Chapter 1.6. Resistance to chemotherapy is a major cause of death in cancers, and still, the mechanisms behind are fairly unknown. The TIMP family is known to inhibit proteolytic activity of matrix metalloproteinases (MMPs) and earlier studies suggest that TIMP-1 confers resistance to chemotherapy. The resistance caused by TIMP-1 is towards multiple dugs including topoisomerase 1 (TOP1) and 2 (TOP2) inhibitors and taxanes. In this project, the global proteome and phosphoproteome of the MCF-7 breast cancer cell lines expressing high and low levels of TIMP-1 were compared to find the molecular mechanism behind the resistance in presence of high TIMP-1 levels. My contribution to the project My contribution to this study was to annotate the up-regulated and hyper phosphorylated proteins, generation of PPI network, the functional analysis including the interpretation of the enrichment data and resultant network. An interaction network of the up-regulated and hyper-phosphorylated proteins was generated from the STRING database using a cutoff of 0.7 for confidence level. The moderately high confidence score was employed to 135 136 CHAPTER 8. PAPER V - ROLE OF TIMP-1 IN CHEMOTHERAPY RESISTANT BREAST CANCER make a compact network with less false positive and predictive data. These proteins were analysed for pathway and functional enrichment using IPA and ExPlain. As we discussed in the introduction chapter 1.8, the datasets behind these tools are incomplete and non-overlapping, the usage of multiple tool ensured a better coverage of annotations. Since, the study was focused around chemotherapy resistance in breast cancer, the known and predicted target for the cancer drugs added another layer of information to this functional analysis. The targets of chemotherapeutic drugs used in the experiments, epirubicin, irinotecan, etoposide, and cisplatin were queried from ChemProt database and DrugBank. All the facts collected about the functional classes were layered on the highly connected PPI network from STRING using Cytoscape. Since, genes can be part of multiple functional classes, it was required to visualise this multi functional data on the network. The color-coding of these proteins for all their functional classes was done using the MultiColoredNodes plugin for Cytoscape. This analysis helped in clustering the related genes into classes like cancer, cell cycle, DNA binding, drug target and drug transporters. The phosphoproteome data helped in finding which transcription factors are controlled by phosphorylation, enrichment data identified their function and the interaction network showed what are their targets. This clustering of functional classes and PPI network generation helped in hypothesis generation and interpretation of results. The paper is included in appendix as breast cancer in not the major theme of this thesis. The functional and network analysis conducted as a part of the study is a comprehensive illustration of integrative analysis. This analysis is based on the multiple data types and using the underlying biological relations between them with support from the knowledge available in the field. Due to the reason of methodological importance, this study is relevant to be included in the thesis. Article pubs.acs.org/jpr TIMP‑1 Increases Expression and Phosphorylation of Proteins Associated with Drug Resistance in Breast Cancer Cells Omid Hekmat,§,# Stephanie Munk,§,# Louise Fogh,†,‡,# Rachita Yadav,‡,∥ Chiara Francavilla,§ Heiko Horn,§ Sidse Ørnbjerg Würtz,†,‡ Anne-Sofie Schrohl,†,‡ Britt Damsgaard,†,‡ Maria Unni Rømer,†,‡ Kirstine C. Belling,†,‡ Niels Frank Jensen,†,‡ Irina Gromova,⊥ Dorte B. Bekker-Jensen,§ José M. Moreira,†,‡ Lars J. Jensen,§ Ramneek Gupta,∥,‡ Ulrik Lademann,†,‡ Nils Brünner,†,‡,# Jesper V. Olsen,*,§,# and Jan Stenvang*,†,‡,# † Institute of Veterinary Disease Biology, Faculty of Health and Medical Sciences and ‡Sino-Danish Breast Cancer Research Centre, University of Copenhagen, Dyrlægevej 88, 1., 1870 Frederiksberg C, Denmark § Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3b, Bldg. 6.1, 2200, Copenhagen, Denmark ∥ Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Building 208, 2800, Kongens Lyngby, Denmark ⊥ Cancer Proteomics, Genome Integrity Unit, Danish Cancer Society Research Center, DK-2100 Copenhagen, Denmark S Supporting Information * ABSTRACT: Tissue inhibitor of metalloproteinase 1 (TIMP-1) is a protein with a potential biological role in drug resistance. To elucidate the unknown molecular mechanisms underlying the association between high TIMP-1 levels and increased chemotherapy resistance, we employed SILAC-based quantitative mass spectrometry to analyze global proteome and phosphoproteome differences of MCF-7 breast cancer cells expressing high or low levels of TIMP-1. In TIMP-1 high expressing cells, 312 proteins and 452 phosphorylation sites were up-regulated. Among these were the cancer drug targets topoisomerase 1, 2A, and 2B, which may explain the resistance phenotype to topoisomerase inhibitors that was observed in cells with high TIMP-1 levels. Pathway analysis showed an enrichment of proteins from functional categories such as apoptosis, cell cycle, DNA repair, transcription factors, drug targets and proteins associated with drug resistance or sensitivity, and drug transportation. The NetworKIN algorithm predicted the protein kinases CK2a, CDK1, PLK1, and ATM as likely candidates involved in the hyperphosphorylation of the topoisomerases. Upregulation of protein and/or phosphorylation levels of topoisomerases in TIMP-1 high expressing cells may be part of the mechanisms by which TIMP-1 confers resistance to treatment with the widely used topoisomerase inhibitors in breast and colorectal cancer. KEYWORDS: tissue inhibitor of metalloproteinase 1, SILAC, quantitative mass spectrometry, phosphoproteomics, topoisomerase, breast cancer, resistance to chemotherapy, two-dimensional PAGE ■ INTRODUCTION impact on cellular sensitivity/resistance to apoptotic stimuli, including some chemotherapeutic drugs being used in cancer treatment.6−11 For example, lack of TIMP-1 protein either alone12 or in combination with topoisomerase 2A (TOP2A) gene aberrations13 was associated with an increased benefit from adjuvant treatment with a TOP2 inhibitor (epirubicin containing combination chemotherapy). Of specific interest was that this association was not observed in patients treated with a combination chemotherapy regimen not including a TOP2 inhibitor.13 Similarly, low versus high TIMP-1 plasma Resistance to systemic chemotherapy is considered the main cause for the annual death of thousands of breast cancer patients worldwide.1,2 Although many different mechanisms for drug resistance have been suggested it is still neither clinically possible to predict nor to reverse drug resistance. Tissue inhibitors of metalloproteinases (TIMPs) are a family with four members known to regulate the proteolytic activity of matrix metalloproteinases (MMPs).3,4 However, these protease inhibitors have other and non-MMP dependent biological functions, including regulation of cell proliferation, angiogenesis, and apoptosis.4,5 A number of studies suggest that the regulation of apoptosis by some of the TIMPs may have an © 2013 American Chemical Society Received: May 15, 2013 Published: August 5, 2013 4136 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research ■ levels showed an association with an increased objective response rate, increased progression-free survival and increased survival of metastatic colorectal cancer patients treated with combination chemotherapy including the topoisomerase 1 (TOP1) inhibitor irinotecan.14 A similar association was not seen in metastatic colorectal cancer patients treated with combination chemotherapy without a TOP1 inhibitor.15 In addition, many publications support the link between TIMP-1 and tumor cell survival demonstrating a highly statistically significant association between high tumor or plasma levels of TIMP-1 and poor cancer patient outcome.16−18 Recent preclinical studies have supported the above-mentioned findings exemplified by the fact that human breast cancer cells, which are genetically modified to overexpress TIMP-1, showed a massive increase in expression of genes involved in signal transduction, apoptosis, adhesion and proliferation19 and the TIMP-1 overexpressing cells had decreased sensitivity to the TOP2 inhibitor epirubicin and the taxane paclitaxel.10,11 The TIMP-1-mediated decrease in sensitivity to epirubicin and paclitaxel was associated with enhanced degradation of cyclin B110 and activation of the PI3K/Akt/NF-kβ pathway.11 Other possible mechanisms of action of TIMP-1-mediated drug resistance came from studies demonstrating an antiapoptotic activity of TIMP-1 being mediated by activation of the Akt cell survival pathways, focal adhesion kinase (FAK) and the extracellular signal-regulated kinase (ERK) pathway.6−8 In addition, TIMP-1 can bind to the tetraspanin cell surface protein CD6320,21 and in a human breast epithelial cell line this interaction induced antiapoptotic effects by activation of the Akt survival pathway.9 Collectively, these studies suggest that TIMP-1 confers resistance to chemotherapy, including treatment with TOP1 and 2 inhibitors and taxanes, supporting the idea of measuring the level of TIMP-1 as a predictive biomarker for topoisomerase inhibitor response in patients.12−14,22,23 Moreover, if the exact biological functions of TIMP-1 in relation to chemotherapy resistance are identified, it might be possible to interfere with the mechanisms leading to chemotherapy resistance and possibly reverse the resistance mechanisms. In order to elucidate the mechanisms underlying the association between high TIMP-1 levels and increased chemotherapy resistance, a quantitative global investigation of high TIMP-1 expressing breast cancer cells is required. Recent breakthroughs in the proteomics technology of high-resolution mass spectrometry (MS) instrumentation allows identification of thousands of proteins in various proteomes, quantification of thousands of post-translational modifications (PTMs) such as phosphorylations and determination of protein−protein interactions.24 In particular, quantitative proteomics, which combines stable isotope labeling by amino acids in cell culture (SILAC) with enrichment strategies of modified peptides and high-performance MS, represents a powerful approach to monitor intracellular events in a global fashion. Our laboratories have generated single cell clones from the human breast cancer cell line MCF-7 expressing high or low levels of TIMP-1. We selected two clones with low TIMP-1 protein expression and two clones with high TIMP-1 protein expression. These cell clones were employed in a SILACbased25 quantitative MS approach to investigate the proteome and phospho-proteome changes between cells expressing high or low levels of TIMP-1 in two biological replicates. Article EXPERIMENTAL PROCEDURES Cell Cultures and SILAC Labeling The parental MCF-7S1 breast cancer cell line (kindly provided by Professor Marja Jäaẗ tela, The Danish Cancer Society, Copenhagen, Denmark)26 was stably transfected with pcDNA(hyg)-TIMP-1 by FuGENE trasfection reagent (Roche, Denmark) and subsequently single cell cloned by limited dilution. Eleven single cell clones were screened for TIMP-1 expression levels and two high and two low expressing TIMP-1 single cell clones were chosen for further analyses. The cells were propagated in complete media: RPMI 1640 (Gibco, Invitrogen, Denmark) with 10% FCS (Gibco, Invitrogen, Denmark) and 100 μg/mL hygromycin (Calbiochem, VWR, Denmark). For quantitative MS, cells were labeled in SILAC RPMI 1640 (PAA Laboratories GmbH, Germany)27 supplemented with 10% dialyzed FCS (Sigma, Denmark) and 200 μM glutamine (Gibco, Invitrogen, Denmark) for 12 days to ensure complete incorporation of amino acids (Figure 1). After the 12 days incorporation of amino acids, cells from each condition were seeded with same cell density in T300 flasks and media was changed two days before cell harvest. The two TIMP-1 low single cell clones were labeled with natural variants (light label) of the amino acids, one of theTIMP-1 high single clones with medium variants of amino acids (L-[13C6]Arg (+6) and L[2H4]Lys (+4)), and the second TIMP-1 high expressing single cell clone was labeled with heavy variants of the amino acids (L[13C6,15N4]Arg (+10) and L-[13C6,15N2]Lys (+8)) (Cambridge Isotope Laboratories, Andover, MA). Cells were propagated using 0.1% trypsin/EDTA (Gibco, Invitrogen, Denmark). TIMP-1 wild type (TWT-III) and TIMP-1 knockout (TKOIII) murine fibrosarcoma cell lines were previously established in our laboratory as described in ref 28. These cells were grown in M199 media (Gibco), supplemented with 10% FCS. All cells were grown at 37 °C in humidified air containing 5% CO2. Cell Lysis and In-Solution Digestion Cells from light/medium/heavy SILAC conditions were lysed separately at 4 °C in ice cold modified RIPA buffer [50 mM Tris, pH 7.5, 150 mM NaCl, 1% NP-40, 0.1% sodium deoxycholate, 1 mM EDTA, 5 mM β-glycerolphosphate, 5 mM NaF, 1 mM sodium orthovanadate, 1 complete inhibitor cocktail tablet per 50 mL (Roche, Basel, Switzerland)]. Proteins were precipitated overnight at −20 °C in 4-fold excess of ice cold acetone. The acetone-precipitated proteins were resolubilized in denaturation buffer (10 mM HEPES, pH 8.0, 6 M urea, 2 M thiourea) and the lysates from light/medium/heavy SILAC conditions were mixed 1:1:1 based on protein concentrations (Figure 1). The soluble proteins were reduced for 60 min at room temperature (RT) with 1 mM DTT and alkylated for 60 min at RT with 5.5 mM chloroacetamide (CAA). Endoproteinase Lys-C (Wako, Osaka, Japan) was added (1:100 m/m) and the samples were incubated for 3 h at RT. The samples were then diluted 4-fold with deionized water, and digested with trypsin (modified sequencing grade, Promega, Madison, WI) (1:100 m/m) overnight at RT. Trypsin and Lys-C activities were quenched by acidification of the samples (2% v/v of TFA, pH ∼ 2). For each of the samples, the peptide mixture was desalted and concentrated on a C18-SepPak cartridge (Waters, Milford, MA) and eluted with 1 × 2 mL of 40% acetonitrile (ACN) in 0.1% TFA followed by 1 × 2 mL 60% ACN in 0.1% TFA. A sample of each of the eluates (total tryptic proteome) was desalted and concentrated on a C18 STAGE-tip31 and 4137 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article 30% ACN) for 30 min, followed by isocratic (100%) buffer B for 6 min at a flow rate of 1.0 mL/min. Fractions of 2 mL were collected of which some were pooled. A sample of each fraction or fraction pool was desalted and concentrated on a C18 STAGE-tip and eluted with 2 × 10 μL 40% ACN in 0.5% acetic acid before LC-MS/MS (for proteome analysis). Phospho-peptides were enriched using Titansphere chromatography as described.29 Briefly, titanium dioxide beads (10 μm Titansphere, GL Sciences, Japan) were precoated with 2,5dihydroxybenzoic acid (2,5-DHB) by incubating the beads in a solution of 20 mg/mL 2,5-DHB in 80% ACN, 1% TFA for 20 min at RT. Approximately 1 mg of coated beads was added to each SCX fraction or fraction pool and incubated under rotation for 30 min at RT. Early SCX fractions, mostly enriched in phospho-peptides, were incubated with coated TiO2 beads twice consecutively for better coverage. The beads were washed once with 100 μL SCX buffer B and once with 100 μL 40% ACN in 0.5% TFA and transferred in 50 μL 80% ACN in 0.5% acetic acid on top of a C8 STAGE-tip. The bound phosphopeptides were eluted directly into a 96-well plate by 2 × 10 μL 5% NH4OH followed by 2 × 10 μL 10% NH4OH/25% ACN, pH > 11. The eluate was immediately concentrated in a speedvac at 60 °C to a final volume of about 5−10 μL and acidified using 20 μL 5% ACN in 1% TFA. Each sample was then desalted and concentrated on a C18 STAGE-tip and eluted with 2 × 10 μL 40% ACN in 0.5% acetic acid before LC-MS/MS. LC-MS/MS of Peptides All LC-MS/MS experiments were performed on an EASY-nLC system (Proxeon Biosystems, Odense, Denmark) interfaced with a hybrid LTQ-Orbitrap Velos (Thermo Electron, Bremen, Germany)31 through a nanoelectrospray ion source. All peptides were autosampled and separated on a 15 cm column (75 μm internal diameter) packed in-house with 3 μm C18 beads (Reprosil-AQ Pur, Dr. Maisch, Germany), where the tip of the column formed the electrospray (in-house pulled by a Sutter P-2000). For liquid chromatography, a linear gradient of ACN in 0.5% acetic acid (either: 8−24% ACN for 90 min, then 24−48% ACN for 15 min, then 60% ACN for 1 min; or: 8− 24% ACN in 150 min, then 24−48% ACN in 30 min, then 60% ACN for 1 min) was used at a constant flow rate of 250 nL/ min. The effluent from the HPLC was directly electrosprayed into the mass spectrometer using 2.1 kV spray voltage through a liquid junction connection and a heated capillary temperature of 275 °C. A lock-mass ion (m/z 445.120024) was used for internal calibration in all experiments as described earlier.32 MS was performed in a data dependent acquisition mode where up to the 10 most intense peaks were chosen for fragmentation after acquiring each full scan using Higher energy Collisional Dissociation (HCD)33 for all MS/MS events. Dynamic exclusion was used to avoid picking peaks more than once. The settings were a mass window of 10 ppm, a max list size of 500, and a time window of 90 s. Full scans were acquired in the m/z range of 300−2000 with an R = 30,000 at m/z 400 and a target value of 1e6 ions with a maximum injection time of 500 ms. For fragment scans the settings were an isolation window of 4 Da, a minimum signal intensity of 5000, R = 7500 at m/z 400, and a target value of 5e4 ions with a maximum injection time of 250 ms. Figure 1. Experimental workflow of the SILAC-based quantitative proteomics and phospho-proteomics for the analyses of the biological role of TIMP-1 in breast cancer. Two TIMP-1 low expressing and two TIMP-1 high expressing populations derived from MCF-7 human breast cancer cells were labeled by triple SILAC. Lysates were mixed 0.5:0.5:1:1 as shown. Proteins were digested by endoproteinase Lys-C and trypsin and tryptic peptides were fractionated by SCX. Phosphopeptides were enriched using TiO2 beads precoated with DHB. Samples were analyzed by high resolution nanoLC-MS/MS. The proteome data were determined directly from the SCX fractions. eluted with 2 × 10 μL 40% ACN in 0.5% acetic acid before LCMS/MS. SCX Fractionation, Phospho-Peptide Enrichment, and Proteome Preparation Peptide fractionation by SCX chromatography29,30 was performed in a 1 mL Resource S column (GE Healthcare, Sweden) on an Ä KTA FPLC system (GE Healthcare, Sweden). The peptide mixture, eluted off C18-SepPak, was loaded directly onto a 10 mL injection loop and separated by a linear gradient from 100% SCX buffer A (5 mM KH2PO4, pH 2.7, 30% ACN) to 30% SCX buffer B (5 mM KH2PO4, pH 2.7, 350 mM KCl, 4138 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article Analysis of Total Peptide and Enriched Phospho-Peptide Data Sets by MASCOT and MaxQuant All raw Orbitrap full-scan MS and MS/MS data were analyzed together using the software MaxQuant34 version 1.0.14.7. Proteins were identified by searching the HCD-MS/MS peak lists against a total of 174 122 protein entries encompassing a concatenated forward and reversed version of the International Protein Index (IPI) database for humans (v. 3.68) supplemented with commonly observed contaminants such as porcine trypsin and bovine serum proteins using the MASCOT search engine version 2.3.02. Tandem mass spectra were initially matched with a mass tolerance of 7 ppm on precursor masses and 0.02 Da for fragment ions, set to recognize tryptic cleavage sites and allowed for up to three missed cleavage sites. Cysteine carbamidomethylation (Cys +57.021464 Da) was searched as a fixed modification. N-Acetylation of protein (N-term +42.010565 Da), N-pyro-glutamine (Gln −17.026549 Da), oxidized methionine (+15.994915 Da), and for phosphopeptides: phosphorylation of serine, threonine, and tyrosine (Ser/Thr/Tyr +79.966331 Da) were searched as variable modifications. Labeled lysine and arginine were specified as fixed or variable modification, depending on prior knowledge about the parent ion (MaxQuant SILAC triplet identification). Peptide identifications were filtered based on their Mascot score, SILAC state, number of arginine and lysine residues and peptide length (minimum peptide length was specified to be six amino acids) to achieve a maximum false discovery rate of one percent. Protein groups were assembled and quantified based on the Occam’s razor principle. Finally, to pinpoint the actual phosphorylated amino acid residue(s) within all identified phospho-peptide sequences, MaxQuant calculated the localization probabilities of all putative serine, threonine, and tyrosine phosphorylation sites using the PTM score algorithm as described.35 Statistical Determination of SILAC Ratio Cutoffs for Expression and Phosphorylation Medians of the log2-transformed normalized SILAC ratios were calculated using the SILAC ratio sets from the biological replicates thus reflecting the high TIMP-1/low TIMP-1 ratios of expression and phosphorylation for all identified proteins and phospho-sites, respectively. The statistical P-values were calculated for detection of significant outlier ratios (Significance A values). Three levels of significance were chosen, P-value <0.01, 0.01 ≤ P-value < 0.05, P-value ≥ 0.05, and the median log2-transformed normalized SILAC ratios were plotted as a function of the log10-transformed summed peptide intensities for proteins and as a function of the log10‑transformed phosphopeptide intensities for phospho-sites. The median ratio cutoffs were then chosen so as to exclude the median ratios with Pvalues >0.05 as shown in Figure 2. STRING Network and Ingenuity Pathway Analysis Proteins with median normalized SILAC ratios ≥ 2.3 at the expression level and/or median normalized SILAC ratios ≥ 3.0 at phosphorylation level (460 entries) were used to build a protein−protein interaction network from the STRING database system (http://string-db.org/36) at a reliability score of at least 0.7.37 The same set of proteins was analyzed for enrichment of pathways and functional classes using the tools Ingenuity Pathway Analysis (IPA, www.ingenuity.com) and Explain (Biobase, http://www.biobase-international.com/) as well as in-house phenotypically related gene collections. The 460 UniProt entries, mapping to 453 encoding genes (six Figure 2. Statistical determination of the median normalized SILAC ratio cutoffs for proteins up-regulated at expression and/or phosphorylation. (A) Median log2-transformed normalized SILAC ratios for proteins plotted as a function of the log10-transformed summed peptide intensities and categorized based on significance A values for the regulation. (B) Median log2-transformed normalized SILAC ratios for class I phospho-sites plotted as a function of the log10-transformed phospho-peptide intensities and categorized based on significance A values for the regulation. 4139 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article etoposide cytotoxicity in the MCF-7 cells and cytotoxicity of epirubicin and SN-38 in the murine TWT-III and TKO-III cells. All MTT and LDH assays were performed at least three times and each time in triplicates. protein entries were obsolete in UniProt, two were not mapped to genes, and one demerged into two genes) were further analyzed for different functional class enrichments and drug interactions using gene ontology. Targets of the used chemotherapies epirubicin, irinotecan, etoposide, and cisplatin were queried from ChemProt database38 and DrugBank.39 This gene annotation data was layered on the protein−protein interaction network from the STRING database using Cytoscape40 and its MultiColoredNodes plugin.41 Western Blotting and TIMP-1 ELISA Lysates from each of the four selected cell clones were harvested individually by scraping off the cells in ice cold PBS, spun down at 300g at 4 °C and lysed by incubation in ProteoJET Mammalian Cell Lysis Reagent (Fermentas, Germany) containing protease inhibitors (Aprotinin, Leupeptin, Pepstatin A and Pefa Block, 1 μg/mL) (Calbiochem, VWR, Denmark) and phosphatase inhibitors (sodiumfluoride and sodiumorthovanadate, 1 mM) (Calbiochem, VWR, Denmark) for 10 min at RT. The cells were then spun at 18 000g for 15 min at 4 °C, and each of the four supernatants was transferred to a new tube. The total amount of protein was determined by the BCA Protein Assay kit (Pierce, VWR, Denmark) according to manufacturer’s instructions. The NuPAGE system (Invitrogen A/S, Denmark) was used for SDS-PAGE gel separation of proteins according to manufacturer’s instructions. In brief, lysates were mixed with NuPAGELDS loading buffer and NuPAGE sample reducing agent. Samples were then incubated at 70 °C for 10 min. Samples were loaded onto NuPAGE Novex 4−12% Bis-Tris gels with 50 μg/lane and were run in NuPAGE MOPS buffer with NuPAGE antioxidant according to the manufacturer’s instructions. Gels were blotted on polyvinylidene difluoride membranes with 2× NuPAGEtransfer buffer with 20% ethanol. Blots were blocked in washing buffer (PBS + 0.1% Tween 20) containing either 5% nonfat dry milk (for TOP1, TOP2B) or 2% ECL prime blocking reagent (for TIMP-1, β-actin, TOP2A) for 1 h and incubated overnight with the appropriate primary antibody diluted in blocking reagent: In-house mouse monoclonal anti-TIMP-1 antibody VT-7, 0.1 μg/mL,45 rabbit monoclonal anti-TOP1 1:10 000 (Epitomics, Abcam, Burlingame, CA), rabbit monoclonal anti-TOP2A 1:1000 (Cell signaling, VWR, Denmark), sheep polyclonal anti-TOP2B 1:500 (R&D systems, Trichem, Denmark), and mouse monoclonal anti-β-actin 1:1 500 000 (Sigma-Aldrich, Denmark). Blots were washed four times for a period of 30 min in washing buffer and incubated with secondary horseradish peroxidase-conjugated antibody (Dako A/S, Denmark) diluted in blocking reagent. The blots were washed four times for a period of 30 min and developed using the Amersham ECL plus Western Blotting Detection Kit (for TOP2B) or Amersham ECL Advance Western Blotting Detection Kit (for TIMP-1, βactin, TOP2A) (GE Health/Amersham Bioscience, VWR, Denmark) according to the manufacturer’s instructions. Blots were visualized with a CCD camera (BioSpectrum Imaging System, UVP BioImaging, Upland, CA). During experiments, the differences in cellular expression of TIMP-1 among the selected clones were routinely assayed with an in-house sandwich ELISA assay employing a sheep polyclonal anti-TIMP-1 antibody in the catching step and the MAC15 anti-TIMP-1 monoclonal antibody in the detection step, as described in ref 46. The levels of murine TIMP-1 in the wild-type and knockout cells were measured by a commercial quantikine mouse TIMP-1 ELISA kit (R&D Systems) according to the manufactures recommendations. Phosphorylation Sites Sequence Bias Analysis Sequence bias around the up-regulated phosphorylation sites was visualized using the IceLogo software42 which compared class I up-regulated phosphorylation sites (median normalized ratios ≥ 3.0) with reference class I phosphorylation sites (0.8 ≤ median normalized ratios ≤ 1.2), all from the same data set. The outcome of the IceLogo analysis was compared to known kinase substrate motifs (www.phosida.com35) in order to obtain over-represented, unbiased, and under-represented known kinase substrate motifs for the up-regulated phospho-sites. NetPhorest and NetworKIN Kinase Prediction Analysis The NetworKIN algorithm43 combines kinase consensus motifs, extracted from the NetPhorest atlas,44 with contextual information of the kinases and their substrates in protein association networks extracted from the STRING database. It was applied on all phosphorylation sites obtained by MS analysis. Since NetworKIN incorporates data from NetPhorest, the results include not only the specific kinase but also the name of the NetPhorest group. Growth Assay and Sensitivity to Chemotherapy For the growth assay, 40 000 cells/well of each cell line were plated in six 6-well plates. Each day, one plate was harvested: media were removed from all wells and cells were washed with PBS before the addition of 1 mL trypsin. After incubation for 60 s, 1 mL of media was added and cells were resuspended. Three individual samples from each cell suspension were counted using a hemocytometer, and the doubling time for each cell line was calculated based on three independent experiments. Growth medium was renewed on the fourth day after plating the cells. Viability of the four included TIMP-1 cell clones were tested upon treatment with the TOP2 inhibitor epirubicin (Meda AS, Denmark), the TOP1 inhibitor SN-38 (the active metabolite of irinotecan) (Sigma-Aldrich, Denmark), the TOP2B inhibitor 2(4-((7-chloro-2-quinoxalinyl)oxy)phenoxy)propionic acid (XK 469) (Sigma-Aldrich, Denmark), the TOP2 inhibitor etoposide (Meda AS, Denmark), and cis-diamminedichloroplatinum (Cisplatin) (Hospira, Denmark). Cells were seeded in 96-well plates with 8000 cells/well and allowed to plate overnight. Cells were then treated with the appropriate drug for 48 h, and cell viability was determined by addition of MTT (Sigma-Aldrich, Denmark) dissolved in PBS. MTT was added to the cells in complete media at a final concentration of 0.5 mg/mL. Cells were incubated at 37 °C for 3 h and generated formazan crystals were dissolved with 20% SDS in 0.02 M HCl overnight and measured at 570 and 690 nm. Based on the dose response curves generated for the low and high TIMP-1 cell clones in response to each of chemotherapeutics, the inhibitory concentration resulting in 50% viability (IC50) for each of chemotherapeutics was estimated. As previously described28 a lactate dehydrogenase (LDH) release assay (Cytotoxicity Detection Kit; Roche A/S, Denmark) was applied to evaluate Two-Dimensional Gel Electrophoresis and Immunoblotting Cellular lysates were subjected to IEF (pI 4−7) twodimensional PAGE (2D PAGE) as previously described.47 4140 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article (median normalized ratio ≥ 3.0), whereas 542 proteins (median normalized ratio ≤ 0.4) and 443 phosphorylation sites (median normalized ratio ≤0.3) were down-regulated (Table 1). Spearman’s correlation coefficients of 0.9 between Between 20 and 30 μL of sample was applied to the first dimension, and IEF gels were run for each sample. Proteins were visualized using a silver staining procedure. Immunoblotting using Western blots of lysates were prepared as previously described. Briefly, proteins were resolved by 2D-gel electrophoresis, blotted onto Hybond-C nitrocellulose membranes (Amersham Biosciences), and reacted with a TOP1 specific rabbit antibody (1:2000 TOP1 antibody, Epitomics) followed by detection of immune complexes with a horseradish peroxidase-labeled polymer (1:200) (Envision+ detection kit; DAKO). Blocking of antibody cross-reactivity was done using a protein-free blocking buffer (Thermo Fischer Scientific, Waltham, MA). Membranes were reversibly stained with Ponceau S solution (Sigma-Aldrich) to match the location of proteins in the membrane with the Western blot signal and to ensure proper focusing of protein spots. To identify the phosphorylation state of TOP1, one aliquot of each of low TIMP-1 A or high TIMP-1 B cell lysates was treated for 30 min at 37 °C with lambda protein phosphatase (Lambda PP), a Mn2+-dependent protein phosphatase with activity toward phosphorylated serine, threonine, and tyrosine residues according to the manufacturer’s instructions. One aliquot of each cell clone was mock-treated prior to resolving by 2D gel electrophoresis. Table 1. Summary of Quantitative Proteomics and PhosphoProteomics Data identified protein groups identified phospho-sitesc total ratio ≥ 3.0b ratio ≤ 0.3b 5421a 452 443 a Nonredundant total number identified in both experimental replicates 1 and 2. bRatios are medians of the normalized High TIMP-1/low TIMP-1 SILAC ratios from both experimental replicates 1 and 2; ratio cutoffs were determined from the statistical analyses based on significance A values (Figure 2). cMASCOT score ≥ 10; PTM score ≥ 25; localization probability ≥ 0.8. normalized SILAC ratios for proteins identified in both experiments and coefficients of 0.6−0.8 between normalized SILAC ratios for phosphorylation sites in both experiments (Supporting Information Figures 2A and B) were in line with previous phosphoproteomics experiments.48 Similar to what has been observed in most SILAC experiments,49,50 the majority of proteins (>75%) and phosphoproteins (>60%) were found to have SILAC ratios between 0.5 and 2.0 and to exemplify the general validity of the data set, housekeeping proteins such as Heat shock 70 kDa protein 4 (hsp74) (Figure 3B, left) and β-tubulin (data not shown) were found to be expressed in equal amounts in both TIMP-1 low and high expressing clones. Differential expression of TIMP-1 among the cells expressing low and high levels was verified in the proteome data set (Figure 3B, right), in concordance with Western blot and ELISA analyses (Figure 3A). Statistical Analysis of Cell Viability Data and Cell Growth All calculations were performed using SAS software (version 9.2, SAS Software, Inc., Cary, NC). For statistical analyses, the relationship between TIMP-1 concentration and cell survival was analyzed with mixed model solution. It is a generalization of general linear model solution containing both fixed and random effects. Within the model, drug doses and TIMP-1 levels were set as fixed effects, whereas cell line, plate placement, and experiment number were set as random effects. Mean values of the doubling times for the low TIMP-1A/B and the high TIMP-1A/B cells were analyzed by Student’s t test. The level of significance was set at P < 0.05. ■ 6709a 312 542 total ratio ≥ 2.3b ratio ≤ 0.4b RESULTS Proteomic and Phosphoproteomic Analysis of TIMP-1 Expressing Cells Pathway and Functional Category Enrichment in TIMP-1 High Expressing Cells TIMP-1 transfected single cell clones obtained from the human breast cancer cell line MCF-7S1 were used as the cellular model, and the clones were SILAC labeled (Figure 1). Based on TIMP-1 protein expression levels as determined by ELISA, we selected two low high and two low expressing clones from our panel of 11 single cell clones. From both replicates combined, we found 41 417 unique peptides originating from 6709 protein groups and 5421 unique class I phospho-sites mapped to 1640 protein groups (Supporting Information Tables 1−4). The overlap of the identified protein groups between the two biological replicates was 68%, (Supporting Information Figure 1A). Serine (Ser), threonine (Thr), and tyrosine (Tyr) phosphorylation sites comprised 92.2%, 7.4%, and 0.4% of the total phosphorylation sites, respectively (Supporting Information Table 5), with similar percentages for the up-regulated Ser/Thr/Tyr sites. Moreover, one or two phosphorylation sites were detected in most phosphorylated peptides (Supporting Information Figures 1B and C). Comparative analysis of the proteomic data from the TIMP-1 clones revealed that the TIMP-1 high expressing cells overexpressed 312 proteins (median normalized ratio ≥ 2.3) and 452 class I phosphorylation sites were up-regulated Proteins found to be up-regulated (median normalized ratio ≥ 2.3) and/or hyper-phosphorylated (median normalized ratio ≥ 3.0) in the TIMP-1 high expressing cells were selected for further analysis. The cutoff values were selected based on the significance of regulation at both expression and phopshorylation levels (Figure 2). Using these cutoff values, a combined list of 460 highly up-regulated proteins was generated. An interaction network of 146 nodes was obtained for these proteins at a high confidence level (0.7) in the STRING database (Figure 4). In Table 2A, we list the up-regulated and/ or hyperphosphorylated proteins with known biological relation to TIMP-1. Interestingly, TIMP-1 was directly connected to the CD44 antigen (up-regulated 2.7 fold, Table2A), which has been shown to bind TIMP-1,51 and to clusterin (CLU) (up-regulated 4 fold, Table 2A), which has been associated with drug resistance to both TOP1 and TOP2 inhibitors.52,53 These 460 proteins were used for pathway and functional enrichment analysis (Supporting Information Table 7). IPA mapped 460 proteins to 453 entries in ingenuity database. The JAK/STAT signaling pathway and cell cycle G2/M DNA damage checkpoint regulation pathway were among the significantly 4141 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article Figure 3. Validation of the model system. (A) Western blot analysis of TIMP-1 with β-actin as a loading control. The quantitative ELISA measurements of TIMP-1 are shown below the blot. (B) Representative peptide from HSP74 showing 1:1:1 ratios independent of TIMP-1 (left) and representative unmodified peptide from TIMP-1 (right). All peptides are in SILAC triplets. Different colors correspond to the colors in Figure 1, representing the SILAC L/M/H labels. members JunB (up-regulated 3.0 fold in the TIMP-1 high expressing cell lines) and Fos-related antigen 2 (FRA2 or FosL2) (up-regulated 2.6 fold in the TIMP-1 high expressing cell lines) are also in the network (Figure 4). Noteworthy among other proteins of particular interest in relation to TIMP1 that did not show up in the protein interaction network but were nevertheless found to be up-regulated in TIMP-1 high expressing cells (Table 2A) was the membrane protein CD63, previously shown to bind to TIMP-1.9,21,51 CD63 was upregulated approximately 2-fold (statistical cutoff 2.3). perturbed in the data set. The JAK/STAT pathway is one of the main signaling pathways in eukaryotic cells and is involved in the control of cell proliferation, differentiation, survival, and apoptosis.54 The G2/M damage checkpoint is often deficient in cancer, resulting in survival of cells with DNA damage and mutations leading to resistance and sustained proliferation.55 The same gene set was further analyzed for biological function, molecular processes enrichment and drug interactions. Proteins overexpressed in TIMP-1 high expressing cells participate in several functional categories including: apoptosis (e.g., CLU, FosL2, mTOR, TIMP-1, CD44, TOP1, TOP2B, ABCC1), cell cycle (e.g., mTOR, TOP1, TOP2B), DNA repair (e.g., TOP1, CLU), drug resistance or sensitivity (e.g., NDRG1, TOP2A, CD59, CLU), drug targets (e.g., TOP1, TOP2A, TOP2B), and drug transport (e.g., ABCC1, -3, -6). For a complete list of functional groups and the proteins discovered in each group, see Supporting Information Table 7. The most relevant functional classes were layered on the protein−protein interaction network from STRING with color-coding representing different functional classes (Figure 4). This analysis aimed to search for novel links between TIMP-1 and cancer related pathways, thereby identifying potential new functional roles of TIMP-1 in cancer. In addition to TIMP-1 and its direct interactors CD44 and CLU in the functional network (Figure 4), noteworthy among other proteins in the network are TOP1, TOP2A, and TOP2B, all of which are involved in maintaining DNA topology during DNA replication, transcription, or repairing DNA double strand breaks. Interestingly, we also identified the mammalian target of rapamycin (mTOR), which has been implicated in the resistance to TOP2 inhibitors.56 Activator protein-1 (AP-1) transcription factor complex High TIMP-1 Protein Level Is Associated with Increased Levels and Phosphorylation of Topoisomerases DNA topoisomerases were found in the enriched functional classes (Figure 4). More specifically, TOP2A displayed increased expression (8-fold) in TIMP-1 high expressing cells (Table 2A) and the proteomics data was validated by Western blotting (Figure 5B). Proteomics data also showed that TOP1 was 1.8 fold higher expressed (Table 2A) in the TIMP-1 high expressing cells, however this slight fold up-regulation is lower than the statistically determined cutoff value of 2.3 and higher expression of TOP1 is not detectable in the Western blot (Figure 5B). There was no differential expression of TOP2B between TIMP-1 low and high expressing cells (Table 2A), also validated by the Western blotting (Figure 5B). Many phosphorylation sites on the topoisomerases were found to be up-regulated in the TIMP-1 high expressing cells (Table 2B). TOP1 had three phosphorylation sites (Ser 2, 10 and 112), where phosphorylation was up-regulated in TIMP-1 high expressing cells (Figure 5A, left and Table 2B). The phosphorylation at Ser 2 was only detected in the first replicate with a fold-change of about 3. Phosphorylation at Ser 10 and 4142 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article Figure 4. Functional class distribution of protein−protein interaction network of the identified proteins up-regulated at expression and/or phosphorylation levels in TIMP-1 high expressing cells. The nodes in the STRING network are sectored by different colors for functional annotations. Circular nodes originate from the proteome data set (median normalized ratio ≥ 2.3), triangular nodes originate from the phosphoproteome data set (median normalized ratio ≥ 3.0), whereas the octagon represents the proteins detected in both data sets. TOP1, TOP2A, TOP2B, TIMP-1, CD44, and CLU are highlighted. immunoblots + phosphatase) showed substantially fewer modified forms as compared with the untreated samples, indicating that the majority of the more acidic forms are due to phosphorylations, which supports our MS-based analysis. The most heavily phosphorylated topoisomerase enzyme was TOP2B, in which several Ser sites were phosphorylated in the TIMP-1 high expressing cells: Ser 1336, 1340, 1342, 1344, 1400, 1413, 1461, 1466, 1522, 1524, and 1526 (Figure 5A, right and Table 2B). The phosphorylations on Ser 1461 and 1466 were only detected in the first experiment, but were found to have an 11-fold increase. All other phosphorylation sites were found to be 2−5-fold up-regulated in TIMP-1 high expressing cells. Since TOP2B is similarly expressed between TIMP-1 low and high expressing cells (Figure 5B and Table 2A), the SILAC ratios for the phosphorylations indicate true up-regulation of several phosphorylations at a post-translational level. 112 were detected in both replicates with up-regulations of about two and three folds respectively in the TIMP-1 high expressing cells. TOP2A had one identified phosphorylation site at Ser 1328 which was about 13-fold more phosphorylated in the TIMP-1 high expressing cells although this was only detected in replicate one. The fact that SILAC ratios for the phosphorylations of TOP1 and TOP2A are generally higher than the SILAC ratios for their expressions (Table 2), between TIMP-1 low and high expressing cells, indicates some upregulation of phosphorylation at a post-translational level. To confirm these differences in the phosphorylation states of TOP1 between low and high TIMP-1 expressing cells, we exploited the fact that phosphorylated protein will almost always have a more acidic pI than its corresponding unphosphorylated form. IEF followed by immunoblotting allows detecting more acidic forms of a protein and the PTM state of a protein. Indeed, 2D gel-based comparative analysis of TOP1 in low TIMP-1 and high TIMP-1 cells showed that TOP1 exists in a state of at least four modified forms in low TIMP-1 expressing cells, and that TIMP-1 overexpression affects TOP1 gain of additional modifications, with a clear shift toward multiple modification states (Figure 5C, TOP1 immunoblots). Since PTMs other than phosphorylation can cause changes in the pI, we treated cell lysates with lambda phosphatase prior to gel analysis to show that the multiple forms identified were mainly due to phosphorylation events. The TOP1 patterns obtained in this manner (Figure 5C, TOP1 Kinase Motif Analysis of Upregulated Phosphorylation Sites in TIMP-1 High Expressing Cells In order to visualize the kinase motifs over-represented in the up-regulated phophorylation sites compared to the unregulated phophorylation sites, the sequence windows aligned around all class I up-regulated phosphorylation sites with median normalized ratios equal to or higher than the statistical cutoff of 3.0 were compared to those of unregulated class I sites and demonstrated a bias against arginine in several minus and plus subsites (especially −1 to −4) and against proline in +1 subsite (Figure 6A). This indicates an under-representation of the 4143 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 4144 3f 2±2 2.8 ± 0.4 13f 2±1 2±1 4±2 4±2 4±2 11f 11f 5±2 5±2 high TIMP-1/low TIMP-1e NetPhorest group CK2_group ATM_ATR_group CDK2_CDK3_CDK1_CDK5_group CDK2_CDK3_CDK1_CDK5_group CK2_group PLK_group (all three sites) CDK2_CDK3_CDK1_CDK5 group CDK2_CDK3_CDK1_CDK5_group CK2_group PKC_group PLK_group CK2_group PLK_group kinase CK2alpha ATM CDK1 CDK1 CK2alpha PLK1 (all three sites) CDK1 CDK1 CK2alpha PKCdelta PLK1 CK2alpha PLK1 0.5 0.5 0.3 0.6 0.2 0.2 0.3 0.3 0.1 0.3 0.2 0.2 0.2 NetworKIN score (A) Median protein SILAC ratios from quantitative proteomics. (B) Marker phospho-peptides from topoisomerases identified by quantitative phospho-proteomics. Median phospho-peptide SILAC ratios are reported. Potential protein kinases responsible for the up-regulated phosphorylation sites (NetworKIN) in topoisomerases are reported along with their respective scores. bAll protein ratios (total peptide counts ≥ 2) are medians of the normalized SILAC ratios from experimental replicates 1 and 2. cOnly identified in experimental replicate 1. dA site localization probability cutoff of 0.80 was used. e All ratios are medians of the normalized SILAC ratios from experimental replicates 1 and 2. fOnly identified in experimental replicate 1. a _VVEAVNS(ph)DS(ph)DS(ph)EFGIPKK_ _SEDDS(ph)AKFDS(ph)NEEDSASVFSPSFGLK_f _VKAS(ph)PITNDGEDEFVPSDGLDKDEYTFSPGK_ _VKAS(ph)PITNDGEDEFVPS(ph)DGLDK_ TOP 1 TOP 2A TOP 2B P-sited 2 10 112 1328 1336, 1340, 1342, 1344 1400 1400, 1413 1461, 1466 1522, 1524, 1526 sequence _(ac)S(ph)GDHLHNDSQIEADFR_f _(ac)SGDHLHNDS(ph)QIEADFR_ _ENGFSS(ph)PPQIKDEPEDDGYFVPPK_ _IKNENTEGS(ph)PQEDGVELEGLK_f _RNPWS(ph)DDES(ph)KS(ph)ES(ph)DLEETEPVVIPR_ protein name high TIMP-1/low TIMP-1b TIMP1_HUMAN 10 ± 1 TOP1_HUMAN 1.8 ± 0.2 TOP2A_HUMAN 8c TOP2B_HUMAN 1±2 CD44_HUMAN 2.7 ± 0.6 CD63_HUMAN 2.1 ± 0.2 CLUS_HUMAN 4±1 JUNB_HUMAN 3.0 ± 0.5 FOSL2_HUMAN 2.6 ± 0.8 topoisomerases: Marker phospho-peptides identified by quantitative phospho-proteomics UniProt name (A) Effect of TIMP-1 expression levels on those of topoisomerases and others: Median protein ratios from quantitative proteomics metalloproteinase inhibitor 1 DNA topoisomerase 1 DNA topoisomerase 2-alpha DNA topoisomerase 2-beta CD44 antigen CD63 antigen clusterin transcription factor jun-B Fos-related antigen 2 (B) Effect of TIMP-1 expression levels on the phosphorylation levels of protein name Table 2. Regulated Proteins and Phospho-Proteins with Known Biological Relation to TIMP-1a Journal of Proteome Research Article dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article Figure 5. Effect of TIMP-1 overexpression on the expression and phosphorylation levels of topoisomerases. (A) A representative phosphorylated peptide from pTOP1 (left) and a phosphorylated peptide from pTOP2B (right). All peptides are in SILAC triplets. Different colors correspond to the colors in Figure 1, representing the SILAC L/M/H labels. (B) Western blot analysis of TOP1, TOP2A, and TOP2B with β-actin as a loading control. The expression for TIMP-1 is also shown. (C) Two-dimensional immunoblot analysis of TOP1 expression and PTMs patterns in low TIMP-1 A (upper panel) and high TIMP-1 B (lower panel) cell line clones. The IEF gels were either silver stained (left-hand panels) or immunoblotted for TOP1 (right-hand panels). Arrowheads indicate multiple forms of TOP1. Treatment of lysates with lambda protein phosphatase prior to gel analysis is shown (right-hand panel, TOP1 immunoblot + phosphatase). 4145 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article Figure 6. Effect of TIMP-1 overexpression on the global phosphorylation patterns and cell growth. (A) Phospho-peptides were aligned around the class I phosphorylation sites, thereby comparing up-regulated phospho-peptides (median normalized ratios ≥ 3.0) with reference phospho-peptides (0.8 ≤ median normalized ratios ≤ 1.2). Observed sequence bias was visualized with the IceLogo software tool. Over-represented, unbiased, and under-represented known kinase substrate motifs are also shown. (B) Growth curves for the cell lines low TIMP-1 A, low TIMP-1 B, high TIMP-1 A, and high TIMP-1 B. Cells were counted in triplicates with 24 h intervals and the best-fitted exponential lines were layered on top of the data. Three independent experiments were performed and a representative experiment is shown. Error bars represent SE. Doubling times were calculated from the curves and the differences between the low TIMP-1A/B and the high TIMP-1A/B cells were statistically significant (P = 0.0003). TIMP-1 High Expressing Cells Are More Resistant toward Topoisomerase Inhibitors but Not toward Cisplatin baseophilic kinases such as PKA and PKC, as well as the proline-directed cyclin-dependent kinases and MAP kinases,57 which are of particular interest since TIMP-1 high expressing cells showed a significantly longer doubling time (25 h) compared to the TIMP-1 low expressing cells (22 h) (P = 0.0003) (Figure 6B). A preference was seen for glutamic acids in the minus subsites (−4 to −1) and for serine in the distal minus subsites (−6 to −3) (Figure 6A) in TIMP-1 high cells, which indicates an over-representation of kinases such as PLK, PLK1, and CK1. There is no bias for or against kinases such as CK2 and ATM/ATR (Figure 6A). In order to combine the sequence bias information with the protein association network information, a NetworKIN analysis was used to identify candidate kinases involved in the hyperphosphorylation of topoisomerases (Table 2B). Five highscoring kinases were identified: ATM, CDK1, CK2 alpha, PKC delta and PLK1. PLK1 was the only kinase, which was expressed at a slightly higher level in the TIMP-1 high expressing cells (1.4-fold induction, Supporting Information Table 1). To test whether the increased protein levels and/or phosphorylation status of the topoisomerase enzymes in TIMP-1 high expressing cells were associated with a changed sensitivity to targeted inhibition of the topoisomerases, we performed cell viability assays of the high and low TIMP-1 expressing clones treated with different topoisomerase inhibitors. Each cell line was exposed to increasing concentrations of specific topoisomerase inhibitors to analyze the cell viability response (Figure 7A−C). The relationship between cellular TIMP-1 protein levels and sensitivity to the TOP1 inhibitor SN-38 was statistically highly significant as determined by mixed model analysis (P < 0.0001), with TIMP-1 high expressing cells being significantly less sensitive to SN-38 compared to TIMP-1 low expressing cells (Figure 7A). TIMP-1 high expressing cells were also significantly (P = 0.035) less sensitive to epirubicin (general TOP2 inhibitor) induced reduction of cell viability as compared to TIMP-1 low expressing cells (Figure 7B). This was confirmed by exposure of the cells to etoposide, another TOP2 inhibitor. In full agreement with the epirubicin data, TIMP-1 high expressing 4146 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article Figure 7. Cell viability of cells treated with chemotherapy for 48 h. (A) Cells treated with the TOP1 inhibitor SN-38 (0, 0.256, 1.28, 6.4, 32, 160, 800 nM). (B) Cells treated with the TOP2 inhibitor Epirubicin (0, 0.061, 2.4, 9.8, 39, 156, 625 nM). (C) Cells treated with the TOP2B inhibitor XK 469 (0, 25, 50, 75, 100, 200, 300, 500 μM). (D) Cells treated with cisplatin (0, 3.13, 6.25, 12.50, 25, 50, 100 μM). Data is presented as percent of untreated cells, and the concentrations of drugs are indicated in the figures. observed following cisplatin treatment (Figure 7D). The IC50 values are shown in Table 3. cells were less sensitive (P < 0.0001) to etoposide induced cell death as compared to TIMP-1 low expressing cells (Supporting Information Figure 3A). As an important extension, we applied murine fibrosarcoma cell lines to generalize our findings. These data demonstrated that SN-38 and epirubicin caused significantly more cell death in mouse fibroblast cells established from a TIMP-1 genetically knock out mouse compared to wild type mouse fibroblasts28 (Supporting Information Figure 3B). To test if high TIMP-1 also influenced sensitivity to a specific TOP2B inhibitor, we exposed the cells to the TOP2B inhibitor XK 469. We showed, that the TIMP-1 high expressing cells were significantly less sensitive to this TOP2B inhibitor (P = 0.023) as compared to TIMP-1 low expressing cells (Figure 7C). The observed differences in sensitivity to chemotherapeutic drugs could be associated to TIMP-1 mediated differences in cellular growth. Therefore, we compared the growth of the 4 MCF-7 sublines and found that the TIMP-1 overexpressing cells had a small but significantly longer doubling time (25 h) compared to the TIMP-1 low expressing cells (22 h) (P = 0.0003) (Figure 6B). To exclude the possibility that overexpression of TIMP-1 in MCF-7S1 cells led to a more general chemoresistant phenotype, perhaps related to the observed differences in growth rate, we tested the cell viability response to the chemotherapeutic drug cisplatin that has a different mode of action. This drug does not target any of the topoisomerases, but cross-links DNA thereby preventing normal cell cycle regulation which eventually triggers apoptosis.58 No significant differences in cell viability (P = 0.13) between TIMP-1 low and high expressing cells were Table 3. IC50 Values of TOP Inhibitors and Cisplatin for the Four Low and High TIMP-1 Cell Clonesa IC50 values for the four cell clones low TIMP-1 A SN-38 Epirubicin XK 469 Cisplatin 70 30 65 57 nM nM μM μM low TIMP-1 B 70 50 90 50 nM nM μM μM high TIMP-1 A high TIMP-1 B 230 nM 150 nM 460 μM 52 μM 150 nM 130 nM 450 μM 44 μM a The half maximal inhibitory concentration (IC50) is read from the dose−response curves for each cell line exposed to either SN-38, epirubicin, XK 469, or cisplatin. ■ DISCUSSION In this study, we employed SILAC based quantitative MS to analyze global proteome and phosphoproteome differences of MCF-7 breast cancer cells genetically manipulated to express high or low levels of TIMP-1. We prioritized to investigate proteins being potentially biologically associated with our preclinical findings and our clinical observations that high levels of TIMP-1 in cancer cells significantly associate with resistance to treatment with topoisomerase inhibitors.12,22 We confirmed the previous findings that high cellular expression of TIMP-1 is associated with increased resistance to topoisomerase inhibitors, and we also observed that murine TIMP-1 wild-type 4147 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article fibrosarcoma cells are more resistant to topoisomerase inhibitors than their gene deficient counterparts. Moreover, our data showed that cells expressing high levels of TIMP-1 have increased expression and/or phosphorylation of the topoisomerases, which may explain the resistance phenotype observed in cells with high TIMP-1 levels. The proteomic and phosphoproteomic data revealed regulation of hundreds of proteins and hundreds of phosphorylation sites in cells with high TIMP-1 levels compared to those with low levels. We mapped the most upregulated proteins and phospho-proteins to functional classes using IPA and found enrichment for processes that TIMP-1 is believed to be involved in, for example, apoptosis, cell cycle, transcription factors, DNA repair, drug transport, and drug resistance/sensitivity.7−11,28,59 It is particularly interesting that all the identified topoisomerases were either hyper-phosphorylated or overexpressed since this may explain why previous studies found high TIMP-1 levels in tumor or plasma to be associated with decreased benefit from topoisomerase inhibitor containing chemotherapy,12−15 as both topoisomerase 1 and 2 activity is positively dependent on phosphorylation.60−62 It is therefore intriguing to speculate that increased expression levels and phosphorylation status of topoisomerases may cause the chemotherapy resistance phenotype. To investigate the functional relevance of the increased protein expression and/or phosphorylation of topoisomerases found in the two TIMP-1 high-expressing cell lines, we exposed all cell lines to TOP1 and TOP2 inhibitors and found significantly decreased sensitivity to both inhibitors in TIMP-1 high expressing cells. Although there is abundant evidence that high TIMP-1 levels are associated with topoisomerase inhibitor resistance, the underlying mode of action is to date not clear. TIMP-1 may bind to the cell surface and be transported into the nucleus as shown in a previous study in MCF-7 human breast cancer cells.63 As such, TIMP-1 has also been shown to bind to the cell surface proteins CD63 and CD44. The binding of TIMP-1 to these proteins initiates intracellular signal transduction that leads to an antiapoptotic phenotype.9,12−15,20,21,51 We found both CD63 and CD44 to be up-regulated at the expression level, which suggests a positive feedback mechanism. This opens new doors in developing anticancer therapeutic interventions, as disruption of the TIMP-1 complex with plasma membrane proteins could potentially reduce the antiapoptotic signaling from the complex. TIMP-1 has been suggested to initiate many different intracellular signaling pathways, which could explain the chemotherapy resistance phenotype seen in TIMP-1 high expressing cells and tumors. To determine which kinases, and hence pathways, may be hyperactivated in TIMP-1 high expressing cells, we analyzed the kinase motifs for all the upregulated phosphorylation sites against unchanged phosphorylation sites (reference) from the same data set and found a bias against proline-directed kinases. This is interesting because the proline-directed kinases, Akt and ERK, play a role in TIMP1 overexpressing cells, and have been related to resistance to breast cancer treatment.7−9,11,19,59,64,65 Second, although contradictory to earlier studies, we found TIMP-1 high expressing cells to grow slightly but significantly slower than TIMP-1 low expressing cells. This could be explained by the underrepresentation of the proline-directed kinases that promote proliferation.66−68 Consistent with this we have recently reported an inverse relationship between TIMP-1 protein levels and the proliferation marker Ki67 in clinical breast cancer samples.69 The motif analysis showed that the recognition motif for polo-like kinases was overrepresented. Moreover, polo-like kinase 1 (PLK1) phosphorylates TOP2A70 and NetworKIN predicted PLK1 also to be responsible for the phosphorylation of six up-regulated phosphorylation sites in TOP2B. PLK1 also phosphorylates numerous other cell-cycle proteins, including PKMYT1 and CCNB1, both of which we found to be upregulated in cells with high TIMP-1 levels. These phosphorylations lead to inhibition of PKMYT171 and promoted nuclear import of CCNB,72 promoting progression through M-phase. This is consistent with the slower growth of TIMP-1 high expressing cells and the observed increase in expression and phosphorylation of the topoisomerases. Our data set revealed increased expression of hundreds of proteins in the TIMP-1 high expressing cells compared to TIMP-1 low expressing, and it is possible that the mere regulation of protein expression plays a role. As such, we observed an up-regulation of several transcription factors in TIMP-1 high expressing cells, which may explain the massive amounts of proteins being up-regulated in these cells. Two transcription factors belonging to the activator protein-1 (AP1) complex family, namely, JunB and FosL2, were found to be up-regulated in TIMP-1 high expressing cells and are present in the STRING TIMP-1 interaction network. A previous 293 AP1 reporter cell line study showed that exposure to recombinant TIMP-1 resulted in elevated level of AP-1 activity, suggesting that TIMP-1 can activate this transcription factor complex either directly or indirectly.19 While the PI3K/Akt/NF-kβ signaling pathway has also been proposed as a candidate in another TIMP-1 high related TOP2 inhibitor resistant model,11 we did not observe a differential expression of NF-kβ in TIMP1 low and high expressing cells. This does not exclude that the protein could possess different activity in different cell lines. ■ CONCLUSIONS This study is the first global, unbiased, and quantitative proteomic investigation of low and high TIMP-1 expressing breast cancer cells, and it shows for the first time that overexpression of TIMP-1 results in up-regulation and hyperphosphorylation of a number of proteins being either directly or indirectly associated with drug resistance mechanisms. Of particular interest is the observed association between high TIMP-1 protein expression and resistance to topoisomerase inhibitors, which is likely due to the observed up-regulation and/or hyper-phosphorylation of the three major DNA topoisomerases, TOP1, TOP2A, and TOP2B. However, the exact relationship between topoisomrase phosphorylation and sensitivity to topoisomerase inhibitors remains to be elucidated. In particular, it should be tested whether phosphorylated topoisomerases are likely candidates as biomarkers for topoisomerase inhibitor resistance in TIMP-1 high expressing tumors. Importantly, our data from the experimental model system recapitulates fundamental aspects of increased resistance to topoisomerase inhibitors observed in vivo for TIMP-1 overexpressing cells. 4148 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research ■ Article Di, L. A.; Albain, K.; Swain, S.; Piccart, M.; Pritchard, K. Comparisons between different polychemotherapy regimens for early breast cancer: meta-analyses of long-term outcome among 100,000 women in 123 randomised trials. Lancet 2012, 379 (9814), 432−444. (3) Egeblad, M.; Werb, Z. New functions for the matrix metalloproteinases in cancer progression. Nat. Rev. Cancer 2002, 2 (3), 161−174. (4) Stetler-Stevenson, W. G. Tissue inhibitors of metalloproteinases in cell signaling: metalloproteinase-independent biological activities. Sci. Signaling 2008, 1 (27), re6. (5) Wurtz, S. O.; Schrohl, A. S.; Sorensen, N. M.; Lademann, U.; Christensen, I. J.; Mouridsen, H.; Brunner, N. Tissue inhibitor of metalloproteinases-1 in breast cancer. Endocr.-Relat. Cancer 2005, 12 (2), 215−227. (6) Airola, K.; Karonen, T.; Vaalamo, M.; Lehti, K.; Lohi, J.; Kariniemi, A. L.; Keski-Oja, J.; Saarialho-Kere, U. K. Expression of collagenases-1 and −3 and their inhibitors TIMP-1 and −3 correlates with the level of invasion in malignant melanomas. Br. J. Cancer 1999, 80 (5−6), 733−743. (7) Liu, X. W.; Bernardo, M. M.; Fridman, R.; Kim, H. R. Tissue inhibitor of metalloproteinase-1 protects human breast epithelial cells against intrinsic apoptotic cell death via the focal adhesion kinase/ phosphatidylinositol 3-kinase and MAPK signaling pathway. J. Biol. Chem. 2003, 278 (41), 40364−40372. (8) Liu, X. W.; Taube, M. E.; Jung, K. K.; Dong, Z.; Lee, Y. J.; Roshy, S.; Sloane, B. F.; Fridman, R.; Kim, H. R. Tissue inhibitor of metalloproteinase-1 protects human breast epithelial cells from extrinsic cell death: a potential oncogenic activity of tissue inhibitor of metalloproteinase-1. Cancer Res. 2005, 65 (3), 898−906. (9) Jung, K. K.; Liu, X. W.; Chirco, R.; Fridman, R.; Kim, H. R. Identification of CD63 as a tissue inhibitor of metalloproteinase-1 interacting cell surface protein. EMBO J. 2006, 25 (17), 3934−3942. (10) Wang, T.; Lv, J. H.; Zhang, X. F.; Li, C. J.; Han, X.; Sun, Y. J. Tissue inhibitor of metalloproteinase-1 protects MCF-7 breast cancer cells from paclitaxel-induced apoptosis by decreasing the stability of cyclin B1. Int. J. Cancer 2010, 126 (2), 362−370. (11) Fu, Z. Y.; Lv, J. H.; Ma, C. Y.; Yang, D. P.; Wang, T. Tissue inhibitor of metalloproteinase-1 decreased chemosensitivity of MDA435 breast cancer cells to chemotherapeutic drugs through the PI3K/ AKT/NF-small ka, CyrillicB pathway. Biomed. Pharmacother. 2011, 65 (3), 163−167. (12) Willemoe, G. L.; Hertel, P. B.; Bartels, A.; Jensen, M. B.; Balslev, E.; Rasmussen, B. B.; Mouridsen, H.; Ejlertsen, B.; Brunner, N. Lack of TIMP-1 tumour cell immunoreactivity predicts effect of adjuvant anthracycline-based chemotherapy in patients (n = 647) with primary breast cancer. A Danish Breast Cancer Cooperative Group Study. Eur. J. Cancer 2009, 45 (14), 2528−2536. (13) Ejlertsen, B.; Jensen, M. B.; Nielsen, K. V.; Balslev, E.; Rasmussen, B. B.; Willemoe, G. L.; Hertel, P. B.; Knoop, A. S.; Mouridsen, H. T.; Brunner, N. HER2, TOP2A, and TIMP-1 and responsiveness to adjuvant anthracycline-containing chemotherapy in high-risk breast cancer patients. J. Clin. Oncol. 2010, 28 (6), 984−990. (14) Sorensen, N. M.; Bystrom, P.; Christensen, I. J.; Berglund, A.; Nielsen, H. J.; Brunner, N.; Glimelius, B. TIMP-1 is significantly associated with objective response and survival in metastatic colorectal cancer patients receiving combination of irinotecan, 5-fluorouracil, and folinic acid. Clin. Cancer Res. 2007, 13 (14), 4117−4122. (15) Frederiksen, C.; Qvortrup, C.; Christensen, I. J.; Glimelius, B.; Berglund, A.; Jensen, B. V.; Nielsen, S. E.; Keldsen, N.; Nielsen, H. J.; Brunner, N.; Pfeiffer, P. Plasma TIMP-1 levels and treatment outcome in patients treated with XELOX for metastatic colorectal cancer. Ann. Oncol. 2011, 22 (2), 369−375. (16) Schrohl, A. S.; Christensen, I. J.; Pedersen, A. N.; Jensen, V.; Mouridsen, H.; Murphy, G.; Foekens, J. A.; Brunner, N.; HoltenAndersen, M. N. Tumor tissue concentrations of the proteinase inhibitors tissue inhibitor of metalloproteinases-1 (TIMP-1) and plasminogen activator inhibitor type 1 (PAI-1) are complementary in determining prognosis in primary breast cancer. Mol. Cell. Proteomics 2003, 2 (3), 164−172. ASSOCIATED CONTENT S Supporting Information * Additonal experimental details as described in the text. This material is available free of charge via the Internet at http:// pubs.acs.org. Accession Codes All the MS raw data associated with this manuscript can be found at http://cpr1.sund.ku.dk/datasets/proteomics. The name of the zipfile containing all the raw files is “TIMP1_in_relation_to_drug_resistance_in_breast_cancer_cells”. The password is pTOP_2b. ■ AUTHOR INFORMATION Corresponding Author *(J.V.O.) E-mail: Jesper.Olsen@cpr.ku.dk. Telephone: +45 35 32 50 22. Fax: +45 35 32 50 01. (J.S.) E-mail: Stenvang@sund. ku.dk. Telephone: +45 35 33 37 53. Fax: +45 35 33 27 55. Author Contributions # O.H., S.M., L.F., N.B., J.V.O., and J.S.: Shared authorship. Notes The authors declare no competing financial interest. ■ ACKNOWLEDGMENTS The authors would like to thank Mr. Anatoliy Dmytriyev for uploading the raw data, and Dr. Christian D. Kelstrup and Dr. Sebastian A. Wagner for helpful discussions. Ms. Vibeke Jensen is acknowledged for technical assistance on cell culture and TIMP-1 analysis. We thank the Danish Natural Research Foundation, The Danish Strategic Research Council (TIPCAT), The Medical Research Council, The Danish Cancer Society, The Danish Center for Translational Breast Cancer Research, and A Race Against Breast Cancer for financial support. Work at the Center for Protein Research is supported by a generous donation from the Novo Nordisk Foundation. Part of this work has been funded by PRIME-XS a seventh Framework Programme of the European Union (Contract No. 262067- PRIME-XS). C.F. is supported by Marie Curie and EMBO postdoctoral fellowships. ■ ABBREVIATIONS TIMP, tissue inhibitor of metalloproteinase; SILAC, stable isotope labeling by amino acids in cell culture; TOP, topoisomerase; FAK, focal adhesion kinase; ERK, extracellular signal-regulated kinase; PTM, post-translational modifications; TWT-III, TIMP-1 wild type murine fibrosarcoma cell lines; TKO-III, TIMP-1 knock out murine fibrosarcoma cell lines; CAA, chloroacetamide; 2,5-DHB, 2,5-dihydroxybenzoic acid; HCD, higher energy collisional dissociation; IPI, International Protein Index; LDH, lactate dehydrogenase; 2D PAGE, twodimensional PAGE; lambda PP, lambda protein phosphatase; hsp74, Heat shock 70 kDa protein 4; CLU, clusterin; mTOR, mammalian target of rapamycin; AP-1, activator protein-1; PLK1, polo-like kinase 1 ■ REFERENCES (1) Early Breast Cancer Trialists’ Collaborative Group (EBCTCG). Effects of chemotherapy and hormonal therapy for early breast cancer on recurrence and 15-year survival: an overview of the randomised trials. Lancet 2005, 365 (9472), 1687−1717. (2) Peto, R.; Davies, C.; Godwin, J.; Gray, R.; Pan, H. C.; Clarke, M.; Cutter, D.; Darby, S.; McGale, P.; Taylor, C.; Wang, Y. C.; Bergh, J.; 4149 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article (17) Wurtz, S. O.; Christensen, I. J.; Schrohl, A. S.; Mouridsen, H.; Lademann, U.; Jensen, V.; Brunner, N. Measurement of the uncomplexed fraction of tissue inhibitor of metalloproteinases-1 in the prognostic evaluation of primary breast cancer patients. Mol. Cell. Proteomics 2005, 4 (4), 483−491. (18) Birgisson, H.; Nielsen, H. J.; Christensen, I. J.; Glimelius, B.; Brunner, N. Preoperative plasma TIMP-1 is an independent prognostic indicator in patients with primary colorectal cancer: a prospective validation study. Eur. J. Cancer 2010, 46 (18), 3323−3331. (19) Bigelow, R. L.; Williams, B. J.; Carroll, J. L.; Daves, L. K.; Cardelli, J. A. TIMP-1 overexpression promotes tumorigenesis of MDA-MB-231 breast cancer cells and alters expression of a subset of cancer promoting genes in vivo distinct from those observed in vitro. Breast Cancer Res. Treat. 2009, 117 (1), 31−44. (20) Stilley, J. A.; Sharpe-Timms, K. L. TIMP1 contributes to ovarian anomalies in both an MMP-dependent and -independent manner in a rat model. Biol. Reprod. 2012, 86 (2), 47. (21) Egea, V.; Zahler, S.; Rieth, N.; Neth, P.; Popp, T.; Kehe, K.; Jochum, M.; Ries, C. Tissue inhibitor of metalloproteinase-1 (TIMP1) regulates mesenchymal stem cells through let-7f microRNA and Wnt/beta-catenin signaling. Proc. Natl. Acad. Sci. U.S.A. 2012, 109 (6), E309−E316. (22) Schrohl, A. S.; Meijer-van Gelder, M. E.; Holten-Andersen, M. N.; Christensen, I. J.; Look, M. P.; Mouridsen, H. T.; Brunner, N.; Foekens, J. A. Primary tumor levels of tissue inhibitor of metalloproteinases-1 are predictive of resistance to chemotherapy in patients with metastatic breast cancer. Clin. Cancer Res. 2006, 12 (23), 7054− 7058. (23) Sorensen, N. M.; Schrohl, A. S.; Jensen, V.; Christensen, I. J.; Nielsen, H. J.; Brunner, N. Comparative studies of tissue inhibitor of metalloproteinases-1 in plasma, serum and tumour tissue extracts from patients with primary colorectal cancer. Scand. J. Gastroenterol. 2008, 43 (2), 186−191. (24) Cox, J.; Mann, M. Quantitative, high-resolution proteomics for data-driven systems biology. Annu. Rev. Biochem. 2011, 80, 273−299. (25) Ong, S. E.; Blagoev, B.; Kratchmarova, I.; Kristensen, D. B.; Steen, H.; Pandey, A.; Mann, M. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 2002, 1 (5), 376−386. (26) Jaattela, M.; Benedict, M.; Tewari, M.; Shayman, J. A.; Dixit, V. M. Bcl-x and Bcl-2 inhibit TNF and Fas-induced apoptosis and activation of phospholipase A2 in breast carcinoma cells. Oncogene 1995, 10 (12), 2297−2305. (27) Cox, J.; Matic, I.; Hilger, M.; Nagaraj, N.; Selbach, M.; Olsen, J. V.; Mann, M. A practical guide to the MaxQuant computational platform for SILAC-based quantitative proteomics. Nat. Protoc. 2009, 4 (5), 698−705. (28) Davidsen, M. L.; Wurtz, S. O.; Romer, M. U.; Sorensen, N. M.; Johansen, S. K.; Christensen, I. J.; Larsen, J. K.; Offenberg, H.; Brunner, N.; Lademann, U. TIMP-1 gene deficiency increases tumour cell sensitivity to chemotherapy-induced apoptosis. Br. J. Cancer 2006, 95 (8), 1114−1120. (29) Macek, B.; Mann, M.; Olsen, J. V. Global and site-specific quantitative phosphoproteomics: principles and applications. Annu. Rev. Pharmacol. Toxicol. 2009, 49, 199−221. (30) Olsen, J. V.; Macek, B. High accuracy mass spectrometry in large-scale analysis of protein phosphorylation. Methods Mol. Biol. 2009, 492, 131−142. (31) Olsen, J. V.; Schwartz, J. C.; Griep-Raming, J.; Nielsen, M. L.; Damoc, E.; Denisov, E.; Lange, O.; Remes, P.; Taylor, D.; Splendore, M.; Wouters, E. R.; Senko, M.; Makarov, A.; Mann, M.; Horning, S. A dual pressure linear ion trap Orbitrap instrument with very high sequencing speed. Mol. Cell. Proteomics 2009, 8 (12), 2759−2769. (32) Olsen, J. V.; de Godoy, L. M.; Li, G.; Macek, B.; Mortensen, P.; Pesch, R.; Makarov, A.; Lange, O.; Horning, S.; Mann, M. Parts per million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol. Cell. Proteomics 2005, 4 (12), 2010−2021. (33) Olsen, J. V.; Macek, B.; Lange, O.; Makarov, A.; Horning, S.; Mann, M. Higher-energy C-trap dissociation for peptide modification analysis. Nat. Methods 2007, 4 (9), 709−712. (34) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26 (12), 1367−1372. (35) Olsen, J. V.; Blagoev, B.; Gnad, F.; Macek, B.; Kumar, C.; Mortensen, P.; Mann, M. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell 2006, 127 (3), 635−648. (36) Szklarczyk, D.; Franceschini, A.; Kuhn, M.; Simonovic, M.; Roth, A.; Minguez, P.; Doerks, T.; Stark, M.; Muller, J.; Bork, P.; Jensen, L. J.; von, M. C. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011, 39 (Database issue), D561−D568. (37) von, M. C.; Jensen, L. J.; Snel, B.; Hooper, S. D.; Krupp, M.; Foglierini, M.; Jouffre, N.; Huynen, M. A.; Bork, P. STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005, 33 (Database issue), D433− D437. (38) Taboureau, O.; Nielsen, S. K.; Audouze, K.; Weinhold, N.; Edsgard, D.; Roque, F. S.; Kouskoumvekaki, I.; Bora, A.; Curpan, R.; Jensen, T. S.; Brunak, S.; Oprea, T. I. ChemProt: a disease chemical biology database. Nucleic Acids Res. 2011, 39 (Database issue), D367− D372. (39) Knox, C.; Law, V.; Jewison, T.; Liu, P.; Ly, S.; Frolkis, A.; Pon, A.; Banco, K.; Mak, C.; Neveu, V.; Djoumbou, Y.; Eisner, R.; Guo, A. C.; Wishart, D. S. DrugBank 3.0: a comprehensive resource for ’omics’ research on drugs. Nucleic Acids Res. 2011, 39 (Database issue), D1035−D1041. (40) Smoot, M. E.; Ono, K.; Ruscheinski, J.; Wang, P. L.; Ideker, T. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 2011, 27 (3), 431−432. (41) Warsow, G.; Greber, B.; Falk, S. S.; Harder, C.; Siatkowski, M.; Schordan, S.; Som, A.; Endlich, N.; Scholer, H.; Repsilber, D.; Endlich, K.; Fuellen, G. ExprEssence–revealing the essence of differential experimental data in the context of an interaction/regulation net-work. BMC Syst. Biol. 2010, 4, 164. (42) Colaert, N.; Helsens, K.; Martens, L.; Vandekerckhove, J.; Gevaert, K. Improved visualization of protein consensus sequences by iceLogo. Nat. Methods 2009, 6 (11), 786−787. (43) Linding, R.; Jensen, L. J.; Ostheimer, G. J.; van Vugt, M. A.; Jorgensen, C.; Miron, I. M.; Diella, F.; Colwill, K.; Taylor, L.; Elder, K.; Metalnikov, P.; Nguyen, V.; Pasculescu, A.; Jin, J.; Park, J. G.; Samson, L. D.; Woodgett, J. R.; Russell, R. B.; Bork, P.; Yaffe, M. B.; Pawson, T. Systematic discovery of in vivo phosphorylation networks. Cell 2007, 129 (7), 1415−1426. (44) Miller, M. L.; Jensen, L. J.; Diella, F.; Jorgensen, C.; Tinti, M.; Li, L.; Hsiung, M.; Parker, S. A.; Bordeaux, J.; Sicheritz-Ponten, T.; Olhovsky, M.; Pasculescu, A.; Alexander, J.; Knapp, S.; Blom, N.; Bork, P.; Li, S.; Cesareni, G.; Pawson, T.; Turk, B. E.; Yaffe, M. B.; Brunak, S.; Linding, R. Linear motif atlas for phosphorylation-dependent signaling. Sci. Signaling 2008, 1 (35), ra2. (45) Moller, S. N.; Dowell, B. L.; Stewart, K. D.; Jensen, V.; Larsen, L.; Lademann, U.; Murphy, G.; Nielsen, H. J.; Brunner, N.; Davis, G. J. Establishment and characterization of 7 new monoclonal antibodies to tissue inhibitor of metalloproteinases-1. Tumour Biol. 2005, 26 (2), 71−80. (46) Holten-Andersen, M. N.; Murphy, G.; Nielsen, H. J.; Pedersen, A. N.; Christensen, I. J.; Hoyer-Hansen, G.; Brunner, N.; Stephens, R. W. Quantitation of TIMP-1 in plasma of healthy blood donors and patients with advanced cancer. Br. J. Cancer 1999, 80 (3−4), 495−503. (47) Cabezon, T.; Gromova, I.; Gromov, P.; Serizawa, R.; Timmermans, W., V; Kroman, N.; Celis, J. E.; Moreira, J. M. Proteomic Profiling of Triple-negative Breast Carcinomas in Combination With a Three-tier Orthogonal Technology Approach Identifies Mage-A4 as Potential Therapeutic Target in Estrogen Receptor Negative Breast Cancer. Mol. Cell. Proteomics 2013, 12 (2), 381−394. 4150 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Journal of Proteome Research Article (48) Pines, A.; Kelstrup, C. D.; Vrouwe, M. G.; Puigvert, J. C.; Typas, D.; Misovic, B.; de, G. A.; von, S. L.; van de Water, B.; Danen, E. H.; Vrieling, H.; Mullenders, L. H.; Olsen, J. V. Global phosphoproteome profiling reveals unanticipated networks responsive to cisplatin treatment of embryonic stem cells. Mol. Cell. Biol. 2011, 31 (24), 4964−4977. (49) de Godoy, L. M.; Olsen, J. V.; Cox, J.; Nielsen, M. L.; Hubner, N. C.; Frohlich, F.; Walther, T. C.; Mann, M. Comprehensive massspectrometry-based proteome quantification of haploid versus diploid yeast. Nature 2008, 455 (7217), 1251−1254. (50) Selbach, M.; Schwanhausser, B.; Thierfelder, N.; Fang, Z.; Khanin, R.; Rajewsky, N. Widespread changes in protein synthesis induced by microRNAs. Nature 2008, 455 (7209), 58−63. (51) Lambert, E.; Bridoux, L.; Devy, J.; Dasse, E.; Sowa, M. L.; Duca, L.; Hornebeck, W.; Martiny, L.; Petitfrere-Charpentier, E. TIMP-1 binding to proMMP-9/CD44 complex localized at the cell surface promotes erythroid cell survival. Int. J. Biochem. Cell Biol. 2009, 41 (5), 1102−1115. (52) Lourda, M.; Trougakos, I. P.; Gonos, E. S. Development of resistance to chemotherapeutic drugs in human osteosarcoma cell lines largely depends on up-regulation of Clusterin/Apolipoprotein J. Int. J. Cancer 2007, 120 (3), 611−622. (53) Mizutani, K.; Matsumoto, K.; Hasegawa, N.; Deguchi, T.; Nozawa, Y. Expression of clusterin, XIAP and survivin, and their changes by camptothecin (CPT) treatment in CPT-resistant PC-3 and CPT-sensitive LNCaP cells. Exp. Oncol. 2006, 28 (3), 209−215. (54) Wang, Y. H.; Huang, M. L. Organogenesis and tumorigenesis: insight from the JAK/STAT pathway in the Drosophila eye. Dev. Dyn. 2010, 239 (10), 2522−2533. (55) Bucher, N.; Britten, C. D. G2 checkpoint abrogation and checkpoint kinase-1 targeting in the treatment of cancer. Br. J. Cancer 2008, 98 (3), 523−528. (56) Gaur, S.; Chen, L.; Yang, L.; Wu, X.; Un, F.; Yen, Y. Inhibitors of mTOR overcome drug resistance from topoisomerase II inhibitors in solid tumors. Cancer Lett. 2011, 311 (1), 20−28. (57) Ubersax, J. A.; Ferrell, J. E., Jr. Mechanisms of specificity in protein phosphorylation. Nat. Rev. Mol. Cell Biol. 2007, 8 (7), 530− 541. (58) Alborzinia, H.; Can, S.; Holenya, P.; Scholl, C.; Lederer, E.; Kitanovic, I.; Wolfl, S. Real-time monitoring of cisplatin-induced cell death. PLoS One 2011, 6 (5), e19714. (59) Li, G.; Fridman, R.; Kim, H. R. Tissue inhibitor of metalloproteinase-1 inhibits apoptosis of human breast epithelial cells. Cancer Res. 1999, 59 (24), 6267−6275. (60) Ackerman, P.; Glover, C. V.; Osheroff, N. Phosphorylation of DNA topoisomerase II by casein kinase II: modulation of eukaryotic topoisomerase II activity in vitro. Proc. Natl. Acad. Sci. U.S.A. 1985, 82 (10), 3164−3168. (61) Bandyopadhyay, K.; Gjerset, R. A. Protein kinase CK2 is a central regulator of topoisomerase I hyperphosphorylation and camptothecin sensitivity in cancer cell lines. Biochemistry 2011, 50 (5), 704−714. (62) Hackbarth, J. S.; Galvez-Peralta, M.; Dai, N. T.; Loegering, D. A.; Peterson, K. L.; Meng, X. W.; Karnitz, L. M.; Kaufmann, S. H. Mitotic phosphorylation stimulates DNA relaxation activity of human topoisomerase I. J. Biol. Chem. 2008, 283 (24), 16711−16722. (63) Ritter, L. M.; Garfield, S. H.; Thorgeirsson, U. P. Tissue inhibitor of metalloproteinases-1 (TIMP-1) binds to the cell surface and translocates to the nucleus of human MCF-7 breast carcinoma cells. Biochem. Biophys. Res. Commun. 1999, 257 (2), 494−499. (64) Baselga, J. Targeting the phosphoinositide-3 (PI3) kinase pathway in breast cancer. Oncologist 2011, 16 (Suppl 1), 12−19. (65) McCubrey, J. A.; Steelman, L. S.; Chappell, W. H.; Abrams, S. L.; Wong, E. W.; Chang, F.; Lehmann, B.; Terrian, D. M.; Milella, M.; Tafuri, A.; Stivala, F.; Libra, M.; Basecke, J.; Evangelisti, C.; Martelli, A. M.; Franklin, R. A. Roles of the Raf/MEK/ERK pathway in cell growth, malignant transformation and drug resistance. Biochim. Biophys. Acta 2007, 1773 (8), 1263−1284. (66) Hayakawa, T.; Yamashita, K.; Tanzawa, K.; Uchijima, E.; Iwata, K. Growth-promoting activity of tissue inhibitor of metalloproteinases1 (TIMP-1) for a wide range of cells. A possible new growth factor in serum. FEBS Lett. 1992, 298 (1), 29−32. (67) Luparello, C.; Avanzato, G.; Carella, C.; Pucci-Minafra, I. Tissue inhibitor of metalloprotease (TIMP)-1 and proliferative behaviour of clonal breast cancer cells. Breast Cancer Res. Treat. 1999, 54 (3), 235− 244. (68) Peng, L.; Yanjiao, M.; Ai-guo, W.; Pengtao, G.; Jianhua, L.; Ju, Y.; Hongsheng, O.; Xichen, Z. A fine balance between CCNL1 and TIMP1 contributes to the development of breast cancer cells. Biochem. Biophys. Res. Commun. 2011, 409 (2), 344−349. (69) Bjerre, C.; Knoop, A.; Bjerre, K.; Larsen, M. S.; Henriksen, K. L.; Lyng, M. B.; Ditzel, H. J.; Rasmussen, B. B.; Brunner, N.; Ejlertsen, B.; Laenkholm, A. V. Association of tissue inhibitor of metalloproteinases1 and Ki67 in estrogen receptor positive breast cancer. Acta Oncol. 2013, 52 (1), 82−90. (70) Li, H.; Wang, Y.; Liu, X. Plk1-dependent phosphorylation regulates functions of DNA topoisomerase IIalpha in cell cycle progression. J. Biol. Chem. 2008, 283 (10), 6209−6221. (71) Nakajima, H.; Toyoshima-Morimoto, F.; Taniguchi, E.; Nishida, E. Identification of a consensus motif for Plk (Polo-like kinase) phosphorylation reveals Myt1 as a Plk1 substrate. J. Biol. Chem. 2003, 278 (28), 25277−25280. (72) Toyoshima-Morimoto, F.; Taniguchi, E.; Shinya, N.; Iwamatsu, A.; Nishida, E. Polo-like kinase 1 phosphorylates cyclin B1 and targets it to the nucleus during prophase. Nature 2001, 410 (6825), 215−220. 4151 dx.doi.org/10.1021/pr400457u | J. Proteome Res. 2013, 12, 4136−4151 Bibliography [1] Crick, F. H. C. The biological replication of macromolecules. Symp. Soc. Exp. Biol XII, 138 (1958). 3 [2] Crick, F. Central dogma of molecular biology. Nature 227, 561–3 (1970). URL http://www.ncbi.nlm.nih.gov/pubmed/4913914. 3, 4 [3] Kuska, B. Beer, bethesda, and biology: how ”genomics” came into being. J Natl Cancer Inst 90, 93 (1998). URL http://www.ncbi.nlm.nih.gov/pubmed/9450566. 5 [4] Watson, J. D. & Crick, F. H. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171, 737–8 (1953). URL http://www.ncbi.nlm.nih. gov/pubmed/13054692. 5 [5] Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science 270, 467–70 (1995). URL http://www.ncbi.nlm.nih.gov/pubmed/7569999. 5 [6] Consortium, C. e. S. Genome sequence of the nematode c. elegans: a platform for investigating biology. Science 282, 2012–8 (1998). URL http://www.ncbi.nlm.nih. gov/pubmed/9851916. 6 [7] Arabidopsis Genome, I. Analysis of the genome sequence of the flowering plant arabidopsis thaliana. Nature 408, 796–815 (2000). URL http://www.ncbi.nlm.nih. gov/pubmed/11130711. 6 [8] Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). URL http://www.ncbi.nlm.nih.gov/pubmed/11237011. 6 [9] Sanger, F. & Coulson, A. R. A rapid method for determining sequences in dna by primed synthesis with dna polymerase. J Mol Biol 94, 441–8 (1975). URL http://www.ncbi.nlm.nih.gov/pubmed/1100841. 6 [10] Sanger, F., Nicklen, S. & Coulson, A. R. Dna sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74, 5463–7 (1977). URL http://www.ncbi.nlm. nih.gov/pubmed/271968. 6, 9 [11] Ware, J. S., Roberts, A. M. & Cook, S. A. Next generation sequencing for clinical diagnostics and personalised medicine: implications for the next generation cardiologist. Heart 98, 276–81 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/22128206. 7 153 154 BIBLIOGRAPHY [12] Korlach, J. et al. Long, processive enzymatic dna synthesis using 100% dye-labeled terminal phosphate-linked nucleotides. Nucleosides Nucleotides Nucleic Acids 27, 1072–83 (2008). URL http://www.ncbi.nlm.nih.gov/pubmed/18711669. 7 [13] Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–80 (2005). URL http://www.ncbi.nlm.nih.gov/pubmed/ 16056220. 8 [14] Valouev, A. et al. A high-resolution, nucleosome position map of c. elegans reveals a lack of universal sequence-dictated positioning. Genome Res 18, 1051–63 (2008). URL http://www.ncbi.nlm.nih.gov/pubmed/18477713. 8 [15] Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by rna sequencing. Science 320, 1344–9 (2008). URL http://www.ncbi.nlm.nih.gov/ pubmed/18451266. 9, 17 [16] Down, T. A. et al. A bayesian deconvolution strategy for immunoprecipitation-based dna methylome analysis. Nat Biotechnol 26, 779–85 (2008). URL http://www.ncbi. nlm.nih.gov/pubmed/18612301. 10, 22 [17] Smith, Z. D., Gu, H., Bock, C., Gnirke, A. & Meissner, A. High-throughput bisulfite sequencing in mammalian genomes. Methods 48, 226–32 (2009). URL http://www. ncbi.nlm.nih.gov/pubmed/19442738. 10, 22 [18] Ren, B. et al. Genome-wide location and function of dna binding proteins. Science 290, 2306–9 (2000). URL http://www.ncbi.nlm.nih.gov/pubmed/11125145. 10 [19] Song, L. & Crawford, G. E. Dnase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb Protoc 2010, pdb prot5384 (2010). URL http://www.ncbi.nlm.nih.gov/ pubmed/20150147. 10 [20] Waki, H. et al. Global mapping of cell type-specific open chromatin by faire-seq reveals the regulatory role of the nfi family in adipocyte differentiation. PLoS Genet 7, e1002311 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/22028663. 10 [21] Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer traces using phred. i. accuracy assessment. Genome Res 8, 175–85 (1998). URL http://www.ncbi.nlm.nih.gov/pubmed/9521921. 11 [22] Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memoryefficient alignment of short dna sequences to the human genome. Genome Biol 10, R25 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19261174. 11 [23] Li, H. & Durbin, R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–60 (2009). URL http://www.ncbi.nlm.nih.gov/ pubmed/19451168. 11 [24] Li, H. et al. The sequence alignment/map format and samtools. Bioinformatics 25, 2078–9 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19505943. 11, 12, 118 [25] Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and snp calling from next-generation sequencing data. Nat Rev Genet 12, 443–51 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21587300. 12 [26] McKenna, A. et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome Res 20, 1297–303 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20644199. 12, 118 [27] Danecek, P. et al. The variant call format and vcftools. Bioinformatics 27, 2156–8 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21653522. 13 BIBLIOGRAPHY 155 [28] van den Oord, E. J. Controlling false discoveries in genetic studies. Am J Med Genet B Neuropsychiatr Genet 147B, 637–44 (2008). URL http://www.ncbi.nlm.nih.gov/ pubmed/18092307. 14 [29] Purcell, S. et al. Plink: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81, 559–75 (2007). URL http://www.ncbi.nlm. nih.gov/pubmed/17701901. 14 [30] van der Sluis, S., Posthuma, D. & Dolan, C. V. Tates: efficient multivariate genotypephenotype analysis for genome-wide association studies. PLoS Genet 9, e1003235 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23359524. 14 [31] Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39, 906–13 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/17572673. 14 [32] Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78, 629–44 (2006). URL http://www.ncbi.nlm.nih.gov/ pubmed/16532393. 14 [33] Stephens, M. & Donnelly, P. A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73, 1162–9 (2003). URL http://www.ncbi.nlm.nih.gov/pubmed/14574645. 14 [34] Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5, e1000529 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19543373. 14 [35] Browning, B. L. & Browning, S. R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 84, 210–23 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19200528. 14 [36] Uh, H. W. et al. How to deal with the early gwas data when imputing and combining different arrays is necessary. Eur J Hum Genet 20, 572–6 (2012). URL http://www. ncbi.nlm.nih.gov/pubmed/22189269. 14 [37] Mullis, K. B. & Faloona, F. A. Specific synthesis of dna in vitro via a polymerasecatalyzed chain reaction. Methods Enzymol 155, 335–50 (1987). URL http://www. ncbi.nlm.nih.gov/pubmed/3431465. 15 [38] Okou, D. T. et al. Microarray-based genomic selection for high-throughput resequencing. Nat Methods 4, 907–9 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/ 17934469. 15 [39] Albert, T. J. et al. Direct selection of human genomic loci by microarray hybridization. Nat Methods 4, 903–5 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/ 17934467. 15 [40] Hodges, E. et al. Genome-wide in situ exon capture for selective resequencing. Nat Genet 39, 1522–7 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/17982454. 15 [41] Bau, S. et al. Targeted next-generation sequencing by specific capture of multiple genomic loci using low-volume microfluidic dna arrays. Anal Bioanal Chem 393, 171–5 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/18958448. 15 [42] Meyer, M., Stenzel, U., Myles, S., Prufer, K. & Hofreiter, M. Targeted highthroughput sequencing of tagged nucleic acid samples. Nucleic Acids Res 35, e97 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/17670798. 15 156 BIBLIOGRAPHY [43] Jordan, B. Historical background and anticipated developments. Ann N Y Acad Sci 975, 24–32 (2002). URL http://www.ncbi.nlm.nih.gov/pubmed/12538151. 16 [44] Stoughton, R. B. Applications of dna microarrays in biology. Annu Rev Biochem 74, 53–82 (2005). URL http://www.ncbi.nlm.nih.gov/pubmed/15952881. 16 [45] Dufva, M. Fabrication of dna microarray. Methods Mol Biol 529, 63–79 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19381969. 16 [46] Hardiman, G. Microarray platforms–comparisons and contrasts. Pharmacogenomics 5, 487–502 (2004). URL http://www.ncbi.nlm.nih.gov/pubmed/15212585. 16 [47] Tanaka, A. et al. All-in-one tube method for quantitative gene expression analysis in oligo-dt(30) immobilized pcr tube coated with mpc polymer. Anal Sci 25, 109–14 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19139583. 16 [48] Smyth, G. K. Limma: linear models for microarray data, 397–420 (Springer, 2005). 17 [49] Boguski, M. S., Tolstoshev, C. M. & Bassett, J., D. E. Gene discovery in dbest. Science 265, 1993–4 (1994). URL http://www.ncbi.nlm.nih.gov/pubmed/8091218. 17 [50] Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W. Serial analysis of gene expression. Science 270, 484–7 (1995). URL http://www.ncbi.nlm.nih.gov/pubmed/ 7570003. 18 [51] Kodzius, R. et al. Cage: cap analysis of gene expression. Nat Methods 3, 211–22 (2006). URL http://www.ncbi.nlm.nih.gov/pubmed/16489339. 18 [52] Brenner, S. et al. Gene expression analysis by massively parallel signature sequencing (mpss) on microbead arrays. Nat Biotechnol 18, 630–4 (2000). URL http://www.ncbi. nlm.nih.gov/pubmed/10835600. 18 [53] Siddiqui, A. S. et al. A mouse atlas of gene expression: large-scale digital geneexpression profiles from precisely defined developing c57bl/6j mouse tissues and cells. Proc Natl Acad Sci U S A 102, 18485–90 (2005). URL http://www.ncbi.nlm.nih.gov/ pubmed/16352711. 18 [54] Hegedus, Z. et al. Deep sequencing of the zebrafish transcriptome response to mycobacterium infection. Mol Immunol 46, 2918–30 (2009). URL http://www.ncbi.nlm. nih.gov/pubmed/19631987. 18 [55] t Hoen, P. A. et al. Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res 36, e141 (2008). URL http://www.ncbi.nlm.nih.gov/pubmed/18927111. 18 [56] Morrissy, A. S. et al. Next-generation tag sequencing for cancer gene expression profiling. Genome Res 19, 1825–35 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/ 19541910. 18 [57] Kircher, M., Heyn, P. & Kelso, J. Addressing challenges in the production and analysis of illumina sequencing data. BMC Genomics 12, 382 (2011). URL http: //www.ncbi.nlm.nih.gov/pubmed/21801405. 19 [58] Morrissy, S. et al. Digital gene expression by tag sequencing on the illumina genome analyzer. Curr Protoc Hum Genet Chapter 11, Unit 11 11 1–36 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20373513. 18 [59] Anders, S. Htseq: Analysing high-throughput sequencing data with python . 18 BIBLIOGRAPHY 157 [60] Roberts, A., Pimentel, H., Trapnell, C. & Pachter, L. Identification of novel transcripts in annotated genomes using rna-seq. Bioinformatics 27, 2325–9 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21697122. 18, 20 [61] Team, R. C. A language and environment for statistical computing (2013). URL http://www.R-project.org. 18 [62] Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol 11, R106 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20979621. 18, 20 [63] Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–40 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/19910308. 20 [64] Malone, J. H. & Oliver, B. Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol 9, 34 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/ 21627854. 20 [65] Matouk, C. C. & Marsden, P. A. Epigenetic regulation of vascular endothelial gene expression. Circ Res 102, 873–87 (2008). URL http://www.ncbi.nlm.nih.gov/pubmed/ 18436802. 21 [66] Friso, S. et al. Global dna hypomethylation in peripheral blood mononuclear cells as a biomarker of cancer risk. Cancer Epidemiol Biomarkers Prev 22, 348–55 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23300023. 21 [67] Shi, H. et al. Expressed cpg island sequence tag microarray for dual screening of dna hypermethylation and gene silencing in cancer cells. Cancer Res 62, 3214–20 (2002). URL http://www.ncbi.nlm.nih.gov/pubmed/12036936. 22 [68] Wolff, G. L., Kodell, R. L., Moore, S. R. & Cooney, C. A. Maternal epigenetics and methyl supplements affect agouti gene expression in avy/a mice. Faseb Journal 12, 949–57 (1998). URL http://www.ncbi.nlm.nih.gov/pubmed/9707167. 22 [69] Egger, G., Liang, G., Aparicio, A. & Jones, P. A. Epigenetics in human disease and prospects for epigenetic therapy. Nature 429, 457–63 (2004). URL http://www.ncbi. nlm.nih.gov/pubmed/15164071. 22 [70] Razin, A. & Riggs, A. D. Dna methylation and gene function. Science 210, 604–10 (1980). URL http://www.ncbi.nlm.nih.gov/pubmed/6254144. 22 [71] Jaenisch, R. Dna methylation and imprinting: why bother? Trends Genet 13, 323–9 (1997). URL http://www.ncbi.nlm.nih.gov/pubmed/9260519. 22 [72] Bestor, T. H. The dna methyltransferases of mammals. Hum Mol Genet 9, 2395–402 (2000). URL http://www.ncbi.nlm.nih.gov/pubmed/11005794. 22 [73] Bibikova, M. et al. Genome-wide dna methylation profiling using infinium(r) assay. Epigenomics 1, 177–200 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/22122642. 22 [74] Brinkman, A. B. et al. Whole-genome dna methylation profiling using methylcapseq. Methods 52, 232–6 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20542119. 22 [75] Stevens, M. et al. Estimating absolute methylation levels at single-cpg resolution from methylation enrichment and restriction enzyme sequencing methods. Genome Res 23, 1541–53 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23804401. 22 158 BIBLIOGRAPHY [76] Gu, H. et al. Genome-scale dna methylation mapping of clinical samples at singlenucleotide resolution. Nat Methods 7, 133–6 (2010). URL http://www.ncbi.nlm.nih. gov/pubmed/20062050. 22 [77] Bock, C. et al. Quantitative comparison of genome-wide dna methylation mapping technologies. Nat Biotechnol 28, 1106–14 (2010). URL http://www.ncbi.nlm.nih.gov/ pubmed/20852634. 22 [78] Li, C. C. et al. A sustained dietary change increases epigenetic variation in isogenic mice. PLoS Genet 7, e1001380 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/ 21541011. 23 [79] Milagro, F. I. et al. A dual epigenomic approach for the search of obesity biomarkers: Dna methylation in relation to diet-induced weight loss. Faseb Journal 25, 1378–89 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21209057. 23 [80] Dabelea, D. & Crume, T. Maternal environment and the transgenerational cycle of obesity and diabetes. Diabetes 60, 1849–55 (2011). URL http://www.ncbi.nlm.nih. gov/pubmed/21709280. 23 [81] Boks, M. P. et al. Current status and future prospects for epigenetic psychopharmacology. Epigenetics 7, 20–8 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/ 22207355. 23 [82] Ong, S. E., Foster, L. J. & Mann, M. Mass spectrometric-based approaches in quantitative proteomics. Methods 29, 124–30 (2003). URL http://www.ncbi.nlm.nih. gov/pubmed/12606218. 24 [83] Anderson, L. & Seilhamer, J. A comparison of selected mrna and protein abundances in human liver. Electrophoresis 18, 533–7 (1997). URL http://www.ncbi.nlm.nih.gov/ pubmed/9150937. 24 [84] Cohen, P. The role of protein phosphorylation in human health and disease. the sir hans krebs medal lecture. Eur J Biochem 268, 5001–10 (2001). URL http: //www.ncbi.nlm.nih.gov/pubmed/11589691. 24 [85] Manning, G., Whyte, D. B., Martinez, R., Hunter, T. & Sudarsanam, S. The protein kinase complement of the human genome. Science 298, 1912–34 (2002). URL http: //www.ncbi.nlm.nih.gov/pubmed/12471243. 24 [86] Cohen, P. T. Protein phosphatase 1–targeted in many directions. J Cell Sci 115, 241–56 (2002). URL http://www.ncbi.nlm.nih.gov/pubmed/11839776. 24 [87] Ong, S. E. et al. Stable isotope labeling by amino acids in cell culture, silac, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 1, 376–86 (2002). URL http://www.ncbi.nlm.nih.gov/pubmed/12118079. 24 [88] Gruhler, A. et al. Quantitative phosphoproteomics applied to the yeast pheromone signaling pathway. Mol Cell Proteomics 4, 310–27 (2005). URL http://www.ncbi. nlm.nih.gov/pubmed/15665377http://www.mcponline.org/content/4/3/310.full.pdf. 24 [89] Yang, J. & Honavar, V. Feature subset selection using a genetic algorithm, 117–136 %@ 146137622X (Springer, 1998). 25 [90] McCulloch, W. S. & Pitts., W. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 115–133 (1943). 25 [91] Murphy, K. P. Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series) (2012). 28 BIBLIOGRAPHY 159 [92] Sun, Z., Rao, X., Peng, L. & Xu, D. Prediction of protein supersecondary structures based on the artificial neural network method. Protein Eng 10, 763–9 (1997). URL http://www.ncbi.nlm.nih.gov/pubmed/9342142. 28 [93] Blom, N., Sicheritz-Ponten, T., Gupta, R., Gammeltoft, S. & Brunak, S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 4, 1633–49 (2004). URL http://www.ncbi.nlm.nih.gov/ pubmed/15174133. 28 [94] Saha, S. & Raghava, G. P. Prediction of continuous b-cell epitopes in an antigen using recurrent neural network. Proteins 65, 40–8 (2006). URL http://www.ncbi.nlm. nih.gov/pubmed/16894596. 28 [95] Eftekhar, B., Mohammad, K., Ardebili, H. E., Ghodsi, M. & Ketabchi, E. Comparison of artificial neural network and logistic regression models for prediction of mortality in head trauma based on initial clinical data. BMC Med Inform Decis Mak 5, 3 (2005). URL http://www.ncbi.nlm.nih.gov/pubmed/15713231http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC551612/pdf/1472-6947-5-3.pdf. 28 [96] Meisler, M. H. Evolutionarily conserved noncoding dna in the human genome: how much and what for? Genome Res 11, 1617–8 (2001). URL http://www.ncbi.nlm.nih. gov/pubmed/11591637. 29 [97] McLaren, W. et al. Deriving the consequences of genomic variants with the ensembl api and snp effect predictor. Bioinformatics 26, 2069–70 (2010). URL http://www. ncbi.nlm.nih.gov/pubmed/20562413. 29 [98] Kumar, P., Henikoff, S. & Ng, P. C. Predicting the effects of coding non-synonymous variants on protein function using the sift algorithm. Nat Protoc 4, 1073–81 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19561590. 29 [99] Adzhubei, I., Jordan, D. M. & Sunyaev, S. R. Predicting functional effect of human missense mutations using polyphen-2. Curr Protoc Hum Genet Chapter 7, Unit7 20 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23315928. 29 [100] Liu, X., Jian, X. & Boerwinkle, E. dbnsfp: a lightweight database of human nonsynonymous snps and their functional predictions. Hum Mutat 32, 894–9 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21520341. 29 [101] Gonzalez-Perez, A. & Lopez-Bigas, N. Improving the assessment of the outcome of nonsynonymous snvs with a consensus deleteriousness score, condel. Am J Hum Genet 88, 440–9 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21457909. 29, 44 [102] Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, snpeff: Snps in the genome of drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80–92 (2012). URL http://www.ncbi.nlm. nih.gov/pubmed/22728672. 29 [103] Pabinger, S. et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 15, 256–78 (2014). URL http://www.ncbi.nlm.nih. gov/pubmed/23341494. 29 [104] Sherry, S. T. et al. dbsnp: the ncbi database of genetic variation. Nucleic Acids Res 29, 308–11 (2001). URL http://www.ncbi.nlm.nih.gov/pubmed/11125122. 30, 117 [105] Flicek, P. et al. Ensembl 2013. Nucleic Acids Res 41, D48–55 (2013). URL //www.ncbi.nlm.nih.gov/pubmed/23203987. 30, 118, 121 http: [106] Stenson, P. D. et al. The human gene mutation database (hgmd) and its exploitation in the fields of personalized genomics and molecular evolution. Curr Protoc Bioinformatics Chapter 1, Unit1 13 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/ 22948725. 30, 117 160 BIBLIOGRAPHY [107] Landrum, M. J. et al. Clinvar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 42, D980–5 (2014). URL http: //www.ncbi.nlm.nih.gov/pubmed/24234437. 30, 117 [108] Khatri, P., Sirota, M. & Butte, A. J. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 8, e1002375 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/22383865. 30, 33 [109] Freeling, M. & Subramaniam, S. Conserved noncoding sequences (cnss) in higher plants. Curr Opin Plant Biol 12, 126–32 (2009). URL http://www.ncbi.nlm.nih.gov/ pubmed/19249238. 30 [110] Bernstein, B. E. et al. An integrated encyclopedia of dna elements in the human genome. Nature 489, 57–74 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/ 22955616. 31 [111] Cirulli, E. T. & Goldstein, D. B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet 11, 415–25 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20479773. 31 [112] Ashburner, M. et al. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet 25, 25–9 (2000). URL http://www.ncbi.nlm.nih. gov/pubmed/10802651. 31 [113] Carbon, S. et al. Amigo: online access to ontology and annotation data. Bioinformatics 25, 288–9 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19033274. 31 [114] Eden, E., Navon, R., Steinfeld, I., Lipson, D. & Yakhini, Z. Gorilla: a tool for discovery and visualization of enriched go terms in ranked gene lists. BMC Bioinformatics 10, 48 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19192299. 31 [115] Zhou, X. & Su, Z. Easygo: Gene ontology-based annotation and functional enrichment analysis tool for agronomical species. BMC Genomics 8, 246 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/17645808. 31 [116] Doniger, S. W. et al. Mappfinder: using gene ontology and genmapp to create a global gene-expression profile from microarray data. Genome Biol 4, R7 (2003). URL http://www.ncbi.nlm.nih.gov/pubmed/12540299. 31 [117] Huang da, W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc 4, 44–57 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19131956. 31 [118] Salwinski, L. et al. The database of interacting proteins: 2004 update. Nucleic Acids Res 32, D449–51 (2004). URL http://www.ncbi.nlm.nih.gov/pubmed/14681454. 32 [119] Licata, L. et al. Mint, the molecular interaction database: 2012 update. Nucleic Acids Res 40, D857–61 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/22096227. 32 [120] Kerrien, S. et al. The intact molecular interaction database in 2012. Nucleic Acids Res 40, D841–6 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/22121220. 32 [121] Bader, G. D., Betel, D. & Hogue, C. W. Bind: the biomolecular interaction network database. Nucleic Acids Res 31, 248–50 (2003). URL http://www.ncbi.nlm.nih.gov/ pubmed/12519993. 32 [122] Breitkreutz, B. J., Stark, C. & Tyers, M. The grid: the general repository for interaction datasets. Genome Biol 4, R23 (2003). URL http://www.ncbi.nlm.nih. gov/pubmed/12620108. 32 BIBLIOGRAPHY 161 [123] Keshava Prasad, T. S. et al. Human protein reference database–2009 update. Nucleic Acids Res 37, D767–72 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/18988627. 32 [124] Lage, K. et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol 25, 309–16 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/17344885http://www.nature.com/nbt/ journal/v25/n3/pdf/nbt1295.pdf. 32 [125] Portales-Casamar, E. et al. Jaspar 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res 38, D105–10 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/19906716. 32 [126] Matys, V. et al. Transfac and its module transcompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34, D108–10 (2006). URL http://www.ncbi.nlm. nih.gov/pubmed/16381825. 32 [127] Lachmann, A. et al. Chea: transcription factor regulation inferred from integrating genome-wide chip-x experiments. Bioinformatics 26, 2438–44 (2010). URL http: //www.ncbi.nlm.nih.gov/pubmed/20709693. 32 [128] Qin, B. et al. Cistromemap: a knowledgebase and web server for chip-seq and dnase-seq studies in mouse and human. Bioinformatics 28, 1411–2 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/22495751. 32 [129] Ziebarth, J. D., Bhattacharya, A. & Cui, Y. Ctcfbsdb 2.0: a database for ctcfbinding sites and genome organization. Nucleic Acids Res 41, D188–94 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23193294. 32 [130] Yang, J. H., Li, J. H., Jiang, S., Zhou, H. & Qu, L. H. Chipbase: a database for decoding the transcriptional regulation of long non-coding rna and microrna genes from chip-seq data. Nucleic Acids Res 41, D177–87 (2013). URL http://www.ncbi. nlm.nih.gov/pubmed/23161675. 32 [131] Heinemeyer, T. et al. Databases on transcriptional regulation: Transfac, trrd and compel. Nucleic Acids Res 26, 362–7 (1998). URL http://www.ncbi.nlm.nih.gov/ pubmed/9399875. 32 [132] Messeguer, X. et al. Promo: detection of known transcription regulatory elements using species-tailored searches. Bioinformatics 18, 333–4 (2002). URL http://www. ncbi.nlm.nih.gov/pubmed/11847087. 32 [133] Bailey, T. L. et al. Meme suite: tools for motif discovery and searching. Nucleic Acids Res 37, W202–8 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19458158. 32 [134] Chekmenev, D. S., Haid, C. & Kel, A. E. P-match: transcription factor binding site search by combining patterns and weight matrices. Nucleic Acids Res 33, W432–7 (2005). URL http://www.ncbi.nlm.nih.gov/pubmed/15980505. 32 [135] Fazius, E., Shelest, V. & Shelest, E. Sitar: a novel tool for transcription factor binding site prediction. Bioinformatics 27, 2806–11 (2011). URL http://www.ncbi. nlm.nih.gov/pubmed/21893518. 32 [136] Glazko, G. V. & Emmert-Streib, F. Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics 25, 2348–54 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19574285. 32 [137] Kanehisa, M. & Goto, S. Kegg: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28, 27–30 (2000). URL http://www.ncbi.nlm.nih.gov/pubmed/10592173. 32 162 BIBLIOGRAPHY [138] D’Eustachio, P. Reactome knowledgebase of human biological pathways and processes. Methods Mol Biol 694, 49–61 (2011). URL http://www.ncbi.nlm.nih.gov/ pubmed/21082427. 32 [139] MacRae, C. A. Action and the actionability in exome variation. Circ Cardiovasc Genet 5, 597–8 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/23250897. 32 [140] Kel, A. et al. Explain: finding upstream drug targets in disease gene regulatory networks. SAR QSAR Environ Res 19, 481–94 (2008). URL http://www.ncbi.nlm. nih.gov/pubmed/18853298. 32 [141] Warde-Farley, D. et al. The genemania prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res 38, W214–20 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20576703. 32 [142] Franceschini, A. et al. String v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res 41, D808–15 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23203871. 32 [143] Chen, E. Y. et al. Enrichr: interactive and collaborative html5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013). URL http://www.ncbi.nlm.nih. gov/pubmed/23586463. 32 [144] Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13, 2498–504 (2003). URL http://www.ncbi.nlm.nih.gov/pubmed/14597658http://genome.cshlp.org/ content/13/11/2498.full.pdf. 32 [145] Reich, M. et al. Genepattern 2.0. Nat Genet 38, 500–1 (2006). URL ncbi.nlm.nih.gov/pubmed/16642009. 33 http://www. [146] Zhao, J., Gupta, S., Seielstad, M., Liu, J. & Thalamuthu, A. Pathway-based analysis using reduced gene subsets in genome-wide association studies. BMC Bioinformatics 12, 17 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21226955. 33 [147] Dold, S., Wjst, M., von Mutius, E., Reitmeir, P. & Stiepel, E. Genetic risk for asthma, allergic rhinitis, and atopic dermatitis. Arch Dis Child 67, 1018–22 (1992). URL http://www.ncbi.nlm.nih.gov/pubmed/1520004. 33 [148] O’Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med 5, 28 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23537139. 34 [149] Nabholz, C. E. & von Overbeck, J. Gene-environment interactions and the complexity of human genetic diseases. J Insur Med 36, 47–53 (2004). URL http: //www.ncbi.nlm.nih.gov/pubmed/15104029. 35 [150] John, B. & Lewis, K. R. Chromosome variability and geographic distribution in insects. Science 152, 711–21 (1966). URL http://www.ncbi.nlm.nih.gov/pubmed/ 17797432. 35 [151] Mannino, D. M. et al. Surveillance for asthma–united states, 1980-1999. MMWR Surveill Summ 51, 1–13 (2002). URL http://www.ncbi.nlm.nih.gov/pubmed/12420904. 39 [152] Martinez, F. D. et al. Asthma and wheezing in the first six years of life. the group health medical associates. N Engl J Med 332, 133–8 (1995). URL http://www.ncbi. nlm.nih.gov/pubmed/7800004. 39 BIBLIOGRAPHY 163 [153] Akhabir, L. & Sandford, A. J. Genome-wide association studies for discovery of genes involved in asthma. Respirology 16, 396–406 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21276132http://onlinelibrary.wiley.com/ store/10.1111/j.1440-1843.2011.01939.x/asset/j.1440-1843.2011.01939.x.pdf?v= 1&t=hqbejp4b&s=845b02f00ece0c6374fc6f02650cb5b045b544f2. 40 [154] Bisgaard, H., Bonnelykke, K. & Stokholm, J. Immune-mediated diseases and microbial exposure in early life. Clin Exp Allergy 44, 475–81 (2014). URL http: //www.ncbi.nlm.nih.gov/pubmed/24533884. 39 [155] Cookson, W. O. & Moffatt, M. F. Asthma: an epidemic in the absence of infection? Science 275, 41–2 (1997). URL http://www.ncbi.nlm.nih.gov/pubmed/8999535. 39, 53 [156] Mantzouranis, E., Papadopouli, E. & Michailidi, E. Childhood asthma: recent developments and update. Curr Opin Pulm Med 20, 8–16 (2014). URL http: //www.ncbi.nlm.nih.gov/pubmed/24240439. 39 [157] Gilliland, F. D. et al. Effects of glutathione s-transferase m1, maternal smoking during pregnancy, and environmental tobacco smoke on asthma and wheezing in children. Am J Respir Crit Care Med 166, 457–63 (2002). URL http://www.ncbi. nlm.nih.gov/pubmed/12186820. 39 [158] Young, S. et al. The influence of a family history of asthma and parental smoking on airway responsiveness in early infancy. N Engl J Med 324, 1168–73 (1991). URL http://www.ncbi.nlm.nih.gov/pubmed/2011160. 39 [159] Illi, S. et al. Perennial allergen sensitisation early in life and chronic asthma in children: a birth cohort study. Lancet 368, 763–70 (2006). URL http://www.ncbi. nlm.nih.gov/pubmed/16935687. 39 [160] Weitzman, M., Gortmaker, S. & Sobol, A. Racial, social, and environmental risks for childhood asthma. Am J Dis Child 144, 1189–94 (1990). URL http://www.ncbi. nlm.nih.gov/pubmed/2239856. 40 [161] Von Ehrenstein, O. S. et al. Reduced risk of hay fever and asthma among children of farmers. Clin Exp Allergy 30, 187–93 (2000). URL http://www.ncbi.nlm.nih.gov/ pubmed/10651770. 40 [162] Moffatt, M. F. et al. A large-scale, consortium-based genomewide association study of asthma. N Engl J Med 363, 1211–21 (2010). URL http://www.ncbi.nlm.nih.gov/ pubmed/20860503. 40, 55 [163] Potaczek, D. P. et al. Different fcer1a polymorphisms influence ige levels in asthmatics and non-asthmatics. Pediatr Allergy Immunol 24, 441–9 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23725541. 40 [164] Sleiman, P. M. et al. Variants of dennd1b associated with asthma in children. N Engl J Med 362, 36–44 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20032318. 40, 55 [165] Murphy, S. K. & Hollingsworth, J. W. Stress: a possible link between genetics, epigenetics, and childhood asthma. Am J Respir Crit Care Med 187, 563–4 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23504358. 40 [166] Hancock, D. B. et al. Meta-analyses of genome-wide association studies identify multiple loci associated with pulmonary function. Nat Genet 42, 45–52 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20010835. 41 [167] Repapi, E. et al. Genome-wide association study identifies five loci associated with lung function. Nat Genet 42, 36–44 (2010). URL http://www.ncbi.nlm.nih.gov/ pubmed/20010834. 41 164 BIBLIOGRAPHY [168] Martinez, F. D. Managing childhood asthma: challenge of preventing exacerbations. Pediatrics 123 Suppl 3, S146–50 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/ 19221157. 43 [169] Schatz, M. et al. Relationships among quality of life, severity, and control measures in asthma: An evaluation using factor analysis. Journal of Allergy and Clinical Immunology 115, 1049–1055 (2005). URL <GotoISI>://WOS:000229055100023. 43 [170] Takeichi, M. Cadherins: a molecular family important in selective cell-cell adhesion. Annu Rev Biochem 59, 237–52 (1990). URL http://www.ncbi.nlm.nih.gov/pubmed/ 2197976. 44 [171] Wheelock, M. J. & Johnson, K. R. Cadherins as modulators of cellular phenotype. Annu Rev Cell Dev Biol 19, 207–35 (2003). URL http://www.ncbi.nlm.nih. gov/pubmed/14570569. 44 [172] Nawijn, M. C., Hackett, T. L., Postma, D. S., van Oosterhout, A. J. & Heijink, I. H. E-cadherin: gatekeeper of airway mucosa and allergic sensitization. Trends Immunol 32, 248–55 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21493142. 44 [173] Koppelman, G. H. et al. Identification of pcdh1 as a novel susceptibility gene for bronchial hyperresponsiveness. Am J Respir Crit Care Med 180, 929–35 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19729670. 44 [174] Yanai, I. et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21, 650–9 (2005). URL http://www.ncbi.nlm.nih.gov/pubmed/15388519. 44 [175] Roth, R. B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human cns. Neurogenetics 7, 67–80 (2006). URL http://www.ncbi.nlm. nih.gov/pubmed/16572319. 44 [176] Watkins, N. A. et al. A haematlas: characterizing gene expression in differentiated human blood cells. Blood 113, e1–9 (2009). URL http://www.ncbi.nlm.nih.gov/ pubmed/19228925. 44 [177] Ross, A. J., Dailey, L. A., Brighton, L. E. & Devlin, R. B. Transcriptional profiling of mucociliary differentiation in human airway epithelial cells. Am J Respir Cell Mol Biol 37, 169–85 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/17413031. 44 [178] Grad, R. & Morgan, W. J. Long-term outcomes of early-onset wheeze and asthma. J Allergy Clin Immunol 130, 299–307 (2012). URL http://www.ncbi.nlm.nih.gov/ pubmed/22738675. 53 [179] Nievas, I. F. & Anand, K. J. Severe acute asthma exacerbation in children: a stepwise approach for escalating therapy in a pediatric intensive care unit. J Pediatr Pharmacol Ther 18, 88–104 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23798903. 53 [180] Bisgaard, H. et al. Chromosome 17q21 gene variants are associated with asthma and exacerbations but not atopy in early childhood. Am J Respir Crit Care Med 179, 179–85 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19029000. 54, 55 [181] Granell, R. et al. Examination of the relationship between variation at 17q21 and childhood wheeze phenotypes. J Allergy Clin Immunol 131, 685–94 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23154084. 54 [182] Mamanova, L. et al. Target-enrichment strategies for next-generation sequencing. Nat Methods 7, 111–8 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20111037. 54 BIBLIOGRAPHY 165 [183] Moffatt, M. F. et al. Genetic linkage of t-cell receptor alpha/delta complex to specific ige responses. Lancet 343, 1597–600 (1994). URL http://www.ncbi.nlm.nih.gov/ pubmed/7911920. 54 [184] Moffatt, M. F., Traherne, J. A., Abecasis, G. R. & Cookson, W. O. Single nucleotide polymorphism and linkage disequilibrium within the tcr alpha/delta locus. Hum Mol Genet 9, 1011–9 (2000). URL http://www.ncbi.nlm.nih.gov/pubmed/10767325. 54 [185] Palmer, C. N. et al. Common loss-of-function variants of the epidermal barrier protein filaggrin are a major predisposing factor for atopic dermatitis. Nat Genet 38, 441–6 (2006). URL http://www.ncbi.nlm.nih.gov/pubmed/16550169. 54 [186] Ferreira, M. A. et al. Identification of il6r and chromosome 11q13.5 as risk loci for asthma. Lancet 378, 1006–14 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/ 21907864. 55 [187] Hirota, T. et al. Genome-wide association study identifies three new susceptibility loci for adult asthma in the japanese population. Nat Genet 43, 893–6 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21804548. 56 [188] Torgerson, D. G. et al. Meta-analysis of genome-wide association studies of asthma in ethnically diverse north american populations. Nat Genet 43, 887–92 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21804549. 56 [189] Li, X. et al. The c11orf30-lrrc32 region is associated with total serum ige levels in asthmatic patients. J Allergy Clin Immunol 129, 575–8, 578 e1–9 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/22070912. 56 [190] Marenholz, I. et al. The eczema risk variant on chromosome 11q13 (rs7927894) in the population-based alspac cohort: a novel susceptibility factor for asthma and hay fever. Hum Mol Genet 20, 2443–9 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/ 21429916. 56 [191] Bonnelykke, K. et al. Meta-analysis of genome-wide association studies identifies ten loci influencing allergic sensitization. Nat Genet 45, 902–6 (2013). URL http: //www.ncbi.nlm.nih.gov/pubmed/23817571. 56 [192] Li, X. et al. Genome-wide association study of asthma identifies rad50-il13 and hla-dr/dq regions. J Allergy Clin Immunol 125, 328–335 e11 (2010). URL http: //www.ncbi.nlm.nih.gov/pubmed/20159242. 56 [193] Bonnelykke, K. et al. A genome-wide association study identifies cdhr3 as a susceptibility locus for early childhood asthma with severe exacerbations. Nat Genet 46, 51–5 (2014). URL http://www.ncbi.nlm.nih.gov/pubmed/24241537. 56, 126 [194] Norgaard-Pedersen, B. & Hougaard, D. M. Storage policies and use of the danish newborn screening biobank. Journal of inherited metabolic disease 30, 530–6 (2007). URL <GotoISI>://MEDLINE:17632694. 56 [195] Hollegaard, M. V. et al. Genome-wide scans using archived neonatal dried blood spot samples. BMC Genomics 10, 297 (2009). URL http://www.ncbi.nlm.nih.gov/ pubmed/19575812. 56 [196] Hollegaard, M. V. et al. Robustness of genome-wide scanning using archived dried blood spot samples as a dna source. BMC Genet 12, 58 (2011). URL http://www. ncbi.nlm.nih.gov/pubmed/21726430. 56 [197] Jorgensen, T. J. et al. Hypothesis-driven candidate gene association studies: practical design and analytical considerations. Am J Epidemiol 170, 986–93 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19762372. 58 166 BIBLIOGRAPHY [198] Wjst, M. et al. Asthma families show transmission disequilibrium of gene variants in the vitamin d metabolism and signalling pathway. Respir Res 7, 60 (2006). URL http://www.ncbi.nlm.nih.gov/pubmed/16600026. 58 [199] Hwang, S. et al. A protein interaction network associated with asthma. J Theor Biol 252, 722–31 (2008). URL http://www.ncbi.nlm.nih.gov/pubmed/18395227. 58 [200] Liu, Y. & Liu, S. Protein-protein interaction network analysis of children atopic asthma. Eur Rev Med Pharmacol Sci 16, 867–72 (2012). URL http://www.ncbi.nlm. nih.gov/pubmed/22953633. 58 [201] Indap, A. R., Cole, R., Runge, C. L., Marth, G. T. & Olivier, M. Variant discovery in targeted resequencing using whole genome amplified dna. BMC Genomics 14, 468 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/23837845. 59 [202] Longmate, J. A., Larson, G. P., Krontiris, T. G. & Sommer, S. S. Three ways of combining genotyping and resequencing in case-control association studies. PLoS One 5, e14318 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/21187953. 60 [203] Rivas, M. A. et al. Deep resequencing of gwas loci identifies independent rare variants associated with inflammatory bowel disease. Nat Genet 43, 1066–73 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21983784. 60 [204] Tabor, H. K., Risch, N. J. & Myers, R. M. Candidate-gene approaches for studying complex genetic traits: practical considerations. Nat Rev Genet 3, 391–7 (2002). URL http://www.ncbi.nlm.nih.gov/pubmed/11988764. 60 [205] Adeyemo, A. & Rotimi, C. Genetic variants associated with complex human diseases show wide variation across multiple populations. Public Health Genomics 13, 72–9 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/19439916. 60 [206] Marigorta, U. M. & Navarro, A. High trans-ethnic replicability of gwas results implies common causal variants. PLoS Genet 9, e1003566 (2013). URL http://www.ncbi.nlm. nih.gov/pubmed/23785302. 60 [207] Yang, X. Use of functional genomics to identify candidate genes underlying human genetic association studies of vascular diseases. Arterioscler Thromb Vasc Biol 32, 216–22 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/22258904. 60 [208] Rosenwasser, L. J. & Borish, L. Genetics of atopy and asthma: the rationale behind promoter-based candidate gene studies (il-4 and il-10). Am J Respir Crit Care Med 156, S152–5 (1997). URL http://www.ncbi.nlm.nih.gov/pubmed/9351597. 60 [209] Kaimal, V. et al. Integrative systems biology approaches to identify and prioritize disease and drug candidate genes. Methods Mol Biol 700, 241–59 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21204038. 60 [210] Middleton, F. A. et al. Integrating genetic, functional genomic, and bioinformatics data in a systems biology approach to complex diseases: application to schizophrenia. Methods Mol Biol 401, 337–64 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/ 18368374. 60 [211] Flier, J. S. Obesity wars: molecular progress confronts an expanding epidemic. Cell 116, 337–50 (2004). URL http://www.ncbi.nlm.nih.gov/pubmed/14744442. 77 [212] Spiegelman, B. M. & Flier, J. S. Obesity and the regulation of energy balance. Cell 104, 531–43 (2001). URL http://www.ncbi.nlm.nih.gov/pubmed/11239410. 77 [213] Friedman, J. M. A war on obesity, not the obese. Science 299, 856–8 (2003). URL http://www.ncbi.nlm.nih.gov/pubmed/12574619. 77 BIBLIOGRAPHY 167 [214] O’Rahilly, S. Human genetics illuminates the paths to metabolic disease. Nature 462, 307–14 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19924209. 77 [215] Rosen, E. & Spiegelman, B. What we talk about when we talk about fat. Cell 156, 20–44 (2014). URL http://linkinghub.elsevier.com/retrieve/pii/S0092867413015468. 78, 79 [216] Nedergaard, J., Bengtsson, T. & Cannon, B. Three years with adult human brown adipose tissue. Ann N Y Acad Sci 1212, E20–36 (2010). URL http://www.ncbi.nlm. nih.gov/pubmed/21375707. 78 [217] Symonds, M. E. Brown adipose tissue growth and development. Scientifica (Cairo) 2013, 305763 (2013). URL http://www.ncbi.nlm.nih.gov/pubmed/24278771. 79 [218] Rhodes, P. et al. Adult-onset obesity reveals prenatal programming of glucose-insulin sensitivity in male sheep nutrient restricted during late gestation. PLoS One 4, e7393 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19826474. 80 [219] Rankinen, T. et al. The human obesity gene map: the 2005 update. Obesity (Silver Spring) 14, 529–644 (2006). URL http://www.ncbi.nlm.nih.gov/pubmed/16741264. 80 [220] Kunej, T. et al. Obesity gene atlas in mammals. J Genomics 1, 45–55 (2012). 80, 117 [221] Bell, C. G. et al. Integrated genetic and epigenetic analysis identifies haplotypespecific methylation in the fto type 2 diabetes and obesity susceptibility locus. PLoS One 5, e14040 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/21124985. 80 [222] Cordero, P. et al. Leptin and tnf-alpha promoter methylation levels measured by msp could predict the response to a low-calorie diet. J Physiol Biochem 67, 463–70 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21465273. 80 [223] Digel, W. & Lubbert, M. Dna methylation disturbances as novel therapeutic target in lung cancer: preclinical and clinical results. Crit Rev Oncol Hematol 55, 1–11 (2005). URL http://www.ncbi.nlm.nih.gov/pubmed/15886007. 80 [224] Martinez, J. A., Milagro, F. I., Claycombe, K. J. & Schalinske, K. L. Epigenetics in adipose tissue, obesity, weight loss, and diabetes. Adv Nutr 5, 71–81 (2014). URL http://www.ncbi.nlm.nih.gov/pubmed/24425725. 80 [225] Jiang, Y. H., Bressler, J. & Beaudet, A. L. Epigenetics and human disease. Annu Rev Genomics Hum Genet 5, 479–510 (2004). URL http://www.ncbi.nlm.nih.gov/ pubmed/15485357. 80 [226] The international hapmap project. Nature 426, 789–96 (2003). URL ncbi.nlm.nih.gov/pubmed/14685227. 117 http://www. [227] Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/ 23128226. 117 [228] Forbes, S. A. et al. Cosmic: mining complete cancer genomes in the catalogue of somatic mutations in cancer. Nucleic Acids Res 39, D945–50 (2011). URL http: //www.ncbi.nlm.nih.gov/pubmed/20952405. 117 [229] Amladi, S. Online mendelian inheritance in man ’omim’. Indian J Dermatol Venereol Leprol 69, 423–4 (2003). URL http://www.ncbi.nlm.nih.gov/pubmed/17642958. 117, 122 [230] Rappaport, N. et al. Malacards: an integrated compendium for diseases and their annotation. Database (Oxford) 2013, bat018 (2013). URL http://www.ncbi.nlm.nih. gov/pubmed/23584832. 117, 122 168 BIBLIOGRAPHY [231] Safran, M. et al. Genecards version 3: the human gene integrator. Database (Oxford) 2010, baq020 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20689021. 117 [232] Hindorff LA, M. J. E. B. I. J. H. H. P. K. A., MacArthur J (European Bioinformatics Institute) & TA, M. A catalog of published genome-wide association studies. Available at: www.genome.gov/gwastudies (2013). 117 [233] Cariaso, M. & Lennon, G. Snpedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic Acids Res 40, D1308–12 (2012). URL http: //www.ncbi.nlm.nih.gov/pubmed/22140107. 117, 122 [234] Li, R. et al. Building the sequence map of the human pan-genome. Nat Biotechnol 28, 57–63 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/19997067. 118 [235] Li, H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint arXiv 1303.3997 (2013). 118 [236] Agresti, A. Approximate is better than “exact” for interval estimation of binomial proportions. The American statistician 52, 119 (1998). 119 [237] Hervella, M. et al. The loss of functional caspase-12 in europe is a pre-neolithic event. PLoS One 7, e37022 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/22615879. 119 [238] Blair, D. R. et al. A nondegenerate code of deleterious variants in mendelian loci contributes to complex disease risk. Cell 155, 70–80 (2013). URL http://www.ncbi. nlm.nih.gov/pubmed/24074861. 119 [239] Ward, L. D. & Kellis, M. Haploreg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res 40, D930–4 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/ 22064851. 120 [240] MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–8 (2012). URL http://www.ncbi.nlm.nih.gov/ pubmed/22344438. 120 [241] Wheeler, D. A. et al. The complete genome of an individual by massively parallel dna sequencing. Nature 452, 872–6 (2008). URL http://www.ncbi.nlm.nih.gov/pubmed/ 18421352. 121 [242] Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol 5, e254 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/17803354. 121 [243] Rasmussen, M. et al. Ancient human genome sequence of an extinct palaeo-eskimo. Nature 463, 757–62 (2010). URL http://www.ncbi.nlm.nih.gov/pubmed/20148029. 121 [244] Rasmussen, M. et al. An aboriginal australian genome reveals separate human dispersals into asia. Science 334, 94–8 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/ 21940856. 121 [245] Olalde, I. et al. Derived immune and ancestral pigmentation alleles in a 7,000-yearold mesolithic european. Nature 507, 225–8 (2014). URL http://www.ncbi.nlm.nih. gov/pubmed/24463515. 121 [246] Welter, D. et al. The nhgri gwas catalog, a curated resource of snp-trait associations. Nucleic Acids Res 42, D1001–6 (2014). URL http://www.ncbi.nlm.nih.gov/pubmed/ 24316577. 122 [247] Sanghera, D. K. & Blackett, P. R. Type 2 diabetes genetics: Beyond gwas. J Diabetes Metab 3 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/23243555. 122 BIBLIOGRAPHY 169 [248] Rasmussen, M. et al. The genome of a late pleistocene human from a clovis burial site in western montana. Nature 506, 225–9 (2014). URL http://www.ncbi.nlm.nih. gov/pubmed/24522598. 122 [249] Kin, T. & Ono, Y. Idiographica: a general-purpose web application to build idiograms on-demand for human, mouse and rat. Bioinformatics 23, 2945–6 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/17893084. 122 [250] Howard, B. V. et al. Rising tide of cardiovascular disease in american indians. the strong heart study. Circulation 99, 2389–95 (1999). URL http://www.ncbi.nlm.nih. gov/pubmed/10318659. 122 [251] Lee, E. T. et al. Diabetes and impaired glucose tolerance in three american indian populations aged 45-74 years. the strong heart study. Diabetes Care 18, 599–610 (1995). URL http://www.ncbi.nlm.nih.gov/pubmed/8585996. 122 [252] Sinclair, K. A., Bogart, A., Buchwald, D. & Henderson, J. A. The prevalence of metabolic syndrome and associated risk factors in northern plains and southwest american indians. Diabetes Care 34, 118–20 (2011). URL http://www.ncbi.nlm.nih. gov/pubmed/20864516. 122 [253] Chen, R. et al. Type 2 diabetes risk alleles demonstrate extreme directional differentiation among human populations, compared to other diseases. PLoS Genet 8, e1002621 (2012). URL http://www.ncbi.nlm.nih.gov/pubmed/22511877. 122, 126 [254] Chakravarthy, M. V. & Booth, F. W. Eating, exercise, and ”thrifty” genotypes: connecting the dots toward an evolutionary understanding of modern chronic diseases. J Appl Physiol (1985) 96, 3–10 (2004). URL http://www.ncbi.nlm.nih.gov/pubmed/ 14660491. 122 [255] Corbo, R. M. & Scacchi, R. Apolipoprotein e (apoe) allele distribution in the world. is apoe*4 a ’thrifty’ allele? Ann Hum Genet 63, 301–10 (1999). URL http://www. ncbi.nlm.nih.gov/pubmed/10738542. 123 [256] Tinanoff, N. Cleft lip and palate. Nelson Textbook of Pediatrics, 18th edition. Philadelphia: Saunders Elsevier 1532–1533 (2007). 123 [257] MacArthur, D. G. et al. Loss of actn3 gene function alters mouse muscle metabolism and shows evidence of positive selection in humans. Nat Genet 39, 1261–5 (2007). URL http://www.ncbi.nlm.nih.gov/pubmed/17828264. 123 [258] Belsky, D. W. et al. Development and evaluation of a genetic risk score for obesity. Biodemography Soc Biol 59, 85–100 (2013). URL http://www.ncbi.nlm.nih.gov/ pubmed/23701538. 125 [259] Steckel, R. H. & Rose, J. C. The backbone of history: health and nutrition in the Western Hemisphere, vol. 2 %@ 0521801672 (Cambridge University Press, 2002). 126 [260] Corona, E. et al. Analysis of the genetic basis of disease in the context of worldwide human relationships and migration. PLoS Genet 9, e1003447 (2013). URL http: //www.ncbi.nlm.nih.gov/pubmed/23717210. 126 [261] Replication, D. I. G. & Meta-analysis, C. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat Genet 46, 234–44 (2014). URL http://www.ncbi.nlm.nih.gov/pubmed/24509480. 126 [262] Benfey, P. N. & Mitchell-Olds, T. From genotype to phenotype: systems biology meets natural variation. Science 320, 495–7 (2008). URL http://www.ncbi.nlm.nih. gov/pubmed/18436781. 126 170 BIBLIOGRAPHY [263] Stergachis, A. B. et al. Exonic transcription factor binding directs codon choice and affects protein evolution. Science 342, 1367–72 (2013). URL http://www.ncbi.nlm. nih.gov/pubmed/24337295. 131 [264] Ritchie, G. R., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat Methods 11, 294–6 (2014). URL http://www.ncbi.nlm. nih.gov/pubmed/24487584. 131 [265] Perez-Llamas, C. & Lopez-Bigas, N. Gitools: analysis and visualisation of genomic data using interactive heat-maps. PLoS One 6, e19541 (2011). URL http://www. ncbi.nlm.nih.gov/pubmed/21602921. 131 [266] Smoot, M. E., Ono, K., Ruscheinski, J., Wang, P. L. & Ideker, T. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27, 431–2 (2011). URL http://www.ncbi.nlm.nih.gov/pubmed/21149340. 131 [267] Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res 19, 1639–45 (2009). URL http://www.ncbi.nlm.nih.gov/pubmed/19541911. 131 [268] Brown, K. R. et al. Navigator: Network analysis, visualization and graphing toronto. Bioinformatics 25, 3327–9 (2009). URL http://www.ncbi.nlm.nih.gov/ pubmed/19837718. 131