Cardiac Structural and Sarcomere Genes Associated with
Transcription
Cardiac Structural and Sarcomere Genes Associated with
DOI: 10.1161/CIRCGENETICS.112.963421 Cardiac Structural and Sarcomere Genes Associated with Cardiomyopathy Exhibit Marked Intolerance of Genetic Variation Running title: Pan et al.; Cardiomyopathy Gene Variant Intolerance Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 Dunn Dunn, nn, MS nn MS,, CGC CGC1; Stephen Pan, MD, MS1,2; Colleen A. Caleshu, ScM, CGC1; Kylaa E. Du Ashley, shl hley ey,, MR ey MRCP MRCP, CP,, D CP DPhil1 Marcia J. Foti1; Maura K. Moran, BA1; Oretunlewa Soyinka1; Euan A. As 1 Stanford Cen C Center ente t r for Inhe te Inherited eriitedd C Cardiovascular arddiovasc scuulaar D Disease, iseaasee, St Sta Stanford anfo fordd H Hospital osppittall & Clin Clinics; i iccs;; 2B in Biomedical ioom Informatics matics Trainingg P Program, roogr g am am, Stanford Stan St a fo an ford r U rd University nive ni vers ve rsit ityy School it Sccho hool ol of of Medicine, M di Me dici cine ci ne, Stanford, CA ne Correspondence: Euan A. Ashley MRCP DPhil Stanford Center for Inherited Cardiovascular Disease Falk Cardiovascular Research Center 300 Pasteur Drive Stanford, CA 94305 Tel: (650) 498-4900 Fax: (650) 725-1599 E-mail: euan@stanford.edu Journal Subject Codes: [16] Myocardial cardiomyopathy disease, [89] Genetics of cardiovascular disease, [109] Clinical genetics, [146] Genomics 1 DOI: 10.1161/CIRCGENETICS.112.963421 Abstract: Background - The clinical significance of variants in genes associated with inherited cardiomyopathies can be difficult to determine due to uncertainty regarding population genetic variation and a surprising amount of tolerance of the genome even to loss of function variants. We hypothesized that genes associated with cardiomyopathy might be particularly resistant to the accumulation of genetic variation. Methods and Results - We analyzed the rates of single nucleotide genetic variation in all known genes from the exomes of >5,000 individuals from the National Heart, Lung, and Blood Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 Institute’s Exome Sequencing Project (ESP), as well as the rates of structural variation from the Database of Genomic Variants. Most variants were rare, with over halff un unique uniq ique iq ue tto o on onee in iindividual. indi ndi divv Cardiomyopathy associated genes exhibited a rate of nonsense variantss 96.1 96.1% 1% lower lowe lo werr th we than an oother Mendelian dise disease seas se asee ge ggenes. ene n s.. We W tested the ability of in-silico in n-si s lico algorithmss to to distinguish between betweee a set of variants pathogenicity nts in MYBPC3, nt MYBPC3 C3, MYH7, C3 MYH7 MY H77, and and TNNT2 T NT TN NT22 with withh strong stroong evidence evi v dencee for foor pa ath t og ogen e ic en icit i y an it and variants from the (GERP, om th om he ESP data. d taa. Algorithms da Alggoriith t ms m based b se ba sed on on conservation connserv rvaatio rv on att tthe he nnucleotide ucle leootidee llevel le eve vel (G GER PhastCons)) did not perform algorithms m as as we well ll aass amino ac acid d level lev evel e pprediction redi d ct di c io on algo gori go rith ri thmss (Polyphen-2, SIFT). Variants with strongg evidence causality the i iants evid i ence for id for di disease caus allit i y were fo ffound undd iin n th he ESP data at prevalence hhigher ighe ig herr than he than expected. exp xpec ecte ted. d d. Conclusions - Genes associated with cardiomyopathy carry very low rates of population variation. The existence in population data of variants with strong evidence for pathogenicity suggests that even for Mendelian disease genetics, a probabilistic weighting of multiple variants may be preferred over the ‘single gene’ causality model. Key words: cardiomyopathy; genetic heart disease; genetic variation; genomics; genetic testing 2 DOI: 10.1161/CIRCGENETICS.112.963421 Background New DNA sequencing technologies are poised to transform the genetic evaluation of patients. Soon the availability of genetic information will no longer be a barrier to our understanding of the genetic basis of disease. Rather it will be our ability to understand and interpret the data that will be paramount. The interpretation of clinical genetic testing is a complex process that requires an appreciation of factors establishing causality as well as a detailed understanding of the ‘tolerated’ genetic variation present in human genomes of different ethnicities. Until recently, Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 much of the genetic variation in human populations was unknown. With population ith large largge scale popu puula l sequencing projects such as the 1000 Genomes Project1, the true extent nt of tthis his variation hi variat i tio i n iiss now n clear. ar. r. IIndeed, ndee nd eed, recent ee recent analyses indicate a su ssurprising urprising prevale prevalence enc nce of tolerated genet becoming clea genetic variation.2-44 Clinical n l genetic nical geene n tic testing t st te stin tin i g is is increasingly incre reeasingl glly av available vaila ilabl blle fo forr co conditions ondit itio ions io ons such suchh as as hypertrophic hypert rtro troph phic hic cardiomyopathy, pathy y, where it is i usedd ffor or pr ppredictive ed dic i tiive ffamily amil ilyy testing il testing, g, andd lo llong ngg QT QT syndrome, syyndrome,, whe where e it 57 5-7 may alter management nt as well ell ll as impact im ct ffamily famil il screening. screening in 5The yield Th ield ield ld from fr genetic etiic testi testing testing, in however, can be variable. Evidence for or against a variant’s role is assembled from previous reports in the literature, co-segregation, the likelihood that the variant disrupts the reading frame (weighted more towards nonsense variants, small insertion-deletion variants, or splice site variants) and algorithmic predictions based on conservation, constraint, or protein motif disruption. Despite such resources, a large number of variants found through clinical genetic testing remain of unclear significance. Greatly lacking is knowledge of the population genetic variation in these and other genes, which is needed for the interpretation of variants not just in Mendelian diseases, but also for common disease risk assessment8,9 and pharmacogenomics.10-12 One recent project to catalog population scale single nucleotide variant (SNV) data has 3 DOI: 10.1161/CIRCGENETICS.112.963421 been the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP)13,14. This large scale effort is aimed at sequencing the exome, consisting of the protein coding regions (exons) of the human genome, from members of several different cohorts followed throughout the country for the purpose of defining the genetic components of complex diseases. In contrast with the 1000 Genomes study, which has low coverage of hundreds of genomes, the NHLBI exomes study has high coverage (average >100x), high quality sequencing data for >5000 individuals of Caucasian and African American ethnicity. Thus, it represents a Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 valuable comparison dataset for variants thought to cause monogenic Mendelian disease. diseasee. One O limitation of both of these datasets, however, is the absence of structural ral variants. varia i nts. t These Thesee may ma be particularly y imp important mpor mp orta or tant ta nt because b ca be cause of theirr tendency to ddisrupt isrupt the reading is g fframe. rame. T The he Databasee of Genomic Variants Variaants (DGV)1155 iss a curated Va cu ura r tedd rrepository epo posito ory ooff stru structural ucttural vvariation ariattioon ((consisting consis isti is t ng off ddeletions, tion ti onss, andd ccopy-number on opy-nu op numb berr varia ant nts) ts) whi hich hi h se erv rves es a ssimilar imil im mil i ar purpose purpo posse ass the po th he above ab abov bov for insertions, deleti variants) which serves structural variation. v Using ng these thhe sources so rces off pop population llation atii genetic etiic variation, ariation iatii wee sought so ght h to characterize characteri ch h terii e th the he tolerance of the human genome to variation in genes associated with Mendelian diseases with a specific focus on those that have been associated with inherited cardiomyopathy. Methods NHLBI Exome Sequencing Project Data Data from the NHLBI ESP5400 dataset was accessed on December 12th, 2011 and downloaded for analysis. This data is the accumulation of called variants from the exomes of 5,379 individuals from multiple cohorts of the ESP, including the Women’s Health Initiative, Framingham Heart Study, Jackson Heart Study, Multi-Ethnic Study of Atherosclerosis, Atherosclerosis Risk in Communities, Coronary Artery Risk Development in Young Adults, 4 DOI: 10.1161/CIRCGENETICS.112.963421 Cardiovascular Health Study, Genomic Research on Asthma in the African Diaspora, Lung Health Study, Pulmonary Arterial Hypertension population, Acute Lung Injury cohort, and the Cystic Fibrosis cohort. The primary purpose of the ESP is to sequence the exomes of a large number of individuals selected for the extremes of primarily complex traits from these cohorts. While these exomes may not represent a true sample of the general population, they do represent a phenotyped cohort that is unlikely to be enriched for Mendelian disease, with the possible exception of cystic fibrosis. Resulting SNV calls were filtered for depth and base call thresholds Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 and were annotated for quality using a support vector machine algorithm ESP hm by y the NHLBII ES E data analysis group. Only calls that passed all quality filters were usedd for downstream analysis. d wnstream do t anal Further information regarding ormation orm mat atio io on re reg gard rd ding alignment, variant calls, cal alls al ls, and filtering, as as well as the entirety of this data is available http://evs.gs.washington.edu/EVS/. laablle at http:// /evs.ggs.waash shingt gtonn.eeduu/E EVS S /. 1000 Genomes mes Data Dat ata Data from the t 1000 Genomes Phase Phase 1 March March h 22012 0 2 release 01 rellease1 was retri retrieved ieved d and the subset off small insertion/deletion eletion el letii (1-50 ((1 1 50 bp) bp)) calls all lls were ere used sed ed d for f analysis anall si sis i (http://www.1000genomes.org/). ((http:// http:// // 1000genomes 100 0000g org/) /) Database of Genomic Variants For evaluation of structural variation, the November 2010 data release from the Database of Genomic Variants (http://projects.tcag.ca/variation), aligned to the hg19 version of the human genome, was accessed. This includes data from 42 separate studies evaluating for structural variation involving segments of DNA greater than 1kb, as well as smaller insertions/deletions (indels). The data is collected from small individual genome and population level studies without known enrichment for disease. Gene Annotation Gene annotation data was accessed from the Online Mendelian Inheritance in Man database16 5 DOI: 10.1161/CIRCGENETICS.112.963421 (http://www.ncbi.nlm.nih.gov/omim). All genes as annotated in the NCBI Reference Sequence Database (RefSeq)17 via the University of California, Santa Cruz (UCSC) Genome Browser18 (including alternate isoforms) were divided into subgroups by OMIM annotation and/or literature review according to their known association with: i) cardiomyopathy, specifically HCM or DCM ii) any other Mendelian disease, or iii) neither of the above. After accounting for alternate isoforms, there were 120 isoforms of 46 separate genes associated with inherited cardiomyopathy, 5,764 isoforms of 2,831 separate genes with other Mendelian disease Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 association, and 25,437 isoforms of 16,102 separate genes without known own Mendelian disease diise sea association for which variant data from the ESP5400 dataset was available. lable. Analysis off Population Po opu pula lati la tion ti o Variation Var a iation Data Variants from om the ESP5400 om 00 dataset dat attasett were weree grouped gro rouped ed byy gene gene ne into into the th he three thrreee categories cattego ca ori ries e previously prev vio ous mino minor norr allele no alleele l frequencies fre requ quen e ciiess ffor or eac each achh vvariant ac ari rian iantt were weere extracted. extra raccted ra d. Variant V ri Va rian a t subtypes an suubt btyp ypes es we described and mi were y pr ppredicted edicted func functional tiional effect eff ffect (s ff ((synonymous, y onym yn y ous,, missense,, nonsense,, splice) spl plice)) and the ssum of analyzed by minor allelee frequencies ffreq encies cii across known kkno n n isoforms isoff was as used sedd to come up p with ith ithh a raw ra co count ntt off expected number of variants per type per transcript. For synonymous, missense, and nonsense variants, this number was then normalized for transcript length based on data from RefSeq. For splice site variants, this number was normalized by number of known exons per transcript. In order to evaluate the distribution of small indels (1-50 bp) which were notably absent from the public release of the ESP5400 dataset, the subset of called indels from the 1000 Genomes Phase 1 March 2012 release was retrieved and annotated using ANNOVAR19 software against the NCBI RefSeq database17 to determine the subset in coding regions of genes with any disease association as above. 6 DOI: 10.1161/CIRCGENETICS.112.963421 Curation of Known Variants We manually curated a set of variants in MYH7, MYBPC3, and TNNT2 with strong evidence for causing cardiomyopathy. This set comprised of missense variants seen in patients at the Stanford Center for Inherited Cardiovascular Disease from September 2010 to December 2011 and considered likely or very likely disease causing. To supplement this list, we selected variants from a publicly available repository of sarcomeric variants20 with the highest number of independent citations. These variants were then manually curated and any variants we considered Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 likely or very likely disease causing were included in our high confidence ence set. Curation relied rel e i on published data, cases from our clinical cohort, and case or control dataa from commer commercial cial ial ggenetic e testing laboratories. unrelated oratories. orato tori to ries ri es. Classification es Classi Cl sifi si f cation was based on segregation seegr greegation data,, presence pres eseence in multiple unre es cases, absence controls, availability compelling model vitro data. nce nc ce in controls s, andd avai aiila labili l ty off co li ompeelllingg aanimal nim ni mall m oddel or o in vi itro da ata.. Variants Va were considered deredd ve very r lik ry likely kelly di dise disease seas a e ca causing ausing on only y iiff st strong stro trong ng segregation seg egrrega eg gati tion ti on ddata a a and/or at and orr aanimal an and/ nima ni imall m model data was available. v vailable. Algorithmic i P Prediction d dii tii off Variant V i tP Pathogenicity th h i iitt All missense variants from the NHLBI ESP5400 dataset as well as variants from our curated list of known pathogenic variants in HCM were scored using GERP21, a measure of evolutionary constraint at a nucleotide base level utilizing a rejected substitution score, and PhastCons22, another measure of evolutionary conservation at the nucleotide base level utilizing multiple sequence alignment, using the SeattleSeq SNP Annotation server (http://snp.gs.washington.edu/SeattleSeqAnnotation). Polyphen223(http://genetics.bwh.harvard.edu/pph2/) and SIFT24 (http://sift.jcvi.org/) scores, both predictions of pathogenicity of missense variants based on the effects of the predicted resulting amino acid substitution, were obtained from their respective servers. 7 DOI: 10.1161/CIRCGENETICS.112.963421 Structural Variation Analysis Structural variants from DGV were grouped on the basis of Mendelian disease association. The average number of structural variants per gene was computed. Due to the varying size of both structural variants and the transcripts they affect, we normalized by evaluating only structural variants affecting protein coding regions of genes and calculating the percent of each gene’s coding region based on transcript length affected by a deletion in DGV. Statistical Analysis Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 All data analysis was carried out using the R Statistical Programming Language. Tests fo forr statistical significance between groups were non-parametric tests without assumption out assum ptio ti n of the the underlying dis distribution. str trib ib but utio tio on. n These The h se included the Wilcoxon Wilcoxxon on rank-sum test for for direct comparison between two Kruskal-Wallis Spearman’s wo ggroups, wo roups, the Kr K usska kal-Wa W llis iss test tesst forr analysis analy ysiss of variance, variaance, and and S pear arrma man’ n’’s rrank annk order for correlation. Given that most not linkage i ion. Give Gi ive ven th hatt m ostt genes os genees aare ge re no ot in linkage linka inkagge with wit ithh each each other, oth ther er, li ink nkaage ag be bbetween tweenn ggenes twee does not affect the Kruskal-Wallis significantly. f the results off th fect he Kr K uska k l-Wa W ll llis test si ign g ificantlly. For th analysis the the h anal all sis is off th h exonic e onic ic ddistribution distrib istrib ib ti tion io off pathogenic athho nii and d ES ESP P variants, ariants rii ts Fisher’s Fishh ’s exact test was used. While Fisher’s test does assume independence of events which may not necessarily be true for the distribution of variants in a gene due to linkage disequilibrium, given the overall rarity of most variants analyzed (almost all less than 1% minor allele frequency and the majority being unique) it is unlikely that a rare variant in one exon significantly affects the probability of a variant in another exon. Results Most Genetic Variation is Rare Most variants in the population data were not shared between many individuals. Private variants, those that were found only in one person, were abundant. Out of the 9,974 total variants called in 8 DOI: 10.1161/CIRCGENETICS.112.963421 the NHLBI exomes distributed amongst 46 separate cardiomyopathy associated genes, 9,103 (91%) had minor allele frequencies less than 1%. Of these rare variants, 5,448 (60%) were private. This predominance of rare variants was almost identical in other genes, whether or not they were associated with Mendelian disease. Common variants (minor allele frequency > 5%) comprised only 5% of all genetic variation in the coding regions of human genes. We found many genes for which a large amount of genetic variation was not only expected, but likely serves a critical purpose. Among Mendelian disease associated genes, the Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 five with the highest rates of missense variation were all HLA loci (Supplementary upplementary Table Ta abl ble 2), where high rates of polymorphism are thought to be selectively maintained. ained. d 25 A Another noth ther w th wellell el ne lo ocu cuss wi ith very high missense vari riat ri atio at i n was the ABO O blood group locus. recognized gen gene locus with variation Among non-Mendelian n-Mendelian disease ndiseasse asso associated s ciatted ggenes, so enes, thos those ose wi with th the thhe m most ost variation vari riiationn included in nclluded d many ma of r rece ry cept ce ptoor gen pt enes es, co cons n iste teent withh th the he su surv rviival ivall advantage adv dvan antaagee of an of a soph phis ph isticaated is ted se sens nsii the olfactory receptor genes, consistent survival sophisticated sensing system for environmental e od odorant dorant mol molecules. leculles.266 Missense ssense variant ariant iant and ndd nonsense variant ariant rii t rates te di didd nott appear a correlated elat l ed d when hhen e llooking ooki kin across all genes (Spearman’s rho=0.36). This remained true when looking at the subset of genes with Mendelian disease association or the subset without Mendelian disease association. Mendelian Disease Genes Exhibit Lower Rates of Genetic Variation We found significantly lower levels of variation in genes associated with Mendelian disease as compared to genes without a known association (Table 1). In general, this reduction was much stronger for types of genetic variation that would be predicted to have more impact on the resulting protein product, such as splice site or nonsense variants. Mendelian disease genes were noted to have a 67.3% lower rate of nonsense variants as compared to genes without known disease association (p=9.6x10-6). These were even more rare in cardiomyopathy associated genes 9 DOI: 10.1161/CIRCGENETICS.112.963421 (Figure 1), which exhibited a 98.7% lower nonsense variant rate as compared to non-disease associated genes and a 96.1% lower rate as compared to the remaining Mendelian disease associated genes (p=5.7 x 10-7). Similarly lower variant rates were seen for both missense and splice site variants as well. Interestingly, this was reversed with respect to synonymous variation, with cardiomyopathy specific genes having slightly higher rates of variation (116.4 variants per megabase of coding region per chromosome in cardiomyopathy genes vs. 90.8 and 95.1 variants per megabase of coding region per chromosome for non-OMIM and OMIM genes, respectively, Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 p=2.7x10-3). Nonsense Variants are Extremely Rare in Cardiac Structural and Sarcomere Genes Genes es Single nucleotide eottid idee variants vari va riian a ts ts thought to have the mostt effect eff f ect on protein ffunction unction are ones that result in a premature uree stop codon, n, i.e i.e. e. nnonsense ons n ensee vvariants. ns ariiantts.. We We looked look oked ok ed at at cardiomyopathy-associated caardio omy myop oppathy hy--asssoociat hy ated at d genes in the NHLBI L exo LBI exome xoome m dat data atta to to evaluate eva valuatte for fo the th he overall oveeral ov alll prevalence prev pr val alen nce ooff th this i type is ty e ooff vari ty variation riiat atio tio i n in in a population without known inherited inhherited d cardiomyopathy. cardi diomyo di y pa p th hy. y Overall, Overall ll,, we ffound ll oundd tha that h t nonsense vari variants were extremely mel ell rare iin th these h genes genes. In I fact ffact, act iin n th the he ssubset bset b t off genes that thhat are routinely ro ti tinel inell sequenced seq enc for clinical purposes in HCM, we found only one nonsense variant each in MYH7 and MYBPC3. Nonsense variants were completely absent in the sarcomeric genes ACTC1, TNNT2, TNNI3, MYL2, MYL3, and TPM1. While the nonsense variant in MYH7 has not been reported previously, the nonsense variant found in MYBPC3 (p.Trp1214Ter) has been associated with hypertrophic cardiomyopathy in one published report in an Asian Indian population27. Among cardiomyopathy-associated genes, the gene with the greatest number of nonsense variants in the ESP5400 exomes data was the very large gene titin (TTN), which has been implicated in familial DCM. This may be largely due to its immense size, as the coding region of titin consists of upwards of 100 kilobases. In total we noted 23 predicted nonsense variants in 10 DOI: 10.1161/CIRCGENETICS.112.963421 titin in the NHLBI exome data. The majority of these nonsense variants seemed to be distributed evenly throughout the length of the gene, although there were two notable clusters of nonsense variants near the 5’ end of the gene (Figure 2). This is in direct contrast to a recent report of a high burden of variants in the A band of the titin protein (corresponding to a group of exons near the 3’ end of the transcript) associated with dilated cardiomyopathy (DCM)28. Both clusters of nonsense variants in our analysis were in exons that are specific to the novex alternate splice isoforms of titin, the first in the terminal exon (exon 46) of the novex-3 isoform (NM_133379) Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 and the other in exon 44 of the novex-2 isoform (NM_133437). Neither er of these are the major m ma cardiac isoforms of titin, which may explain why nonsense variants in these regio regions i ns may ay y bbee more tolerated. the other hand th d, DMD, DM MD, which w ichh hhas wh as bbeen een en im mplicat ated ed in in Duchenne Duchen Du en nnee aand nd B ecke ker mu ke m uscc On the hand, implicated Becker muscular 29,30 29,3 ,3 30 a wel elll as X el -li link nked ked d ffamilial amilia am iall cardiomyopathy, ia cardio i my io myop opat atthy, hy 29 was no wa noted oteed to to manifest man aniffestt an extremely anif ext xtre trem m dystrophy as well X-linked low rate of nonsense variants de ddespite sppite it iitss enormous size. siize. Of Of al all ll human h man genes, hu g nes,, DMD spa ge spans p ns the th h largest region with of on off the th h genome: encompassing sii 22.4 4 million illi il li bases, bases b ith ith h a coding odi din region gii consisting consist sist i about 14 kb spread over greater than 70 exons. The NHLBI dataset, however, contained only one predicted nonsense variant within this gene. Prediction of Pathogenicity of Missense Variants Remains Challenging We collected 46 variants, 40 of which were missense, with particularly strong evidence of causality from three genes most often found to be causal in HCM (MYBPC3, MYH7, and TNNT2) (Supplemental Table 3). Given a large amount of ambiguity over the effects of missense variants in the genome, we compared the missense variants from this pathogenic list to missense variants from the NHLBI exome data within the same genes. These 40 pathogenic missense variants were generally located in regions within these three genes that were notable for 11 DOI: 10.1161/CIRCGENETICS.112.963421 very low variant frequencies in the population data, suggesting that these are regions with vital functions that do not tolerate high rates of variation (Figure 3). Furthermore, 10/26 pathogenic missense variants in MYH7 and 6/10 of the pathogenic missense variants in TNNT2 were found in exons that were notable for a complete absence of non-synonymous likely benign variation (Supplemental Table 4). These exons in MYH7 (exons 6,7,9, 13, and 19 of NM_000257) and in TNNT2 (exon 10 of NM_000364) thus likely encode critical functional domains in the resultant peptide. In support of this, the above noted exons in Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 MYH7 all encode for portions of the functional head and neck domainss31. In addition, the he above ab mentioned exon in TNNT2 encodes a portion of a tropomyosin binding g site, wi with ith induced indduced ed 32 33 variants in this t s exon exon previously previ vioously shown to strongly reduce vi redu re d ce bindingg efficacy effficcac a y32,33 . In general, exonic e distributionn was was strikinglyy different diffferen nt bbetween etw weeen thee ppathogenic ath hogen enicc vari variants iannts an and nd ES E ESP5400 P5540 400 variants varian antss in MYH7 (p=.0059) 0059) 9) aand 9) n TNNT2 nd TN NNT NT22 (p=.013). (p= p .013 13)). Thiss di 13 diff difference fferen ff ence ce w was as nnot ot sstatistically tati ta ati tist sttical ica ly y ssignificant ignifi ig fica fi cant ntt iin n MYBPC3, which w mayy be due due to the th he low low number number b off pathogenic p th pa hoggeniic mis missense i sense vari variants iants in this gene genn in our collection, on consistent istent with iith thh reports rt that h the thhe majority majorit ajj it off ddisease-causing disease is ca sing in variants ariants rii ts in i this thi his gene g tend to be frameshift, splice, or nonsense variants rather than missense34,35. Of note, 4 of the 46 variants with good evidence of pathogenicity were present in the NHLBI exome data. The individual incidences of these variants were very low, with almost all found in only 1 individual each, except for one variant in TNNT2, p.Arg278Cys that was found in 6 individuals in the NHLBI exome cohort. No phenotype information was available to us for these individuals. These variants were removed from the NHLBI ESP variant list for any further analysis. We used widely accepted variant classification algorithms to predict pathogenicity of missense variants. We found the evolutionary constraint based algorithms GERP and PhastCons 12 DOI: 10.1161/CIRCGENETICS.112.963421 to be poorly predictive of variant pathogenicity in this data. Notably, GERP scores appeared on the whole to be higher in the NHLBI ESP variant set (Figure 4), the opposite of what would be expected. While PhastCons predicted scores of > 0.95 (max score of 1) for all the variants in our curated causative variant list, the majority of presumably tolerated missense variants (67%) from the NHLBI exome data set were also noted to have a similarly high PhastCons score, resulting in a c-statistic for classification of 0.52, akin to no discriminatory power (Figure 5). The use of algorithms based on amino acid substitution gave much better results. SIFT Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 had modest discriminatory power with a c-statistic of 0.70. Polyphen-2, utilizes 2, which also utilize zes ze information about peptide structure and interaction, performed the bestt with c-statistic witth a c-st tati tistic ti icc ooff 0.77. It should bee noted based machine-learning was not oted ot ed d however how owev ver that Polyphen-2 is base ed oon n a machine-lear arning algorithm that w ar trained on va may included some curated vvariants ariiants that m ay hhave av ve in nclude d d so de omee of tthose hosee from fro rom ro m ou oour ur cu uraateed list. Cardiomyopathy Genes Exhibit Structural o thyy G opat enes E xhi hibi b t Le bi Less ss Stru uct ctur tur u all Variation Varia iati ia tion ti n We attempted ted to recap recapitulate pitulate th these hese fi find findings dings g iin n othe other h r ty type types p s off ggenetic enetiic vari variation iatiion byy evaluating the distributionn off small Project. notably all ll indels inddells in i data data from f the th h 1000 1000 Genomes G P Project ject There Th were ere notabl otabl bl only onl nl 5,969 indels from this dataset in coding regions, of which 868 were in Mendelian disease associated genes and 26 were in cardiomyopathy associated genes. This gave total rates of 17 indels per 1,000 exons in non-Mendelian disease genes, 10 indels per 1,000 exons in Mendelian disease genes, and 9 indels per 1,000 exons in cardiomyopathy genes. However the overall low number of these types of variants in this data limited any further statistical analysis. We then used data from DGV to query on a per gene basis the number of all structural variants that have been reported as well as the overall extent of the coding region of genes that are covered by known structural variants. We found that the total number, per gene, of all structural variants and only structural variants affecting coding regions did not differ between 13 DOI: 10.1161/CIRCGENETICS.112.963421 genes associated with Mendelian disease and those that are not (Table 2). However we did note a 53% reduction of coding region covered by reported deletion type structural variants in genes that are specifically associated with cardiomyopathy as compared to genes without Mendelian disease association (p-value=0.02). Discussion Recent studies have suggested a surprising rate of tolerance to genetic variation within the Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 human genome. Here, we show that this tolerance does not extend to genes associated with g cardiomyopathy, especially structural and sarcomere genes. This observation with systems rvatiion fi fits ts w ithh a sy it sys s model of organism disproportionately rganism m function func n tiion o where some genes are di disp s roportionately y iintolerant ntolerant of variation because their describing eirr fu ffunction unction hhas as less les esss redundan rredundancy. redu edu dund n an nd anccy.. IIn n aaddition, dditiion, in in describi desscri de riibi bingg population poppulation ulatio tio on variation vaari riat atio at ion data io d for these genes, variants e es, we ene we note noote the the presence preese s ncce of a ssurprising u pr ur pris i in is ingg number nuumb mber er of of disease-associated dise di seas se a e-as as a so as soci ciat ciat ated ed var a iaant ar nts in a population without enrichment cardiomyopathy. enrichm hmen hm en nt for for ca card rdio rd iomy io myop my oppat a hyy. In contrast diversity ontrast ontr on tras astt to the the he high hig ighh rate rate of of genetic gene ge neti ticc variation ti vari va riat attio ionn found foun fo undd in genes gen enes es dependent dep epen ende dent ntt oon n di dive diversit vers rsiit for effective function such as the olfactory receptor loci, we found that population genetic variation, especially variation expected to affect protein function, was rare in Mendelian disease associated genes. We hypothesized that genes essential for cardiac function might be among the genes most intolerant of variation. Not only was this the case, but the strength of these associations was also found to be dependent on the severity of the predicted alteration of protein function, exemplified by the extreme rarity of nonsense variants in cardiomyopathy-specific genes. These findings extended to structural variants as well, specifically in regards to the percent of the coding transcript that is involved in deletion type structural variants in individuals without disease. One strength of our study is in the practical application to clinical genetic testing, which relies on data from unaffected individuals to judge the likely pathogenicity of novel variants. As 14 DOI: 10.1161/CIRCGENETICS.112.963421 our understanding of human genetic variation has improved, it has become clear that even rare genetic variation can be normal and well tolerated, representing a challenge in linking genotypes to phenotypes. One recent study has estimated, using 1000 Genomes data, that the average person has as many as 100 loss of function variants per genome2. This population level of variation has implications for the interpretation of results of clinical genetic testing. However our results indicate that this variation is not evenly distributed and genes for which associations with Mendelian disease have been established have much lower levels of such variation, likely Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 representing the effects of purifying selection. Why genes associated with cardiomyopathy show even lower rates off geneti genetic tic va ti variation ari riat M nde deli l an ddisease li isseaase associated genes is no ot sself-evident elf-evident but th here are many possibilities possibb than other Men Mendelian not there his may hi may be thee case. casee. One ne study d hhas dy as sug uggeestted d tthat hatt M ha endeliian en n ddisease iseasee ge gen nes ma m y not as to why this suggested Mendelian genes may hee hhubs ubs off ggene ub enee ne en nnetworks twor orrks36 (b ((because ecau a se tto o ma manifest anife nife fest ddisease isea is sease se a var variant aria aria iant ccannot anno an nott bbee necessarily be the fatal). However, w wever, , genes genes associated associiatedd with i h cardiomyopathy cardi d omyo di y pa p th hy mayy bbee an except exception ptiion gi pt ggiven ven their es essential functions within iithin thi hi the th h sarcomere and ndd the thhe heart’s h rt’’s unique niq nii e position siitii in i serving ser iing n all ll other othhe organs. organs Variants in these highly structured peptides with molecular motor functions that operate constantly throughout life would be expected to be heavily selected against in the general population. The finding of a slight increase in synonymous variants in cardiomyopathyassociated genes is unexpected. It is possible that this represents a decrease in codon use bias in these genes relative to others, which may in turn reflect a decreased need of efficiency of translation of these structural proteins, but why this may be the case is not evident. One intriguing finding in cardiomyopathy genetics is the contrast between disease causing variants found in MYBPC3 and those in MYH7, the two genes with the highest number of HCM-causing variants. Indeed, the high rate of nonsense pathogenic variants found in 15 DOI: 10.1161/CIRCGENETICS.112.963421 MYBPC337 is in contrast with the almost universal missense nature of those found in MYH7. The extreme rarity of nonsense variants in cardiomyopathy genes in the data presented here suggests that a high probability for pathogenicity for such variants found in MYBPC3 in patients would be appropriate. The absence of disease-causing nonsense variants in MYH7 is curious. It may be that MYH7 haploinsufficiency may not be tolerated at all. We do note that predisposition of genes towards one type of variation versus another is not uncommon given the poor correlation between rates of different types of variation noted in our data, which may be driven by the Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 resulting effects of such variants (dominant-negative effects in missense se variants versus haploinsufficiency states in nonsense variants). Missense s nse vvariants ssen aria ar iaantss rremain emain among the most ddifficult i fi if f cult to interpre interpret et in i a clinical context. Without a large aarge rgge number of of affected afffected ed d and and d unaffected unaaffectted ffamily fami am mily ly m members em mbe bers too show shhow co-segregation coo-segr gregaatiion of gr variant with h disea disease, eaase, see iitt iss of ofte often fte t n difficul ddifficult di i ullt to ddetermine eteermi et m ne iiff a mi mi miss missense sense var variant ariian iant truly tru ruly ly is pathogenic. pathogen path pa thog th ogen e Much has been made of the use off measures off evolu evolutionary l tiionaryy conservation conservatiion to pr pprioritize ioritize missense miss variants. Ourr anal analysis l si sis i sh sho shows h s that h while hhile il th these h meas measures res can help hell eexclude cll dde variants ariants rii t att position positions siitii in the genome that do not show conservation, they are unable to efficiently discriminate between likely causative and non-causative variants. While evolutionary conservation at the nucleotide base level appears to be a necessary characteristic of a pathogenic variant, it is not sufficient in and of itself to classify a variant as causative. Algorithms using the predicted effects of the resulting amino acid substitution showed much better classification potential although this may in part reflect the use of cardiomyopathy causative variants as training data for these classifiers. Our analysis also confirms recent evidence that the overwhelming majority of variation in the human genome is rare (i.e. affecting < 1% of the population). Interestingly, more than half of variants analyzed were private (found in only one person). In fact, taking all 8 commonly 16 DOI: 10.1161/CIRCGENETICS.112.963421 sequenced genes for HCM together (ACTC1, TNNT2, TNNI3, MYL2, MYL3, TPM1, MYH7, MYBPC3), we found 159 private missense variants, 3 private splice site variants, and 2 private nonsense variants for a total of 164 private variants that would have the potential to affect the resulting protein. Assuming that none of these variants was found in the same person, this would imply that 3% of a general population sample who were to be sequenced today would have candidate variants not seen previously on a small HCM disease genetic testing panel. This highlights the continued importance of co-segregation and other supporting data in deciding Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 whether or not a novel variant is causative of disease. It was also surprising to find 4 of 46 “gold standard” pathogenic variants ic va riiants t present nt iin n this population sample, with count 5,379 sam mpl p e, w ith a to it ttotal tal pathogenic allele co oun unt of 9 among g 5,3 ,379 37 individuals. These Thes data would imply variants believed HCM approximately ly a background ly nd prevalence prrevvale leenc n e of o var rian ntss be eliieve vedd ca ccausative usat ative of H HC CM of app pproxi pp xiimaa 0.2% (based likely However, d on 446 6 vvariants arian antts in i 3 ggenes, enes es, and th es tthus us li lik kely a ssubstantial ke ubst ub staantiiall underestimate). st undderest un de st stim imatee). im ) H owev ow ev this is much higher expected where the h hig gher than exp pectedd iin n a ggeneral enerall ppopulation oppullatiion wh here th he pr pprevalence evallence of HCM is 38-40 38-40 estimated too be b 00.2% 22% % iinn multiple m lltiple tiipll populations, pop llations atii s 38 when hhen e considering id in that h th the h yield iield eld ld off genetic etiic testing is far from 100%. This is consistent with other recently published studies finding higher than expected prevalence of genetic variants associated with other Mendelian cardiovascular diseases such as familial DCM14 and long QT syndrome41, though the burden of evidence of pathogenicity for variants in these studies was variable. While it remains possible that some individuals within these cohorts may harbor undiagnosed HCM given that phenotype data for these individuals is not publicly available, the genetic prevalence rate would still be expected to be much lower than that observed in this data. Based on this genetic variant prevalence data, estimates of the incidence of HCM would have to be underestimated by a factor of at least 2 for our current models of HCM disease inheritance to 17 DOI: 10.1161/CIRCGENETICS.112.963421 be true. Given that these estimates of HCM disease prevalence were based on multi-modality screening in diverse populations, it seems likely that some proportion of the variants thought to be causal of HCM under a single gene model cannot be. Alternatively, we posit that the idea of a single gene disorder with variable penetrance is likely an artifact of a limited genomic window, and that what has commonly been perceived as a single gene disorder may in fact be the result of a combination of multiple genetic variants each contributing a portion of the variance, with variants contributing differently in different individuals. Just as some have suggested that a Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 number of rare variants with strong effect size may be the driver of thee inherited compone component nen in ne many common diseases8,42,43, so too might this be the case for what have ave hi historically isttoriicallly been beeen a m onog on ogen og enic en i ddisorders. i orders. is perceived as monogenic Limitations ns h li limi mita mi tati ta tionns. No ti No individual ind ndivid d id idua uall phenotype ua phhen enot otyype data ot dat ffor or tthe he ccohorts he ohor oh hortts ts iin n NH NHL LBII ES E P, 1000 10 Our study has limitations. NHLBI-ESP, Genomes, or o DGV is ppublicly ubliiclyy avai available, ilabl blle,, so iitt iiss not possible p ssib po i le ffor or us at thi this his ti hi time ime to determinee if those individuals iid d al als l with iith th h variants ariants rii ts from f our o r curated c rated tedd sett may ma have hhaa e features ffeat eat res off an undiagnosed ndiagnosed ndi dia ed d cardiomyopathy. While the accumulated set of variants from these 5,379 individuals is available, individual exomes cannot be reconstructed so it is not possible to determine which variants may be shared on the same chromosome. Also the family structure of the individuals within the NHLBI ESP data was also unknown. It is thus possible that a rare variant could be overrepresented if many members of the same family were sequenced. Conclusion In conclusion, using publicly available exome-wide sequencing data from thousands of individuals, we found that genes associated with Mendelian diseases show much lower rates of protein-altering genetic variation, including missense, nonsense, and splice-site variation, with an 18 DOI: 10.1161/CIRCGENETICS.112.963421 extreme intolerance of variation noted specifically in cardiomyopathy-associated genes. Cardiomyopathy-associated genes specifically showed intolerance to structural variation as well. Nonsense variants in genes that have been recurrently linked to hypertrophic cardiomyopathy were extremely rare, and our results suggest that such variants in these genes found on clinical testing have a very high likelihood of being pathogenic. In contrast, novel missense variants were present in at least 3% of individuals, and thus the careful interpretation of missense variants found on clinical genetic testing is critical. Current in silico classification schemes for predicting Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 the pathogenicity of missense variants unfortunately have low power in n classifying cardiomyopathy variants. Finally, we note a much higher than expected prevalence ed preval lence of vvariants ariia ar with strong evidence of genome ev vid iden ence en ce for for ppathogenicity. a hogenicity. This suggests at suggeest sts that, using the power po sequencing,, a nnew framework ew framew workk for for he hheterogeneous teroogene neouus Mendelian Men endeli lian li an n ddisorders isor orderss such succh as iinherited nheerit nh i ed d cardiomyopathies members p hiiess needs pathi needss to to be b developed dev e elop oped op ed where whher ere variants vari rian iantts ffound ound ou ndd inn pati ppatients ati t en entts and nd ffamily am mil ily me mem mb are viewed pr spectrum pprobabilistically obabilisticallyy on a sp pectrum from from unlikely unli l ke li k llyy to likely l kellyy contributors li contrib ibutors of variable ib individual magnit magnitude. nit i de de Wh While Whil il this thi his model oddell challenges hall ll the th h cl classic l sii ‘‘single siingle l variant ariant iant in in a single ingll gene gen disorder’ view, it may also begin to explain some of the significant variability in disease expression found in family members with the same ‘causal’ variant. Acknowledgments: The authors would like to thank the NHLBI GO Exome Sequencing Project and its ongoing studies which produced and provided exome variant calls for comparison: the Lung GO Sequencing Project (HL-102923), the WHI Sequencing Project (HL-102924), the Broad GO Sequencing Project (HL-102925), the Seattle GO Sequencing Project (HL-102926) and the Heart GO Sequencing Project (HL-103010). Funding Sources: Stephen Pan is supported by NIH grant 5T15LM007033. This work was also supported in part by NIH grants DP2OD004613, R01HL105993, UL1RR029890 (Euan Ashley). Conflict of Interest Disclosures: Euan Ashley reports equity and consulting in relation to Personalis Inc. 19 DOI: 10.1161/CIRCGENETICS.112.963421 References: 1. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. 2. MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335:823–828. 3. Li Y, Vinckenbosch N, Tian G, Huerta-Sanchez E, Jiang T, Jiang H, et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat Genet. 2010;42:969–972. Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 4. Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318 2007;318:420– 8:4 :42 426. 5. Gersh BJ, Maron BJ, Bonow RO, Dearani JA, Fifer MA, Link MS, et al al. 20 ACCF/AHA 2011 11 A CCF/ CC F/AH F/ A guideline for summary: a or the the di diag diagnosis agno ag n si siss and treatment of hypertrophic hypertrrop ophic cardiomyop cardiomyopathy: pat athy: executive summ report of thee Am Task American n College Col ollege ge of of Cardiology Caard Card rdio i lo io logy g Foundation/American Fou o nddation/ n//Am n/Am Amer eric er ican ic an Heart Heart Association Asssoc o iati iaati tion onn Tas T as Force on Practice Guidelines. 2011;124:2761–2796. racttice Guidelin ra ines. Circulation. in Circcul u atio on. 20 0111;1 1244:2 27661– 1–27 2 966. 27 6. Ackerman Priori SG, Willems a MJ, an MJ Pr MJ Pri iori S G, W G, il il illems s S, S, Berul Beru rull C, ru C, Brugada Bru ruga gada da R, R, Calkins Calk lkin inss H, in H, et al. al. HRS/EHRA HRS/ HR HRS/ S/EH EHRA EH RA expert consensus statement s sensus nt on on the t e state th stat st atte off ggenetic enet en etic et i ttesting ic esti es tiing n for or tthe he cchannelopathies hann ha nnnel elop oppat athies and cardiomyopathies ppathies athies this document was ddeveloped evellopped d as a pa ppartnership rtnershi hiip bbetween etween th the he Heart Rhyt Rhythm y hm Society (HRS) the European Heart Rhythm Association Rhythm. RS) and th he Eu E rope ro peean an H eart Rh ea eart R hyt ythm yt hm mA ssoc ss occia ocia i ti tion on ((EHRA). EHRA EH RA). RA ). Heart Heart He Hear arrt Rh hyt ythm h . 2011; 2011 8:1308–1339. 399 3 7. Wheeler M, Pavlovic A, DeGoma E, Salisbury H, Brown C, Ashley EA. A New Era in Clinical Genetic Testing for Hypertrophic Cardiomyopathy. J Cardiovasc Transl Res. 2009;2:381–391. 8. Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 2010;11:415–425. 9. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. 10. Ashley EA, Butte AJ, Wheeler MT, Chen R, Klein TE, Dewey FE, et al. Clinical assessment incorporating a personal genome. Lancet. 2010;375:1525–1535. 11. Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, et al. Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence. PLoS Genet. 2011;7:e1002280. 12. Pan S, Dewey FE, Perez MV, Knowles JW, Chen R, Butte AJ, et al. Personalized Medicine 20 DOI: 10.1161/CIRCGENETICS.112.963421 and Cardiovascular Disease: From Genome to Bedside. Curr Cardiovasc Risk Rep. 2011;5:542– 551. 13. Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, et al. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science. 2012;337:64-69. 14. Norton N, Robertson PD, Rieder MJ, Züchner S, Rampersaud E, Martin E, et al. Evaluating Pathogenicity of Rare Variants From Dilated Cardiomyopathy in the Exome Era. Circ Cardiovasc Genet. 2012;5:167–174. Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 15. Zhang J, Feuk L, Duggan GE, Khaja R, Scherer SW. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet Genome Res. 2006;115:205–214. 16. Online Mendelian Inheritance in Man, OMIM®. Online Mendeliann Inheritance Inh nher erit er itan it ance an ce in in Man. Ma . Available at: http://omim.org. Accessed December 11, 2011. K , Tatusova Tatu Ta tuso tuso sova v T DR R. NCBI N BI Reference S NC equences: current st t 17. Pruitt KD, T,, Klimke W, Maglott DR. Sequences: status, policy and new new w initiativ initiatives. ives iv es. Nucleic es Nu ucl c ei eicc Ac Acid Acids idss Re id R Res. s 2009;37:D32–D36. s. 200 009;377:D D32–D 322–D36 –D36 36. 18. Fujita PA, PA, Rhead Rhead B, B, Zweig Zw weiig AS, AS S, Hinrichs H nrric Hi i hs hs AS, AS S, Karolchik Kaaroolch chiik D ch D,, Cl C Cline line MS, MS,, et et al.. The Th UC UCSC CSC o owser r ddatabase: a abas at b se: up pdate dat 201 011 01 1. Nucleic Nucl clei cl eic ei ic A cid i s Re id Res s. 2011;39:D876–D882. 201 011; 1;39 1; 39:D 39 :D87 D87 876–D8 D8882. D8 genome browser update 2011. Acids Res. 19. Wang K,, Li M,, Hakonarson H H.. AN ANNOVAR: NNO OVA V R: R ffunctional unctiional annotation annotatiion off ge ggenetic netic variants ffrom ghput sequencing sequuen enci cing ci ng ddata. ata. Nucleic at ata. Nucl Nu Nucl clei eic Acid eic A Ac cids id ds Re Res. s. 2010;38:e164–e164. s. 201 0 0; 0;38 38:e 38:e :e16 e164– 4–e1 4– e1 164 64.. high-throughput Acids 20. Genomics of Cardiovascular Development, Adaptation, and Remodeling. NHLBI Program for Genomic Applications, Harvard Medical School Available at: http://www.cardiogenomics.org. Accessed January 20, 2012. 21. Cooper GM. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15:901–913. 22. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. 23. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7:248–249. 24. Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009;4:1073–1081. 25. Hughes AL, Yeager M. Natural selection at major histocompatibility complex loci of vertebrates. Annu Rev Genet. 1998;32:415–435. 21 DOI: 10.1161/CIRCGENETICS.112.963421 26. Menashe I, Man O, Lancet D, Gilad Y. Different noses for different people. Nat Genet. 2003;34:143–144. 27. Bashyam MD, Purushotham G, Chaudhary AK, Rao KM, Acharya V, Mohammad TA, et al. A low prevalence of MYH7/MYBPC3 mutations among familial hypertrophic cardiomyopathy patients in India. Mol Cell Biochem. 2012;360:373–382. 28. Herman DS, Lam L, Taylor MRG, Wang L, Teekakirikul P, Christodoulou D, et al. Truncations of titin causing dilated cardiomyopathy. N Engl J Med. 2012;366:619–628. Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 29. Politano L, Nigro V, Nigro G, Petretta VR, Passamano L, Papparella S, et al. Development of cardiomyopathy in female carriers of Duchenne and Becker muscular dystrophies. JAMA. 1996;275:1335–1338. 30. Sylvius N, Tesson F, Gayet C, Charron P, Bénaïche A, Peuchmaurd rd M, M, et al. al. A new ne locus locc for lo autosomal dominant dilated cardiomyopathy identified on chromosome 6q12-q16. Hum me 6q1 q112 q116. Am q12Am J H u Genet. 2001;68:241–246. 31. Van Driest Jaeger Ommen SR, Comprehensive iestt SL, Jae ie ege gerr MA MA, Om mmen meen SR R, Wi Will ll ML, Gersh Ger errsh BJ, ersh J Tajik J, Taj ajik ik k AJ,, et al. al. Comp Co omp mpre rehee re Analysis off the Chain Patients thee Beta-Myosin Beta-Myo osin Heavy Heav He vy Ch hai a n Gene Genee in 389 38 Unrelated Unre Un rellateed Pat re tieentts With Witth Hypertrophic Hypperttroop Hy Cardiomyopathy. Coll Cardiol. pat pat athy hy J Am hy. mC oll Ca ardio iol. io l 2004;44:602–610. l. 200 004;;44:6 00 602–6610.. 32. Jin J-P, Chong SM. Localization tropomyosin-binding sites Lo ocaaliiza z tiion o of of the thee tw th twoo ttr rop opom om myo y siinn bi bind n in nd ng si ite tess of troponin T. Arch. Arr Biochem. Bioph Biophys. B p ys y . 2010;500:144–150. 2010;;5000:14 1444–15 150. 0 33. Palm T, Graboski S, Hitchcock-DeGregori SE, Greenfield NJ. mutations Grab b ki S Hitch Hi Hitchcock hc k D DeGregori eG Gr i SE Gr nfi field ld N NJ J Disease-causing D Disease is ca si sing i m tatio tatiio in cardiac troponin T: identification of a critical tropomyosin-binding region. Biophys J. 2001;81:2827–2837. 34. Andersen PS, Havndrup O, Hougs L, Srensen KM, Jensen M, Larsen LA, et al. Diagnostic yield, interpretation, and clinical utility of mutation screening of sarcomere encoding genes in Danish hypertrophic cardiomyopathy patients and relatives. Hum. Mutat. 2009;30:363–370. 35. Richard P, Charron P, Carrier L, Ledeuil C, Cheav T, Pichereau C, et al. Hypertrophic cardiomyopathy: distribution of disease genes, spectrum of mutations, and implications for a molecular diagnosis strategy. Circulation. 2003;107:2227–2232. 36. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL. The human disease network. Proc Natl Acad Sci U S A. 2007;104:8685–8690. 37. Erdmann J, Daehmlow S, Wischke S, Senyuva M, Werner U, Raible J, et al. Mutation spectrum in a large cohort of unrelated consecutive patients with hypertrophic cardiomyopathy. Clin Genet. 2003;64:339–349. 38. Maron BJ, Gardin JM, Flack JM, Gidding SS, Kurosaki TT, Bild DE. Prevalence of 22 DOI: 10.1161/CIRCGENETICS.112.963421 hypertrophic cardiomyopathy in a general population of young adults: echocardiographic analysis of 4111 subjects in the CARDIA study. Circulation. 1995;92:785–789. 39. Zou Y, Song L, Wang Z, Ma A, Liu T, Gu H, et al. Prevalence of idiopathic hypertrophic cardiomyopathy in China: a population-based echocardiographic analysis of 8080 adults. Am J Med. 2004;116:14–18. 40. Maron BJ. Hypertrophic cardiomyopathy: a systematic review. JAMA. 2002;287:1308–1320. 41. Refsgaard L, Holst AG, Sadjadieh G, Haunsø S, Nielsen JB, Olesen MS. High prevalence of genetic variants previously associated with LQT syndrome in new exome data. Eur J Hum Genet. 2012;20:905-908. Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 42. Holm H, Gudbjartsson DF, Sulem P, Masson G, Helgadottir HT, Zanon C, et al. A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nat Genet. 2011;4 2011;43:316– 433::3 320. 43. Lupski JR, Belmont JW, Boerwinkle E, Gibbs RA. Clan genomicss and the the complex com ompl plex pl ex architecturee off hu human 2011;147:32–43. huma mann di ma ddisease. sease. Cell. 2011;147:32–43 se 43. 43 23 DOI: 10.1161/CIRCGENETICS.112.963421 Table 1. Average rates of variation by subtype across genes without Mendelian disease association(non-OMIM), genes with annotated Mendelian disease association (OMIM), and genes associated with inherited cardiomyopathies. For synonymous, missense, and nonsense variant rates, units are counts per 1x106 base pairs of coding region per chromosome. For rates of splice site variants, units are counts per exon per chromosome. 1st Qu. = 1st quartile, 3rd Qu. = 3rd quartile. P-values were computed using non-parametric Kruskal-Wallis test for analysis of variance. Synonymous Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 Non-OMIM Genes p-value OMIM Genes 2.7x10-3 Cardiomyopathy Genes 1st Qu. 4.2 5.2 4.8 Median 30.1 35.2 44.8 44 .8 8 Mean 90.8 95.1 116.4 116. 11 6.4 6. 122.5 122 5 121.1 121 1 151.66 151 3rd Qu. Missense Missen ense Non-OMIM Non No n-OM OMIM IM Genes Gen eness OMIM OM M G Genes enes en es Cardiomyopathy C rd Ca r io omy myop opat op athy at hy Genes Gen enes es 1st Qu. 2.2 2.2 22.5 2. 5 1.5 Median 13.5 16.2 11.4 Mean 85.5 85 5 76.6 76 .6 6 27.6 27 .6 6 3rd Qu. 89.9 86.2 46.8 -2 1. 1.8x10 . Nonsense Non-OMIM Genes OMIM Genes Cardiomyopathy Genes 1st Qu. 0.000 0.000 0.000 Median 0.000 0.000 0.000 Mean 0.794 0.266 0.011 3rd Qu. 0.048 0.037 0.012 5.7x10-7 Splice Non-OMIM Genes OMIM Genes Cardiomyopathy Genes 1st Qu. 0.000 0.000 0.000 Median 0.000 0.000 0.000 Mean 0.094 0.072 0.004 3rd Qu. 0.000 0.003 0.000 24 8.8x10-8 DOI: 10.1161/CIRCGENETICS.112.963421 Table 2. Average counts of structural variants (SVs) and percent of transcript affected by known SVs in the Database of Genomic Variants. Non-OMIM – genes without Mendelian disease association. OMIM – genes with known Mendelian disease association. Numbers are averages of per gene counts or percents for all genes within that classification. *Denotes statistically significant difference between cardiomyopathy associated genes and genes without Mendelian disease association (p=.02 by Wilcoxon rank-sum test). Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 Non-OMIM OMIM Cardiomyopathy y Average Number of SVs Affecting Gene 2.8 3.1 3.1 3.2 32 3. Average Number of SVs Affecting Gene in Coding Regions 1.5 1.6 1..6 1.2 12 1. o Coding Cod o in ing g Regions Re ion Re Regi ons Affected by Known SVss Average % of % 32% 31% 15%** Figure Legends: g ds: gend Figure 1. Plot of missense and nonsense variant rates for all known human gene transcripts calculated from the exomes of 5,379 persons in the NHLBI Exome Sequencing Project. NonOMIM = genes without known association with a Mendelian disease. OMIM = genes with a known association with a Mendelian disease in the Online Mendelian Inheritance in Man (OMIM) database. Cardiomyopathy = genes with known association with a familial cardiomyopathy. Variant rates are in units of counts per 1,000 base pairs of coding region per transcript per chromosome. Figure 2. Location of nonsense variants found in the large sarcomeric gene titin (TTN). The structure of 5 known isoforms is displayed at the top of the figure oriented by location on 25 DOI: 10.1161/CIRCGENETICS.112.963421 chromosome 2, with the 5’ end of the transcript on the right and the 3’ end on the left. Red arrows depict exons in which clusters of nonsense variants were noted. Figure 3. Plot of minor allele frequency of non-synonymous coding variants from the NHLBI ESP data set over the distribution of the known exons of A) MYH7 (chromosome 14), B) MYBPC3 (chromosome 11), and C) TNNT2 (chromosome 1). Red arrows denote locations of pathogenic variants from a curated list from clinical experience at our institution and literature Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 reports. X-axis is genomic coordinates in megabases. 5’ and 3’ refer to start and end of transcript, respectively. Figure 4. Rel Relative from NHLBI ESP curated R elat el ative distribution at dist stribu st buu onn of bution of missense m ssen mi sssen ensee variants v ri va rian ants fr an rom m th thee NHLB NH HLB LBII ES E P an andd a cu ura rate tedd te pathogenic va variant MYH7, MYBPC3, TNNT2, scored A)) G GERP, B) ari rian a t listt in the th he genes geeness M YH7 H7, M YB BP PC3 3, and an nd TN NNT T2,, as sc core redd byy A ERP RP, B PhastCons, C) SIFT, and D) D Polyphen-2. Pol o ypphe henn-2. n2 Grey 2. Gre r y bars bbaars denote den enot ote va ot vvariants rian ri a tss ffrom an room NH NHLBI ESP data,, black bars denotee variants ffrom list. C,, 1 – SIFT was rom ro m th thee path ppathogenic pa ath hog ogen en enic nic ic lis ist. is t. For t. For or panel pan anel el C SIFT SI T score score co ore re w as used as use s d to preserve presee consistency between panels, with far right predicted to be more pathogenic and far left predicted to be less pathogenic. Figure 5. Receiver operator curves for A) GERP, B) PhastCons, C) SIFT, and D) Polyphen-2 for the classification of collected missense variants from the NHLBI ESP data set and a curated pathogenic missense variant list in the genes MYH7, MYBPC3, and TNNT2. AUC = area under the curve. 26 Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 Cardiac Structural and Sarcomere Genes Associated with Cardiomyopathy Exhibit Marked Intolerance of Genetic Variation Stephen Pan, Colleen A. Caleshu, Kyla E. Dunn, Marcia J. Foti, Maura K. Moran, Oretunlewa Soyinka and Euan A. Ashley Downloaded from http://circgenetics.ahajournals.org/ by guest on November 19, 2016 Circ Cardiovasc Genet. published online October 16, 2012; Circulation: Cardiovascular Genetics is published by the American Heart Association, 7272 Greenville Avenue, Dallas, TX 75231 Copyright © 2012 American Heart Association, Inc. All rights reserved. Print ISSN: 1942-325X. Online ISSN: 1942-3268 The online version of this article, along with updated information and services, is located on the World Wide Web at: http://circgenetics.ahajournals.org/content/early/2012/10/16/CIRCGENETICS.112.963421 Data Supplement (unedited) at: http://circgenetics.ahajournals.org/content/suppl/2012/10/16/CIRCGENETICS.112.963421.DC1.html Permissions: Requests for permissions to reproduce figures, tables, or portions of articles originally published in Circulation: Cardiovascular Genetics can be obtained via RightsLink, a service of the Copyright Clearance Center, not the Editorial Office. Once the online version of the published article for which permission is being requested is located, click Request Permissions in the middle column of the Web page under Services. Further information about this process is available in the Permissions and Rights Question and Answer document. Reprints: Information about reprints can be found online at: http://www.lww.com/reprints Subscriptions: Information about subscribing to Circulation: Cardiovascular Genetics is online at: http://circgenetics.ahajournals.org//subscriptions/ Cardiac Structural and Sarcomere Genes Associated with Cardiomyopathy Exhibit Marked Intolerance of Genetic Variation Stephen Pan MD MS1,2, Colleen A. Caleshu ScM CGC1, Kyla E. Dunn MS CGC1, Marcia J. Foti1, Maura K. Moran BA1, Oretunlewa Soyinka1, Euan A. Ashley MRCP DPhil1 Supplemental Material CIRCCVG/2012/963421 Supplemental Material 1 ABCC9 DSG2 MYH7 PSEN2 TNNI3 ACTC1 DSP MYL2 RBM20 TNNT2 ACTN2 EYA4 MYL3 SCN5A TPM1 BAG3 FKTN MYLK2 SDHA TTN CALR3 JPH2 MYO6 SGCD TTR CAV3 LAMP2 MYOZ2 SLC25A4 VCL COX15 LDB3 NEXN TAZ CSRP3 LMNA PLN TCAP DES MYBPC3 PRKAG2 TMPO DMD MYH6 PSEN1 TNNC1 Supplemental Table I: List of genes determined to have association with cardiomyopathy. Associations were noted from the Online Mendelian Inheritance in Man (OMIM) database or through literature review. CIRCCVG/2012/963421 Supplemental Material 2 Missense Rank OMIM Non-OMIM Nonsense OMIM Non-OMIM Splice OMIM Non-OMIM 1 HLA-DQB1 DEFB108B FUT2 OR10X1 OAS1 PATE4 2 HLA-A C6orf10 NPSR1 KRTAP13-2 MUC7 ZNF419 3 HLA-B OR51B6 DYX1C1 OR5AR1 TMEM216 KLK12 4 HLA-C OR52E6 C17orf107 MS4A12 APOL4 GUCA1C 5 HLA-DQA1 KRTAP12-2 FCGR2A PLA2G2C CYP2D6 RNASE9 6 PRR4 OR2W3 AMPD1 CENPM AGL C14orf105 7 KIR3DL1 OR11H6 CC2D2A OR4X1 NPHP4 HTR3D 8 TAS2R38 APOBEC3H POMT1 OR51Q1 UCP3 AVPI1 9 ABO OR5B3 PRODH FAM187B DTNBP1 CEACAM21 10 BTNL2 OR13C5 DNAH11 OR6C74 XPNPEP3 UGT2B10 11 CYBA PTX4 CLEC7A UBE2NL CFHR1 CFLAR 12 HLA-DRB1 TAS2R42 LPL OVCH2 DAOA NIPAL2 13 SPINK5 OR5H6 CYP2A6 OR2L8 TRPV4 C13orf26 14 NAT2 OR51Q1 CD36 MAGEB16 NPC2 GREB1 15 GYPB OR12D2 CDH15 TAS2R46 BTNL2 GSDMB 16 APOL4 RAET1E TRPM1 OR1B1 ANKK1 C15orf57 17 GP6 OR5R1 COL9A2 OR4C16 LRTOMT XRCC4 CIRCCVG/2012/963421 Supplemental Material 3 18 SLC39A4 OR52B6 TLR5 SEC22B PGM1 GSTT2 19 HRG OR9G1 COQ2 CSAG1 LPA TCTEX1D1 20 CYP2D6 OR8D4 KNG1 OR52N4 CES1 SLC22A24 Supplemental Table 2: Top 20 genes in each category with highest variant rate by subtype. CIRCCVG/2012/963421 Supplemental Material 4 Gene MYBPC3 MYBPC3 MYBPC3 MYBPC3 MYBPC3 MYBPC3 MYBPC3 MYBPC3 MYBPC3 MYBPC3 MYBPC3 MYH7 MYH7 MYH7 MYH7 MYH7 a Controls c Experimental data Frequency in NHLBI ESP Variant Classification Cases IVS8+1G>A IVS11-2A>G (c.927-2 A>G) IVS27+1 G>A (c.2905+1 G>A) IVS30+2 T>G (c.3330+2 T>G) p.Val219Leu (c.655G>C) p.Arg502Gln (c.1505G>A) p.Arg502Trp (c.1504C>T) p.Glu542Gln (c.1624G>C) p.Trp792Arg (c.2374T>C) p.Trp792ValfsX41 (c.2373dupG) p.Pro955ArgfsX95 (c.2864_2865delCT) p.Arg169Gly (c.505A>G) p.Ala199Val (c.596C>T) p.Arg204His (c.611 G>A) p.Arg249Gln (c.746A>G) p.Ile263Thr (c.788T>C) likely disease causing 2 moderate 200 moderate very likely disease causing 7 strong 300 moderate likely disease causing 4 weak 250 moderate likely disease causing ≥10 weak 250 moderate likely disease causing 6 weak 1200 n/a very likely disease causing 9 moderate 418 n/a very likely disease causing 37 strong 395 n/a 1 likely disease causing 11 weak 650 moderate 1 likely disease causing 9 weak 400 n/a very likely disease causing ≥14 strong 700 weak very likely disease causing 4 strong 300 n/a likely disease causing 1 strong n/a n/a likely disease causing 1 strong 400 n/a likely disease causing 4 n/a 300 n/a very likely disease causing 11 strong 211 moderate likely disease causing 5 weak 200 n/a CIRCCVG/2012/963421 Supplemental Material Segregation b 5 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 MYH7 p.Arg403Gln (c.1208 G>A) p.Arg403Leu (c.1208G>T) p.Arg403Trp (c.1207C>T) p.Arg453Cys (c.1357C>T) p.Arg453His (c.1358G>A) p.Val606Met (c.1816 G>A) p.Arg663His (c.1988G>A) p.Arg719Gln (c.2156G>A) p.Gly716Arg (c.2146G>A) p.Arg723Cys (c.2167C>T) p.Ile736Thr (c.2207T>C) p.Gly741Arg (c.2221G>A) p.Gly741Trp (c.2221G>T) p.Arg870His (c.2609G>A) p.Leu908Val (c.2722C>G) p.Glu924Lys (c.2770G>A) p.Glu1356Lys (c. 4066G>A) p.Arg1712Gln very likely disease causing 12 strong 100 strong likely disease causing 3 strong 150 n/a very likely disease causing 11 strong 300 n/a very likely disease causing 14 strong 502 n/a likely disease causing 3 n/a n/a n/a very likely disease causing ≥17 strong 470 strong very likely disease causing 19 strong 420 n/a likely disease causing 11 moderate 1132 n/a very likely disease causing 9 strong 400 n/a Very likely disease causing 5 strong 440 n/a likely disease causing 8 weak 496 weak likely disease causing 8 weak 220 n/a likely disease causing 3 moderate 96 weak very likely disease causing 11 strong 370 moderate very likely disease causing 16 strong 841 moderate likely disease causing 6 weak 890 moderate likely disease causing 5 weak 1096 moderate likely disease causing 4 weak 200 n/a CIRCCVG/2012/963421 Supplemental Material 1 6 (c.5135G>A) MYH7 MYH7 TNNT2 TNNT2 TNNT2 TNNT2 TNNT2 TNNT2 TNNT2 TNNT2 TNNT2 TNNT2 p.Ser1776Gly (c. 5326 A>G) p.Lys1459Asn (c. 4377G>T) p.Ile79Asn (c.236T>A) p.Arg92Gln (c.275G>A) p.Arg92Leu (c.275 G>T) p.Arg92Trp (c.274C>T) p.Arg94Leu (c.281G>T) p.Phe110Ile (c.328T>A) p.Phe110Leu (c.328T>C) p.Arg130Cys (c.388C>T) p.Arg173Trp (c.517C>T) p.Arg278Cys (c.832 C>T) likely disease causing 6 weak 200 n/a likely disease causing 4 weak 990 n/a very likely disease causing 4 strong 390 strong very likely disease causing 6 strong 530 strong very likely disease causing 3 moderate 240 strong very likely disease causing 16 strong 690 strong likely disease causing 3 weak 890 weak very likely disease causing 14 strong 460 strong likely disease causing 2 weak 250 n/a likely disease causing 5 moderate 370 weak likely disease causing 3 moderate 335 n/a likely disease causing 13 weak 600 moderate 6 Supplemental Table 3. Manually curated high confidence pathogenic variants. a: total number of unrelated individuals with the variant from published data, our clinical cohort, and clinical laboratory data provided in genetic test report. b: strength of segregation data based on largest number of affected individuals with the variant within a single kindred. >5 – strong, 4-‐5 – moderate, 2-‐3 – weak. c: total number of controls the variant was not observed in from published data and clinical laboratory data. CIRCCVG/2012/963421 Supplemental Material 7 MYBPC3 (NM_000256) Exon Pathogenic ESP 1 0 2 0 4 0 5 0 6 1 7 0 8 0 11 0 12 0 13 0 14 0 15 0 16 3 17 0 18 0 20 0 21 0 22 0 23 1 24 0 25 0 26 0 27 0 28 0 29 0 30 0 31 0 32 0 33 0 Total 5 1 7 3 8 4 1 3 1 2 1 5 4 7 2 5 2 1 3 1 4 5 5 1 9 2 9 1 4 1 102 p-value 0.3833 MYH7 (NM_000257) Exon Pathogenic 3 6 7 9 11 13 14 16 17 18 19 20 21 22 23 24 26 30 31 32 34 35 36 37 38 39 40 Total 0 1 2 2 0 3 2 1 0 1 2 4 0 1 2 0 0 1 0 1 0 1 0 1 0 0 0 25 p-value 0.001794 ESP 2 0 0 0 2 0 2 3 1 1 0 2 4 3 3 2 2 5 4 2 6 2 3 7 3 1 1 61 TNNT2 (NM_000364) Exon Pathogenic 2 6 8 9 10 11 12 13 14 15 16 Total ESP 0 0 0 1 6 1 1 0 0 0 1 1 2 1 2 0 2 1 3 4 1 2 10 p-value 0.01289 19 Supplemental Table 4: Distribution of missense variants amongst the exons of MYBPC3, MYH7, and TNNT2. The canonical isoform was used in the case of multiple isoforms. P-‐values represent results of a Fisher’s Exact Test for independence of distributions between pathogenic variants and the variants from the Exome Sequencing Project (ESP) dataset. CIRCCVG/2012/963421 Supplemental Material 8