Bio-session package - Social Science Genetic Association Consortium

Transcription

Bio-session package - Social Science Genetic Association Consortium
66*$&-81(%,2/2*,&$/$1127$7,21
,QWURGXFWLRQ
,GHQWLI\LQJDQG5HVROYLQJ7DUJHWV
Ɣ /LPLWDWLRQVRIFXUUHQW*:$6
9LVVFKHU3HWHU0HWDO)LYH\HDUVRI*:$6GLVFRYHU\7KH$PHULFDQ-RXUQDORI+XPDQ
*HQHWLFV
Ɣ )XQFWLRQDODQQRWDWLRQ7}QX
+DSOR5HJZZZEURDGLQVWLWXWHRUJPDPPDOVKDSORUHJKDSORUHJSKS613LQIR
KWWSVQSLQIRQLHKVQLKJRY6131H[XVKWWSZZZVQSQH[XVRUJ5HJXORPH'%
KWWSUHJXORPHVWDQIRUGHGX
Ɣ H47/7}QX
*7(;&RQVRUWLXPKWWSVFRPPRQIXQGQLKJRY*7([)X-LQJ\XDQHWDO8QUDYHOLQJWKH
UHJXODWRU\PHFKDQLVPVXQGHUO\LQJWLVVXHGHSHQGHQWJHQHWLFYDULDWLRQRIJHQHH[SUHVVLRQ3/R6
JHQHWLFVH5HJXORPH'%KWWSUHJXORPHVWDQIRUGHGX
Ɣ *HQRPHVDQG6HTXHQFLQJ7}QX
KWWSJHQRPHVRUJDQG*HQRPHV&RQVRUWLXPSDSHUV
Ɣ *HQHEDVHGWHVWV-DLPH
KWWSJXPSTLPUHGXDX9(*$6/LX-LPP\=HWDO$YHUVDWLOHJHQHEDVHGWHVWIRU
JHQRPHZLGHDVVRFLDWLRQVWXGLHV$PHULFDQMRXUQDORIKXPDQJHQHWLFV
'HVFULELQJ7DUJHWV
Ɣ ,GHQWLI\LQJSUHYLRXVDVVRFLDWLRQV-DLPH
KXPDQKWWSZZZJHQRPHJRYJZDVWXGLHVPRXVHKWWSZZZLQIRUPDWLFVMD[RUJ]HEUDILVK
KWWS]ILQRUJ
Ɣ 3DWKZD\DQDO\VLV-DLPH
KWWSDWJXPJKKDUYDUGHGXLQULFK/HH3KLO+HWDO,15,&+LQWHUYDOEDVHGHQULFKPHQW
DQDO\VLVIRUJHQRPHZLGHDVVRFLDWLRQVWXGLHV%LRLQIRUPDWLFV
Ɣ *HQHIXQFWLRQSUHGLFWLRQ/XGH
Ɣ *HQHSULRULWL]DWLRQ/XGH
Ɣ $QDO\VLVRIFKURPDWLQPDUNV*RVLD
KWWSVZZZEURDGLQVWLWXWHRUJPSJHSLJZDV7U\QND*RVLDHWDO&KURPDWLQPDUNVLGHQWLI\
FULWLFDOFHOOW\SHVIRUILQHPDSSLQJFRPSOH[WUDLWYDULDQWV1DWXUHJHQHWLFV
(1&2'(OLQNVKWWSZZZQDWXUHFRPHQFRGHWKUHDGVKWWSJHQRPHXFVFHGX(1&2'(
KWWSZZZURDGPDSHSLJHQRPLFVRUJ
6XPPDU\'LVFXVVLRQ
7}QX(VNRWHVNR#EURDGLQVWLWXWHRUJ
/XGH)UDQNHOXGH#OXGHVLJQQO
-DLPH'HUULQJHUMDLPHODQH#JPDLOFRP
*RVLD7U\QNDJRVLD#EURDGLQVWLWXWHRUJ
REVIEW
Five Years of GWAS Discovery
Peter M. Visscher,1,2,* Matthew A. Brown,1 Mark I. McCarthy,3,4 and Jian Yang5
The past five years have seen many scientific and biological discoveries made through the experimental design of genome-wide association studies (GWASs). These studies were aimed at detecting
variants at genomic loci that are associated with complex traits
in the population and, in particular, at detecting associations
between common single-nucleotide polymorphisms (SNPs) and
common diseases such as heart disease, diabetes, auto-immune
diseases, and psychiatric disorders. We start by giving a number
of quotes from scientists and journalists about perceived problems
with GWASs. We will then briefly give the history of GWASs and
focus on the discoveries made through this experimental design,
what those discoveries tell us and do not tell us about the genetics
and biology of complex traits, and what immediate utility has
come out of these studies. Rather than giving an exhaustive review
of all reported findings for all diseases and other complex traits, we
focus on the results for auto-immune diseases and metabolic
diseases. We return to the perceived failure or disappointment
about GWASs in the concluding section.
Introduction: Have GWASs Been a Failure?
In the past five years, genome-wide association studies
(GWASs) have led to many scientific discoveries, and yet
at the same time, many people have pointed to various
problems and perceived failures of this experimental
design. Let us begin by considering a number of criticisms
that have been made against GWASs. We do not list these
quotes to discredit any of the scientists or journalists
involved, nor to deliberately cite them out of context.
Rather, they serve to confirm that the points we discuss
in this review are related to beliefs held by a significant
number of scientific commentators and therefore warrant
consideration.
From an interview with Sir Alec Jeffreys, ESHG Award
Lecturer 2010:
‘‘One of the great hopes for GWAS was that, in the
same way that huge numbers of Mendelian disorders
were pinned down at the DNA level and the gene
and mutations involved identified, it would be
possible to simply extrapolate from single gene disorders to complex multigenic disorders. That really
hasn’t happened. Proponents will argue that it has
worked and that all sorts of fascinating genes that
predispose to or protect against diabetes or breast
cancer, for example, have been identified, but the
fact remains that the bulk of the heritability in these
conditions cannot be ascribed to loci that have
emerged from GWAS, which clearly isn’t going to
be the answer to everything.’’
From McCLellan and King, Cell 20101:
‘‘To date, genome-wide association studies (GWAS)
have published hundreds of common variants
whose allele frequencies are statistically correlated
with various illnesses and traits. However, the vast
majority of such variants have no established biological relevance to disease or clinical utility for prognosis or treatment.’’
‘‘An odds ratio of 3.0, or even of 2.0 depending on
population allele frequencies, would be robust to
such population stratification. However, odds ratios
of the magnitude generally detected by GWAS
(<1.5) can frequently be explained by cryptic population stratification, regardless of the p value associated with them.’’
‘‘More generally, it is now clear that common risk
variants fail to explain the vast majority of genetic
heritability for any human disease, either individually or collectively (Manolio et al., 2009).’’
‘‘The general failure to confirm common risk variants is not due to a failure to carry out GWAS
properly. The problem is underlying biology, not
the operationalization of study design. The common
disease–common variant model has been the
primary focus of human genomics over the last
decade. Numerous international collaborative efforts
representing hundreds of important human diseases
and traits have been carried out with large well-characterized cohorts of cases and controls. If common
alleles influenced common diseases, many would
have been found by now. The issue is not how to
develop still larger studies, or how to parse the data
still further, but rather whether the common
disease–common variant hypothesis has now been
tested and found not to apply to most complex
human diseases.’’
From Nicholas Wade in the New York Times, March 20
2011:
‘‘More common diseases, like cancer, are thought to
be caused by mutations in several genes, and finding
the causes was the principal goal of the $3 billion
1
University of Queensland Diamantina Institute, Princess Alexandra Hospital, Brisbane, Queensland 4102, Australia; 2The Queensland Brain Institute, The
University of Queensland, Brisbane, Queensland 4072, Australia; 3Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK;
4
Oxford Centre for Diabetes, Endocrinology and Metabolism, Churchill Hospital Old Road, Headington Oxford OX3 7LJ, UK; 5Queensland Institute of
Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia
*Correspondence: peter.visscher@uq.edu.au
DOI 10.1016/j.ajhg.2011.11.029. !2012 by The American Society of Human Genetics. All rights reserved.
The American Journal of Human Genetics 90, 7–24, January 13, 2012
7
human genome project. To that end, medical geneticists have invested heavily over the last eight years
in an alluring shortcut. But the shortcut was based
on a premise that is turning out to be incorrect. Scientists thought the mutations that caused common
diseases would themselves be common. So they first
identified the common mutations in the human
population in a $100 million project called the
HapMap. Then they compared patients’ genomes
with those of healthy genomes. The comparisons
relied on ingenious devices called SNP chips, which
scan just a tiny portion of the genome. (SNP,
pronounced ‘‘snip,’’ stands for single nucleotide
polymorphism.) These projects, called genome-wide
association studies, each cost around $10 million or
more. The results of this costly international exercise
have been disappointing. About 2,000 sites on the
human genome have been statistically linked with
various diseases, but in many cases the sites are
not inside working genes, suggesting there may be
some conceptual flaw in the statistics. And in most
diseases the culprit DNA was linked to only a small
portion of all the cases of the disease. It seemed that
natural selection has weeded out any disease-causing
mutation before it becomes common.’’
From Tim Crow, Molecular Psychiatry 20112:
‘‘There comes a point at which the genetic skeptic
can be pardoned the suggestion that if the genes
are so small and so multiple, what they are hardly
matters, the dividing line between polygenes and
no genes is of little practical consequence. Have we
reached this point’’?
From a commentary article by Jonathan Latham, on
guardian.co.uk, 17 April 2011:
‘‘Among all the genetic findings for common
illnesses, such as heart disease, cancer and mental
illnesses, only a handful are of genuine significance
for human health. Faulty genes rarely cause, or even
mildly predispose us, to disease, and as a consequence
the science of human genetics is in deep crisis.
Since the Collins paper [Manolio et al. 20093] was
published nothing has happened to change that
conclusion. It now seems that the original twinstudy critics were more right than they imagined.
The most likely explanation for why genes for
common diseases have not been found is that, with
few exceptions, they do not exist.’’
These quotes raise a number of different issues about
the methodology, research outcomes, and utility of the
research findings. The pertinent points made in these
quotes are:
(1) GWASs are founded on a flawed assumption that
genetics plays an important role in the risk to
common diseases;
(2) GWASs have been disappointing in not explaining
more genetic variation in the population;
(3) GWASs have not delivered meaningful, biologically
relevant knowledge or results of clinical or any
other utility; and
(4) GWAS results are spurious.
In this review we will briefly give the history of GWASs
and then focus on the discoveries made through this
experimental design, what those discoveries tell us and
do not tell us about the genetics and biology of complex
traits, and what immediate utility has come out of these
studies. We will focus on the results for auto-immune
diseases and metabolic diseases, although there have
been important findings for other diseases and complex
traits. In the concluding section, we will again consider
the perceived failure or disappointment of GWASs.
What Are GWASs, and How Did We Get There?
Attempts to use linkage analysis to map genomic loci that
have an effect on disease or other complex traits have
been ubiquitous in the last two decades. Gene mapping
by linkage relies on the cosegregation of causal variants
with marker alleles within pedigrees. We define and
discuss what we mean by ‘‘causal’’ in Box 1. Because the
number of recombination events per meiosis is relatively
small, tagging a causal variant requires only a few genetic
markers per chromosome. The downside of the small
number of recombination events is that the mapping
resolution, i.e., how close to the causal variant one can
get through linked markers, is typically low. Linkage
mapping has been extremely successful in mapping genes
and gene variants affecting Mendelian traits (e.g., singlegene disorders).4 Mapping loci underlying common
diseases and, in particular, identifying causative mutations have had much less success. There are many reasons
for the failure of linkage analyses to reliably identify
complex-trait loci in human pedigrees. One reason is
that the effect sizes (‘‘penetrance’’) of individual causal
variants are too small to allow detection via cosegregation
within pedigrees.
GWASs are based upon the principle of linkage disequilibrium (LD) at the population level. LD is the nonrandom
association between alleles at different loci. It is created by
evolutionary forces such as mutation, drift, and selection
and is broken down by recombination.5 Generally, loci
that are physically close together exhibit stronger LD
than loci that are farther apart on a chromosome. The
larger the (effective) population size, the weaker the LD
for a given distance.6 (Linkage analysis exploits the large
LD within pedigrees.) The genomic distance at which LD
decays determines how many genetic markers are needed
to ‘‘tag’’ a haplotype, and the number of such tagging
markers is much smaller than the total number of
segregating variants in the population. For example,
a selection of approximately 500,000 common SNPs in
the human genome is sufficient to tag common variation
8 The American Journal of Human Genetics 90, 7–24, January 13, 2012
tion that is obtained from linkage analysis in family
studies. What if we do not have any prior information
on genomic loci or, alternatively, we deliberately want an
unbiased scan of the genome? In a landmark paper, Risch
and Merikangas83 showed that performing an association
scan involving one million variants in the genome and
a sample of unrelated individuals could be more powerful
than performing a linkage analysis with a few hundred
markers. It took only 10 years before this theoretical design
became reality. What was needed was the discovery (accelerated by the sequencing of the human genome) of
hundreds of thousands of single-nucleotide variants, the
quantification of the correlation (LD) structure of those
markers in the human genome, and the ability to accurately genotype hundreds of thousands of markers in an
automated and affordable manner. The LD structure was
investigated in the HapMap project,7 and the outcome
was a list of tag SNPs that captured most of the common
genomic variation in a number of human populations.
Concurrently, commercial companies produced dense
SNP arrays that could genotype many markers in a single
assay. The technological advances together with biobanks
of either population cohorts or case-control samples facilitated the ability to conduct GWASs.
Although GWASs are unbiased with respect to prior biological knowledge (or prior beliefs) and with respect to
genome location, they are not unbiased in terms of what
is detectable. GWASs rely on LD between genotyped
SNPs and ungenotyped causal variants. The strength of
statistical association between alleles at two loci in the
genome strongly depends on their allele frequencies,
such that a rare variant (say, one with a frequency <0.01)
will be in low LD (as measured by r2) with a nearby
common variant, even if they map to the same recombination interval.84 But the SNPs that are on the SNP chips
have been selected to be common (most have a minor
allele frequency >0.05). Therefore, GWASs are by design
powered to detect association with causal variants that
are relatively common in the population. Is it realistic to
assume common causal variants for disease segregate in
the population? This is discussed in Box 2.
Box 1. What Is a Causal Variant?
New mutations that contribute to an increase or
decrease in risk to disease arise in populations all
the time. Some of these mutations can reach an
appreciable frequency in the population, for
example by random drift or by natural selection.
As discussed in the main text, these mutations will
be associated with other variants in the genome
through LD. Such associations will include those
with SNPs that are genotyped on ‘‘SNP chips.’’
Because there are many more segregating variants
in the population than those genotyped in GWASs,
it is unlikely, but not impossible, that a mutation is
genotyped itself, and so its effect usually will be detected through an association with a genotyped
variant. This genotyped variant can be robustly associated with disease in multiple samples from the
same population, or even across populations, but it
is not the mutation that causes variation in risk.
The results from GWASs have shown that variants
at many genetic loci in the genome are associated
with disease, and these also reflect many ancestral
mutations with an effect on susceptibility to disease.
Therefore, the effect size (in terms of increasing or
decreasing the absolute probability of disease) is,
on average, small, and individual variants are
neither necessary nor sufficient to cause disease.
Herein lies the problem of defining ‘‘causal’’: How
do we prove that a particular mutation causes the
observed effect on variation in the population?
Engineering the same mutation in a cell or animal
model might give a relevant phenotype, but that is
not a proof. The mutation can have a direct effect
on gene expression in human tissues or be functional in another way, but that doesn’t prove it has
a causal effect on disease risk. Operationally, in this
review what we mean by ‘‘causal variant’’ is an
(unknown) variant that has a direct or indirect functional effect on disease risk, rather than a variant
that is associated with disease risk through LD,
even if we don’t have the tools available at present
to prove causality beyond reasonable doubt. Hence,
it is the variant that causes the observed association
signal.
in non-African populations, even though the total number
of common SNPs exceeds 10 million.7
Geneticists realized some time ago that they could
exploit population-based LD to map genes. For example,
Bodmer suggested in 1986 that fine-mapping using population association could lead to closer linkage between
a causative mutation and a linked marker.82 However,
fine-mapping still relied on having an initial genomic loca-
(Nearly) Five Years of Discovery
Although the first results from a GWAS were reported in
20058 and 2006,9 we take the 2007 Wellcome Trust Case
Control Consortium (WTCCC) paper in Nature10 as a starting point. The reason for this is that the WTCCC study was
the first large, well-designed GWAS for complex diseases to
employ a SNP chip that had good coverage of the genome.
There are many ways to summarize the discoveries based
on GWASs in the last five years. We have tried to separate
the discoveries quantitatively and to focus on the biology.
There are now well over 2000 loci that are significantly and
robustly associated with one or more complex traits (see
GWAS catalog in Web Resources), as shown in Figure 1.
The vast majority of the loci identified are new, i.e., before
2007 their association with disease or other complex traits
The American Journal of Human Genetics 90, 7–24, January 13, 2012
9
Box 2.
Box 2. The CDCV Hypothesis
Currently, the allele frequency of variants that
contribute to cause common disease is a subject of
some debate.85,86 The common disease-common
variant (CDCV) hypothesis is sometimes said to be
one side of this debate; the other side holds that
disease-causing alleles are typically rare. But what
is the precise ‘‘hypothesis’’ in the CDCV hypothesis?
We tried to find the origin of the CDCV hypothesis.
Many researchers cite either Lander87 or Risch and
Merikangas.83 We will add Chakravarti88 and Reich
and Lander89 as key studies. Lander87 noted from
the then-available data that there is a limited diversity in coding regions at genes, in that most variants
are very rare, and therefore the effective number of
alleles is small. In addition, he provided ‘‘tantalizing
examples’’ of common alleles with large effects (for
example, such alleles include APOE [MIM 107741],
MTHFR [MIM 607093], and ACE [MIM 106180]).
Reich and Lander89 presented a theoretical population-genetics model that predicted a relatively
simple spectrum of the frequency of disease risk
alleles at a particular disease locus. They (re)phrased
the CDCV hypothesis as the prediction that the expected allelic identity is high for those disease loci
that are responsible for most of the population risk
for disease. These studies did not appear to make
any prediction about the number of disease loci or,
therefore, about the effect size. What the authors
stated was that if a disease was common, there was
likely to be one disease-causing allele that was
much more common than all the other diseasecausing alleles at the same locus.87,89
Risch and Merikangas83 quantified two important
points regarding the detection of disease loci: first,
that detection by association is more powerful
than linkage when the genotype-relative risk is
modest or small and the risk-allele frequency is large
(say, >10%); and second, that the multiple-testing
burden of a genome scan by association does not
prevent the detection of genome-wide-significant
findings. This paper was essentially about experimental design and statistical power (and hence feasibility), not about the CDCV hypothesis as such.
Finally, Chakravarti88 pointed out that if individuals
with disease needed to be homozygous for risk variants at multiple loci, then the risk alleles at those
loci must be more common than they would be in
a model in which homozygosity at any risk locus is
sufficient to cause disease. We note that without
the assumption of strong epistasis on the scale of
liability, there is no need for risk variants to be
common. For example, Risch’s multilocus multiplicative model,90 which implies an additive model
Continued
on the log (risk) scale (it is one of the ‘‘exchangeable’’
models91), does not rely on a particular allelic spectrum of risk-allele frequencies.
What all these landmark papers have in common
is a remarkable foresight in predicting the GWAS era
well before the publication of the full draft of the
human genome sequence, the HapMap project, or
the availability of commercial genotyping. But
what can we conclude about the origin and specifics
of the CDCV hypothesis? As implicitly or explicitly
stated in these key papers, there is no strong prediction about the exact allele-frequency spectrum of
risk variants in the genome, nor a prediction about
the effect size at any disease loci and hence about
the total number of risk alleles in the genome.
The current debate is about the frequency spectrum of disease-causing alleles. Phrasing the debate
as an either/or question is not very helpful because
examples of both common and rare alleles are
already known, but there is still an open question
as to whether most genetic variation contributing
to complex traits in the population is caused by
rare variants or common variants. A more general
question regards the spectrum of allele frequencies
of disease-causing alleles and the joint distribution
between risk-allele frequency and effect size. In the
special case of an evolutionarily neutral model and
a constant effective population size, most causal
variants that are segregating in the population will
be rare, but most heritability will be due to common
variants.79,92 The reason for this apparent paradox is
that the number of segregating variants is proportional to 1/[p(1 ! p), where p is the allele frequency
of a risk-increasing allele (so the smaller p, the
more variants of that frequency), whereas the heritability contributed at that frequency is proportional
to p(1 ! p). The net effect is that the heritability is
distributed equally over all frequencies, and cumulatively most heritability is contributed by common
variants.
was not known. Essentially, these are 2000 new biological
leads. The number of loci identified per complex trait
varies substantially, from a handful for psychiatric diseases
to a hundred or more for inflammatory bowel disease
(IBD1 [MIM 266600], including Crohn disease [CD]11
and ulcerative colitis [UC]12) and stature.13 Importantly,
the number of discovered variants is strongly correlated
with experimental sample size (Figure 2), which predicts
that an ever-increasing discovery sample size will increase
the number of discovered variants: very roughly, after
a minimum sample-size threshold below which no variants are detected is reached, a doubling in sample size leads
10 The American Journal of Human Genetics 90, 7–24, January 13, 2012
Figure 1. GWAS Discoveries over Time
Data obtained from the Published GWAS Catalog (see Web
Resources). Only the top SNPs representing loci with association
p values < 5 3 10!8 are included, and so that multiple counting
is avoided, SNPs identified for the same traits with LD r2 > 0.8 estimated from the entire HapMap samples are excluded.
to a doubling of the number of associated variants discovered. The proportion of genetic variation explained by
significantly associated SNPs is usually low (typically less
than 10%) for many complex traits, but for diseases such
as CD and multiple sclerosis (MS [MIM 126200]), and for
quantitative traits such as height and lipid traits, between
10% and 20% of genetic variance has been accounted for
(Table 1). In comparison to the pre-GWAS era, the proportion of genetic variation accounted for by newly discovered variants that are segregating in the population is large.
It is clear that for most complex traits that have been
investigated by GWAS, multiple identified loci have
genome-wide statistical significance, and thus it is likely
that there are (many) other loci that have not been identified because of a lack of statistical significance (false negatives). Recently, researchers have developed and applied
methods to quantify the proportion of phenotypic variation that is tagged when one considers all SNPs simultaneously.12–14 These methods focus on estimation rather
than hypothesis testing and do not suffer from false
negatives caused by small effect sizes.15 Whole-genome
approaches to estimating genetic variation have shown
that approximately one-third to one-half of additive
genetic variation in the population is being tagged when
all GWAS SNPs are considered simultaneously.12–14 This
is a surprisingly large proportion given that evolutionary
theory predicts that most variants affecting disease risk
ought to be found at a low frequency in the population
if they affect fitness,16,17 and such risk variants would
not be in sufficient LD with the common SNPs to be
detected in GWASs.
Autoimmune Diseases
We concentrate on seven auto-immune diseases, ankylosing spondylitis (AS [MIM 106300]), rheumatoid arthritis
(RA [MIM 180300), systemic lupus erythematosus (SLE
Figure 2. Increase in Number of Loci Identified as a Function of
Experimental Sample Size
(A) Selected quantitative traits.
(B) Selected diseases.
The coordinates are on the log scale. The complex traits were
selected with the criteria that there were at least three GWAS
papers published on each in journals with a 2010–2011 journal
impact factor >9 (e.g., Nature, Nature Genetics, the American Journal
of Human Genetics, and PLoS Genetics) and that at least one paper
contained more than ten genome-wide significant loci. These
traits are a representative selection among all complex traits that
fulfilled these criteria.
[MIM 152700]), and type 1 diabetes (T1D [MIM 222100]),
MS, CD, and UC. Table 2 summarizes the number of genes
that have been identified for these diseases. Across these
diseases, 19 loci (mainly related to human leukocyte
antigen) were known prior to 2007, and 277 have been
discovered from 2007 onward. The total of 277 includes
multiple counts of loci that have been implicated across a
number of diseases; such loci include BLK (MIM 191305),
TNFAIP3 (MIM 191163) and CD40 (MIM 109535).
Inflammatory bowel disease (IBD, not to be confused
here with identity by descent) is thought to arise from
dysregulation of intestinal homeostasis.18 GWASs of IBD
(CD and UC) have been highly successful in terms of
the number of loci identified (99 nonoverlapping loci in
The American Journal of Human Genetics 90, 7–24, January 13, 2012 11
Table 1. Population Variation Explained by GWAS for a Selected
Number of Complex Traits
Trait or Disease
h2 Pedigree
Studies
h2 GWAS
Hitsa
h2 All
GWAS SNPsb
Type 1 diabetes
0.998
0.699
0.312
Type 2 diabetes
0.3–0.6100
0.05-0.1034
Obesity (BMI)
0.4–0.6101,102
0.01-0.0236
0.214
Crohn’s disease
0.6–0.8103
0.111
0.412
Ulcerative colitis
0.5103
0.0512
Multiple sclerosis
0.3–0.8104
105
,c
0.145
0.2106
Ankylosing spondylitis
>0.90
Rheumatoid arthritis
0.6107
Schizophrenia
0.7–0.8108
0.0179
0.3109
Bipolar disorder
0.6–0.7108
0.0279
0.412
Breast cancer
0.3110
0.08111
Von Willebrand factor
0.66–0.75112,113
0.13114
115,116
0.1
13
Height
0.8
Bone mineral density
0.6-0.8117
0.05118
QT interval
0.37–0.60119,120
0.07121
HDL cholesterol
0.5122
0.157
Platelet count
0.8123
0.05–0.158
0.2514
0.513,14
0.214
a
Proportion of phenotypic variance or variance in liability explained by
genome-wide-significant and validated SNPs. For a number of diseases, other
parameters were reported, and these were converted and approximated to the
scale of total variation explained. Blank cells indicate that these parameters
have not been reported in the literature.
b
Proportion of phenotypic variance or variance in liability explained when all
GWAS SNPs are considered simultaneously. Blank cell indicate that these
parameters have not been reported in the literature.
c
Includes pre-GWAS loci with large effects.
total18), and a substantial proportion of familial risk, about
20%, has been accounted for.11,12,18 Twenty-eight risk loci
are shared between CD and UC, despite the fact that these
diseases display distinct clinical features, and it has been
suggested that the two diseases share pathways and are
part of a mechanistic continuum.18 There are also strong
overlaps between genes involved in CD and UC, AS,19
and psoriasis (MIM 177900), again suggesting shared aetiopathogenic mechanisms in these conditions. Pleiotropic
genetic effects are becoming increasing widely identified,
including in classical autoimmune diseases.20 For example,
a coding variant in the gene PTPN22 (MIM 600716)
confers strong risk for T1D and RA as well as protection
against CD.18
Metabolic Diseases
In terms of metabolic diseases, we focus here specifically
on type 2 diabetes (T2D [MIM 125853]); fasting glucose
and insulin levels; body-mass index (BMI) and obesity;
and fat distribution. A recent review21 already covered
these complex traits, but we have updated that review
wherever necessary. Table 3 gives an overview of the
number of loci identified.
More than 20 major GWASs for T2D have been published to date21–24, and there has been a cumulative tally
of around 50 genome-wide-significant hits,21,23,24 only
three of which were known before the GWAS era. Most
of these studies have involved individuals of European
descent; the latest published effort is from the DIAGRAM
(Diabetes Genetics Replication and Meta-analysis)
Consortium and includes more than 47,000 GWAS individuals and 94,000 samples for replication. More recently,
equivalent studies have emerged from samples of East
Asians,23,25–27 South Asians,22 and Hispanics,28,29 and
large studies involving African Americans and other major
ethnic groups are underway. Notwithstanding differences
in allele frequency and LD patterns, most of the signals
found in one ethnic group show some evidence of association in others, indicating that the common-variant
signals identified by GWASs are likely to be the result of
widely distributed causal alleles that are of relatively high
frequency. This is an important observation because it
indicates that most of the GWAS-identified associations
for T2D reflect high LD with a causal variant that has
a small effect size rather than low LD with a causal variant
that has a large effect size. The largest common-variant
signal identified for T2D remains TCF7L2 (MIM 602228)
(detected just prior to the GWAS era30), which has a
per-allele odss ratio (OR) of around 1.35. The remaining
signals detected by GWAS have allelic ORs in the range
between 1.05 and 1.25. Collectively, the most-strongly
associated variants at these loci are estimated to explain
around 10% of familial aggregation of T2D in European
populations.
The MAGIC (Meta-Analysis of Glucose- and InsulinRelated Traits Consortium) investigators have been
carrying out equivalent analyses focused on the identification of variants influencing variation in glucose and
insulin levels in healthy nondiabetic individuals.31–33 Prior
to the GWAS era, the only compelling association signal
for fasting glucose levels was known at GCK (MIM
138079) (glucokinase),34 but GWAS in European samples
(46,000 GWAS and 76,000 replication samples) have
expanded that number to 1632. These variants explain
around 10% of the inherited variation in fasting glucose
levels. Only two signals (near GCKR [MIM 600842] and
IGF1 [MIM 147440]) were shown to influence fasting
insulin levels in the same analysis. Equivalent analyses
for 2h glucose33 (15,000 GWAS samples and up to 30,000
replication samples) identified further signals, including
variants near the GIP (MIM 137240) receptor (GIPR [MIM
137241]).
Before the GWAS era, the only robust association
between DNA sequence variation and either BMI or
weight involved low-frequency variants in MC4R (MIM
155541).35 Now, there are more than 30. In the most
recent study from the GIANT consortium,36 these analyses
extended to almost 250,000 samples, half of them in the
stage 1 GWAS, the remainder for replication. The largest
signal remains that at FTO (MIM 610966),37 where the
12 The American Journal of Human Genetics 90, 7–24, January 13, 2012
Table 2.
Summary of GWAS Findings for Seven Autoimmune Diseasesa
Prior to 2007
2007 onward
Disease
Number of Loci
Loci
Number of Loci
Some or All of the Loci
Ankylosing
spondylitis
1
HLA-B27
13
IL23R, ERAP1, 2p15, 21q22, CARD9 (MIM 607212), IL12B
(MIM 161561), PTGER4 (MIM 601586), IL1R2 (MIM 147811),
TNFR1, TBKBP1 (MIM 608476), ANTXR2 (MIM 608041),
RUNX3 (MIM 600210), KIF21B (MIM 608322)
Rheumatoid
arthritis
3
HLA-DRB1,
PADI4,
CTLA4
30
AFF3 (MIM 601464), BLK, CCL21 (MIM 602737), CD2/CD58
(MIM 186990)/153420], CD28, CD40, FCGR2A (MIM 146790),
HLA-DRB1, IL2/IL21 (MIM 147680/605384), IL2RA, IL2RB
(MIM 146710), KIF5A/PIP4K2C, PRDM1 (MIM 603423), PRKCQ
(MIM 600448), PTPRC (MIM 151460), REL (MIM 164910), STAT4
(MIM 600558), TAGAP, TNFAIP3, TNFRSF14, TRAF1/C5 (MIM
120900/601711), TRAF6 (MIM 602355), IL6ST (MIM 600694),
SPRED2 (MIM 609292), RBPJ (MIM 147183), CCR6
(MIM 601835), IRF5 (MIM 607218), PXK (MIM 611450)
Systemic lupus
erythematosus
3
HLA, PTPN22,
IRF5 (MIM
607218)
31
BANK1 (MIM 610292), BLK (MIM 191305), C1q, C2 (MIM 613927),
C4A/B (MIM 120820/120810), CRP (MIM 123260), ETS1
(MIM 164720), FcGR2A–FcGR3A (MIM 146790/146740), FcGR3B
(MIM 610665), HIC2-UBE2L3 (MIM 607712/603721), IKZF1 (MIM
603023), IL10 (MIM 124092), IRAK1 (MIM 300283), ITGAM–ITGAX
(MIM 120980)/151510], JAZF1, KIAA1542/PHRF1, LRRC18-WDFY4,
LYN (MIM 165120), NMNAT2 (MIM 608701), PRDM1 (MIM
603423), PTTG1 (MIM 604147), PXK (MIM 611450), RASGRP3
(MIM 609531), SLC15A4, STAT1 (MIM 600555), TNFAIP3, TNFSF4
(MIM 603594), TNIP1 (MIM 607714), TREX1 (MIM 606609),
UHRF1BP1, XKR6
Type 1
diabetes
4
HLA, INS
(MIM 176730),
PTPN22, CTLA4
40
RGS1, IL18RAP (MIM 604509), IFIH1 (MIM 606951), CCR5 (MIM
601373), IL2 (MIM 147680), IL7R, MHC, BACH2 (MIM 605394),
TNFAIP3, TAGAP, IL2RA, PRKCQ (MIM 600448), INS (MIM 176730),
ERBB3 (MIM 190151), 12q13.3, SH2B3 (MIM 605093), CTSH
(MIM 116820), CLEC16A (MIM 611303), PTPN2 (MIM 176887),
CD226 (MIM 605397), UBASH3A (MIM 605736), C1QTNF6, IL10
(MIM 124092), 4p15.2, C6orf173, 7p15.2, COBL (MIM 610317),
GLIS3 (MIM 610192), C10orf59, CD69 (MIM 107273), 14q24.1,
14q32.2, IL27 (MIM 608273), 16q23.1, ORMDL3 (MIM 610075),
17q21.2, 19q13.32, 20p13, 22q12.2, Xq28
Multiple
sclerosis
1
HLA
52
BACH2 (MIM 605394), BATF (MIM 612476), CBLB, CD40, CD58,
CD6 (MIM 186720), CD86, CLEC16A (MIM 611303), CLECL1,
CYP24A1, CYP27B1, DKKL1 (MIM 605418), EOMES (MIM 604615),
EVI5 (MIM 602942), GALC (MIM 606890), HHEX (MIM 604420),
IL12A, IL12B, IL22RA2, IL2RA, IL7, IL7R, IRF8, KIF21B (MIM
608322), MALT1, MAPK1 (MIM 176948), MERTK (MIM 604705),
MMEL1, MPHOSPH9 (MIM 605501), MPV17L2, MYB (MIM 189990),
MYC (MIM 190080), OLIG3 (MIM 609323), PLEK (MIM 173570),
PTGER4 (MIM 601586), PVT1 (MIM 165140), RGS1, SCO2 (MIM
604272), SP140 (MIM 608602), STAT3, TAGAP, THEMIS (MIM
613607), TMEM39A, TNFRSF1A, TNFSF14 (MIM 604520), TYK2,
VCAM1, ZFP36L1 (MIM 601064), ZMIZ1 (MIM 607159), ZNF767
Crohn’s
disease
4
NOD2 (MIM 605956),
IBD5 (MIM 606348),
DRB1*0103, IL23R
67
SMAD3 (MIM 603109), ERAP2 (MIM 609497), IL10 (MIM 124092),
IL2RA, TYK2, FUT2 (MIM 182100), DNMT3A (MIM 602769),
DENND1B (MIM 613292), BACH2 (MIM 605394), ATG16L1
(MIM 610767)
Ulcerative
colitis
3
DRB1*1502,
DRB1*0103, IL23R
44
IL1R2 (MIM 147811), IL8RA-IL8RB, IL7R, IL12B, DAP
(MIM 600954), PRDM1 (MIM 603423), JAK2 (MIM 147796),
IRF5 (MIM 607218), GNA12 (MIM 604394), LSP1 (MIM 153432),
ATG16L1 (MIM 610767)
Total
19
277
a
The names of the loci are signposts and do not indicate that these loci are necessarily biologically relevant. A number of associated variants are distant from
protein-coding genes.
average between-homozygotes difference in weight is
around 2.5 kg. The effects at other loci are smaller, and
in combination, these variants explain no more than
1%–2% of overall variation in adult BMI (although this
percentage rises to almost 20% if the analysis is extended
to all GWA variants, not just those that reach genome-
wide significance14). As well as these studies of BMI and
obesity in population samples, there have been several
studies focused on extreme obesity phenotypes.38,39 The
genome-wide-significant loci thrown up by these efforts
only partially overlap with those emerging from population-based studies, raising the possibility that some of
The American Journal of Human Genetics 90, 7–24, January 13, 2012 13
Table 3.
Summary of GWAS Findings for Metabolic Traitsa
Prior to 2007
2007 onward
Disease
Number of Loci
Loci
Number of Loci
Some or All of the Loci
Type 2 diabetes
3
PPARG, KCNJ11
(MIM 600937),
TCF7L2
50
NOTCH2 (MIM 600275), PROX1 (MIM 601546), GCKR, THADA
(MIM 611800), BCL11A (MIM 606557), RBMS1 (MIM 602310), IRS1,
ADAMTS9, ADCY5 (MIM 600293), IGF2BP2 (MIM 608289), WFS1,
ZBED3, CDKAL1, DGKB (MIM 604070), JAZF1, GCK, KLF14,
TP53INP1 (MIM 606185), SLC30A8 (MIM 611145), PTPRD
(MIM 601598), CDKN2A, CHCHD9, CDC123, HHEX (MIM 604420),
DUSP8 (MIM 602038), KCNQ1, CENTD2, MTNR1B, HMGA2 (MIM
600698), TSPAN8 (MIM 600769), HNF1A, ZFAND6 (MIM 610183),
PRC1 (MIM 603484), FTO, SRR (MIM 606477), HNF1B (MIM
189907), DUSP9 (MIM 300134), CDCD4A, UBE2E2 (MIM 602163),
GRB14 (MIM 601524), ST6GAL1 (MIM 109675), VPS26A (MIM
605506), HMG20A (MIM 605534), AP3S2 (MIM 602416), HNF4A
(MIM 600281), SPRY2 (MIM 602466)
Body-mass index
1
MC4R
30
NEGR1 (MIM 613173), TNNI3K (MIM 613932), PTBP2 (MIM
608449), TMEM18 (MIM 613220), POMC, FANCL (MIM 608111),
LRP1B (MIM 608766), CADM2 (MIM 609938), ETV5 (MIM 601600),
GNPDA2 (MIM 613222), SLC39A8 (MIM 608732), HMGCR
(MIM 142910), PCSK1, ZNF608, NCR3 (MIM 611550), HMGA1
(MIM 600701), LRRN6C, TUB (MIM 601197), BDNF, MTCH2
(MIM 613221), FAIM3 (MIM 606015), MTIF3, PRKD1
(MIM 605435), MAP2K5 (MIM 602520), FTO, SH2B1, GPRC5B
(MIM 605948), KCTD15, GIPR, TMEM160
Glucose or insulin
1
GCK
15
GCKR, G6PC2, IGF1, ADCY5 (MIM 600293), MADD (MIM 603584),
ADRA2A, CRY2 (MIM 603732), FADS1 (MIM 606148), GLIS3
(MIM 610192), SLC2A2, PROX1 (MIM 601546), C2CD4B (MIM
610344), DGKB (MIM 604070), GIPR, VPS13C (MIM 608879)
Fat distribution
0
20
TBX15 (MIM 604127), LYPLAL1, IRS1, SPRY2 (MIM 602466), GRB14
(MIM 601524), STAB1 (MIM 608560), ADAMTS9, CPEB4 (MIM
610607), VEGFA (MIM 192240), TFAP2B (MIM 601601), LY86
(MIM 605241), RSPO3 (MIM 610574), NFE2L3 (MIM 604135), MSRA
(MIM 601250), ITPR2 (MIM 600144), HOXC13 (MIM 142976),
NRXN3 (MIM 600567), ZNRF3 (MIM 612062), PIGC (MIM 601730)
Total
5
107
a
The names of the loci are signposts and do not indicate that these loci are necessarily biologically relevant. A number of associated variants are distant from
protein-coding genes.
the most extreme cases of obesity are driven by highly
penetrant, low-frequency variants. Variation at copynumber variants (CNVs) has some impact on BMI. This is
true of common CNVs (the NEGR1 association seems likely
to be driven by a common CNV40) and also rarer CNVs for
which evidence is starting to accumulate (e.g., 16p CNV
and effect on morbid obesity and developmental delay41).
The adverse metabolic effects of obesity depend not
only on the overall level of adiposity but also on the distribution of fat around the body; visceral (abdominal) fat has
particularly adverse consequences for overall health. GWASs
of fat-distribution phenotypes (including waist circumference, waist:hip ratio, and body-fat percentage studied in close
to 200,000 individuals) have revealed almost 20 loci with
genome-wide significance40,42–44 and relatively little overlap
with those loci influencing overall adiposity. As with BMI, the
proportion of variance explained by these loci is small
(around 1% after adjustment for BMI, age, and sex).
New Biology Arising from GWAS Discoveries
Autoimmune Diseases
Thus far nearly all genes associated with MS have been
involved in autoimmune pathways rather than in
neurologic degenerative diseases.45 Indeed, of the two
MS-associated genes involved in neurodegeneration, one
(KIF21B) is also associated with AS and CD, suggesting
that it is actually an autoimmunity gene. The genes
involved in MS include genes coding for components of
the cytokine pathway (CXCR5 [MIM 601613], IL2RA
[MIM 147730], IL7R [MIM 146661], IL7 [MIM 146660],
IL12RB1 [MIM 601604], IL22RA2 [MIM 606648], IL12A
[MIM 161560], IL12B [MIM 161561], IRF8 [MIM 601565],
TNFRSF1A [MIM 191190], TNFRSF14 [MIM 602746], and
TNFSF14 [MIM 604520]), costimulatory molecules
(CD37 [MIM 151523], CD40, CD58 [MIM 153420],
CD80 [MIM 112203], CD86 [MIM 601020], and CLECL1
[MIM 607467]), and signal-transduction molecules of
immunological relevance (CBLB [MIM 604491], GPR65
[MIM 604620], MALT1 [MIM 604860], RGS1 [MIM
600323], STAT3 [MIM 102582], TAGAP [MIM 609667],
and TYK2 [MIM 176941]). Interestingly, these genes mainly
implicate T-helper cells in MS pathogenesis.
Genetic findings have had a major impact on AS research
and therapeutics. The association of the genes IL23R (MIM
607562)46 and IL12B19 have pointed to the involvement of
the IL-23R pathway, and hence IL-17-producing
14 The American Journal of Human Genetics 90, 7–24, January 13, 2012
proinflammatory cell populations, in the aetiopathogenesis of AS. The involvement of this pathway in AS was not
considered until the genetic discoveries were reported.
The recent demonstration that ERAP1 (MIM 606832) polymorphisms are associated with HLA-B27-positive but not
HLA-B27-negative AS has shed important light on research
into the mechanism by which HLA-B27 induces AS; this
mechanism has remained an enigma since the discovery
of the association of HLA-B27 with AS in the early 1970s.
ERAP1 is involved in peptide processing before HLA class
I molecule presentation; the restriction of the association
of ERAP1 variants to HLA-B27-positive disease indicates
that HLA-B27 operates to cause AS by a mechanism
that involves peptide presentation. Protective variants of
ERAP1 have been shown to have lower peptide-processing
capacity and thus to reduce the amount of peptide available to HLA-B27.47 Thus HLA-B27 is more likely to cause
AS when it is processing more peptides.
The finding that PADI4 (MIM 605347) is associated with
RA focused research interest on the role of anti-citrullinated peptide antibodies (ACPAs) and disease.48 PADI4 is
involved in the citrullination of peptides against which
ACPAs develop. The association of PADI4 variants with
RA therefore indicated that ACPAs are directly involved
in RA pathogenesis, not an indirect manifestation of
immune dysregulation in the disease. Subsequently, it
was discovered that the association of HLA-DRB1 (MIM
142857) with RA was restricted to ACPA-positive disease
and that there was a strong gene-environment interaction,
such that cigarette smoking increases the risk of ACPApositive but not ACPA-negative RA.49 Because ACPApositive disease is more severe than ACPA-negative disease
and has a greater propensity toward joint-damaging
erosion, this provided further evidence supporting publichealth measures against cigarette smoking.
The genetic loci identified for IBD through GWASs have
highlighted a number of pathways, including antibacterial
autophagy and signaling pathways (e.g., IL-10 signaling,
T-cell-negative regulators, and pathways involving B cells
and innate sensors).18 Some of these pathways were previously not suspected to be important for these diseases.
The role of a number of pathways, for example the IL-23R
pathway, the autophagy pathway, and innate immunity,
have all come from hypothesis-generating genetics research,
not from immunology or hypothesis-driven research.
Similar advances could be described for many other
autoimmune diseases but are beyond the scope of this
review.
Metabolic Traits
Most loci affecting T2D and fasting glucose levels map to
regulatory sequences, and in many cases, the ‘‘causal’’ transcript, i.e., the transcript responsible for mediating the
effect of the associated variants, is not yet known. At other
loci, a combination of coding variants, strong biological
candidates, and/or cis expression QTL data has defined
the transcript through which the effect is mediated
(HNF1A [MIM 142410], GCK, IRS1 [MIM 147545], WFS1
[MIM 606201], PPARG [MIM 601487], CAMK1D [MIM
607957], JAZF1 [MIM 606246], KLF14 [MIM 609393] and
others) as a first step to inferring biology.50 Some of these
stories are now starting to be fleshed out into biological
mechanisms (e.g., KLF1451).
There is incomplete overlap with the loci influencing
physiological variation in glucose and insulin. Some loci
(e.g., MTNR1B [MIM 600804]) have a relatively large effect
on both, whereas others (e.g., G6PC2 [MIM 608058])
influence fasting glucose levels but have a minimal effect
on T2D risk. Still others (e.g., CDKN2A and CDKN2 B
[MIM 600160 and 600431]) impact T2D and have surprisingly modest effects on fasting glucose levels in healthy,
nondiabetic individuals32,33,50. Most of these loci appear
to have their primary effect on the function of beta cells
rather than on insulin resistance, highlighting the importance of the former with respect to normal and abnormal
glucose homeostasis.50 Of the subset of loci (including
PPARG, KLF14, and ADAMTS9 [MIM 605421]) shown to
influence T2D risk through a primary effect on insulin
resistance, only FTO seems to act primarily through an
effect on obesity.50 Several of the T2D loci overlap genes
that are known to harbor rare variants responsible for
penetrant, monogenic forms of diabetes (such genes
include KCNQ1 [MIM 607542], PPARG, HNF1A, GCK,
and WFS1), indicating that multiple causal variants at
the same locus segregate in the population at difference
frequencies. There is overlap between signals influencing
T2D risk and those influencing body weight (CDKAL1
[MIM 611259] and ADCY5 [MIM 600293]) indicating
that some of the observed epidemiological associations
between these traits are attributable to shared susceptibility variants.52
Whereas many of the fasting-glucose and fasting-insulin
signals map near strong biological candidates for relevant
traits (such candidate genes include IRS1, IGF1, ADRA2A
[MIM 104210], SLC2A2 [MIM 138160], GCK and GCKR)
and fit within established models of our understanding
of islet biology, this is far from the case with the loci identified for T2D. Efforts to demonstrate that the genes
mapping close to T2D risk loci are enriched for particular
pathways or processes have met with only limited success;
the most robust finding yet has been in relation to
cell-cycle regulation (and was consistent with a model in
which the regulation of islet mass is a key component of
risk50). Either T2D is especially heterogeneous or else key
aspects of its pathophysiology are as yet poorly codified
in existing databases.
As for T2D and fasting glucose, most of the signals for
obesity and fat distribution map to regulatory signals, the
causal transcript is known at only a minority of the loci.
Signals influencing BMI appear to be enriched for genes
implicated in neuronal processes, whereas those influencing fat distribution seem to be more closely related to
adipose development.36,43 Overlap with signals and genes
implicated in more severe forms of disease (morbid obesity,
The American Journal of Human Genetics 90, 7–24, January 13, 2012 15
lipodystrophy) is seen at some loci (PCSK1 [MIM 162150],
POMC [MIM 176830], BDNF [MIM 113505], MC4R, and
SH2B1 [MIM 608937]) but is far from complete (some
loci implicated in extreme obesity case-control studies
show no association with BMI at the population level36).
The strongest signal for overall adiposityis the one mapping to FTO37. FTO is thought to be a DNA methylase,53
but its function is poorly understood. Murine models
demonstrate that modulation of Fto expression is associated with changes in body weight,54–56 but no direct
evidence linking coding variants in FTO in humans to
body-weight variation has been demonstrated. For the
time being, FTO remains the strongest candidate, but
the role of other genes (e.g., RPGRIP1L [MIM 610937]) in
the region cannot be discounted. This example demonstrates the difficulties that remain in relating GWAS signals
to downstream biology. Fat distribution is a strongly
gender-dimorphic phenotype, and many of the signals
associated with fat distribution seem to have a selective
effect on this phenotype in women.43
Quantitative Traits
In addition to having been performed on the quantitative
traits discussed previously (e.g., BMI and fasting-glucose
and -insulin levels), GWASs have been done on a number
of quantitative risk factors for disease and for traits that
are models for the genetic architecture of complex traits.
For bone mineral density (BMD), a risk factor for osteoporotic fracture, a total of 34 loci, together explaining ~5% of
narrow sense heritability, have been identified (Estrada
et al., abstract presented at the American Society for Bone
and Mineral Research 2010 Annual Meeting, published
in J. Bone. Med. Res. 25 [Suppl S1], p. 1243). Among these
genes, there is a major over-representation of genes in the
Wnt-signaling pathway, which was first implicated in osteoporosis (MIM 166710) from studies in families with high
or low BMD phenotypes. Many other examples exist in
osteoporosis and other human diseases in which GWASs
have demonstrated that more-prevalent but less-severe
genetic variants in genes initially identified from studies
of severe familial diseases have proven to be important in
the risk of disease in the general population. For human
height, a combined discovery and validation cohort of
~180,000 samples identified 180 robustly associated loci,
many in meaningful biological pathways and with evidence for multiple segregating variants at the same loci.13
Together these loci explain approximately 12%–14% of
additive genetic variation (~10% of phenotypic variation).
A meta-analysis of more than 100,000 individuals of
European ancestry detected a total of 95 loci significantly
associated with plasma concentrations of cholesterol
and triglycerides, known risk factors for coronary artery
disease,57 and it provided evidence that the GWAS loci
were of biological and clinical relevance. A meta-analysis
from the HaemGen consortium on platelet count and
platelet volume, which are endophenotypes for myocardial infarction (MIM 608446), discovered 68 loci.58
When the genes of a number of these loci were silenced
in Drosophila, 11 showed a clear platelet phenotype. These
genes are previously unknown regulators of blood cell
formation. The identification of so many loci has uncovered new gene functions in megakaryopoiesis and platelet
formation. That is, new biology has resulted directly from
the identification of SNPs that are associated with variation
in platelet phenotypes.
Across these quantitative traits, a number of loci discovered through GWASs were known to be a mutational target
for those traits because Mendelian forms with extreme
phenotypes existed. Taken together, the inference from
quantitative traits in terms of the (large) number of loci
involved, the allelic frequency spectrum of associated variants, and the nature of the candidate genes suggest that
models arising from quantitative traits appropriately
reflect the genetic architecture of disease and reinforce
the emerging evidence that it is the cumulative effect of
many loci that underlies susceptibility to disease.
From GWAS to Translation: Clinical Relevance
Autoimmune Diseases
Many of the MS-associated genes discovered by GWASs
represent excellent potential therapeutic targets. Of particular note is the identification of two genes involved in
vitamin D metabolism (CYP27B1 [MIM 609506] and
CYP24A1 [MIM 126065]). This identification might help
to explain the latitudinal variation in MS incidence—i.e.,
higher MS prevalence at more extreme latitudes is most
likely due to higher rates of vitamin D deficiency. Two
other identified genes are already targets of MS therapies,
highlighting the relevance of the findings to the disease
pathogenesis (natalizumab targets VCAM1 [MIM
192225], and daclizumab targets IL2RA). The findings for
AS have stimulated the trial of therapies against identified
pathways. Anti-IL-17 treatment has been shown in a phase
2 trial to have equivalent efficacy as the current gold-standard treatment, TNF-inhibition, in the treatment of AS.
The relevance of the RA-related genetic findings to therapeutic development is highlighted by the fact that some
existing therapies already target genes or gene pathways
highlighted by the genetic associations with RA; such therapies include those involving TNF inhibitors (e.g., infliximab) and co-stimulation inhibitors (e.g., abatacept).
Abatacept is a fusion protein of CTLA-4 and immunoglobulin. It acts by preventing costimulation of T-helper cells
by the binding of the T cell’s CD28 protein to the B7
protein on the antigen-presenting cell. CTLA4 (MIM
123890) and CD28 (MIM 186760) polymorphisms are
associated with RA. The RA-associated genes include
many involved in the NfKB signaling pathway and
place this pathway at the center of RA pathogenesis. As
in MS, mouse research prior to the genetic discoveries
had implicated the IL-23-dependent Th17-lymphocyte
pathway in RA pathogenesis. To date there has been very
little genetic support for this with regard to human
diseases, in contrast to the situation in seronegative
16 The American Journal of Human Genetics 90, 7–24, January 13, 2012
diseases such as AS, psoriasis and IBD, where strong genetic
associations exist and treatments targeting the pathway
are in clinical use.
Metabolic Diseases
The main relevance of GWASs lies in the insights into
disease biology (see above) and the potential for clinical
translation through novel approaches to the diagnosis,
prevention, treatment, and monitoring of disease. This
will take some time, in particular given that most GWAS
discoveries were made in the last few years. The predictive
power of disease risk ascertained from genetic data remains
poor because for most diseases only a small proportion of
additive genetic variation has been accounted for.
Although it is possible for T2D to identify individuals
who are at the extremes of the genotype risk score distribution and who differ appreciably in T2D risk (they have
twice or half the average risk for the upper and lower
1%–2%, respectively), many of these would already be
identifiable on the basis of classical risk factors. In fact,
when using receiver operating characteristic (ROC) analyses, BMI and age do a far better job of discrimination
than the genetic variants so far discovered.59 This may
change as low frequency and rare causal alleles are found.
Although individual prediction is not yet practical with
the variants at hand, it should be possible to identify
groups of individuals who are at a substantially greaterthan-average risk for diabetes, and this might be of value,
for example, with respect to clinical-trial enrichment.
One obvious route to early translation involves the identification of diagnostic biomarkers on the basis of the
processes that have been uncovered. These may have
predictive impact well beyond the genetic variants that
led to their discovery. This was recently demonstrated by
a GWAS of C-reactive protein (CRP) levels; that study
found that common variants near the HNF1A gene were
associated with variation in CRP.60 The authors asked
whether rare HNF1A mutations that are causal for the
Mendelian MODY (MIM 606391) subtype of diabetes are
also associated with differences in CRP levels and whether
it would be possible to use CRP levels as a diagnostic
marker to help identify individuals who have early-onset
diabetes and who are likely to have HNF1A-MODY (and
to direct those individuals to sequence-based diagnostics).
They were able to show marked differences in CRP levels
between HNF1A -MODY and other types of diabetes and
demonstrated that diagnoses based on CRP levels has
a discriminative accuracy of more than 80% for this diagnostic classification.61,62 Otherwise, GWAS findings have
as yet had no impact on therapeutic optimization. Recent
studies have identified variants that influence therapeutic
response to metformin63 and might herald better understanding of how these drugs work.
New Science Facilitated by GWASs
Although the GWAS approach was designed for the detection of associations between DNA markers and disease, as
a by-product such studies have generated new scientific
discoveries. A detailed description and discussion is outside
the scope of this review, and we highlight only a few of
these advances: the discovery of genes affecting genetic
recombination and their correlation with natural selection64–66 and new insight in human population structure
and evolution.67–73
Interpretation of GWAS Results
GWASs conducted in the last five years were designed and
powered to detect associations through LD between genotyped (or imputed) common SNP markers and unknown
causal variants. What do the results imply in terms of variance explained in the population, common versus rare
variants underlying complex traits, and the nature of
complex-trait variation and evolution? It is too early to
be able to quantify the joint distribution of risk-allele
frequencies and their effect sizes because there are very
few causal variants identified by GWAS and because
systematic study of rare variants (through exome or
whole-genome sequencing) is in an early stage. To understand the allelic spectrum of risk variants and thereby
inform optimal design of experiments aiming to detect
causal variants, one must differentiate between two explanations for observed associations between genotyped
common SNPs and disease: the association can be caused
by one or more causal variants that have large effect sizes
and are in low LD with the genotyped SNPs, or it can be
caused by causal variants that have small effects and are
in high LD with the genotyped SNPs. Low LD occurs
when the allele frequencies of the unknown causal variants and those at the genotyped SNPs are very different
from each other, for example when the allele frequency
of causal variants is much lower than that of the SNPs.
For a single robustly associated SNP in a homogeneous
population, we cannot distinguish between the hypotheses that the association signal is caused by a rare variant
of large effect or a common variant with small effect.
However, variants at multiple loci and GWASs in other
ethnic populations help to narrow the boundaries of the
genetic architecture of diseases. At this point in time, we
can conclude that
(1) Many loci contribute to complex-trait variation
(e.g., Figure 2).
(2) At a number of identified risk loci, there are multiple
alleles associated with disease at a wide range of
frequencies.
(3) There is evidence for pleiotropy, i.e., that the same
variants are associated with multiple traits.66,74,75
(4) A number of variants associated with disease or
complex traits in one ethnic population are also
associated the same disease or traits in other populations (see above for T2D examples).
(5) The hypothesis76 that causal variant(s) that lead to
the association between common SNPs and disease
are mostly rare (say, have an allele frequency of 1%
The American Journal of Human Genetics 90, 7–24, January 13, 2012 17
Box 3.
Box 3. Synthetic Associations
Dickson and colleagues suggested that the observed
association between a common SNP and a complex
trait might result when one or more rare variants at
the locus is in LD with that SNP.76,93 Because
common SNP alleles and rare causal variants cannot
be highly correlated because of the properties of
LD,84 the hypothesis of ‘‘synthetic’’ associations
implies that the effect sizes of the causal variants
are much larger than the effect size observed at the
common SNP and suggests that (re)sequencing
studies might detect such variants. The hypothesis
is not about whether GWASs work as an experimental design but what the likely interpretation of
GWAS hits is in terms of the allele spectrum of causal
risk alleles. Are empirical data consistent with this
hypothesis? Several lines of evidence suggest that
associations observed with common SNP associations are rarely due to synthetic associations with
rare variants. First, because the LD correlation
between common and rare variants is so low (typically 0.01–0.02), synthetic associations imply that
variation explained by the causal variants at the
locus is 50–100 times larger than the variance explained at the genotyped SNP.78 So, if the SNP
explains 0.1% of phenotypic variation in the population, the causal variant would explain 5%–10%.
But as shown in this review, for many complex traits
and diseases tens to hundred of common variants
are identified, and so their combined effects would
explain too much variation if synthetic associations
were the norm. Second, empirical data from
(re)sequencing studies and trans-ethnic mapping
suggest that both common and rare variants
contribute to disease risk.77 At most loci detected
by GWASs, there is no evidence (despite extensive
genotyping and/or re-sequencing) that the
common-variant signal is driven by low-frequency
or rarer variants. Where rare risk alleles are uncovered at the same loci, they seem much more likely
to be independent signals.94–96
Together these observations point to a highly
polygenic model of disease susceptibility with causal
variants across the entire range of the allelefrequency spectrum. By ‘‘polygenic,’’ we mean that
segregating variants at many genomic loci (tens,
hundreds, or even thousands) contribute to genetic
variation for susceptibility in the population. The
observations imply that, for most common complex
diseases, nearly everyone in the population carries
some risk alleles and that affected individuals are
likely to have a different portfolio of risk alleles.79
They also imply that any single risk allele is neither
necessary nor sufficient to cause disease. For the
Continued
etiology of disease, these observations provide
empirical evidence to support a threshold or burden
model involving multiple variants and environmental factors, and they appear to be inconsistent
with a single cause (e.g., a single mutation). A rarevariant only model of disease, characterized by locus
heterogeneity and rare mutations of large effects and
proposed by, for example, McClellan and King,1 is
not consistent with empirical observations.77,79,97
or lower) is not consistent with theoretical and empirical results.77,78 In particular, there is no widespread
evidence for the existence of ‘‘synthetic associations’’
(see Box 3). Numerically, we expect that most causal
variants that segregate in the population are rare,
consistent with evolutionary theory, but the proportion of genetic variation that these variants cumulatively explain depends on their correlation with
fitness.79
(6) A surprisingly large proportion of additive genetic
variation is tagged when all SNPs are considered
simultaneously.12–14
The Cost of GWASs
If we assume that the GWAS results from Figure 1 represent
a total of 500,000 SNP chips and that on average a chip
costs $500, then this is a total investment of $250 million.
If there are a total of ~2,000 loci detected across all traits,
then this implies an investment of $125,000 per discovered locus. Is that a good investment? We think so: The
total amount of money spent on candidate-gene studies
and linkage analyses in the 1990s and 2000s probably
exceeds $250M, and they in total have had little to show
for it. Also, it is worthwhile to put these amounts in
context. $250M is of the order of the cost of a one-two
stealth fighter jets and much less than the cost of a single
navy submarine. It is a fraction of the ~$9 billion cost of
the Large Hadron Collider. It would also pay for about
100 R01 grants. Would those 100 non-funded R01 grants
have made breakthrough discoveries in biology and medicine? We simply can’t answer this question, but we can
conclude that a tremendous number of genuinely new
discoveries have been made in a period of only five years.
Concluding Comments
In this review we have attempted to summarize the
tremendous quality and quantity of discoveries that have
been made by GWASs in the last five years. Because of
space limitations, we have been able to discuss only
a subset of diseases and have not mentioned those made
in common cancers, pediatric diseases, and ophthalmological diseases, to name but a few. We now return to the
18 The American Journal of Human Genetics 90, 7–24, January 13, 2012
perceived failure of GWASs as summarized in the introductory section:
(1) Is the GWAS approach founded on a flawed assumption
that genetics plays an important role in the risk for
common diseases? Pedigree studies, including those
involving twins, suggest that a substantial proportion of variation in susceptibility for common
disease is due to genetic factors. The proportion of
total variation explained by genome-wide-significant variants has reached 10%–20% for a number
of diseases, and clearly there are additional variants
with such small effect sizes that they have not been
detected with stringent significance. As reviewed
here, many of the detected loci are in biologically
meaningful pathways for the diseases investigated.
Whole-genome analyses involving GWAS data
have estimated that 20%–50% of phenotypic variation is captured when all SNPs are considered simultaneously for a number of complex diseases and
traits. These estimates are based on populationwide studies and provide a lower limit of the total
proportion of phenotypic variation due to genetic
factors. Inference from GWASs is independent of
inference drawn from close relatives (pedigree/
family studies), and therefore these studies have
provided independent evidence for the role of
genetics in common diseases.
(2) Have GWASs been disappointing in not explaining more
genetic variation in the population? This criticism
implies that the aim of GWASs is to explain all
genetic variation. This is a misrepresentation of
the objective of GWASs. As was the aim of linkage
studies in pedigrees for complex diseases prior to
the GWAS era, the aim of GWAS is to detect loci
that are associated with complex traits. The detection of such loci has led to the discovery of new biological knowledge about disease—knowledge that
was absent only five years ago. But even ignoring
the aim of GWASs, for a number of complex traits
the proportion of genetic variation uncovered by
GWASs is actually substantial. For example, for
T2D, MS, and CD, approximately 10%, 20%, and
20%, respectively, of genetic variation in the population has been accounted for. Apart from diseases
with a known major locus (which is usually the
major histocompatibility locus), the baseline of
variation explained five years ago was essentially
zero.
(3) Have GWASs delivered meaningful biologically relevant
knowledge or results of clinical or any other utility? As
we have highlighted in this review, the answer to
this question is a definite ‘‘yes.’’ For example, the
discovery of the importance of the autophagy
pathway in Crohn disease, the IL-23R pathway in
rheumatoid arthritis, and factor H in age-related
macular degeneration (MIM 610149)9 have given
important biological insight with direct clinical
relevance. Hunter and Kraft put it this way back in
2007: ‘‘There have been few, if any, similar bursts
of discovery in the history of medical research.’’80
(4) Are GWAS results spurious? The combination of large
sample sizes and stringent significance testing has
led to a large number of robust and replicable associations between complex traits and genetic variants, many of which are in meaningful biological
pathways. A number of variants or different variants
at the same loci have been shown to be associated
with the same trait in different ethnic populations,
and some loci are even replicated across species.81
The combination of multiple variants with small
effect sizes has been shown to predict disease status
or phenotype in independent samples from the
same population. Clearly, these results are not
consistent with flawed inferences from GWASs.
In conclusion, in a period of less than five years, the
GWAS experimental design in human populations has
led to new discoveries about genes and pathways involved
in common diseases and other complex traits, has
provided a wealth of new biological insights, has led to
discoveries with direct clinical utility, and has facilitated
basic research in human genetics and genomics. For the
future, technological advances enabling the sequencing
of entire genomes in large samples at affordable prices is
likely to generate additional genes, pathways, and biological insights, as well as to identify causal mutations.
Acknowledgments
We acknowledge funding from the Australian National Health and
Medical Research Council (NHMRC grants 389892, 496667,
613672, 613601, and 1011506) and the Australian Research
Council (ARC grant DP1093502). P.M.V. and M.A.B. are funded
by NHMRC Senior Principal Research Fellowships. We thank two
referees for many helpful comments.
Web Resources
The URLs for data presented herein are as follows:
Online Mendelian Inheritance in Man (OMIM), http://www.
omim.org
GWAS Catalog, http://www.genome.gov/26525384
References
1. McClellan, J., and King, M.C. (2010). Genetic heterogeneity
in human disease. Cell 141, 210–217.
2. Crow, T.J. (2011). ‘The missing genes: what happened to the
heritability of psychiatric disorders?’. Mol. Psychiatry 16,
362–364.
3. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B.,
Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M.,
Cardon, L.R., Chakravarti, A., et al. (2009). Finding the missing heritability of complex diseases. Nature 461, 747–753.
The American Journal of Human Genetics 90, 7–24, January 13, 2012 19
4. Botstein, D., and Risch, N. (2003). Discovering genotypes
underlying human phenotypes: Past successes for mendelian disease, future approaches for complex disease. Nat.
Genet. Suppl. 33, 228–237.
5. Hartl, D.L., and Clark, A.G. (1997). Principles of population
genetics (Sunderland: Sinauer Associates).
6. Hill, W.G., and Robertson, A. (1968). The effects of
inbreeding at loci with heterozygote advantage. Genetics
60, 615–628.
7. Altshuler, D., Brooks, L.D., Chakravarti, A., Collins, F.S.,
Daly, M.J., and Donnelly, P.; International HapMap Consortium. (2005). A haplotype map of the human genome.
Nature 437, 1299–1320.
8. Dewan, A., Liu, M., Hartman, S., Zhang, S.S., Liu, D.T., Zhao,
C., Tam, P.O., Chan, W.M., Lam, D.S., Snyder, M., et al.
(2006). HTRA1 promoter polymorphism in wet age-related
macular degeneration. Science 314, 989–992.
9. Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S.,
Haynes, C., Henning, A.K., SanGiovanni, J.P., Mane, S.M.,
Mayne, S.T., et al. (2005). Complement factor H polymorphism in age-related macular degeneration. Science 308,
385–389.
10. Wellcome Trust Case Control Consortium. (2007). Genomewide association study of 14,000 cases of seven common
diseases and 3,000 shared controls. Nature 447, 661–678.
11. Franke, A., McGovern, D.P., Barrett, J.C., Wang, K., RadfordSmith, G.L., Ahmad, T., Lees, C.W., Balschun, T., Lee, J.,
Roberts, R., et al. (2010). Genome-wide meta-analysis
increases to 71 the number of confirmed Crohn’s disease
susceptibility loci. Nat. Genet. 42, 1118–1125.
12. Anderson, C.A., Boucher, G., Lees, C.W., Franke, A.,
D’Amato, M., Taylor, K.D., Lee, J.C., Goyette, P., Imielinski,
M., Latiano, A., et al. (2011). Meta-analysis identifies 29 additional ulcerative colitis risk loci, increasing the number of
confirmed associations to 47. Nat. Genet. 43, 246–252.
13. Lango Allen, H., Estrada, K., Lettre, G., Berndt, S.I., Weedon,
M.N., Rivadeneira, F., Willer, C.J., Jackson, A.U., Vedantam,
S., Raychaudhuri, S., et al. (2010). Hundreds of variants clustered in genomic loci and biological pathways affect human
height. Nature 467, 832–838.
14. Yang, J., Manolio, T.A., Pasquale, L.R., Boerwinkle, E., Caporaso, N., Cunningham, J.M., de Andrade, M., Feenstra, B.,
Feingold, E., Hayes, M.G., et al. (2011). Genome partitioning
of genetic variation for complex traits using common SNPs.
Nat. Genet. 43, 519–525.
15. Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders,
A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G.,
Montgomery, G.W., et al. (2010). Common SNPs explain
a large proportion of the heritability for human height.
Nat. Genet. 42, 565–569.
16. Eyre-Walker, A. (2010). Evolution in health and medicine
Sackler colloquium: Genetic architecture of complex traits
and its implications for fitness and genome-wide association studies. Proc. Natl. Acad. Sci. USA 107 (Suppl 1 ),
1752–1756.
17. Pritchard, J.K. (2001). Are rare variants responsible for
susceptibility to complex diseases? Am. J. Hum. Genet. 69,
124–137.
18. Khor, B., Gardet, A., and Xavier, R.J. (2011). Genetics and
pathogenesis of inflammatory bowel disease. Nature 474,
307–317.
19. Danoy, P., Pryce, K., Hadler, J., Bradbury, L.A., Farrar, C., Pointon, J., Ward, M., Weisman, M., Reveille, J.D., Wordsworth,
B.P., et al; Australo-Anglo-American Spondyloarthritis
Consortium; Spondyloarthritis Research Consortium of
Canada. (2010). Association of variants at 1q32 and STAT3
with ankylosing spondylitis suggests genetic overlap with
Crohn’s disease. PLoS Genet. 6, e1001195.
20. Cotsapas, C., Voight, B.F., Rossin, E., Lage, K., Neale, B.M.,
Wallace, C., Abecasis, G.R., Barrett, J.C., Behrens, T., Cho,
J., et al; FOCiS Network of Consortia. (2011). Pervasive
sharing of genetic effects in autoimmune disease. PLoS
Genet. 7, e1002254.
21. McCarthy, M.I. (2010). Genomics, type 2 diabetes, and
obesity. N. Engl. J. Med. 363, 2339–2350.
22. Kooner, J.S., Saleheen, D., Sim, X., Sehmi, J., Zhang, W.,
Frossard, P., Been, L.F., Chia, K.S., Dimas, A.S., Hassanali,
N., et al; DIAGRAM; MuTHER. (2011). Genome-wide association study in individuals of South Asian ancestry identifies
six new type 2 diabetes susceptibility loci. Nat. Genet. 43,
984–989.
23. Yamauchi, T., Hara, K., Maeda, S., Yasuda, K., Takahashi, A.,
Horikoshi, M., Nakamura, M., Fujita, H., Grarup, N., Cauchi,
S., et al. (2010). A genome-wide association study in the
Japanese population identifies susceptibility loci for type 2
diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat. Genet. 42,
864–868.
24. Shu, X.O., Long, J., Cai, Q., Qi, L., Xiang, Y.B., Cho, Y.S., Tai,
E.S., Li, X., Lin, X., Chow, W.H., et al. (2010). Identification
of new genetic risk variants for type 2 diabetes. PLoS Genet.
6, e1001127.
25. Yasuda, K., Miyake, K., Horikawa, Y., Hara, K., Osawa, H.,
Furuta, H., Hirota, Y., Mori, H., Jonsson, A., Sato, Y., et al.
(2008). Variants in KCNQ1 are associated with susceptibility
to type 2 diabetes mellitus. Nat. Genet. 40, 1092–1097.
26. Unoki, H., Takahashi, A., Kawaguchi, T., Hara, K., Horikoshi,
M., Andersen, G., Ng, D.P., Holmkvist, J., Borch-Johnsen, K.,
Jørgensen, T., et al. (2008). SNPs in KCNQ1 are associated
with susceptibility to type 2 diabetes in East Asian and European populations. Nat. Genet. 40, 1098–1102.
27. Tsai, F.J., Yang, C.F., Chen, C.C., Chuang, L.M., Lu, C.H.,
Chang, C.T., Wang, T.Y., Chen, R.H., Shiu, C.F., Liu, Y.M.,
et al. (2010). A genome-wide association study identifies
susceptibility variants for type 2 diabetes in Han Chinese.
PLoS Genet. 6, e1000847.
28. Below, J.E., Gamazon, E.R., Morrison, J.V., Konkashbaev, A.,
Pluzhnikov, A., McKeigue, P.M., Parra, E.J., Elbein, S.C.,
Hallman, D.M., Nicolae, D.L., et al. (2011). Genome-wide
association and meta-analysis in populations from Starr
County, Texas, and Mexico City identify type 2 diabetes
susceptibility loci and enrichment for expression quantitative trait loci in top signals. Diabetologia 54, 2047–2055.
29. Parra, E.J., Below, J.E., Krithika, S., Valladares, A., Barta, J.L.,
Cox, N.J., Hanis, C.L., Wacher, N., Garcia-Mena, J., Hu, P.,
et al; Diabetes Genetics Replication and Meta-analysis
(DIAGRAM) Consortium. (2011). Genome-wide association
study of type 2 diabetes in a sample from Mexico City and
a meta-analysis of a Mexican-American sample from Starr
County, Texas. Diabetologia 54, 2038–2046.
30. Grant, S.F., Thorleifsson, G., Reynisdottir, I., Benediktsson,
R., Manolescu, A., Sainz, J., Helgason, A., Stefansson, H.,
Emilsson, V., Helgadottir, A., et al. (2006). Variant of
20 The American Journal of Human Genetics 90, 7–24, January 13, 2012
transcription factor 7-like 2 (TCF7L2) gene confers risk of
type 2 diabetes. Nat. Genet. 38, 320–323.
31. Prokopenko, I., Langenberg, C., Florez, J.C., Saxena, R.,
Soranzo, N., Thorleifsson, G., Loos, R.J., Manning, A.K.,
Jackson, A.U., Aulchenko, Y., et al. (2009). Variants in
MTNR1B influence fasting glucose levels. Nat. Genet. 41,
77–81.
32. Dupuis, J., Langenberg, C., Prokopenko, I., Saxena, R.,
Soranzo, N., Jackson, A.U., Wheeler, E., Glazer, N.L., Bouatia-Naji, N., Gloyn, A.L., et al; DIAGRAM Consortium;
GIANT Consortium; Global BPgen Consortium; Anders
Hamsten on behalf of Procardis Consortium; MAGIC investigators. (2010). New genetic loci implicated in fasting glucose
homeostasis and their impact on type 2 diabetes risk. Nat.
Genet. 42, 105–116.
33. Saxena, R., Hivert, M.F., Langenberg, C., Tanaka, T., Pankow,
J.S., Vollenweider, P., Lyssenko, V., Bouatia-Naji, N., Dupuis,
J., Jackson, A.U., et al; GIANT consortium; MAGIC investigators. (2010). Genetic variation in GIPR influences the glucose
and insulin responses to an oral glucose challenge. Nat.
Genet. 42, 142–148.
34. Weedon, M.N., Clark, V.J., Qian, Y., Ben-Shlomo, Y., Timpson, N., Ebrahim, S., Lawlor, D.A., Pembrey, M.E., Ring, S.,
Wilkin, T.J., et al. (2006). A common haplotype of the glucokinase gene alters fasting glucose and birth weight: Association in six studies and population-genetics analyses. Am. J.
Hum. Genet. 79, 991–1001.
35. Larsen, L.H., Echwald, S.M., Sørensen, T.I., Andersen, T.,
Wulff, B.S., and Pedersen, O. (2005). Prevalence of mutations
and functional analyses of melanocortin 4 receptor variants
identified among 750 men with juvenile-onset obesity. J.
Clin. Endocrinol. Metab. 90, 219–224.
36. Speliotes, E.K., Willer, C.J., Berndt, S.I., Monda, K.L., Thorleifsson, G., Jackson, A.U., Allen, H.L., Lindgren, C.M.,
Luan, J., Mägi, R., et al; MAGIC; Procardis Consortium.
(2010). Association analyses of 249,796 individuals reveal
18 new loci associated with body mass index. Nat. Genet.
42, 937–948.
37. Frayling, T.M., Timpson, N.J., Weedon, M.N., Zeggini, E.,
Freathy, R.M., Lindgren, C.M., Perry, J.R., Elliott, K.S., Lango,
H., Rayner, N.W., et al. (2007). A common variant in the FTO
gene is associated with body mass index and predisposes to
childhood and adult obesity. Science 316, 889–894.
38. Meyre, D., Delplanque, J., Chèvre, J.C., Lecoeur, C., Lobbens,
S., Gallina, S., Durand, E., Vatin, V., Degraeve, F., Proença, C.,
et al. (2009). Genome-wide association study for early-onset
and morbid adult obesity identifies three new risk loci in
European populations. Nat. Genet. 41, 157–159.
39. Scherag, A., Dina, C., Hinney, A., Vatin, V., Scherag, S., Vogel,
C.I., Müller, T.D., Grallert, H., Wichmann, H.E., Balkau, B.,
et al. (2010). Two new Loci for body-weight regulation identified in a joint analysis of genome-wide association studies
for early-onset extreme obesity in French and german study
groups. PLoS Genet. 6, e1000916.
40. Willer, C.J., Speliotes, E.K., Loos, R.J., Li, S., Lindgren, C.M.,
Heid, I.M., Berndt, S.I., Elliott, A.L., Jackson, A.U., Lamina,
C., et al; Wellcome Trust Case Control Consortium; Genetic
Investigation of ANthropometric Traits Consortium.
(2009). Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat.
Genet. 41, 25–34.
41. Walters, R.G., Jacquemont, S., Valsesia, A., de Smith, A.J.,
Martinet, D., Andersson, J., Falchi, M., Chen, F., Andrieux,
J., Lobbens, S., et al. (2010). A new highly penetrant form
of obesity due to deletions on chromosome 16p11.2. Nature
463, 671–675.
42. Heard-Costa, N.L., Zillikens, M.C., Monda, K.L., Johansson,
A., Harris, T.B., Fu, M., Haritunians, T., Feitosa, M.F., Aspelund, T., Eiriksdottir, G., et al. (2009). NRXN3 is a novel locus
for waist circumference: A genome-wide association study
from the CHARGE Consortium. PLoS Genet. 5, e1000539.
43. Heid, I.M., Jackson, A.U., Randall, J.C., Winkler, T.W., Qi, L.,
Steinthorsdottir, V., Thorleifsson, G., Zillikens, M.C.,
Speliotes, E.K., Mägi, R., et al; MAGIC. (2010). Meta-analysis
identifies 13 new loci associated with waist-hip ratio and
reveals sexual dimorphism in the genetic basis of fat distribution. Nat. Genet. 42, 949–960.
44. Kilpelainen, T.O., Zillikens, M.C., Stancakova, A., Finucane,
F.M., Ried, J.S., Langenberg, C., Zhang, W., Beckmann, J.S.,
Luan, J., Vandenput, L., et al. (2011). Genetic variation
near IRS1 associates with reduced adiposity and an impaired
metabolic profile. Nat. Genet. 43, 753–760.
45. Sawcer, S., Hellenthal, G., Pirinen, M., Spencer, C.C., Patsopoulos, N.A., Moutsianas, L., Dilthey, A., Su, Z., Freeman,
C., Hunt, S.E., et al; International Multiple Sclerosis Genetics
Consortium; Wellcome Trust Case Control Consortium 2.
(2011). Genetic risk and a primary role for cell-mediated
immune mechanisms in multiple sclerosis. Nature 476,
214–219.
46. Burton, P.R., Clayton, D.G., Cardon, L.R., Craddock, N.,
Deloukas, P., Duncanson, A., Kwiatkowski, D.P., McCarthy,
M.I., Ouwehand, W.H., Samani, N.J., et al; Wellcome Trust
Case Control Consortium; Australo-Anglo-American Spondylitis Consortium (TASC); Biologics in RA Genetics and
Genomics Study Syndicate (BRAGGS) Steering Committee;
Breast Cancer Susceptibility Collaboration (UK). (2007).
Association scan of 14,500 nonsynonymous SNPs in four
diseases identifies autoimmunity variants. Nat. Genet. 39,
1329–1337.
47. Evans, D.M., Spencer, C.C., Pointon, J.J., Su, Z., Harvey, D.,
Kochan, G., Oppermann, U., Dilthey, A., Pirinen, M.,
Stone, M.A., et al; Spondyloarthritis Research Consortium
of Canada (SPARCC); Australo-Anglo-American Spondyloarthritis Consortium (TASC); Wellcome Trust Case Control
Consortium 2 (WTCCC2). (2011). Interaction between
ERAP1 and HLA-B27 in ankylosing spondylitis implicates
peptide handling in the mechanism for HLA-B27 in disease
susceptibility. Nat. Genet. 43, 761–767.
48. Suzuki, A., Yamada, R., Chang, X., Tokuhiro, S., Sawada, T.,
Suzuki, M., Nagasaki, M., Nakayama-Hamada, M., Kawaida,
R., Ono, M., et al. (2003). Functional haplotypes of PADI4,
encoding citrullinating enzyme peptidylarginine deiminase
4, are associated with rheumatoid arthritis. Nat. Genet. 34,
395–402.
49. Padyukov, L., Silva, C., Stolt, P., Alfredsson, L., and Klareskog,
L. (2004). A gene-environment interaction between smoking
and shared epitope genes in HLA-DR provides a high risk
of seropositive rheumatoid arthritis. Arthritis Rheum. 50,
3085–3092.
50. Voight, B.F., Scott, L.J., Steinthorsdottir, V., Morris, A.P., Dina,
C., Welch, R.P., Zeggini, E., Huth, C., Aulchenko, Y.S.,
Thorleifsson, G., et al; MAGIC investigators; GIANT
Consortium. (2010). Twelve type 2 diabetes susceptibility
The American Journal of Human Genetics 90, 7–24, January 13, 2012 21
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
loci identified through large-scale association analysis. Nat.
Genet. 42, 579–589.
Small, K.S., Hedman, A.K., Grundberg, E., Nica, A.C., Thorleifsson, G., Kong, A., Thorsteindottir, U., Shin, S.Y.,
Richards, H.B., Soranzo, N., et al; GIANT Consortium;
MAGIC Investigators; DIAGRAM Consortium; MuTHER
Consortium. (2011). Identification of an imprinted master
trans regulator at the KLF14 locus related to multiple metabolic phenotypes. Nat. Genet. 43, 561–564.
Freathy, R.M., Mook-Kanamori, D.O., Sovio, U., Prokopenko,
I., Timpson, N.J., Berry, D.J., Warrington, N.M., Widen, E.,
Hottenga, J.J., Kaakinen, M., et al; Genetic Investigation of
ANthropometric Traits (GIANT) Consortium; Meta-Analyses
of Glucose and Insulin-related traits Consortium; Wellcome
Trust Case Control Consortium; Early Growth Genetics
(EGG) Consortium. (2010). Variants in ADCY5 and near
CCNL1 are associated with fetal growth and birth weight.
Nat. Genet. 42, 430–435.
Gerken, T., Girard, C.A., Tung, Y.C., Webby, C.J., Saudek, V.,
Hewitson, K.S., Yeo, G.S., McDonough, M.A., Cunliffe, S.,
McNeill, L.A., et al. (2007). The obesity-associated FTO
gene encodes a 2-oxoglutarate-dependent nucleic acid demethylase. Science 318, 1469–1472.
Church, C., Lee, S., Bagg, E.A., McTaggart, J.S., Deacon, R.,
Gerken, T., Lee, A., Moir, L., Mecinovi!c, J., Quwailid, M.M.,
et al. (2009). A mouse model for the metabolic effects of
the human fat mass and obesity associated FTO gene. PLoS
Genet. 5, e1000599.
Church, C., Moir, L., McMurray, F., Girard, C., Banks, G.T.,
Teboul, L., Wells, S., Brüning, J.C., Nolan, P.M., Ashcroft,
F.M., and Cox, R.D. (2010). Overexpression of Fto leads to
increased food intake and results in obesity. Nat. Genet. 42,
1086–1092.
Freathy, R.M., Timpson, N.J., Lawlor, D.A., Pouta, A., BenShlomo, Y., Ruokonen, A., Ebrahim, S., Shields, B., Zeggini,
E., Weedon, M.N., et al. (2008). Common variation in the
FTO gene alters diabetes-related metabolic traits to the extent
expected given its effect on BMI. Diabetes 57, 1419–1426.
Teslovich, T.M., Musunuru, K., Smith, A.V., Edmondson,
A.C., Stylianou, I.M., Koseki, M., Pirruccello, J.P., Ripatti, S.,
Chasman, D.I., Willer, C.J., et al. (2010). Biological, clinical
and population relevance of 95 loci for blood lipids. Nature
466, 707–713.
Gieger, C., Radhakrishnan, A., Cvejic, A., Tang, W., Porcu, E.,
Pistis, G., Serbanovic-Canic, J., Elling, U., Goodall, A.H., Labrune, Y., et al. (2011). New gene functions in megakaryopoiesis and platelet formation. Nature 480, 201–208.
Mihaescu, R., Meigs, J., Sijbrands, E., and Janssens, A.C.
(2011). Genetic risk profiling for prediction of type 2 diabetes. PLoS Curr. 3, RRN1208.
Elliott, P., Chambers, J.C., Zhang, W., Clarke, R., Hopewell,
J.C., Peden, J.F., Erdmann, J., Braund, P., Engert, J.C., Bennett,
D., et al. (2009). Genetic Loci associated with C-reactive
protein levels and risk of coronary heart disease. JAMA 302,
37–48.
Owen, K.R., Thanabalasingham, G., James, T.J., Karpe, F.,
Farmer, A.J., McCarthy, M.I., and Gloyn, A.L. (2010). Assessment of high-sensitivity C-reactive protein levels as diagnostic discriminator of maturity-onset diabetes of the young
due to HNF1A mutations. Diabetes Care 33, 1919–1924.
Thanabalasingham, G., Shah, N., Vaxillaire, M., Hansen, T.,
Tuomi, T., Gasperikova, D., Szopa, M., Tjora, E., James, T.J.,
Kokko, P., et al. (2011). A large multi-centre European study
validates high-sensitivity C-reactive protein (hsCRP) as a
clinical biomarker for the diagnosis of diabetes subtypes.
Diabetologia 54, 2801–2810.
63. Zhou, K., Bellenguez, C., Spencer, C.C., Bennett, A.J.,
Coleman, R.L., Tavendale, R., Hawley, S.A., Donnelly, L.A.,
Schofield, C., Groves, C.J., et al; GoDARTS and UKPDS
Diabetes Pharmacogenetics Study Group; Wellcome Trust
Case Control Consortium 2; MAGIC investigators. (2011).
Common variants near ATM are associated with glycemic
response to metformin in type 2 diabetes. Nat. Genet. 43,
117–120.
64. Stefansson, H., Helgason, A., Thorleifsson, G., Steinthorsdottir, V., Masson, G., Barnard, J., Baker, A., Jonasdottir, A., Ingason, A., Gudnadottir, V.G., et al. (2005). A common inversion
under selection in Europeans. Nat. Genet. 37, 129–137.
65. Kong, A., Barnard, J., Gudbjartsson, D.F., Thorleifsson, G.,
Jonsdottir, G., Sigurdardottir, S., Richardsson, B., Jonsdottir,
J., Thorgeirsson, T., Frigge, M.L., et al. (2004). Recombination
rate and reproductive success in humans. Nat. Genet. 36,
1203–1206.
66. Hinch, A.G., Tandon, A., Patterson, N., Song, Y., Rohland, N.,
Palmer, C.D., Chen, G.K., Wang, K., Buxbaum, S.G., Akylbekova, E.L., et al. (2011). The landscape of recombination in
African Americans. Nature 476, 170–175.
67. Seldin, M.F., Tian, C., Shigeta, R., Scherbarth, H.R., Silva, G.,
Belmont, J.W., Kittles, R., Gamron, S., Allevi, A., Palatnik,
S.A., et al. (2007). Argentine population genetic structure:
Large variance in Amerindian contribution. Am. J. Phys.
Anthropol. 132, 455–462.
68. Seldin, M.F., Shigeta, R., Villoslada, P., Selmi, C., Tuomilehto,
J., Silva, G., Belmont, J.W., Klareskog, L., and Gregersen, P.K.
(2006). European population substructure: Clustering of
northern and southern populations. PLoS Genet. 2, e143.
69. Tian, C., Hinds, D.A., Shigeta, R., Kittles, R., Ballinger, D.G.,
and Seldin, M.F. (2006). A genomewide single-nucleotidepolymorphism panel with high ancestry information for
African American admixture mapping. Am. J. Hum. Genet.
79, 640–649.
70. McEvoy, B.P., Montgomery, G.W., McRae, A.F., Ripatti, S.,
Perola, M., Spector, T.D., Cherkas, L., Ahmadi, K.R.,
Boomsma, D., Willemsen, G., et al. (2009). Geographical
structure and differential natural selection among North
European populations. Genome Res. 19, 804–814.
71. Heath, S.C., Gut, I.G., Brennan, P., McKay, J.D., Bencko, V.,
Fabianova, E., Foretova, L., Georges, M., Janout, V., Kabesch,
M., et al. (2008). Investigation of the fine structure of
European populations with applications to disease association studies. Eur. J. Hum. Genet. 16, 1413–1429.
72. Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R.,
Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson,
M.R., et al. (2008). Genes mirror geography within Europe.
Nature 456, 98–101.
73. Price, A.L., Butler, J., Patterson, N., Capelli, C., Pascali, V.L.,
Scarnicci, F., Ruiz-Linares, A., Groop, L., Saetta, A.A., Korkolopoulou, P., et al. (2008). Discerning the ancestry of European
Americans in genetic association studies. PLoS Genet. 4,
e236.
74. Manolio, T.A. (2010). Genomewide association studies
and assessment of the risk of disease. N. Engl. J. Med. 363,
166–176.
22 The American Journal of Human Genetics 90, 7–24, January 13, 2012
75. Sivakumaran, S., Agakov, F., Theodoratou, E., Prendergast,
J.G., Zgaga, L., Manolio, T., Rudan, I., McKeigue, P., Wilson,
J.F., and Campbell, H. (2011). Abundant pleiotropy in
human complex diseases and traits. Am. J. Hum. Genet. 89,
607–618.
76. Dickson, S.P., Wang, K., Krantz, I., Hakonarson, H., and
Goldstein, D.B. (2010). Rare variants create synthetic
genome-wide associations. PLoS Biol. 8, e1000294.
77. Anderson, C.A., Soranzo, N., Zeggini, E., and Barrett, J.C.
(2011). Synthetic associations are unlikely to account for
many common disease genome-wide association signals.
PLoS Biol. 9, e1000580.
78. Wray, N.R., Purcell, S.M., and Visscher, P.M. (2011). Synthetic
associations created by rare variants do not explain most
GWAS results. PLoS Biol. 9, e1000579.
79. Visscher, P.M., Goddard, M.E., Derks, E.M., and Wray, N.R.
(2011). Evidence-based psychiatric genetics, AKA the false
dichotomy between common and rare variant hypotheses.
Molecular Psychiatry, in press. Published online 14 June
2011. 2010.1038/mp.2011.2065.
80. Hunter, D.J., and Kraft, P. (2007). Drinking from the fire
hose—Statistical issues in genomewide association studies.
N. Engl. J. Med. 357, 436–439.
81. Pryce, J.E., Hayes, B.J., Bolormaa, S., and Goddard, M.E.
(2011). Polymorphic regions affecting human height also
control stature in cattle. Genetics 187, 981–984.
82. Bodmer, W.F. (1986). Human genetics: The molecular challenge. Cold Spring Harb. Symp. Quant. Biol. 51, 1–13.
83. Risch, N., and Merikangas, K. (1996). The future of genetic
studies of complex human diseases. Science 273, 1516–
1517.
84. Wray, N.R. (2005). Allele frequencies and the r2 measure of
linkage disequilibrium: impact on design and interpretation
of association studies. Twin Res. Hum. Genet. 8, 87–94.
85. McClellan, J.M., Susser, E., and King, M.C. (2007). Schizophrenia: A common disease caused by multiple rare alleles.
Br. J. Psychiatry 190, 194–199.
86. Craddock, N., O’Donovan, M.C., and Owen, M.J. (2007).
Phenotypic and genetic complexity of psychosis. Invited
commentary on . Schizophrenia: a common disease caused
by multiple rare alleles. Br. J. Psychiatry 190, 200–203.
87. Lander, E.S. (1996). The new genomics: Global views of
biology. Science 274, 536–539.
88. Chakravarti, A. (1999). Population genetics—Making sense
out of sequence. Nat. Genet. 21 (1, Suppl), 56–60.
89. Reich, D.E., and Lander, E.S. (2001). On the allelic spectrum
of human disease. Trends Genet. 17, 502–510.
90. Risch, N. (1990). Linkage strategies for genetically complex
traits. I. Multilocus models. Am. J. Hum. Genet. 46, 222–228.
91. Slatkin, M. (2008). Exchangeable models of complex inherited diseases. Genetics 179, 2253–2261.
92. Hill, W.G., Goddard, M.E., and Visscher, P.M. (2008). Data
and theory point to mainly additive genetic variance for
complex traits. PLoS Genet. 4, e1000008.
93. Wang, K., Dickson, S.P., Stolle, C.A., Krantz, I.D., Goldstein,
D.B., and Hakonarson, H. (2010). Interpretation of association signals and identification of causal variants from
genome-wide association studies. Am. J. Hum. Genet. 86,
730–742.
94. Nejentsev, S., Walker, N., Riches, D., Egholm, M., and Todd,
J.A. (2009). Rare variants of IFIH1, a gene implicated in anti-
95.
96.
97.
98.
99.
100.
101.
102.
103.
104.
105.
106.
107.
108.
viral responses, protect against type 1 diabetes. Science 324,
387–389.
Momozawa, Y., Mni, M., Nakamura, K., Coppieters, W.,
Almer, S., Amininejad, L., Cleynen, I., Colombel, J.F.,
de Rijk, P., Dewit, O., et al. (2011). Resequencing of positional
candidates identifies low frequency IL23R coding variants
protecting against inflammatory bowel disease. Nat. Genet.
43, 43–47.
Rivas, M.A., Beaudoin, M., Gardet, A., Stevens, C., Sharma, Y.,
Zhang, C.K., Boucher, G., Ripke, S., Ellinghaus, D., Burtt, N.,
et al; National Institute of Diabetes and Digestive Kidney
Diseases Inflammatory Bowel Disease Genetics Consortium
(NIDDK IBDGC); United Kingdom Inflammatory Bowel
Disease Genetics Consortium; International Inflammatory
Bowel Disease Genetics Consortium. (2011). Deep resequencing of GWAS loci identifies independent rare variants
associated with inflammatory bowel disease. Nat. Genet.
43, 1066–1073.
Wang, K., Bucan, M., Grant, S.F., Schellenberg, G., and Hakonarson, H. (2010). Strategies for genetic studies of complex
diseases. Cell 142, 351–353, author reply 353–355.
Hyttinen, V., Kaprio, J., Kinnunen, L., Koskenvuo, M., and
Tuomilehto, J. (2003). Genetic liability of type 1 diabetes
and the onset age among 22,650 young Finnish twin pairs:
A nationwide follow-up study. Diabetes 52, 1052–1055.
Polychronakos, C., and Li, Q. (2011). Understanding type 1
diabetes through genetics: Advances and prospects. Nat.
Rev. Genet. 12, 781–792.
Poulsen, P., Kyvik, K.O., Vaag, A., and Beck-Nielsen, H.
(1999). Heritability of type II (non-insulin-dependent)
diabetes mellitus and abnormal glucose tolerance—A population-based twin study. Diabetologia 42, 139–145.
Magnusson, P.K., and Rasmussen, F. (2002). Familial resemblance of body mass index and familial risk of high and
low body mass index. A study of young men in Sweden.
Int. J. Obes. Relat. Metab. Disord. 26, 1225–1231.
Schousboe, K., Willemsen, G., Kyvik, K.O., Mortensen, J.,
Boomsma, D.I., Cornes, B.K., Davis, C.J., Fagnani, C., Hjelmborg, J., Kaprio, J., et al. (2003). Sex differences in heritability
of BMI: A comparative study of results from twin studies in
eight countries. Twin Res. 6, 409–421.
Tysk, C., Lindberg, E., Järnerot, G., and Flodérus-Myrhed, B.
(1988). Ulcerative colitis and Crohn’s disease in an unselected population of monozygotic and dizygotic twins. A
study of heritability and the influence of smoking. Gut 29,
990–996.
Hawkes, C.H., and Macgregor, A.J. (2009). Twin studies
and the heritability of MS: A conclusion. Mult. Scler. 15,
661–667.
Brown, M.A., Kennedy, L.G., MacGregor, A.J., Darke, C.,
Duncan, E., Shatford, J.L., Taylor, A., Calin, A., and Wordsworth, P. (1997). Susceptibility to ankylosing spondylitis in
twins: The role of genes, HLA, and the environment.
Arthritis Rheum. 40, 1823–1828.
Brown, M.A. (2011). Progress in the genetics of ankylosing
spondylitis. Brief. Funct. Genomics 10, 249–257.
MacGregor, A.J., Snieder, H., Rigby, A.S., Koskenvuo, M.,
Kaprio, J., Aho, K., and Silman, A.J. (2000). Characterizing
the quantitative genetic contribution to rheumatoid arthritis
using data from twins. Arthritis Rheum. 43, 30–37.
Lichtenstein, P., Yip, B.H., Björk, C., Pawitan, Y., Cannon,
T.D., Sullivan, P.F., and Hultman, C.M. (2009). Common
The American Journal of Human Genetics 90, 7–24, January 13, 2012 23
109.
110.
111.
112.
113.
114.
115.
genetic determinants of schizophrenia and bipolar disorder
in Swedish families: A population-based study. Lancet 373,
234–239.
Purcell, S.M., Wray, N.R., Stone, J.L., Visscher, P.M., O’Donovan, M.C., Sullivan, P.F., and Sklar, P.; International Schizophrenia Consortium. (2009). Common polygenic variation
contributes to risk of schizophrenia and bipolar disorder.
Nature 460, 748–752.
Lichtenstein, P., Holm, N.V., Verkasalo, P.K., Iliadou, A.,
Kaprio, J., Koskenvuo, M., Pukkala, E., Skytthe, A., and Hemminki, K. (2000). Environmental and heritable factors in the
causation of cancer—Analyses of cohorts of twins from
Sweden, Denmark, and Finland. N. Engl. J. Med. 343, 78–85.
Turnbull, C., Ahmed, S., Morrison, J., Pernet, D., Renwick, A.,
Maranian, M., Seal, S., Ghoussaini, M., Hines, S., Healey,
C.S., et al; Breast Cancer Susceptibility Collaboration (UK).
(2010). Genome-wide association study identifies five new
breast cancer susceptibility loci. Nat. Genet. 42, 504–507.
Orstavik, K.H., Magnus, P., Reisner, H., Berg, K., Graham, J.B.,
and Nance, W. (1985). Factor VIII and factor IX in a twin
population. Evidence for a major effect of ABO locus on
factor VIII level. Am. J. Hum. Genet. 37, 89–101.
de Lange, M., Snieder, H., Ariëns, R.A., Spector, T.D., and
Grant, P.J. (2001). The genetics of haemostasis: A twin study.
Lancet 357, 101–105.
Smith, N.L., Chen, M.H., Dehghan, A., Strachan, D.P., Basu,
S., Soranzo, N., Hayward, C., Rudan, I., Sabater-Lleal, M., Bis,
J.C., et al; Wellcome Trust Case Control Consortium. (2010).
Novel associations of multiple genetic loci with plasma levels
of factor VII, factor VIII, and von Willebrand factor: The
CHARGE (Cohorts for Heart and Aging Research in Genome
Epidemiology) Consortium. Circulation 121, 1382–1392.
Visscher, P.M., Medland, S.E., Ferreira, M.A., Morley, K.I.,
Zhu, G., Cornes, B.K., Montgomery, G.W., and Martin,
N.G. (2006). Assumption-free estimation of heritability
from genome-wide identity-by-descent sharing between
full siblings. PLoS Genet. 2, e41.
116. Silventoinen, K., Sammalisto, S., Perola, M., Boomsma, D.I.,
Cornes, B.K., Davis, C., Dunkel, L., De Lange, M., Harris,
J.R., Hjelmborg, J.V., et al. (2003). Heritability of adult body
height: A comparative study of twin cohorts in eight countries. Twin Res. 6, 399–408.
117. Peacock, M., Turner, C.H., Econs, M.J., and Foroud, T. (2002).
Genetics of osteoporosis. Endocr. Rev. 23, 303–326.
118. Duncan, E.L., Danoy, P., Kemp, J.P., Leo, P.J., McCloskey, E.,
Nicholson, G.C., Eastell, R., Prince, R.L., Eisman, J.A., Jones,
G., et al. (2011). Genome-wide association study using
extreme truncate selection identifies novel genes affecting
bone mineral density and fracture risk. PLoS Genet. 7,
e1001372.
119. Dalageorgou, C., Ge, D., Jamshidi, Y., Nolte, I.M., Riese, H.,
Savelieva, I., Carter, N.D., Spector, T.D., and Snieder, H.
(2008). Heritability of QT interval: how much is explained
by genes for resting heart rate? J. Cardiovasc. Electrophysiol.
19, 386–391.
120. Russell, M.W., Law, I., Sholinsky, P., and Fabsitz, R.R. (1998).
Heritability of ECG measurements in adult male twins. J.
Electrocardiol. Suppl. 30, 64–68.
121. Shah, S.H., and Pitt, G.S. (2009). Genetics of cardiac repolarization. Nat. Genet. 41, 388–389.
122. Hunt, S.C., Hasstedt, S.J., Kuida, H., Stults, B.M., Hopkins,
P.N., and Williams, R.R. (1989). Genetic heritability and
common environmental components of resting and stressed
blood pressures, lipids, and body mass index in Utah pedigrees and twins. Am. J. Epidemiol. 129, 625–638.
123. Evans, D.M., Frazer, I.H., and Martin, N.G. (1999). Genetic
and environmental causes of variation in basal levels of
blood cells. Twin Research: The Official Journal of the International Society for Twin Studies 2, 250–257.
24 The American Journal of Human Genetics 90, 7–24, January 13, 2012
Unraveling the Regulatory Mechanisms Underlying
Tissue-Dependent Genetic Variation of Gene Expression
Jingyuan Fu1,2*, Marcel G. M. Wolfs3, Patrick Deelen4, Harm-Jan Westra1, Rudolf S. N. Fehrmann1,5,
Gerard J. te Meerman1, Wim A. Buurman6, Sander S. M. Rensen6, Harry J. M. Groen7, Rinse K. Weersma8,
Leonard H. van den Berg9, Jan Veldink9, Roel A. Ophoff10, Harold Snieder2, David van Heel11, Ritsert C.
Jansen12, Marten H. Hofker3, Cisca Wijmenga1, Lude Franke1*
1 Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands, 2 Department of Epidemiology, University Medical
Center Groningen, University of Groningen, Groningen, The Netherlands, 3 Department of Pathology and Medical Biology, Molecular Genetics, University Medical Center
Groningen, University of Groningen, Groningen, The Netherlands, 4 Hanze University Groningen, Groningen, The Netherlands, 5 Department of Medical Oncology,
University Medical Center Groningen, University of Groningen, Groningen, The Netherlands, 6 Department of Surgery, University Hospital Maastricht and Nutrition and
Toxicology Research Institute (NUTRIM), Maastricht University, Maastricht, The Netherlands, 7 Department of Pulmonology, University Medical Centre Groningen,
University of Groningen, Groningen, The Netherlands, 8 Department of Gastroenterology and Hepatology, University Medical Center Groningen, University of Groningen,
Groningen, The Netherlands, 9 Department of Neurology, Rudolf Magnus Institute of Neuroscience, University Medical Center Utrecht, Utrecht, The Netherlands,
10 Department of Medical Genetics, University Medical Center Utrecht, Utrecht, The Netherlands, 11 Blizard Institute of Cell and Molecular Science, Barts and The London
School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom, 12 Groningen Bioinformatics Centre, Groningen Biomolecular Sciences
and Biotechnology Institute, University of Groningen, Haren, The Netherlands
Abstract
It is known that genetic variants can affect gene expression, but it is not yet completely clear through what mechanisms
genetic variation mediate this expression. We therefore compared the cis-effect of single nucleotide polymorphisms (SNPs)
on gene expression between blood samples from 1,240 human subjects and four primary non-blood tissues (liver,
subcutaneous, and visceral adipose tissue and skeletal muscle) from 85 subjects. We characterized four different
mechanisms for 2,072 probes that show tissue-dependent genetic regulation between blood and non-blood tissues: on
average 33.2% only showed cis-regulation in non-blood tissues; 14.5% of the eQTL probes were regulated by different,
independent SNPs depending on the tissue of investigation. 47.9% showed a different effect size although they were
regulated by the same SNPs. Surprisingly, we observed that 4.4% were regulated by the same SNP but with opposite allelic
direction. We show here that SNPs that are located in transcriptional regulatory elements are enriched for tissue-dependent
regulation, including SNPs at 39 and 59 untranslated regions (P = 1.8461025 and 4.761024, respectively) and SNPs that are
synonymous-coding (P = 9.961024). SNPs that are associated with complex traits more often exert a tissue-dependent effect
on gene expression (P = 2.6610210). Our study yields new insights into the genetic basis of tissue-dependent expression
and suggests that complex trait associated genetic variants have even more complex regulatory effects than previously
anticipated.
Citation: Fu J, Wolfs MGM, Deelen P, Westra H-J, Fehrmann RSN, et al. (2012) Unraveling the Regulatory Mechanisms Underlying Tissue-Dependent Genetic
Variation of Gene Expression. PLoS Genet 8(1): e1002431. doi:10.1371/journal.pgen.1002431
Editor: Greg Gibson, Georgia Institute of Technology, United States of America
Received July 19, 2011; Accepted November 8, 2011; Published January 19, 2012
Copyright: ! 2012 Fu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by IOP Genomics grant IGE05012A, the Netherlands Organisation for Scientific Research (NWO) VICI grant 918.66.620 (CW), a
European Union FP7 COPACETIC grant 201379 (CW), the Dutch Diabetes Foundation (2006.00.007), the Wellcome Trust (084743 to DvH), the Medical Research
Council UK (G1001158 to DvH), Juvenile Diabetes Research Foundation (33-2008-402 to DvH), a NWO VENI grant 863.09.007 (JF), a NWO VENI grant 916.10.135, a
Horizon Breakthrough grant 92519031 from the Netherlands Genomics Initiative (LF), a NWO clinical fellowship grant 90.700.281 (RKW), the Netherlands ALS
foundation and the Adessium Foundation (LHvdB), the Thierry Latran Foundation (JV), and a Transnational University Limburg (TUL) grant (SSMR). The research
leading to these results has received funding from the European Community’s Health Seventh Framework Programme (FP7/2007–2013) under grant agreement
nu 259867. This study was financed in part by the SIA-raakPRO subsidy for project BioCOMP. The funders had no role in study design, data collection and analysis,
decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: fjingyuan@gmail.com (JF); lude@ludesign.nl (LF)
Introduction
lymphoblastoid cell lines (LCL) [3], [4], liver [5]–[7], blood [8],
[9], brain [10], [11], adipose tissues [6], [8], skin [12], [13] and
primary fibroblasts [12]. However, considerable heterogeneity of
cis-eQTL effects is possible between different tissues: A recent
study reported that the proportion of heritability due to gene
expression attributable to cis-regulation differs between tissues
(37% in blood and 24% in adipose tissue) [14]. By comparing the
overlap of significant cis-eQTL at a predefined threshold, estimates
on the tissue-dependence of cis-eQTL were between 30% (liver,
adipose tissues) and 70–80% (LCLs, fibroblasts, T cells) [8], [9],
It has become clear that human genetic variants, such as single
nucleotide polymorphisms (SNPs), can in cis affect the expression
of nearby genes [1], [2]. Many loci exist that contain genetic
variants that affect gene expression (expression quantitative loci,
eQTL, usually assessed by investigating single nucleotide polymorphisms (SNPs) and expression probes that are within 250 kb
up to 1 Mb apart). These cis-eQTL analyses have been performed
in many different human tissues and cell types, including
PLoS Genetics | www.plosgenetics.org
1
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
HumanHT12 v3 platform (see Materials and Methods). After
normalization, we further removed strong expression differences
between these tissues by removing the 50 principal components
from this dataset and using the residuals for further analysis
(described in [18] and Materials and Methods, Figure S1). We first
performed cis-eQTL analysis in each of these datasets separately,
by testing the correlation between SNPs and probes that were
mapping within 1 Mb distance. At a false-discovery rate (FDR) of
0.05 level, we identified a non-overlapping set of 195,078 probeSNP pairs that were significant in at least one of the tissues under
study: 4,700 probe-SNP pairs were significantly associated in liver,
7,161 pairs significantly in SAT, 5,323 pairs significantly in VAT,
1,971 pairs significantly in muscle, and 190,278 pairs significant in
blood (Figure S2). Owing to the much larger sample size, 182,569
probe-SNP pairs (93.6%) were solely detected in blood, while only
601 probe-SNP pairs (0.31%) were significant in each of the five
different tissues (Figure S3). Although a previous study showed that
the heritability of gene expression levels are higher in blood (37%)
compared to adipose tissue (24%) [14], we believe that the large
difference in the detected probe-SNP pairs between blood and
non-blood tissues is due to statistical power issues that result from
substantial sample size differences. As we had initially run ciseQTL analyses in each of the tissues separately, we subsequently
conducted a weighted Z-score meta-analysis across the four nonblood tissues and detected 23,878 probe-SNP pairs at FDR of
0.05. Out of these, 23.2% (5,550 out of 23,878 probe-SNP pairs)
had not been identified in any of the single-tissue analyses (Figure
S4). In total, the single-tissue analyses and meta-analysis yielded a
non-overlapping set of 200,629 significant probe-SNP pairs,
corresponding to 103,968 unique expression altering SNPs
(eSNPs) and 11,618 probes (eProbes) that represent 8,561 unique
genes (eGenes) (Figure S2).
Author Summary
Gene expression can be affected by genetic variation, e.g.
single nucleotide polymorphisms (SNPs). These are called
expression-affecting SNPs or eSNPs. Gene expression levels
are known to vary across different tissues in the same
individual, despite the fact that genetic variation is the
same in these tissues. We explored the different mechanisms by which genetic variants can mediate tissuedependent gene expression. We observed that the genetic
variants that associated with complex traits are more likely
to affect gene expression in a tissue-dependent manner.
Our results suggest that complex traits are even more
complex than we had anticipated, and they underline the
great importance of using expression data from tissues
relevant to the disease being studied in order to further
the understanding of the biology underlying the disease
association.
[15], [16]. However, due to statistical power issues, it is likely that
the tissue-dependency of cis-eQTL has been overestimated by
studies solely assessing the overlap of cis-eQTL between tissues
based on a certain threshold. Realizing this problem, Ding et al.
used a refined statistical method to estimate the percentage of
overlap by adding a power parameter to the model [12]. They
reported that only 30% of cis-eQTL in LCLs were not shared with
fibroblast cis-eQTL. Similarly, a recent study by Nica et al. [13]
examined the tissue-dependence of cis-eQTL in three human
tissues (LCL, skin and fat) in a continuous manner by quantifying
the proportion of overlap of cis-eQTL from the enrichment of low
P-values. They observed that 29% of cis-eQTL appear to be
exclusively tissue-dependent, and also observed that the effect sizes
of 10–20% of the cis-eQTL present in multiple tissues differ per
tissue type. These observations are in line with a large-scale
transcriptomic analysis of 46 human tissues, which found that
while only 6.0% of genes were ubiquitously expressed across all the
assessed tissues, 3.1% genes were only expressed in a single tissue
[17].
To gain a better understanding of this subtle regulation of
tissue-dependent regulation and to address the question of how
genetic variants mediate tissue-dependent expression, we compared cis-regulation between whole peripheral blood from a large
cohort of 1,240 individuals and four smaller primary human
tissues (liver, subcutaneous adipose tissue (SAT), visceral adipose
tissue (VAT) and skeletal muscle) obtained from a set of 85
subjects. We first applied a robust sampling procedure to estimate
accurately how often genes showed different cis-eQTL effects
between tissues. We then investigated in what way genes are
differently associated with SNPs in different tissues. Finally, we
assessed various functional properties for the SNPs involved in
tissue-dependent cis-regulation and their association with complex
traits.
Cis-eQTL Effects Differ per Tissue Type
To assess the tissue-dependency of the cis-eQTL, we compared
the Spearman correlation of each probe-SNP pair between tissues.
However, due to the small sample sizes of the non-blood datasets
we had very limited statistical power to determine whether there
were cis-eQTL effect differences between non-blood tissues. We
therefore confined ourselves to comparisons between the large
blood dataset and each of the smaller non-blood tissues. To correct
for sample size differences, we employed a resampling procedure,
permitting us to derive an empirical distribution of association Zscores (calculated based on the Spearman correlation) of each
probe-SNP pair in blood of the same sample size as in non-blood
tissues (see Materials and Methods; Figure S5). We observed that
18,456 pairs (9.2% of 200,629 probe-SNP pairs) showed a
significantly different Z-score between blood and at least one of
the non-blood tissues at P,6.2361028 (corresponding to a
conservative Bonferroni-corrected P,0.05), implying a discordant
association between blood and non-blood tissues. The remaining
182,173 probe-SNP pairs, which we called ‘‘concordant association’’, had similar association Z-scores between the tissues under
study (Figure S2). The ‘‘discordant associations’’ accounted for
15.4% of the eSNPs (15,974 out of 103,968 eSNPs), 28.7% of the
eProbes (3,330 out of 11,618 eProbes), and 34.1% of the unique
eGenes (2,919 out of 8,561 eGenes) (Table S2 and Figure S2). We
further assessed for each probe-SNP pair, whether the discordance
was detected between blood and multiple non-blood tissues, or
only between blood and one specific non-blood tissue. We
observed that 14,388 probe-SNP pairs (78.0% of the 18,456
discordant probe-SNP pairs) only showed a discordant effect
between blood and one specific non-blood tissue. Only 125 probeSNP pairs (corresponding to 31 eProbes) showed a discordant
Results
Cis-eQTL Mapping in Five Primary Tissues
For this study, we collected data for four different tissues from a
set of 85 unrelated obese Dutch subjects. We successfully collected
data on 74 liver samples, 62 muscle samples, 83 subcutaneous
adipose tissue (SAT) samples and 77 visceral adipose tissue (VAT)
samples (for 48 individuals all four tissues were available). The fifth
tissue, blood, was collected from a different group of 1,240
unrelated Dutch individuals (Table S1). The gene expression levels
in all five tissues were profiled using the same Illumina
PLoS Genetics | www.plosgenetics.org
2
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
four non-blood tissues. In total, we ended up with 13,603 probeSNP pairs (12,549 top eSNPs, that were affecting 11,575 probes
pairs) these six analyses. Among them, 2,612 probe-SNP pairs
(19.2%) showed a discordant effect among tissues at
P = 6.2361028 level (genome-wide test level), accounting for
2,466 (19.7%) unique eSNPs.
We found that the top eSNPs with discordant effect had a
significantly higher minor allele frequency (MAF) than the
concordant top eSNPs (Wilcoxon test P = 8.27610221). The
eSNPs at a smaller distance from the eProbe (#250 kb) were more
likely to show a discordant effect compared to the eSNPs at larger
distance (250 kb–1 Mb distance, OR = 1.62, P = 3.6610222,
Figure S10). Although we acknowledge that the top eSNPs do
not necessarily reflect the true causal variants, we annotated the
functional properties of the top eSNPs to understand the potential
roles of the eSNPs (irrespective of whether these reflect concordant
or discordant eProbes). We observed that the most of the eSNPs
were located in intragenic regions (67.0%) and intronic regions
(14.9%), where their function often remains undetermined.
Interestingly, eSNPs with discordant effect were (compared to
concordant eSNPs) significantly enriched for synonymous-coding
SNPs (Fisher’s exact P value 9.961024), and more often mapped
in the 39 and 59 untranslated regions (UTRs, Fisher’s exact P
values 1.8461025 and 4.761024, respectively) (Figure 1).
As shown before, we observed that SNPs, associated with
complex traits and diseases, are more likely to be eSNPs [2], [6],
[8], [18], [19]. We subsequently analysed 1,954 trait-associated
SNPs (at P,561028, retrieved from the GWAS catalog per 16
September 2011) [20] and observed that 907 trait-associated SNPs
(46.4%) were eSNPs. Of these, 261 trait-associated eSNPs (28.7%)
showed discordant effects on gene expression, which is significantly
higher than what we observed for all 103,968 trait- and non-traitassociated eSNPs (15.4% discordant, Fisher’s exact test
P = 1.10610233) and also significantly higher than if we compare
this to only the 12,549 top eSNPs (19.7% discordant, Fisher’s
exact test P = 2.6610210).
association in all four comparisons, suggesting similar regulation in
the four non-blood tissues but markedly different regulation in
blood (Figure S6). As such these results reveal there are
considerable differences in the genetically determined regulation
of gene expression between liver, SAT, VAT and muscle tissues,
even though the RNA from these tissues had been derived from
the same individuals at was collected at exactly the same time.
To ensure that our sampling procedure was robust, we used the
same procedure to assess how often our method incorrectly
concluded that a probe-SNP Spearman correlation differed
between two independent eQTL datasets in the same peripheral
blood tissue: We used the 1,240 blood samples as discovery set and
used an independent set of 229 blood samples as validation whose
expression was profiled using Illumina H8-v2 chips, [18], [19], see
Methods and Materials. In this analysis, our method incorrectly
deemed that 0.45% of the probe-SNP pairs showed a significant
difference at the previously used P,6.2361028 level (Figure S7).
In our comparisons between blood and non-blood tissues we had
observed that 9.2% of the probe-SNP pairs showed a discordant
effect, which is substantially higher and indicates that the number
of discordant associations that we identified when comparing
different tissues are not expected by chance (Fisher’s exact test:
OR = 20.6 and P,102300). We also assessed whether imputation
accuracy differences between datasets might confound some of the
results, but did not find evidence this to be the case (see Materials
and Methods).
Properties of eSNPs
For the significant 200,629 probe-SNP pairs, we observed that
for 146,480 pairs (73.0%) the eSNPs were located within 250 kb
distance of the eProbe while 54,149 probe-SNP pairs (27.0%)
mapped between 250 kb and 1 Mb apart. Consistent with a
previous study [15], we observed that eSNPs at a larger distance
from the probes tend to have smaller effects (Figure S8). However,
we realize that due to extensive LD many different SNPs are
usually significantly correlated with one single cis-eQTL probe. To
address this, we performed step-wise conditional analyses in each
tissue type to ascertain whether there were multiple SNPs that
independently affected the expression levels of the same probe. We
observed this for 26.8% of the eProbes in the large blood dataset
(Table S3), (where for 2,794 out 10,443 eProbes we had detected
multiple independent eSNPs): We observed that the secondary,
tertiary and quaternary eSNPs usually map further away from the
probe (Wilcoxon test P = 2.25610266, Figure S9), potentially
reflecting some regulatory elements such as enhancers that usually
reside further away from genes. In the non-blood tissues, we lacked
statistical power to detect many secondary and tertiary effects
(Table S3).
Interestingly, there was a very high overlap between the
discordant eProbes (detected in our comparison across tissues) and
the eProbes with multiple independent effects in blood (detected in
the aforementioned analysis that solely used blood samples). Out
of the 10,443 eProbes in blood, 2,528 eProbes had discordant
association and 7,915 eProbes had concordant association. We
observed that 47.5% of the discordant eProbes had multiple
independent eSNPs present in blood (1,202 out of 2,528); whereas
only 20.1% of the concordant eProbes had multiple independent
eSNPs (1,592 out of 8,219, Fisher’s exact test P = 3.85610281).
This observation suggests that for eProbes: 1) different independent
eSNPs can exist and 2) these independent eSNPs can exert an
effect in one tissue while they do not exert an effect in another
tissue.
We subsequently analyzed the most significant eSNP per eProbe
per tissue and the top eSNP per eProbe from the meta-analysis of
PLoS Genetics | www.plosgenetics.org
Four Categories of Tissue-Dependent Cis-Regulation
As we have shown above, discordant eProbes are more likely to
be influenced by multiple independent eSNPs. However, solely
assessing the discordance of a single SNP-probe pair does not
provide an extensive landscape of the tissue-dependent genetic
determinants of gene expression. To gain further insight into this,
we created ‘association profiles’ for the discordant eProbes and
compared these across tissues. An association profile refers to the
association Z-scores of all tested SNPs within 1 Mb distance of the
eProbe under study (see Materials and Methods), and takes into
account multiple SNPs and linkage disequilibrium. We created
such association profiles for 2,007 discordant eProbes 52 (521
eProbes from liver, 708 eProbes from SAT, 526 eProbes from VAT,
and 252 eProbes from muscle, Figure S2).
Upon inspection of these association profiles for the discordant
eProbes, we identified four main different categories of tissuedependent genetic regulation of gene expression. If the association
profiles for one single eProbe did not correlate at all between two
tissues, we further checked whether the eProbe was significant in
both tissues: If the probe had a significant association in one tissue
but not in the other, we deemed this ‘‘specific cis-regulation’’. If
instead the eProbe was significant in both tissues, but was associated
to different (unlinked) eSNPs in the different tissues, we deemed it
‘‘alternative cis-regulation’’ between tissues. For those association
profiles where two tissues showed a correlation, we checked the
direction and the effect size of allelic effect on gene expression. If
the allelic direction was the same and the effect size was different,
3
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
Figure 1. Functional Properties of eSNPs with tissue-dependent effect and concordant effect. The bar plot shows the frequency of the
eSNP per function property. The eSNPs were annotated using the web-based tool of SNP Annotation and Proxy Search (SNAP; http://www.
broadinstitute.org/mpg/snap/), based on the HapMap CEU population panel (release 22) and genome build 36.3. The asterisks indicate the
significance of Fisher’s exact test by comparing the eSNPs with concordant effect and with discordant effect, as given in the legend.
doi:10.1371/journal.pgen.1002431.g001
replicated this specific cis-regulation in liver (Figure 3A). The
association Z-score for rs12740374 with SORT1 expression
variation in liver was 8.24 (N = 74, P = 1.41610215) but in blood
we observed no effect (Z-score = 0.07, N = 1,240, P = 0.8), nor did
we observe any associations in SAT, VAT or muscle, and the
association profiles for this gene show no correlation between
different tissues (all spearman correlation P values.0.39). Thus, in
our data, rs12740374 only exerts an effect on SORT1 gene
expression in liver, although we did observe that SORT1 was
expressed abundantly in all tissues.
Alternative regulation. Alternative regulation between
tissues refers to a gene that is cis-associated with a SNP in a
particular tissue and associated with a different, independent SNP in
another tissue. Such an alternative cis-regulation is also a common
phenomenon, as we found it applied to on average 14.5% of the
we concluded the eProbe belonged to the category ‘‘different effect
size’’. If the allelic direction was instead opposite, the probes had
tissue-dependent regulation with an ‘‘opposite allelic direction’’
(see Materials and Methods). We discuss each of these four
categories in detail below and in Figure 2 and Figure 3.
Specific regulation. Specific cis-regulation refers to a gene
that is cis-regulated in only one specific tissue. We found this type of
regulation is a common phenomenon as it accounted for on average
33.2% of the discordant eProbes (Figure 2). One well-established
example is the SORT1 gene at the 1p13 cholesterol locus, to which
SNPs map that affect low-density lipoprotein cholesterol (LDL-C)
and the risk of myocardial infarction (MI) in humans [21], [22].
Recently, it was shown that the functional variant rs12740374 alters
the binding site for C/EBP transcription factors and consequently
alters the hepatic expression of the SORT1 gene [23]. Our data
PLoS Genetics | www.plosgenetics.org
4
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
Figure 2. cis-regulation of gene expression between tissues. The associated probe-SNP pairs were classified to be concordant or discordant
between tissues. The small pie plot shows the proportion of probes that have only concordant association (red part) or at least one discordant
association (blue part). The probes with discordant association were under tissue-dependent regulation and we characterized four different
mechanisms: specific regulation, alternative regulation, different effect size and opposite effect sizes. Their proportions are shown in the large blue
pie plot. The concordant cis-regulation and the four different mechanisms are illustrated by the correlation between SNP genotypes (AA, AG and GG)
and gene expression levels in two tissues: brown dots represent the expression of a gene in tissue 1 and purple dots the expression of a gene in
tissue 2.
doi:10.1371/journal.pgen.1002431.g002
Different effect size. The different effect size refers to a
common phenomenon that a gene is associated with the same SNP
with alleles that have the same direction of effect but with a
different magnitude in different tissues (Figure 2). For eProbe that
showed this, we observed a significantly positive correlation
between the association profiles of the tissues. We observed it
applies to on average 47.9% of the probes that show tissuedependent regulation (Figure S2), in line with a previous report
[13]. One example is the O-6-methylguanin-DNAmethyltransferase (MGMT) gene that plays an important role in
DNA repair and which suppresses tumor development [24]. We
observed a cis-eQTL for MGMT across each of the five tissues.
However, the effect size in blood was substantially smaller than
that in SAT tissues (Figure 3C).
Opposite allelic direction. Surprisingly, we observed that
some genes were associated with the same SNPs in different tissues
probes with tissue-dependent regulation (Figure 2). One particular
example is the trans-membrane gene TMEM176A, also known as
hepatocellular carcinoma-associated antigen 112. The expression of
TMEM176A was associated with intronic SNP rs714885 in liver
(N = 74, P = 5.761026) but with the 19.5 kb upstream SNP
rs6464104 in blood (N = 1,240, P = 5.076102132) (Figure 3B).
These two SNPs are unlinked variants (r2 = 0.002 and D9 = 0.054
based on the HapMap phase II CEU panel). We observed the same
alternative association for different probes of TMEM176A in an
independent liver eQTL dataset (profiled using a custom ink-jet
microarrays [7] and in the aforementioned independent blood
eQTL dataset that was profiled using Illumina HumanRef-8 v2
BeadChips) (Table S4) [18], [19]. This clearly shows that 1)
multiple, unrelated variants can sometimes affect exactly the same
gene, and 2) these independent variants sometimes only exert an
effect on the gene expression in a particular tissue.
PLoS Genetics | www.plosgenetics.org
5
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
Figure 3. Case examples for tissue-dependent cis-regulation. (A) The liver-specific regulation of the SORT1 gene. (B) The alternative
regulation of the TMEM176A gene in blood and liver. (C) The cis-regulation for the MGMT gene had different effect sizes in blood and SAT. (D) The cisregulation for the DDT gene show opposite allelic direction between blood and liver. For each gene, the left panel shows the cis-eQTL association
profile in the corresponding tissue (liver or SAT, in blue) vs the association profile in blood (red). The x-axis is the genome position based on genome
build 36.3 (in Mb). The y-axis at the left is the association strength in terms of Z-score. The Z-score in blood has been weighted by the square root of
the sample size, corresponding to the compared tissue. The dashed green line indicates the significance level of association at FDR 0.05. We use the
absolute Z-scores to show the association in (A–C), but use the Z-scores in (D) for a better illustration of allelic direction. We assigned the association
Z-scores in blood a negative value. If the allelic direction in SAT is the same as that in blood, the Z-score in SAT is negative too; otherwise, the Z-score
in SAT is positive. The black line shows the recombination rate at this locus based on the HapMap II CEU panel and the scale is indicated on the righthand y-axis. The green line with arrow at the bottom shows the genome position of the gene and the arrow indicates the transcription direction. The
right panel shows the correlation of the Z-scores between two tissues. The r-value indicates the correlation coefficient of the Pearson correlation.
doi:10.1371/journal.pgen.1002431.g003
PLoS Genetics | www.plosgenetics.org
6
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
effect of the regulatory factors (e.g., stimulating or suppressing the
expression) and the size of their effects could lead to the
observations of different categories (Figure 4).
but with alleles having an opposite effect on the gene expression
between tissues. For a probe under this regulation, we then also
observed a strong negative correlation between its association
profiles across different tissues. This ‘‘opposite allelic direction’’
mechanism accounted for on average 4.4% of the probes under
tissue-dependent regulation (Figure 2), which is much less common
than the three previous mechanisms. However, this is still much
more often than would be expected by chance, as determined by a
comparison between two blood datasets in which we found the
allelic directions were nearly always identical (Figure S7). One
striking opposite allelic direction was observed to D-dopachrome
tautomerase (DDT), which showed completely opposite effects
between blood and liver (Figure 3D). Consistently, we found this
opposite effect in the independent liver [7] and blood
dataset,(H8v2), even when different probes were assessed. The
minor allele rs5751777-C was associated with higher expression in
liver (P = 9.95610222 in the discovery set and P = 2.866102211 in
the validation set), but with lower expression in blood
(P = 3.986102119 in the discovery set and P = 4.37610224 in the
validation set) (Table S5). Strikingly, this opposite allelic direction
was also observed when comparing liver with SAT, VAT and
muscle, tissues that were all obtained from exactly the same set of
individuals (Figure S11).
Another notable gene with an opposite allelic direction is
ORMDL3. Although its function remains unclear, genetic variants
near ORMDL3 are associated with various immune-related
diseases, including asthma, type 1 diabetes, Crohn’s diseases,
ulcerative colitis and primary biliary cirrhosis [25]–[29]. ORMDL3
had a genome-wide significant cis-eQTL in blood and its
association in SAT was showing near-genome-wide significance
(Figure S12). All disease-associated SNPs in this locus showed
association in cis with the expression level of ORMDL3 (Table S6),
including the functional variant rs12936231 that has been
implicated to play a causal role in chromatin remodeling [30].
The risk alleles for asthma and preventive alleles for other
autoimmune diseases showed consistent up-regulation in blood
(and were also reported in LCLs) [25], [30]. However, to our
surprise, the effect in SAT was completely reversed, leading to
down-regulation.
Although we have only provided a few examples here, these
observations indicate that conclusions drawn about mechanistic
up- or down-regulation from a single tissue cannot necessarily be
translated to other tissues, as they may sometimes lead to
completely different conclusions depending on the tissues studied.
In the supplementary material (Tables S7, S8, S9, S10 and Figures
S13, S14, S15, S16), we have summarized the observed tissuedependent regulation for 156 genes that have been reported to be
associated with complex traits at P = 561028 (based on the genes,
mentioned in the Catalog of Published Genome-wide Association
Studies, as of 16/09/2011). Some of these plots also show that the
genetic regulation of gene expression is sometimes even more
complicated than what we have described here: some genes can
have multiple cis-eQTL that were either shared or specific to the
tissues, e.g, the association of MTMR3 gene that was associated
with lung cancer [31], Nephrophaty [32], and inflammatory bowel
disease [33], [34] (Figure S17).
The four categories of tissue-dependent cis-regulation we have
observed can be explained by two molecular models: 1) the tissuedependent use of the same causal variant, i.e., the same eSNPs tag
the same causal variant that is activated differentially by tissuedependent factors; 2) the tissue-dependent causal variants, i.e., the
same or different eSNPs tag different causal variants upon the
tissues under study. The extent of the linkage disequilibrium (LD)
between the causal variants and tag eSNPs, and the direction of
PLoS Genetics | www.plosgenetics.org
Discussion
Gene expression levels are partly determined by genetic
variation, and eQTL mapping in different cell types and tissues
has identified many cis-eQTL. However, the effect of cis-eQTL is
strongly dependent upon the studied tissue. In this study, we
compared the genetic architecture of gene expression regulation in
blood and four non-blood primary tissues. We detected that the
majority (71.3%) of the detected probes under genetic control
(eProbes) show a concordant association across tissues. However, the
remaining 28.7% of the eProbes show discordant, tissue-dependent
regulation. Strikingly, many of those discordantly associated eProbes
are affected by multiple, independent eSNPs. We followed up the
genes under tissue-dependent regulation and identified four
different mechanisms: specific regulation, alternative regulation,
different effect size, and opposite allelic direction. We are the first
to provide a comprehensive landscape of the different mechanisms
of tissue-dependent cis-regulation. Of the four mechanisms
identified, the opposite allelic direction mechanism, where alleles
can have opposing effects on gene expression between tissues is of
particular interest: Although this mechanism is less common than
the other three, it has important implications for inferring the
transcriptional effects of alleles from other tissue data, especially on
the susceptibility risk alleles for complex diseases. The use of
different tissues could result in completely the opposite conclusion!
This finding highlights the great importance of investigating
disease-relevant tissues in order to correctly characterize the
functional effects of disease-associated variants.
We observed that SNPs at various transcriptional regulatory
regions more often than expected exert tissue-dependent regulation, although most of the eSNPs were located at intergenic and
gene intronic regions where functions remain undefined. However, we must emphasize that the causal variants remained
undefined. Furthermore, because of the LD structure, although
the same eSNPs can be associated with the expression of the same
gene in different tissues, this does not necessarily mean that the
same regulatory variants act in the different tissues. We have
proposed two molecular models and suggested that tissuedependent cis-regulations can be explained by the tissue-dependent use of the same causal variants or by the use of different
tissue-dependent causal variants. Further fine-mapping and
functional analyses are needed to identify the causal variants and
to understand how they are used in different tissues due to the
limited resolution of cis-eQTL mapping: It is known that the size of
regulatory cis-elements generally is only a few base pairs (i.e., the
binding sites of transcription factors or microRNAs), whereas the
size of linkage disequilibrium blocks is generally in a range of 10–
100 kb [35]. Furthermore, as the molecular models that we have
proposed are quite simple, we cannot exclude other molecular
mechanisms acting in these processes, e.g., the competition of
different regulatory factors and binding sites in different tissues, or
the role of tissue-specific methylation [36], [37] and chromatin
remodeling [38], etc.
It is well known that trait-associated SNPs are more likely to
have effects on gene expression but, to our surprise, we found that
they are also more likely to exert tissue-dependent effects. This
observation adds an extra layer of complexity to complex traits.
We acknowledge that our study has some limitations: We
compared cis-regulation between peripheral blood and four rather
small non-blood tissues. We lacked statistical power to compare
7
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
Figure 4. Molecular models of tissue-dependent cis-regulation. The observed tissue-dependent cis-regulations can be explained by two
molecular models: (A) the tissue-dependent use of the same causal variants, or (B) the use of tissue-dependent causal variants. The ovals indicate the
two regulatory factors (e.g., transcription factors) that play regulatory roles in different tissues (brown in tissue 1 and purple in tissue 2). These factors
can recognize the same or different cis-elements (the yellow region). The genetic variants are shown as SNPs with A/G alleles. The SNPs in red are
causal variants and the SNPs in blue are tag SNPs. The red line between them indicates the linkage disequilibrium. The arrows indicate the effect of
regulatory factors, here the up arrows represent expression stimulators and the down arrows expression suppressors. The size of the arrows indicates
the size of the differences between the expression of A and G alleles, i.e., the cis-eQTL effect size.
doi:10.1371/journal.pgen.1002431.g004
the cis-regulations between two non-blood tissues well. Secondly,
the identified discordant eQTLs are determined by the limited
tissues that we studied Thirdly, although we corrected for
substantial expression differences across samples by employing
principal component analysis, it is still possible that some of the
observed tissue-dependent cis-regulation can be due to the tissue
heterogeneity (i.e. different proportions of cell types per tissue).
Likewise it is also possible that some of the identified discordant ciseQTL could be due to differences in the base-line expression
PLoS Genetics | www.plosgenetics.org
between tissues. However, we observed this to be the case for both
concordant and discordant cis-eQTL when investigating the
original (non-PCA corrected) expression data (see Table S11).
Nevertheless our results indicate that natural genetic varation
can affect gene expression levels in complex ways. Further analyses
using different tissues and specific cell types and using larger
sample sizes are required to gain a deeper understanding of the
genetic variation of gene expression and to gain better insight into
the full complexity of disease.
8
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
BeadChips, we further used program IMPUTE v2 to impute the
genotypes of SNPs that presented in Omni1-Quad chips but not
directly genotyped on Hap370 and 610-Quad platform [39]. The
reference panel for imputation was the CEU population from
HapMap release 22. The directly genotyped SNPs were coded as
0, 1 or 2, while the imputed SNP dosage values were called at a
0.95 confidence level, ranging between 0 and 2. In this way, we
obtained the genotype of the same set of 1,140,419 SNPs for all
five tissues under study.
RNA profiling. Anti-sense RNA was synthesized, amplified
and purified using the Ambion Illumina TotalPrep Amplification
Kit (Ambion, USA) following the manufacturers’ protocol.
Complementary RNA was hybridized to Illumina HumanHT-12
arrays and scanned on the Illumina BeadArray Reader. Raw
probe intensity data for these samples was extracted using
Illumina’s BeadStudio Gene expression module v3.2 (No
background correction was applied, nor did we remove proves
with low expression).
Materials and Methods
Genotyping and Expression Profiling on Liver, Muscle,
and Adipose Fat Tissues from the Same Population
Subjects. From April 2006 to January 2009, 85 morbidly
obese Dutch subjects (23 male and 62 female subjects) with a body
mass index (BMI) between 35 and 70 were included in the study.
They all underwent elective bariatric surgery at the Department of
General Surgery, Maastricht University Medical Centre. Patients
with acute or chronic inflammatory diseases (e.g., autoimmune
diseases), degenerative diseases, reported alcohol consumption
(.10 g/day), and/or using anti-inflammatory drugs were
excluded. The average age of the subjects was 43.9 with a range
of 17 and 67 years. This study was approved by the Medical
Ethical Board of Maastricht University Medical Centre, in line
with the guidelines of the 1975 Declaration of Helsinki. Informed
consent in writing was obtained from each subject personally. The
subject information was provided in Table S1.
Genotyping. Venous blood samples were obtained after
8 hours fasting on the morning of surgery. DNA was extracted
from this blood using the Chemagic Magnetic Separation Module
1 (Chemagen) integrated with a Multiprobe II Pipeting robot
(PerkinElmer). All samples were genotyped using Illumina
HumanOmni1-Quad BeadChips that contain 1,140,419 SNPs.
Genotyping was performed according to standard protocols from
Illumina.
RNA profiling in four tissues. Wedge biopsies of liver,
visceral adipose tissue (VAT, omentum majus), subcutaneous adipose
tissue (SAT, abdominal), and muscle (musculus rectus abdominis) were
taken during surgery. RNA was isolated using the Qiagen Lipid
Tissue Mini Kit (Qiagen, Crawley, West Sussex, UK, 74804).
Assessment of RNA quality and concentration was done with an
Agilent Bioanalyzer (Agilent Technologies, Santa Clara, USA).
Starting with 200 ng of RNA, the Ambion Illumina TotalPrep
Amplification Kit was used for anti-sense RNA synthesis,
amplification, and purification according to the protocol
provided by the manufacturer (Ambion, Austin, USA). 750 ng of
complementary RNA was hybridized to Illumina HumanHT12
BeadChips and scanned on the Illumina BeadArray Reader. Raw
probe intensity data for these samples was extracted using
Illumina’s BeadStudio Gene expression module v3.2 (No
background correction was applied, nor did we remove probes
with low expression).
Genotyping and Expression Profiling in an Independent
Blood Dataset of 229 Samples
Subjects. To ascertain whether our method for identifying
tissue-dependent cis-eQTL was robust, we compared the large
peripheral blood with an independent blood eQTL dataset that
comprised 229 samples. We have described this cohort in previous
studies [9], [18]. In brief, this study comprised 111 English celiac
disease patients, 59 Dutch amyotrophic lateral sclerosis patients
and 59 Dutch health controls. The peripheral blood (2.5 ml) was
collected with the PAXgene system (PreAnalytix GmbH, UK).
Genotyping and imputation. The samples were genotyped
using the Illumina (Illumina, San Diega, USA) HumanHap300
platform. We further used IMPUTE v2 to impute the genotypes of
all HapMap II SNPs. The reference panel for imputation was the
CEU population from HapMap release 22. The directly
genotyped SNPs were coded as 0, 1 or 2, while the imputed
SNP dosage values were called at a 0.95 confidence level, ranging
between 0 and 2.
RNA profiling. Anti-sense RNA was synthesized amplified
and purified using the Ambion Illumina TltalPrep Amplification
Kit (Ambion, USA) following the manufacturers’ protocol.
Complementary RNA was hybridized to Illumina HumanRef-8
v2 arrays (further referred to as H8v2) and scanned on the
Illumina BeadArray Reader.
Genotyping and Expression Profiling on Blood
Normalization and PCA Correction
Subjects.
The genetical genomics samples for blood were
collected from unrelated Dutch individuals in four studies: 324
healthy individuals were collected in the University Medical
Centre Utrecht, 414 amyotrophic lateral sclerosis (ALS) patients
were collected in the University Medical Centre Utrecht, 49
ulcerative colitis (UC) patients from a part of the inflammatory
bowel disease (IBD) cohort of the University Medical Centre
Groningen, and 453 patients with chronic obstructive pulmonary
disease (COPD) were collected with the NELSON study. All
samples were collected after informed consent and approved by
local ethical review boards. Individual sample information is
provided in Table S1.
Genotyping and imputation. DNA from all samples was
hybridized to oligonucleotide arrays from Illumina. 324 healthy
individuals and 414 ALS patients were genotyped using the
Hap370 platform. The 453 COPD patients and 49 UC patients
were genotyped on the 610-Quad platform. Because the subjects
with liver, muscle, adipose fat tissues were genotyped using more
intensive genotyping platform Illumina HumanOmni1-Quad
PLoS Genetics | www.plosgenetics.org
The raw expression intensities from five tissues were jointly
quantile normalized and log2 transformed. We further applied a
principal component analysis (PCA) on expression correlation
matrix and observed that genes are differentially expressed among
different tissue types (Figure S1). We argue that the dominant
principal components (PCs) will primarily capture sample
differences in expression that reflect physiological or environmental variation (e.g., tissue type and phenotype difference) as well as
systematic experimental variation (e.g. batch and technical effect).
In order to target the difference in the genetic variation of
expression among tissues, we removed the global variation in
expression among tissues by using the residual expression for each
probe in each tissue after removing 50 PCs (identical to what we
have described before [18]). Our previous analysis on the same
dataset showed that the number of significantly detected cis-eQTL
probes increased two-fold when 50 PCs were removed from the
expression data (see Figure S7 in ref [18]). For the independent
blood dataset with 229 subjects, we followed the same quantile
normalized and PCA correction.
9
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
Population Stratification and SNP Quality Control
secondary eSNPs were present, we repeated the entire procedure
to detect tertiary eSNPs by regressing out both the primary and
secondary effect (using appropriate multivariate regression analysis). This procedure was repeated until no significant associations
were detected any more.
We tested population stratification between the two cohorts
using the program PLINK (http://pngu.mgh.harvard.edu/,
purcell/plink/strat.shtml). This program uses complete linkage
agglomerative clustering, based on pair-wise identity-by-state (IBS)
distances. The fact that all the individuals from both cohorts were
clustered together indicates there was no population stratification.
We also checked the allelic frequencies between the two cohorts by
treating the 85 individuals with four tissue samples as cases and the
1,240 individuals for blood samples as controls. For the imputed
SNPs, we used the genotype with highest probability as the
discrete genotype for QC purposes. We removed SNPs that
showed significant differences in allele frequency at P,0.01. Then
the SNPs were quality controlled for minor allelic frequency .5%,
a call rate .95% and an exact Hardy-Weinberg (HWE) P
value.0.001. To make certain on the directions of the allelic effect
on gene expression (up-regulating or down-regulating), we further
removed SNPs with two types of transversion alleles (A/T and G/
C) and confined our analysis to SNPs with transition alleles (A/G
or C/T) and other types of transversion alleles (A/C or G/T). This
quality control resulted in 710,035 SNPs for further analysis.
Sampling Approach to Identify Tissue-Dependent eQTL
Comparing blood and non-blood tissues. For each of the
200,629 probe-SNP pairs that was significantly associated at FDR
0.05 level, we further assessed whether the detected Z-scores
differed per tissue. We used the Z-scores in blood as a reference
because the blood samples were independent from other tissue
samples and the sample size was much larger. To correct for the
sample size difference, we, out of the 1,240 blood samples,
randomly selected without replacement the same number of
samples for the comparison with liver (N = 74), SAT (N = 83),
VAT (N = 77) and muscle (N = 62). For a certain probe-SNP pair,
we re-calculated the association Z-score in blood for the selected
sample size. The sampling procedure was repeated 100 times. We
subsequently fitted a generalized extreme value distribution
(GEVD) for the Z-scores of 1006 sampling procedures in blood.
GEVD is a flexible model with three parameters: location (c), scale
(b) and shape (a). GEVD can resemble different distributions with
different settings of parameters. For example, when a = 0, it
resembles the Gumbel types of distributions (Type I); when a.0, it
resembles the Frechet types of distributions (Type II); when a,0, it
resembles the Weibull types of distributions (Type III). Therefore,
fitting the GEVD can permit us to estimate realistic distribution of
the Z-scores of this certain probe-SNP pair in blood (Figure S3). We
then assessed the deviation of the Z-score of the same probe-SNP
pair in the other four tissues from the estimated GEVD in blood and
computed P value for the difference of Z-scores between tissues. We
did this analysis in R (version 2.10.1) using the package evd:
Functions for extreme value distributions (version 2.2–4). This
analysis was done for each of the 200,629 probe-SNP pairs and
between blood and each of four non-blood tissues. Considering the
possible dependence of the eQTL effect among tissues, the
significance was controlled at the conserved Bonferroni-corrected
0.05, corresponding to a P value of 6.2361028 (0.05/200,629
probe-SNP pairs/4 tissue comparisons). The probe-SNP pairs with
a P#6.2361028 were called ‘‘discordant associations’’, while probeSNP pairs with P.6.2361028 were called ‘‘concordant
associations’’. The expression profiling in all five tissues used the
same platform. Therefore, the discordant association cannot be
explained by the hybridization efficiency. Because all of the tested
SNPs were directly genotyped in non-blood tissues but most of them
were imputed in blood, we further checked whether the discordance
was caused by the imputation. We did not observe that imputation
accuracy might confound our results: 69.3% of the discordant
eSNPs were imputed in blood whereas 68.0% of the concordant
eSNPs were imputed in blood (Fisher’s exact test P value = 0.60).
We also assessed whether there was heterogeneity in effect present
when comparing the different subgroups of phenotypes. We did not
find evidence this to be the case (see Table S6 in ref [18]).
Comparing two independent blood datasets. To further
validate the tissue-dependent effect we had detected, we compared
the cis-eQTL effects between the blood dataset HT12 and H8v2,
using the same sampling procedure as described above. Because of
the difference of expression platform, we could only make
comparisons for those probes that were present in both datasets.
We only investigated SNPs that showed similar allele frequencies
between the two blood datasets (SNPs with allele frequency
P,0.01 were excluded from analysis and as the H8v2 dataset
contained 111 celiac disease patients that were nearly all HLA-
eQTL Discovery
In order to detect cis-eQTLs, analysis was confined to those
probe-SNP combinations for which the distance from the probe
transcript midpoint to SNP genomic location was #1 Mb. For
each probe-SNP pair, we used Spearman correlation to detect
association between SNPs and the variations of the gene
expression in liver, SAT, VAT, muscle and blood, respectively.
We calculated the Spearman correlation coefficient and corresponding P values and subsequently transformed this into a Zscore. To maximize the power of eQTL discovery in non-blood
tissues, we further performed meta-analysis for four non-blood
tissues that combines the association signals across the four nonblood tissues under study. An overall, joint P value was calculated
using a weighted (square root of the dataset sample number) Zmethod. Please see the ref [40] for a comprehensive overview of
this method.
To correct for multiple testing, we controlled the false-discovery
rate (FDR) at 0.05: the distribution of observed p-values was used
to calculate the FDR, by comparison with the distribution
obtained from permuting expression phenotypes relative to
genotypes 100 times. At FDR = 0.05 level, the significance P
value threshold was 1.3761025 for significantly associated probeSNP pairs in liver, 2.0761025 for significant association in SAT,
1.5461025 for significant association in VAT, 5.6461026 for
significant association in muscle, 4.861024 for significant
association in blood and 1.1061024 for significant association in
the meta-analysis of four non-blood tissues. For these significant
probe-SNP pairs, we termed the corresponding SNP, probe and
genes as expression SNP (eSNP), regulated probe (eProbe) and
regulated genes (eGenes), respectively.
Conditional Regression Analysis to Detect Independent
eSNPs
Due to the linkage disequilibrium among the tested SNPs, we
usually found numerous eSNPs for each eProbe. In order to detect
independent eSNPs, we performed conditional regression analysis
for the eProbes per tissue type. For each eProbe, we first regressed out
the main effect of the top eSNP. We then subjected the residuals to
eQTL mapping to detect potential secondary, independent eSNPs.
We again controlled the false discovery at 0.05 by running 100, as
described before in the method section ‘‘eQTL discovery’’. If
PLoS Genetics | www.plosgenetics.org
10
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
repeated this permutation 100 times and determined the empirical
threshold rthres = 0.21 at FDR 0.05 level using the model
(FDR = n0{r0$rthres}/n1{r$rthres), where r and r0 refer to the
Pearson correlation coefficient of real data and permuted data,
respectively; n refers to the number of probes where r$rthres and n0
refers to the average number of probes where r0$rthres from 1006
permutations.
Based on the correlation of association profiles between tissues,
we identified four different categories of tissue-dependent genetic
regulation of gene expression. If the association profiles for one
single probe did not correlate at all between two tissues (r,0.21),
we further checked whether the eProbe was significant in both
tissues: if the probe had a significant association in one tissue but
not in the other, we deemed this ‘‘specific cis-regulation’’; if instead
the eProbe was significant in both tissues, but was associated to
different (unlinked) eSNPs in the different tissues, we deemed it
‘‘alternative cis-regulation’’. For those association profiles where
two tissues showed a correlation (r$0.21), we checked the
direction and the effect size of allelic effect on gene expression:
if the allelic direction was the same and the effect size was
different, we concluded the eProbe belonged to the category
‘‘different effect size’’; if the allelic direction was instead opposite,
the probes had tissue-dependent regulation with an ‘‘opposite
allelic direction’’.
DQ2.2 or HLA-DQ2.5 positive we also excluded the HLA from
this analysis). After filtering we could compare 93,656 probe-SNP
pairs.
Enrichment for SNP Properties
The minor allele frequency (MAF) and function properties of
eSNPs were annotated by the web-based tool SNP Annotation and
Proxy Search (SNAP) (www.broadinstitute.rog/mpg/snap) [41],
using the CEU population panel from HapMap release 22. We
performed Fisher’s exact test to compare the enrichment between
eSNPs with a tissue-dependent effect on expression across tissues
and eSNPs with a static effect.
Cis-eQTL Analysis of Trait-Associated SNPs
To directly assess the effect of trait-associated SNPs on gene
expression, we confined our cis-eQTL analysis to 1,954 SNPs (with
alleles A/G) that were associated with complex traits at
P,5.061028 in the ‘Catalog of Published Genome-wide Associated Studies’ (per 16 September 2011) [20] and assessed the tissuedependency of eQTL effect across the tissues, following the same
analysis and permutation procedures. The cis-eQTL significance
threshold P values were set at P = 4.661023 in blood, 2.661024 in
liver, 2.561024 in muscle, 1.861024 in VAT and 3.261025 in
SAT, and 1.161023 for the meta-analysis of four non-blood tissue.
At these levels, a total of 2,990 probe-SNP pairs were significant in
at least one eQTL analysis.
Differential Expression
For the probes with tissue-dependent cis-regulation, we assessed
whether they were also differential expressed between the tissues
where they showed different cis-regulation. To do so, we relied
upon the quantile-normalized expression intensity before any
removal of the first 50 principal components. For each discordant
eProbe, we used a Wilcoxon Mann-Whitney U test to assess the
differential expression between the tissues. We performed the
same analysis for a random set of concordant eProbes, equal in size
to the set of discordant eProbes. The significance of differential
expression was controlled at a Bonferroni-corrected P value 0.05
level.
Characterizing the Tissue-Dependent Mechanisms of CisRegulation
To characterize the tissue-dependent mechanisms of cisregulation, we reasoned that comparing the association at a single
probe-SNP level cannot provide a complete picture of the tissuedependent genetic determinants of gene expression. To gain
further insight into the tissue-dependent cis-regulation, we
extended analysis for the eProbes with discordant cis-eQTL that
were determined by single probe-SNP comparison and compared
their whole association profiles across tissues. The association
profile refers to the set of the absolute Z-scores of all N number of
the tested SNPs within 1 Mb distance from the middle point of
probe under study: i.e., {|Z1|, |Z2|, |Z3|, … |Zn|}. Such a
profile can represent the combined association signals of the
multiple independent eSNPs and their linkage disequilibrium.
Most of the eProbes only showed significant association in blood
and were not significantly associated in the smaller non-blood
tissues. For those eProbes, we had limited statistical power to
determine whether the association in non-blood tissues is truly
absent or is not detected due to power issues. Therefore, we
confined our comparison of association profiles to the eProbes that
were significantly associated in non-blood tissues and compared
them to those in blood. To assess the similarity of association
profiles across tissues, we computed Pearson correlations coefficient (r) of the association profiles between two tissues. Because the
SNPs were likely in strong linkage equilibrium, there is strong
dependency among the Z-scores within the association profile. To
determine the empirical threshold for the significance of the
correlation between the association profiles and considering the
dependency of the SNPs, we performed permutation analysis by
randomly assigning genomes to the individuals per tissue type. We
thus obtained the association profiles per probe per tissue for the
permuted genotypes. These permuted association profiles retained
the same correlation structure among SNPs and the Pearson
correlation coefficient between the permuted association profiles
(r0) would mainly explain the correlation among SNPs. We
PLoS Genetics | www.plosgenetics.org
Accession Numbers
Expression data for both blood tissue and four non-blood
dataset have been deposited in GEO with accession numbers
GSE20142 (1,240 peripheral blood samples, hybridized to HT12
arrays) and GSE22070 (subcutaneous adipose, visceral adipose,
muscle and liver samples). The expression data of the validation
blood eQTL dataset (229 samples) has been deposited in GEO
with accession number GSE203332.
Supporting Information
Figure S1 The effect of removing principal components from
expression data.
(PDF)
Figure S2 Flowchart for the analysis of the tissue-dependent ciseQTL across the five human tissues.
(PDF)
Figure S3
Overlap of the associated probe-SNP pairs across the
tissues.
(PDF)
Figure S4 Overlap of the associated probe-SNP pairs across the
single-tissue analysis and meta-analysis.
(PDF)
11
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
Figure S5
Sampling procedure. We assessed the difference of
association strength between blood and four other tissues (liver,
SAT, VAT and muscle). As an example, for liver, we randomly
sampled 74 subjects out of the 1,240 blood subjects (making the
same sample size as for the liver tissue dataset) and re-measured
the association strength for each significantly associated probeSNP pair, in terms of Z-scores. This sampling procedure was
repeated 100 times. The histogram showed the Z-scores distribution
of a certain cis-eQTL in 74 blood subjects. We then assessed the
deviation of the Z-scores detected in liver (the red arrow) from the
distribution of Z-scoress in blood, by fitting the extreme value
distribution (EVD) (the red line). The same analysis was performed
for comparing blood with SAT, VAT and muscle, by randomly
sampling N number of blood subjects (N = 83 for the SAT sample
size; 77 for the VAT sample size, and 62 for the muscle sample
size, respectively).
(PDF)
Figure S6
size, corresponding to the compared tissue. The blue dots
represent the Z-scores in SAT. The dashed green line indicates
the significance level at FDR 0.05. For a better illustration of allelic
direction, we assigned the association Z-scores in blood a positive
value. If the allelic direction in SAT is the same as that in blood,
the Z-scores in SAT are positive too; otherwise, the Z-scores in SAT
are negative.
(PDF)
Figure S13 The association profiles of the selected traitassociated genes that show discordant association between blood
and liver. The x-axis is the genome position based on genome
build 36.3. The y-axis at the left is the association profiles in terms
of the Z-score. The Z-score in blood, represented as the red dots or
orange dots. The red dots refer to the Z-scores that have been
weighted by the square root of the sample sizes, corresponding to
the compared tissue. For the clarity of subtle effect in blood, the
weak association in blood was shown as orange dots if the Z-scores
have not been weighted by the sample size, i.e., the Z-scores
reported in 1,240 subjects. The blue dots represent the Z-scores in
liver. The dashed green line indicates the Z-score 3.49,
representing the significance level in blood at FDR 0.05. The
right panel shows the correlation of the absolute association Zscores between two tissues. The rho-value indicates the correlation
coefficient of the Pearson correlation.
(PDF)
The overlap of discordantly associated probe-SNP
pairs.
(PDF)
Figure S7 The comparison of Z-scores between two independent blood datasets. The comparison of cis-eQTL effect was
confined to the set of 93,656 probe-SNP pairs that have been
tested in two independent blood datasets, e.g., a discovery set of
1,240 subjects profiled on the Illumina HT12 expression platform
(HT12) and a validation set of 229 subjects profiled on the
Illumina H8v2 expression platform (H8v2). The Z-scores of ciseQTL in the discovery set were the mean of Z-scores from 1006
taking a sample of 229 out of the 1,240 blood subjects. The gray
dots indicate the concordantly associated probe-SNP pairs
between the two blood samples. The red dots indicate the
discordantly associated probe-SNP pairs (the false-positive tissuedependent association). The black line is the diagonal line.
(PDF)
Figure S14 The association profiles of the selected traitassociated genes that show discordant association between blood
and SAT. The x-axis is the genome position based on genome
build 36.3. The y-axis at the left is the association profiles in terms
of the Z-score. The Z-score in blood, represented as the red dots or
orange dots. The red dots refer to the Z-scores that have been
weighted by the square root of the sample sizes, corresponding to
the compared tissue. For the clarity of subtle effect in blood, the
weak association in blood was shown as orange dots if the Z-scores
have not been weighted by the sample size, i.e., the Z-scores
reported in 1,240 subjects. The blue dots represent the Z-scores in
SAT. The dashed green line indicates the Z-score 3.49,
representing the significance level in blood at FDR 0.05. The
right panel shows the correlation of the absolute association Zscores between two tissues. The rho-value indicates the correlation
coefficient of the Pearson correlation.
(PDF)
Figure S8 The probes-SNP distance for associated probe-SNP
pairs. The distance was calculated by the base pair position (bp) of
SNPs minus the bp position of the middle point of the probes.
(PNG)
Figure S9 Probe-SNP distance for 2,794 eProbes in blood with
multiple independent eSNPs.
(PDF)
Figure S10 The discordant probe-SNP pairs vs. the probe-SNP
distance. The histogram shows the number the probe-SNP pairs
with different distance. The numbers on each bar show the total
number of probe-SNP pairs and the percentage of pairs with
discordant association. The 262 table for Fisher’s exact test is
shown.
(PDF)
Figure S15 The association profiles of the selected traitassociated genes that show discordant association between blood
and VAT. The x-axis is the genome position based on genome
build 36.3. The y-axis at the left is the association profiles in terms
of the Z-score. The Z-score in blood, represented as the red dots or
orange dots. The red dots refer to the Z-scores that have been
weighted by the square root of the sample sizes, corresponding to
the compared tissue. For the clarity of subtle effect in blood, the
weak association in blood was shown as orange dots if the Z-scores
have not been weighted by the sample size, i.e., the Z-scores
reported in 1,240 subjects. The blue dots represent the Z-scores in
VAT. The dashed green line indicates the Z-score 3.49,
representing the significance level in blood at FDR 0.05. The
right panel shows the correlation of the absolute association Zscores between two tissues. The rho-value indicates the correlation
coefficient of the Pearson correlation.
(PDF)
Figure S11 The direction of allelic effect of rs5751777 on DDT
expression. The correlation between the genotype of rs5751777
and the expression intensity of DDT gene (residual variance after
50 PCs removed) in five tissues. Each dot represents one subject,
red for females and blue for males. The X-axis represents the
genotypes and the Y-axis represents the expression rank of the
probes.
(PDF)
Figure S12 The opposite association of ORMDL3 gene between
blood and SAT. The x-axis is the genome position based on
genome build 36.3 (in Mb). The y-axis at the left is the association
profiles in terms of Z-scores. The Z-scores in blood, represented as
the red dots, has been weighted by the square root of the sample
PLoS Genetics | www.plosgenetics.org
Figure S16 The association profiles of the selected traitassociated genes that show discordant association between blood
and muscle. The x-axis is the genome position based on genome
12
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
build 36.3. The y-axis at the left is the association profiles in terms
of the Z-score. The Z-score in blood, represented as the red dots or
orange dots. The red dots refer to the Z-scores that have been
weighted by the square root of the sample sizes, corresponding to
the compared tissue. For the clarity of subtle effect in blood, the
weak association in blood was shown as orange dots if the Z-scores
have not been weighted by the sample size, i.e., the Z-scores
reported in 1,240 subjects. The blue dots represent the Z-scores in
muscle. The dashed green line indicates the Z-score 3.49,
representing the significance level in blood at FDR 0.05. The
right panel shows the correlation of the absolute association Zscores between two tissues. The rho-value indicates the correlation
coefficient of the Pearson correlation.
(PDF)
Table S5 Replication of cis-eQTL of DDT in blood and liver
Figure S17
genes that show discordant association between blood and VAT.
(XLS)
that show opposite allelic direction.
(DOC)
Table S6
Allelic effect of disease-associated SNPs on the
expression of ORMLD3.
(DOC)
Table S7 The tissue-dependent regulation of 45 trait-associated
genes that show discordant association between blood and liver.
(XLS)
Table S8 The tissue-dependent regulation of 50 trait-associated
genes that show discordant association between blood and SAT.
(XLS)
Table S9 The tissue-dependent regulation of 46 trait-associated
Association profiles of MTMR3 in blood and liver.
The x-axis is the genome position based on genome build 36.3 (in
Mb). The y-axis at the left indicates the association Z-score. The Zscores in blood, represented as the red dots, have been weighted by
the square root of the sample size, corresponding to the compared
tissue. The blue dots represent the Z-scores in SAT. The dashed
green line indicates the Z-scores 3.49, representing the significance
level in blood at FDR 0.05. The right panel shows the correlation
of the absolute association Z-scores between two tissues. The r-value
indicates the correlation coefficient of the Pearson correlation.
(PDF)
Table S10 The tissue-dependent regulation of 19 trait-associated genes that show discordant association between blood and
Muscle.
(XLS)
Table S11
Acknowledgments
We thank Robert Hartholt for helping with the DNA isolation and Pieter
van der Vlies, Elvira Oosterom, Marcel Bruinenberg, and Bahram Sanjabi
for the genotyping and gene expression profiling. We also thank Eric
Schadt for providing the allelic directions of eQTL detected in human liver
and Jackie Senior for editing the manuscript.
Table S1 Characteristics of Samples.
(XLS)
The number of discordant cis-eQTL between blood
and non-blood tissues.
(DOC)
Table S2
Author Contributions
Table S3 The Number of independent eSNPs per probe.
Conceived and designed the experiments: JF LF CW MHH. Wrote the
paper: JF LF. Collected tissues: WAB SSMR HJMG RKW LHvdB JV
DvH. Conducted genotyping: CW RAO. Conducted expression profiling:
MGMW CW MHH. Bioinformatics and statistical analyses: JF PD H-JW
LF. Bioinformatics support: RCJ. PCA-based normalization: RSNF GJtM
LF. Helped to improve the manuscript: HS RCJ CW.
(DOC)
Table
S4 Replication
of
tissue-alternative
cis-eQTL
The number of the differentially expressed eProbes.
(DOC)
of
TMEM176A.
(DOC)
References
12. Ding J, Gudjonsson JE, Liang L, Stuart PE, Li Y, et al. (2010) Gene expression
in skin and lymphoblastoid cells: Refined statistical method reveals extensive
overlap in cis-eQTL signals. Am J Hum Genet 87: 779–789.
13. Nica AC, Parts L, Glass D, Nisbet J, Barrett A, et al. (2011) The architecture of
gene regulatory variation across multiple human tissues: the MuTHER study.
PLoS Genet 7: e1002003. doi:10.1371/journal.pgen.1002003.
14. Price AL, Helgason A, Thorleifsson G, McCarroll SA, Kong A, et al. (2011)
Single-Tissue and Cross-Tissue Heritability of Gene Expression Via Identity-byDescent in Related or Unrelated Individuals. PLoS Genet 7: e1001317. doi:10.
1371/journal.pgen.1001317.
15. Dimas AS, Deutsch S, Stranger BE, Montgomery SB, Borel C, et al. (2009)
Common regulatory variation impacts gene expression in a cell type-dependent
manner. Science 325: 1246–1250.
16. Gerrits A, Li Y, Tesson BM, Bystrykh LV, Weersing E, et al. (2009) Expression
quantitative trait loci are highly sensitive to cellular differentiation state. PLoS
Genet 5: e1000692. doi:10.1371/journal.pgen.1000692.
17. Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, et al. (2002) Large-scale
analysis of the human and mouse transcriptomes. Proc Natl Acad Sci U S A 99:
4465–4470.
18. Fehrmann RSN, Jansen RC, Veldink JH, Westra H, Arends D, et al. (2011)
Trans-eQTLs reveal that independent genetic variants associated with a
complex phenotype converge on intermediate genes, with a major role for the
HLA. PLoS Genet 7: e1002197. doi:10.1371/journal.pgen.1002197.
19. Dubois PCA, Trynka G, Franke L, Hunt KA, Romanos J, et al. (2010) Multiple
common variants for celiac disease influencing immune gene expression. Nat
Genet 42: 295–302.
20. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. (2009)
Potential etiologic and functional implications of genome-wide association loci
for human diseases and traits. Proc Natl Acad Sci U S A 106: 9362–9367.
1. Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M (2009) Mapping
complex disease traits with global gene expression. Nat Rev Genet 10: 184–194.
2. Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, et al. (2010) Traitassociated SNPs are more likely to be eQTLs: annotation to enhance discovery
from GWAS. PLoS Genet 6: e1000888. doi:10.1371/journal.pgen.1000888.
3. Monks SA, Leonardson A, Zhu H, Cundiff P, Pietrusiak P, et al. (2004) Genetic
inheritance of gene expression in human cell lines. Am J Hum Genet 75: 1094–1105.
4. Cheung VG, Conlin LK, Weber TM, Arcaro M, Jen K, et al. (2003) Natural
variation in human gene expression assessed in lymphoblastoid cells. Nat Genet
33: 422–425.
5. Bullaughey K, Chavarria CI, Coop G, Gilad Y (2009) Expression quantitative
trait loci detected in cell lines are often present in primary tissues. Hum Mol
Genet 18: 4296–4303.
6. Zhong H, Beaulaurier J, Lum PY, Molony C, Yang X, et al. (2010) Liver and
adipose expression associated SNPs are enriched for association to type 2
diabetes. PLoS Genet 6: e1000932. doi:10.1371/journal.pgen.1000932.
7. Schadt EE, Molony C, Chudin E, Hao K, Yang X, et al. (2008) Mapping the
genetic architecture of gene expression in human liver. PLoS Biol 6: e107.
doi:10.1371/journal.pbio.0060107.
8. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, et al. (2008)
Genetics of gene expression and its effect on disease. Nature 452: 423–428.
9. Heap GA, Trynka G, Jansen RC, Bruinenberg M, Swertz MA, et al. (2009)
Complex nature of SNP genotype effects on gene expression in primary human
leucocytes. BMC Med Genomics 2: 1.
10. Myers AJ, Gibbs JR, Webster JA, Rohrer K, Zhao A, et al. (2007) A survey of
genetic human cortical gene expression. Nat Genet 39: 1494–1499.
11. Richards AL, Jones L, Moskvina V, Kirov G, Gejman PV, et al. (2011)
Schizophrenia susceptibility alleles are enriched for alleles that affect gene
expression in adult human brain. Mol Psychiatry;doi:10.1038/mp.2011.1.
PLoS Genetics | www.plosgenetics.org
13
January 2012 | Volume 8 | Issue 1 | e1002431
Mechanisms Underlying Tissue-Dependent cis-eQTL
31. Hu Z, Wu C, Shi Y, Guo H, Zhao X, et al. (2011) A genome-wide association
study identifies two new lung cancer susceptibility loci at 13q12.12 and 22q12.2
in Han Chinese. Nat Genet 43: 792–796.
32. Gharavi AG, Kiryluk K, Choi M, Li Y, Hou P, et al. (2011) Genome-wide
association study identifies susceptibility loci for IgA nephropathy. Nat Genet 43:
321–327.
33. Franke A, McGovern DPB, Barrett JC, Wang K, Radford-Smith GL, et al.
(2010) Genome-wide meta-analysis increases to 71 the number of confirmed
Crohn’s disease susceptibility loci. Nat Genet 42: 1118–1125.
34. Imielinski M, Baldassano RN, Griffiths A, Russell RK, Annese V, et al. (2009)
Common variants at five new loci associated with early-onset inflammatory
bowel disease. Nat Genet 41: 1335–1340.
35. International HapMap Consortium (2005) A haplotype map of the human
genome. Nature 437: 1299–1320.
36. Liang P, Song F, Ghosh S, Morien E, Qin M, et al. (2011) Genome-wide survey
reveals dynamic widespread tissue-specific changes in DNA methylation during
development. BMC Genomics 12: 231.
37. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, et al. (2009) Human
DNA methylomes at base resolution show widespread epigenomic differences.
Nature 462: 315–322.
38. Eeckhoute J, Lupien M, Meyer CA, Verzi MP, Shivdasani RA, et al. (2009) Celltype selective chromatin remodeling defines the active subset of FOXA1-bound
enhancers. Genome Res 19: 372–380.
39. Howie BN, Donnelly P, Marchini J (2009) A flexible and accurate genotype
imputation method for the next generation of genome-wide association studies.
PLoS Genet 5: e1000529. doi:10.1371/journal.pgen.1000529.
40. Whitlock MC (2005) Combining probability from independent tests: the
weighted Z-method is superior to Fisher’s approach. J Evol Biol 18: 1368–1373.
41. Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O’Donnell CJ, et al. (2008)
SNAP: a web-based tool for identification and annotation of proxy SNPs using
HapMap. Bioinformatics 24: 2938–2939.
21. Schunkert H, König IR, Kathiresan S, Reilly MP, Assimes TL, et al. (2011)
Large-scale association analysis identifies 13 new susceptibility loci for coronary
artery disease. Nat Genet 43: 333–338.
22. Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, et al.
(2010) Biological, clinical and population relevance of 95 loci for blood lipids.
Nature 466: 707–713.
23. Musunuru K, Strong A, Frank-Kamenetsky M, Lee NE, Ahfeldt T, et al. (2010)
From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus.
Nature 466: 714–719.
24. Esteller M, Garcia-Foncillas J, Andion E, Goodman SN, Hidalgo OF, et al.
(2000) Inactivation of the DNA-repair gene MGMT and the clinical response of
gliomas to alkylating agents. N Engl J Med 343: 1350–1354.
25. Moffatt MF, Kabesch M, Liang L, Dixon AL, Strachan D, et al. (2007) Genetic
variants regulating ORMDL3 expression contribute to the risk of childhood
asthma. Nature 448: 470–473.
26. Mells GF, Floyd JAB, Morley KI, Cordell HJ, Franklin CS, et al. (2011)
Genome-wide association study identifies 12 new susceptibility loci for primary
biliary cirrhosis. Nat Genet 43: 329–332.
27. Anderson CA, Boucher G, Lees CW, Franke A, D’Amato M, et al. (2011) Metaanalysis identifies 29 additional ulcerative colitis risk loci, increasing the number
of confirmed associations to 47. Nat Genet 43: 246–252.
28. Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, et al. (2009)
Genome-wide association study and meta-analysis find that over 40 loci affect
risk of type 1 diabetes. Nat Genet 41: 703–707.
29. Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, et al. (2008) Genomewide association defines more than 30 distinct susceptibility loci for Crohn’s
disease. Nat Genet 40: 955–962.
30. Verlaan DJ, Berlivet S, Hunninghake GM, Madore A, Larivière M, et al. (2009)
Allele-specific chromatin remodeling in the ZPBP2/GSDMB/ORMDL3 locus
associated with the risk of asthma and autoimmune disease. Am J Hum Genet
85: 377–393.
PLoS Genetics | www.plosgenetics.org
14
January 2012 | Volume 8 | Issue 1 | e1002431
REPORT
A Versatile Gene-Based Test
for Genome-wide Association Studies
Jimmy Z. Liu,1,* Allan F. Mcrae,1 Dale R. Nyholt,1 Sarah E. Medland,1 Naomi R. Wray,1
Kevin M. Brown,2 AMFS Investigators,3 Nicholas K. Hayward,1 Grant W. Montgomery,1
Peter M. Visscher,1 Nicholas G. Martin,1 and Stuart Macgregor1,*
We have derived a versatile gene-based test for genome-wide association studies (GWAS). Our approach, called VEGAS (versatile genebased association study), is applicable to all GWAS designs, including family-based GWAS, meta-analyses of GWAS on the basis of
summary data, and DNA-pooling-based GWAS, where existing approaches based on permutation are not possible, as well as singleton
data, where they are. The test incorporates information from a full set of markers (or a defined subset) within a gene and accounts for
linkage disequilibrium between markers by using simulations from the multivariate normal distribution. We show that for an association study using singletons, our approach produces results equivalent to those obtained via permutation in a fraction of the computation
time. We demonstrate proof-of-principle by using the gene-based test to replicate several genes known to be associated on the basis of
results from a family-based GWAS for height in 11,536 individuals and a DNA-pooling-based GWAS for melanoma in ~1300 cases and
controls. Our method has the potential to identify novel associated genes; provide a basis for selecting SNPs for replication; and be
directly used in network (pathway) approaches that require per-gene association test statistics. We have implemented the approach
in both an easy-to-use web interface, which only requires the uploading of markers with their association p-values, and a separate downloadable application.
Gene-based tests for association are increasingly being seen
as a useful complement to genome-wide association
studies (GWAS).1 A gene-based approach considers association between a trait and all markers (usually SNPs) within
a gene rather than each marker individually. Depending on
the underlying genetic architecture, gene-based approaches can be more powerful than traditional individual-SNP-based GWAS. For example, if a gene contains
more than one causative variant, then several SNPs within
that gene might show marginal levels of significance that
are often indistinguishable from random noise in the
initial GWAS results. By combining the effects of all SNPs
in a gene into a test-statistic and correcting for linkage
disequilibrium (LD), the gene-based test might be able to
detect these effects. Gene-based tests are also ideally suited
for network (or pathway) approaches to interpreting the
findings from GWAS.2–7 These approaches are necessarily
gene centric and require a measure of the relative importance of each gene to the phenotype of interest. The
gene-based approach also reduces the multiple-testing
problem of GWAS by only considering statistical tests for
~20,000 genes per genome as opposed to testing more
than half a million SNPs in a typical GWAS.
Ideally, a gene-based test statistic can be obtained with
permutations, where LD structure and other possible confounding factors, such as gene size, will be accounted for.
Computing a gene-based test for basic GWAS designs via
permutations is conceptually simple and is currently implemented as the ‘‘set-based test’’ in the PLINK software
package8; however, heavy computational requirements
have restricted this method from being adopted on
a genome-wide scale. Other gene-based tests, such as those
based on genetic distances9 or entropy,10 are often also
restricted to situations where individual genotype information is available or to specific GWAS designs (usually
case-control designs). There are several important situations in which permutations or existing methods cannot
be used; these include family-based GWAS, GWAS metaanalyses based on summary data, and DNA-pooling-based
GWAS. In contrast, our approach, called VEGAS (versatile
gene-based association study), only requires individual
marker p values in order to allow computation of a genebased p value, and it can be applied to virtually any association study design. The method tests the evidence for association on a per-gene basis by summarizing either the full
set of markers (typically SNPs) in the gene or a subset of the
most significant markers (for example, the 10% most
significant SNPs). For some genes, an approach considering all the markers might be the most powerful; for
others, focusing on just the most associated markers might
be apt. The true underlying genetic architecture is seldom
known in advance. The default gene-based test in our
implementation and in the following examples uses the
full set of markers in the gene. Our approach takes account
of LD between markers in a gene by using simulation based
on the LD structure of a set of reference individuals
from a HapMap phase 2 population (CEU [Utah residents
with ancestry from northern and western Europe]; CHB
and JPT [Han Chinese in Beijing, China and Japanese in
Tokyo, Japan]; or YRI [Yoruba in Ibadan, Nigeria]), which
1
Genetics and Population Health Division, Queensland Institute of Medical Research, Brisbane, Queensland 4006, Australia; 2Integrated Cancer Genomics
Division, The Translation Genomics Research Institute, Phoenix, Arizona 85028, USA; 3Australian Melanoma Family Study. List of participants and
affiliations appear in the Acknowledgements
*Correspondence: jimmy.liu@uqconnect.edu.au (J.Z.L.), stuart.macgregor@qimr.edu.au (S.M.)
DOI 10.1016/j.ajhg.2010.06.009. ª2010 by The American Society of Human Genetics. All rights reserved.
The American Journal of Human Genetics 87, 139–145, July 9, 2010 139
provides approximately ~2.1 million autosomal SNPs,11 or
a custom set of individuals if genotype information is
available.
Our method assigns SNPs to each of 17,787 autosomal
genes according to positions on the UCSC Genome Browser
hg18 assembly. In order to capture regulatory regions
and SNPs in LD, we define gene boundaries in this case
as 5 50 kb of 50 and 30 UTRs. Then, for a given gene with
n SNPs, association p values are first converted to uppertail chi-squared statistics with one degree of freedom (df).
The gene-based test statistic is then the sum of all (or
a pre-defined subset) of the chi-squared 1 df statistics within
that gene. If the SNPs are in perfect linkage equilibrium, the
test statistic will have a chi-squared distribution with n
degrees of freedom under the null hypothesis. Because
this is unlikely to be the case, however, the true null distribution given the LD structure (and hence p values that
correlate accordingly) will need to be taken into account.
Ideally, one would achieve this by performing a large
number of permutations; however, this is very computationally intensive, requires individual genotype information, and assumes that individuals are unrelated. Instead,
our Monte Carlo approach makes use of simulations from
the multivariate normal distribution and is both much
faster and agnostic regarding the GWAS design.
For a gene with n SNPs, we simulate an n-element multivariate normally distributed vector with mean 0 and variance S, the n 3 n matrix of pairwise LD (r) values. A vector
of n independent, standard, normally distributed random
variables is first generated and then multiplied by the Cholesky decomposition matrix of S – that is, the n 3 n lower
triangular matrix C, such that CCT ¼ S. The new random
vector, Z ¼ ðz1 ,z2 .zn Þ, will have a multivariate normal
P
distribution, Z $ Nn ð0, Þ. Z is then transformed into
a vector of correlated chi-squared 1 df variables,
Q ¼ ðq1 ,q2 .qn Þ, qi ¼ z2i . The simulated gene-based test
statistic is then the sum of all (or a predefined subset) of
the elements of Q and will have the same approximate
distribution as our observed gene-based test statistic under
the null hypothesis. A large number of multivariate
normal vectors are simulated, and the empirical genebased p value is the proportion of simulated test statistics
that exceed the observed gene-based test statistic.
We have implemented VEGAS in both an easy-to-use
web-interface or as a downloadable application for Linux
and Unix. The only user inputs required are a text file consisting of two columns: SNP rs-name and association p
value, along with specification of the reference population
(CEU, CHB and JPT, or YRI). The downloadable version
also allows the use of custom individual genotypes if available, as well as specification of gene boundaries. Pairwise
LD correlation matrices are calculated in PLINK. The R
corpcor package is used to correct for non-positive definite
correlation matrices,12 and multivariate normal random
vectors are simulated with the mvtnorm package.13
The number of simulations per gene is determined adaptively. In the first stage, 103 simulations will be performed.
If the resulting empirical p value is less than 0.1, 104 simulations will be performed. If the empirical p value from 104
simulations is less than 0.001, the program will perform
106 simulations. At each stage, the simulations are mutually exclusive. For computational reasons, if the empirical
p value is 0, then no more simulations will be performed.
An empirical p value of 0 from 106 simulations can be interpreted as p < 10%6, which exceeds a Bonferroni-corrected threshold of p < 2.8 3 10%6 (z0.05/17,787; this
threshold is likely to be conservative given the overlap
between genes). The user may select whether to perform
the gene-based test on the full set of SNPs within a gene,
a specified percentage of the most significant SNPs, or
just the single most significant SNP. Because the program
depends upon the output from other programs, it is important to take correct GWAS quality-control measures to
account for issues such as population stratification or pooling errors before using VEGAS.
Using a test with permutations as the ‘‘gold standard,’’
we compared the results from VEGAS to those from the
PLINK set-based test8 with permutations (with parameters
--set-p1 --set-r21 --maf 0.01) on a GWAS for height in 3,611
unrelated Australian individuals drawn from communitybased twin studies conducted from 1980 to 2004. Several
recent genetic studies of other traits,14–16 have used these
samples and have described genotype and phenotype
data cleaning. In brief, height was corrected for age and
sex before being converted to standard z scores. PLINK
was used for performing genome-wide association, from
which the results were used in our method. For a given set
of SNPs, the PLINK set-based test initially performs a standard association test and then uses the average association
test statistic across these SNPs as the ‘‘set-based’’ test
statistic (VEGAS uses the sum rather than average; the
two methods are equivalent in calculations of empirical p
values). Then, for the permutation procedure, the phenotypes are randomly shuffled among individuals, and the
process is repeated several thousand times, from which
an empirical p value is obtained. Because of computational
limitations, we only performed the PLINK set-based test on
413 genes on chromosome 22 with 104 permutations each.
To see how both tests deal with more significant genes, we
performed 106–107 permutations on seven additional
genes. These genes were chosen on the basis of having p
values < 10%3 when VEGAS was applied across all chromosomes. across all chromosomes. The results from both tests
are shown in Figure 1, which compares the corresponding
%log10(p value)s from the PLINK set-based test and VEGAS
for 420 genes. For the majority of genes, both methods
produced very similar results. Correlation between the
p values was very high (Pearson r ¼ 0.999), as was that
between the rankings (Spearman r ¼ 0.998). Thus, in addition to being agnostic toward GWAS design, a major
advantage of our method over permutations is speed.
The PLINK set-based test on our computer took ~12 hr to
compute the 413 chromosome 22 genes plus 2 days for
the seven additional genes. In contrast, our approach
140 The American Journal of Human Genetics 87, 139–145, July 9, 2010
Figure 1. Comparison of the $log10(p value)s from the PLINK
Set-Based Test and VEGAS on a GWAS of Height in 3,611
Individuals
The PLINK set-based test was performed on 413 genes on chromosome 22 with 104 permutations (circles) and on seven genes on
other chromosomes; these were selected on the basis of having
the smallest p values from the VEGAS analysis, at 106 to 107
permutations (triangles). The p values from VEGAS were obtained
by running 103 to 107 multivariate normal simulations per gene.
The straight diagonal line indicates a 1:1 relationship.
with 103 to 106 simulations per gene computed the same
set of genes in less than thirty minutes.
We selected nine nonoverlapping genes of various sizes on
chromosome 22 to further investigate the type I error rate of
our method compared to those from permutations. The
previous height data were permuted 1000 times. VEGAS
and the PLINK set-based test were applied to the association
results of each permutation for each of the genes. The
comparison of the p values for each of the nine genes is
shown in Figure S1. Overall, there does not appear to be
any major bias involved with VEGAS. Nevertheless, it should
be noted that our method will produce spurious results if the
incorrect reference population, and hence LD structure, is
used. Biases toward smaller p values will occur if the reference population is older than the study population, and
larger p values will occur in the opposite situation. When
the same 420 genes and 3611 Australian individuals were
used, running VEGAS with the HapMap CEU population
as the reference produced results comparable to those from
permutation (Figure S2A), whereas using the HapMap YRI
population produced significant biases toward smaller
p values (Figure S2B). Slight biases might also potentially
occur for genes with a non-positive definite LD correlation
matrix. In our dataset, this was a property of ~80% of genes,
inhibiting the direct use of Cholesky decomposition. For
these genes, the nearest positive semidefinite matrix is estimated with the R corpcor package.12,17 Matrices that require
a large adjustment might explain some of the discrepancy
Figure 2. Comparison of the $log10(p value)s from Permutations and VEGAS When Only the Single Best SNP from Each
Gene Is Considered
Results are based on a GWAS of height in 3611 individuals. Permutations were performed on 413 genes on chromosome 22 with 103
permutations and on seven additional genes with 105–106 permutations. The p values from VEGAS were obtained from 103–106
multivariate normal simulations per gene. The straight diagonal
line indicates a 1:1 relationship.
between VEGAS and permutations, although as seen in
Figure 1, this does not appear to have a major effect.
Under some genetic architectures, a more powerful genebased method may be to consider only the most significant
SNP in a gene rather than the full set of SNPs and then
correct this SNP’s association p value for gene size and other
possible confounders. Our approach can readily be applied
to this situation. For a gene with n SNPs, recall the simulated
vector of n correlated chi-squared 1 df variables,
Q ¼ ðq1 ,q2 .qn Þ. For the ‘‘Top-SNP’’ method, we define
Qmax as the simulated test statistic of the maximum element
of Q. Then, by simulating a large number of Qmax test statistics, the empirical gene-based p value is the proportion of
simulated Qmax test statistics that exceed the observed test
statistic of the most significant SNP in the gene.
Using the same 420 genes as in our previous analysis
with the full set of SNPs, we compared the VEGAS TopSNP method and permutations (Figure 2). Note that in
this case, we ran our own permutations by using R rather
than the PLINK set-based test because the two methods
are not equivalent. As with the test considering the full
set of SNPs, VEGAS produces results very similar to those
from permutations. Correlation between the p values was
very high (Pearson r ¼ 0.996), as was that between the
rankings (Spearman r ¼ 0.996).
Our method of using the full set of SNPs per gene was
applied to two situations where permutation tests are not
applicable: a family-based GWAS for height, where permutation cannot account for phenotypic correlation between
The American Journal of Human Genetics 87, 139–145, July 9, 2010 141
Table 1.
VEGAS Results for the 15 Most Significant Genes from a Family-Based GWAS for Height in 11,536 Individuals
Chromosome
Gene
4
HHIPa
6
GPR126
8
a
Number of SNPs
Start Position
Stop Position
Test Statistic
p Value
26
145786622
145879331
263.505
10"6
Best SNP
SNP p Value
rs1812175
1.06 3 10"9
"6
rs6570507
2.16 3 10"7
23
142664748
142809096
169.912
5 3 10
CHCHD7a
4
57286868
57293730
31.82
3.2 3 10"5
rs7833986
2.20 3 10"4
6
HMGA1a
6
34312627
34321986
38.934
8.4 3 10"5
rs1776897
6.71 3 10"6
15
ADAMTSL3a
85
82113841
82499597
344.52
1.34 3 10"4
rs7183263
3.89 3 10"7
4
LCORLa
30
17453940
17632474
222.748
1.38 3 10"4
rs6817306
7.63 3 10"6
20
GDF5a
10
33484562
33489441
81.199
1.78 3 10"4
rs4911494
1.39 3 10"4
"4
a
12
HMGA2
1
MFAP2
17
C17orf78
6
64504506
64646338
147.824
15
17173585
17180668
76.961
3.71 3 10"4
rs11203280
6.03 3 10"4
5
32807097
32823775
27.012
5.31 3 10"4
rs8067120
1.80 3 10"3
HIST1H3Ga
16
26379124
26379591
86.062
5.77 3 10"4
rs10946808
2.48 3 10"5
2
NMUR1
18
232096114
232103426
102.955
6.05 3 10"4
rs1434519
3.29 3 10"5
4
ADH5
26
100211152
100228954
142.218
8.01 3 10"4
rs1042364
2.45 3 10"4
"4
rs3936211
7.35 3 10"4
rs10183113
3.71 3 10"6
8
SPATC1
2
EMX1
3.00 3 10
8
145158594
145174003
58.172
8.30 3 10
13
72998111
73015528
60.278
9.62 3 10"4
rs8756
4.26310"7
34
a
These genes have been implicated in previous GWAS of height.22 The signal in HIST1H3G is driven by a variant previously implicated in the neighboring
HIST1H1G.
family members, and a DNA-pooling GWAS for melanoma
(MIM 155600), where individual genotype information is
not available. For height, we included an extra 7,935 relatives of those in our original GWAS of 3,611 unrelated individuals. These consisted of parents, offspring, siblings,
twins, and other family members, all typed with the
same SNP chip as the unrelated individuals used in the first
calculation. The results of the family-based association
analysis were previously published in Liu, et al.18 Table 1
lists the 15 most significant height-associated genes
obtained from VEGAS. One gene, the previously implicated HHIP (MIM 606178; p ¼ 1 3 10"6),19–21 exceeded
a Bonferroni corrected threshold of p < 2.8 3 10"6. Overall, nine of the top 15 genes have been previously implicated in published GWAS of height at genome-wide significance.22 It remains to be seen whether any of the
remaining genes play a role in height. The gene NMUR1
(MIM 604153; p ¼ 6.05 3 10"4) is a G-protein-coupled
receptor and is also involved in neuropeptide signaling,
similar to the previously implicated GPR126 (MIM
612243; p ¼ 5 3 10"6). Height might also be mediated
by MFAP2 (MIM 156790; p ¼ 3.71 3 10"4) through its
role as a glycoprotein component of connective-tissue
microfibrils,23 for which normal connective-tissue development is essential for height growth. Mutations in other
microfibril components have been linked to Marfan
syndrome (MIM 154700), a genetic disorder characterized
by skeletal overgrowth.24 These results suggest that despite
having a relatively small sample size for a GWAS for height,
the gene-based test has the potential to identify novel
genes. In a two-stage GWAS, the most significant genes
may also be used as a basis for selecting SNPs for replication
samples.
For melanoma, the gene-based test was performed on
the results from a GWAS that used pooled DNA in 1354
melanoma cases and 1291 controls. The sample was originally part of a larger previously published GWAS for melanoma,25 and pooling and association methods are
described in that study. This study was performed with
the approval of the appropriate ethics committee and
with informed consent from all participants.
As for height, the results from the gene-based test are
consistent with our current understanding of the genetics
of melanoma (Table 2). Overall, all of the top 15 genes are
in regions known to harbor melanoma-susceptibility
genes. Seven genes identified are located on 20q11.22,
the region originally implicated by Brown et al.25 and containing the skin pigmentation gene ASIP (MIM 600201);
these include MAP1LC3A (MIM 601242; p < 10"6), PIGU
(MIM 608528; p ¼ 2 3 10"6), DYNLRB1 (MIM 607167;
p ¼ 7 3 10"6), TP53INP2 (p ¼ 4.7 3 10"5), and NCOA6
(MIM 605299; p ¼ 1.38 3 10"4). ASIP itself, however,
was nonsignificant (p ¼ 0.116). Given the size of this associated region, it could be the case that a distant enhancer
rather than nonsynonymous or proximal regulatory
elements is driving the association with ASIP. Similarly,
a large number of associated genes are also located on
16q24.3; the most significant of these genes was DEF8
(p ¼ 4 3 10"5). Given that DEF8 lies ~30 kb downstream
of the known melanoma-susceptibility gene, MC1R (MIM
155555), it is likely that this signal is driven by variants
in and around MC1R, which was only nominally
142 The American Journal of Human Genetics 87, 139–145, July 9, 2010
Table 2.
VEGAS Results for the 15 Most Significant Genes from a DNA-Pooling GWAS for Melanoma in 1354 Cases and 1291 Controls
Chromosome
Gene
Number of SNPs
Start Position
Stop Position
Test Statistic
p Value
20
MAP1LC3A
59
32598352
32611810
762.618
<10"6
2 3 10
"6
Best SNP
SNP p Value
rs910873
1.00 3 10"16
rs910873
1.00 3 10"16
20
PIGU
93
32612006
32728750
964.294
15
MYEF2
25
46218920
46257850
50.865
4 3 10"6
rs2470102
4.18 3 10"4
20
DYNLRB1
58
32567864
32592423
548.265
7 3 10"6
rs910873
1.00 3 10"16
20
SNTA1
39
31459423
31495359
242.906
9 3 10"6
rs291695
6.60 3 10"11
16
DEF8
73
88542651
88561968
318.251
4.0 3 10"5
rs1805007
3.33 3 10"16
20
TP53INP2
44
32755808
32764898
312.611
4.7 3 10"5
rs4417778
5.35 3 10"9
"4
rs4911442
2.71 3 10"10
20
NCOA6
81
32766238
32877094
563.953
1.38 3 10
20
CDK5RAP1
55
31410305
31452998
260.851
1.53 3 10"4
rs291695
6.60 3 10"11
5
RXFP3
48
33972247
33974099
138.421
1.95 3 10"4
rs35389
1.31 3 10"8
16
C16orf55
49
88251710
88265176
244.276
3.12 3 10"4
rs258322
1.34 3 10"7
16
MGC16385
59
88563701
88566443
218.033
3.99 3 10"4
rs8049897
9.74 3 10"7
16
DPEP1
58
88207216
88232340
248.214
4.54 3 10"4
rs12918773
4.47 3 10"7
"4
rs258322
1.34 3 10"7
rs4785686
2.76 3 10"7
16
CHMP1A
52
88238344
88251630
248.105
4.60 3 10
16
SPG7
73
88102305
88151675
370.214
4.66 3 10"4
significant (p ¼ 1.30 3 10"3), rather than DEF8 itself. Likewise, the gene RXFP3 (p ¼ 1.95 3 10"4) is adjacent to
SLC45A2 (MIM 606202; p ¼ 8.91 3 10"3), a known melanoma-susceptibility gene, and MYEF2 (p ¼ 4 3 10"6) is
adjacent to SLC24A5 (MIM 609802; p ¼ 2.34 3 10"3),
a gene associated with skin pigmentation.
Although VEGAS was able to produce results equivalent
to those obtained through permutations at a fraction of
the time taken, as well as replicate several known heightand melanoma-associated genes, there are several situations in which use of the gene-based test is limited. The
effectiveness of VEGAS, along with other gene-based
methods, is determined by the underlying genetic architecture of the gene and phenotype of interest. Although
gene-based methods are more powerful than single-marker
analysis for identifying significant genes with multiple
causal variants, the converse is also true. If a gene contains
only one causal variant, then the inclusion of a large
number of nonsignificant markers into the gene-based
test will dilute this gene’s significance. The correct genetic
model to use is seldom known in advance, although our
method can be performed on a specified subset of markers
or just the single most significant marker rather than all
markers in a gene. Similarly, the use of 5 50 kb to define
gene boundaries is an arbitrary choice. Large boundaries
mean that some markers are included in multiple genes, resulting in a situation similar to our results for melanoma,
where it may be difficult to pinpoint the causal gene
when multiple adjacent genes are statistically significant.
Specifying stringent boundaries, however, may not fully
capture regulatory regions or those SNPs in high LD with
variants in the gene. Moreover, given that the majority
of SNPs so far identified in GWAS are found in nongenic
regions,26 these SNPs would not be included in any genecentric analysis at all. For these reasons, gene-based
methods should not be seen as a replacement for traditional single-marker association studies but rather should
be seen as a complement to GWAS and an essential step
for network- and pathway-based approaches. We offer
our gene-based test not as a definitive solution to the
problem but also as one tool in the complex-trait geneticist’s toolbox for post-GWAS analysis.
Supplemental Data
Supplemental Data include two figures and Supplemental
Acknowledgments and can be found with this article online at
http://www.cell.com/AJHG/.
Acknowledgments
Australian Melanoma Family Study Investigators: Graham J. Mann
and Richard F. Kefford (Westmead Institute of Cancer Research,
University of Sydney at Westmead Millennium Institute and Melanoma Institute Australia, PO Box 412, Westmead, NSW 2145,
Australia); John L. Hopper (Centre for Molecular, Environmental,
Genetic, and Analytic Epidemiology, School of Population Health,
Level 2, 723 Swanston Street, University of Melbourne, VIC 3052,
Australia); Joanne F. Aitken (Viertel Centre for Research in Cancer
Control, The Queensland Cancer Council Queensland, PO Box
201, Spring Hill, QLD 4004, Australia); Graham G. Giles (Cancer
Epidemiology Centre, The Cancer Council Victoria, Carlton, VIC
3053, Australia); and Bruce K. Armstrong (School of Public Health,
A27, University of Sydney, NSW 2006, Australia). J.Z.L. is supported by National Health and Medical Research Council
(NHMRC) project grant 496675. S.M., N.K.H., G.W.M., P.M.V.,
A.F.M., and S.E.M. are supported by the NHMRC Fellowships
scheme. N.R.W. and D.R.N. are supported by Australian Research
The American Journal of Human Genetics 87, 139–145, July 9, 2010 143
Council Fellowships. K.M.B. is a recipient of a Career Development Award from the Melanoma Research Foundation and is supported by the National Cancer Institute, National Institutes of
Health (CA109544, CA083115). We thank Joseph Powell for suggesting the name VEGAS. Additional acknowledgements are
provided in the Supplemental Data.
9.
10.
Received: April 29, 2010
Revised: June 7, 2010
Accepted: June 11, 2010
Published online: July 1, 2010
11.
Web Resources
The URLs for data presented herein are as follows:
corpcor, http://strimmerlab.org/software/corpcor
mvtnorm, http://cran.r-project.org/package¼mvtnorm
Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.
nlm.nih.gov/Omim
PLINK, http://pngu.mgh.harvard.edu/~purcell/plink
R, http://www.r-project.org
UCSC Genome Browser, http://genome.ucsc.edu
VEGAS, http://genepi.qimr.edu.au/general/softwaretools.cgi
12.
13.
14.
15.
References
1. Neale, B.M., and Sham, P.C. (2004). The future of association
studies: gene-based analysis and replication. Am. J. Hum.
Genet. 75, 353–362.
2. Wang, K., Li, M., and Bucan, M. (2007). Pathway-based
approaches for analysis of genomewide association studies.
Am. J. Hum. Genet. 81, 1278–1283.
3. Perry, J.R.B., McCarthy, M.I., Hattersley, A.T., Zeggini, E., Weedon, M.N., Frayling, T.M., and Wellcome Trust Case Control,
C.; Wellcome Trust Case Control Consortium. (2009). Interrogating type 2 diabetes genome-wide association data using
a biological pathway-based approach. Diabetes 58, 1463–1467.
4. Holmans, P., Green, E.K., Pahwa, J.S., Ferreira, M.A.R., Purcell,
S.M., Sklar, P., Owen, M.J., O’Donovan, M.C., and Craddock,
N.; Wellcome Trust Case-Control Consortium. (2009). Gene
ontology analysis of GWA study data sets provides insights
into the biology of bipolar disorder. Am. J. Hum. Genet. 85,
13–24.
5. Ruano, D., Abecasis, G.R., Glaser, B., Lips, E.S., Cornelisse,
L.N., de Jong, A.P., Evans, D.M., Davey Smith, G., Timpson,
N.J., Smit, A.B., et al. (2010). Functional gene group analysis
reveals a role of synaptic heterotrimeric G proteins in cognitive ability. Am. J. Hum. Genet. 86, 113–125.
6. Baranzini, S.E., Galwey, N.W., Wang, J., Khankhanian, P., Lindberg, R., Pelletier, D., Wu, W., Uitdehaag, B.M.J., Kappos, L.,
Polman, C.H., et al; GeneMSA Consortium. (2009). Pathway
and network-based analysis of genome-wide association
studies in multiple sclerosis. Hum. Mol. Genet. 18, 2078–2090.
7. Elbers, C.C., van Eijk, K.R., Franke, L., Mulder, F., van der
Schouw, Y.T., Wijmenga, C., and Onland-Moret, N.C. (2009).
Using genome-wide pathway analysis to unravel the etiology
of complex diseases. Genet. Epidemiol. 33, 419–431.
8. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira,
M.A.R., Bender, D., Maller, J., Sklar, P., de Bakker, P.I.W.,
Daly, M.J., and Sham, P.C. (2007). PLINK: a tool set for
16.
17.
18.
19.
20.
21.
144 The American Journal of Human Genetics 87, 139–145, July 9, 2010
whole-genome association and population-based linkage
analyses. Am. J. Hum. Genet. 81, 559–575.
Buil, A., Martinez-Perez, A., Perera-Lluna, A., Rib, L., Caminal,
P., and Soria, J.M. (2009). A new gene-based association test
for genome-wide association studies. BMC Proc 3 (Suppl 7),
S130.
Cui, Y., Kang, G., Sun, K., Qian, M., Romero, R., and Fu, W.
(2008). Gene-centric genomewide association study via
entropy. Genetics 179, 637–650.
Frazer, K.A., Ballinger, D.G., Cox, D.R., Hinds, D.A., Stuve,
L.L., Gibbs, R.A., Belmont, J.W., Boudreau, A., Hardenbol, P.,
Leal, S.M., et al; International HapMap Consortium. (2007).
A second generation human haplotype map of over 3.1
million SNPs. Nature 449, 851–861.
Schaefer, J., Opgen-Rhein, R., and Strimmer, K. (2009). Efficient estimation of covariance and (partial) correlation.
http://strimmerlab.org/software/corpcor/.
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., and
Hothorn, T. (2009). mvtnorm: Multivariate normal and t
distributions. http://CRAN.R-project.org/package¼mvtnorm.
Medland, S.E., Nyholt, D.R., Painter, J.N., McEvoy, B.P.,
McRae, A.F., Zhu, G., Gordon, S.D., Ferreira, M.A., Wright,
M.J., Henders, A.K., et al. (2009). Common variants in the trichohyalin gene are associated with straight hair in Europeans.
Am. J. Hum. Genet. 85, 750–755.
Cornes, B.K., Medland, S.E., Ferreira, M.A., Morley, K.I., Duffy,
D.L., Heijmans, B.T., Montgomery, G.W., and Martin, N.G.
(2005). Sex-limited genome-wide linkage scan for body mass
index in an unselected sample of 933 Australian twin families.
Twin Res. Hum. Genet. 8, 616–632.
Benyamin, B., Perola, M., Cornes, B.K., Madden, P.A.F., Palotie, A., Nyholt, D.R., Montgomery, G.W., Peltonen, L., Martin,
N.G., and Visscher, P.M. (2008). Within-family outliers: segregating alleles or environmental effects? A linkage analysis of
height from 5815 sibling pairs. Eur. J. Hum. Genet. 16,
516–524.
Higham, N.J. (1988). Computing a nearest symmetric positive
semidefinite matrix. Linear Algebra Appl. 103, 103–118.
Liu, J.Z., Medland, S.E., Wright, M.J., Henders, A.K., Heath,
A.C., Madden, P.A., Duncan, A.D., Montgomery, G.W.,
Martin, N.G., and McRae, A.F. (2010). Genome-wide association study of height and body mass index in Australian twin
families. Twin Res. Hum. Genet. 13, 179–193.
Weedon, M.N., Lango, H., Lindgren, C.M., Wallace, C., Evans,
D.M., Mangino, M., Freathy, R.M., Perry, J.R.B., Stevens, S.,
Hall, A.S., et al; Diabetes Genetics Initiative, Wellcome Trust
Case Control Consortium, Cambridge GEM Consortium.
(2008). Genome-wide association analysis identifies 20 loci
that influence adult height. Nat. Genet. 40, 575–583.
Gudbjartsson, D.F., Walters, G.B., Thorleifsson, G., Stefansson, H., Halldorsson, B.V., Zusmanovich, P., Sulem, P., Thorlacius, S., Gylfason, A., Steinberg, S., et al. (2008). Many
sequence variants affecting diversity of adult human height.
Nat. Genet. 40, 609–615.
Lettre, G., Jackson, A.U., Gieger, C., Schumacher, F.R., Berndt,
S.I., Sanna, S., Eyheramendy, S., Voight, B.F., Butler, J.L., Guiducci, C., et al; Diabetes Genetics Initiative, FUSION, KORA,
Prostate, Lung Colorectal and Ovarian Cancer Screening Trial,
Nurses’ Health Study, SardiNIA. (2008). Identification of ten
loci associated with height highlights new biological pathways in human growth. Nat. Genet. 40, 584–591.
22. Hindorff, L., Junkins, H., Mehta, J., and Manolio, T. (2009).
A catalog of published genome-wide association studies.
http://www.genome.gov/gwastudies/ (Accessed: April 26
2010).
23. Faraco, J., Bashir, M., Rosenbloom, J., and Francke, U. (1995).
Characterization of the human gene for microfibril-associated
glycoprotein (MFAP2), assignment to chromosome 1p36.1p35, and linkage to D1S170. Genomics 25, 630–637.
24. Judge, D.P., and Dietz, H.C. (2005). Marfan’s syndrome.
Lancet 366, 1965–1976.
25. Brown, K.M., Macgregor, S., Montgomery, G.W., Craig, D.W.,
Zhao, Z.Z., Iyadurai, K., Henders, A.K., Homer, N., Campbell,
M.J., Stark, M., et al. (2008). Common sequence variants on
20q11.22 confer melanoma susceptibility. Nat. Genet. 40,
838–840.
26. Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M.,
Mehta, J.P., Collins, F.S., and Manolio, T.A. (2009). Potential
etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad.
Sci. USA 106, 9362–9367.
The American Journal of Human Genetics 87, 139–145, July 9, 2010 145
Bioinformatics Advance Access published April 17, 2012
INRICH: Interval-based Enrichment Analysis for Genome
Wide Association Studies
Phil H. Lee,1,2,3 Colm O’Dushlaine,3 Brett Thomas,1 Shaun M. Purcell1,2,3,4∗
Analytical Translational Genetics Unit, Center for Human Genetic Research, Massachusetts
General Hospital, MA 02114; 2 Department of Psychiatry, Harvard Medical School, Boston, MA
02115; 3 Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge,
MA 02142; 4 Mount Sinai School of Medicine, New York, NY 10029, USA
1
Associate Editor: Dr. Jeffrey Barrett
1 INTRODUCTION
Multilocus approaches (often known as pathway or gene-set enrichment analysis methods) can be used to ask whether sets of
single nucleotide polymorphisms (SNPs), often defined by groups
of functionally-related genes, are in aggregate more highly associated with a phenotype than expected by chance. Consideration of
the biological relationships amongst the “top hits” in a genomewide association study (GWAS) can provide orthogonal evidence,
over and above the functionally-agnostic analysis of the number, statistical significance and/or variance explained of those hits.
For example, that a GWAS has three independent SNPs with pvalues around at 1×10−6 is in itself unremarkable. However, if
the associated regions independently map to three of a small set
of functionally-related genes, this will be very unlikely to occur
by chance: consequently, we would likely wish to put more weight on these associations. As well as providing additional statistical
∗ to
whom correspondence should be addressed
evidence to sub-threshold association results, another use of geneset analysis can be called in silico fine-mapping, or prioritizing
specific genes in loci that contain multiple genes with equivalent
association evidence. For example, of ten associated genes within
a block of strong linkage disequilibrium (LD), we may find that
only one shows above-chance relatedness to genes that appear
in other, statistically-independent association intervals. All other
things being equal, one would presumably consider that gene as
more likely to be causally-related compared to the other nine. Furthermore, the identity of the particular enriched gene-sets may offer
insights into disease mechanism and biology, although this will
be contingent on the gene-sets’ accuracy, comprehensiveness and
relevance to the phenotype’s underlying biology.
Over the past few years, several gene-set methods for GWAS have
been developed (Wang et al., 2007; Holmans et al., 2009). Still,
there clearly exist challenges and limitations to be addressed (Hong
et al., 2009). Desirable properties of a gene-set test include that it
is i) robust, and so able to calculate experiment-wide significance,
with adjustment for common biases due to gene size, LD within and
between genes, etc), ii) flexible, with application to (summary) data
from different sources, such as GWAS, from imputed data, copy
number variant (CNV) studies, targeted sequencing, from tables
in manuscripts, etc, and iii) computationally manageable, allowing
genome-wide analysis in a reasonable time on a single machine.
Here we describe the gene-set enrichment analysis tool INRICH
(INterval enRICHment analysis) that aims to satisfy the above properties. INRICH takes a set of independent, nominally-associated
genomic intervals and then tests for the enrichment of predefined
gene-sets. An “interval” will typically correspond to a genomic
region of SNP association defined by LD from a genome-wide scan,
although intervals could also represent, e.g. deletion or duplication
events observed in cases, regions identified as homozygous-bydescent, etc.
2
METHODS
We describe the method implemented in INRICH, focussing on the case
of SNP association from GWAS data. Specifically, analysis follows the
following three steps:
© The Author (2012). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
1
Downloaded from http://bioinformatics.oxfordjournals.org/ at Univ of Colorado Libraries on May 3, 2012
ABSTRACT
Summary: Here we present INRICH, a pathway-based genome-wide
association analysis tool that tests for enriched association signals of
predefined gene-sets across independent genomic intervals. INRICH
has wide applicability, fast running time and, most importantly, robustness to potential genomic biases and confounding factors. Such
factors, including varying gene size and SNP density, linkage disequilibrium within and between genes and overlapping genes with
similar annotations, are often not accounted for by existing gene-set
enrichment methods. By using a genomic permutation procedure, we
generate experiment-wide empirical significance values, corrected for
the total number of sets tested, implicitly taking overlap of sets into
account. By simulation we confirm a properly controlled type I error
rate and reasonable power of INRICH under diverse parameter settings. As a proof of principle, we describe the application of INRICH
on the NHGRI GWAS catalog.
Availability: A standalone C++ program, user manual, and datasets
can be freely downloaded from: http://atgu.mgh.harvard.edu/inrich/.
Contact: shaun@atgu.mgh.harvard.edu
Supplementary information: available from the journal web-site.
3 DATA ANALYSIS AND SUMMARY
INRICH takes disease-associated genomic intervals as input – for example, all GWAS SNPs (and the other, local SNPs in LD) that are associated
with a phenotype at p<1×10−4 . Either PLINK (Purcell et al., 2007) LDclumping or tag SNP selection commands (or similar tools) can be used to
define such independent regions of association, which ensures that multiple, adjacent SNPs that potentially tag the same causal variant are analyzed
as one independent association unit. Due to space limitation, we provide a
detailed instruction manual on the data generation and testing procedure at
our website (http://atgu.mgh.harvard.edu/inrich/).
We first conducted a simulation study to assess the Type I error rates of
INRICH using two GWAS datasets: HapMap III (CEU+TSI; n=200), and
schizophrenia case/control study (n=1,468) (Lieberman et al., 2005). Tested
parameter settings include different enrichment statistics (i.e., “interval” or
“target” mode), LD-clumping r2 measures (r 2 = 0.2), as well as significant p value thresholds to define associated regions (1×10−3 and 5×10−3 ).
Under each setting, we repeated the following procedures 200 times, and
calculated the average type I error rate; i) Generate random phenotype labels
for subjects; ii) Apply standard χ2 association analysis on individual SNPs;
and iii) Run INRICH on the association results using the KEGG gene-sets
(Kanehisa et al., 2010). We also conducted the same simulation study using
two commonly used gene-set enrichment approaches: GenGen (Wang et al.,
2007) (i.e., GSEA tool specifically designed for GWAS) and the hypergeometric test. Compared to these methods, the average type I error rates of
INRICH did not exceed the nominal 5% level. In contrast, under some conditions, the hypergeometric test yielded a type I error rate as high as 100%.
We also considered the power under conditions where the hypergeometric
test is valid, and confirmed that INRICH gives a comparably good power to
the hypergeometric test (S3.xlsx for details). Phenotype-permutation-based
gene-set enrichment methods (such as GenGen) provide statistically rigorous tests, but are computationally very demanding (particularly if based on
imputed datasets, or complex family-based association tests, etc.). In contrast, other gene-set enrichment methods based on summary data alone (such
as the hypergeometric test) are not computationally intensive, but can be very
anti-conservative, as our simulations show, due to unwarranted assumptions
of independence. We argue that INRICH is well-placed between these two
poles, providing an efficient yet robust middle-ground.
As a proof of concept to demonstrate the performance of INRICH under
the alternative hypothesis, we applied INRICH to the summary association
data from the NHGRI (National Human Genome Research Institute) GWAS
catalog (Hindorff et al., 2009). First, we downloaded a list of 4,689 SNPs
that are associated with 411 complex diseases/traits at a p value <1×10−5
(download date: 2011-Mar-04). This analysis focused on 236 diseases/traits
that have at least five associated SNPs. For each phenotype, LD-independent
intervals were generated around the associated SNPs using PLINK, and enrichment test was conducted using 3,182 Gene Ontology (GO) terms (gene-set
size between 5 and 200 genes) (The Gene Ontology Consortium, 2000) and
106 replicates in the first round of permutation and 104 in the second. We
excluded all genes and intervals mapping to the broad MHC region (chr6:2535Mb): in practice because this region contains so many genes, it is unlikely
to improve the power of gene-set enrichment analysis in most cases. After
multiple testing correction, 47 disorders were predicted with at least one
significantly enriched GO term at α=5%. Many of the associations were
consistent with known pathology of examined complex diseases/traits. For
example, Type II diabetes-associated intervals were most significantly enriched for genes involved in glucose homeostasis (corrected p=0.001) and
Crohn’s disease-associated intervals enriched for regulation of activated T
cell proliferation (corrected p=0.003).
In summary, we have implemented a new gene-set enrichment method
in the INRICH package, based on a constrained reshuffling of associated
intervals, to test whether more genes from particular sets are contained in
those intervals than expected by chance. Importantly, we preserve the properties of the original data whilst reshuffling, in terms of the number, SNP
density and gene-density. We have shown appropriate type I error rates, even
when correcting for hundreds of partially-overlapping gene-sets. Preliminary
application to the NHGRI GWAS catalogue indicates good power to detect
true signals. INRICH was recently applied to a large GWAS of bipolar disorder, implicating calcium ion channel genes as enriched (Psychiatric GWAS
Consortium Bipolar Disorder Working Group, 2011). Practically, INRICH is
fast, applicable without individual genotype data, and freely available either
as a command-line tool or with a GUI.
2.2
Overlapping Interval/Gene Merging
It is not uncommon for functionally-related genes to show physical clustering, and therefore yield an inflated false positive rates for such gene sets
if dependent signals are assumed to be independent (Hong et al., 2009;
Holmans et al., 2009). To avoid this potential bias due to multi-counting
physically clustered genes belonging to the same set, we merge overlapping
genes belonging to the same gene-set. We also merge overlapping testing
intervals to ensure that testing units are statistically independent from each
other.
2.3
Set-based Enrichment Tests
The primary enrichment statistic E for each gene-set is the number of intervals that overlap at least one “target” gene (i.e., gene in the tested set),
which we refer to as the interval mode. An alternative test instead counts
the number of target genes that overlap at least one interval, which is useful
for analyzing structural variation data (e.g., CNVs) that typically span large
genomic regions and therefore are likely to disrupt multiple, non-overlapping
genes. We call this test setting as the target mode. We use a permutation
approach, described below, to calculate empirical significance p values for
each gene-set.
Suppose that input data I includes k intervals, I = {i1 , ..., ik }, and
target gene-set T includes m genes, T = {t1 , ..., tm }.
1. Null interval set R is generated by randomly assigning intervals to
genomic locations with the constraints that each null interval ri ∈ R
approximately matches to the original interval Ii ∈ I (i = 1, ..., k)
in terms of the number of SNPs and overlapping genes; we also ensure
approximately similar SNP density per kilobase. Figure S1 illustrates
the three matching criteria.
2. Corresponding to the selected testing mode as described above, the null
enrichment statistic E is calculated as the number of overlapping intervals (or genes) between target gene-set T and randomly matched null
set R.
3. STEP 1) and 2) are repeated N times to generate a distribution of the
enrichment statistics for target gene-set T under the null hypothesis.
4. The empirical p value for T is the proportion of N replicates where the
enrichment statistic is as large as that of original interval set I.
5. Multiple testing correction is achieved via a second, nested round of
permutation to assess the null distribution of the minimum empirical p
value across all tested gene-sets.
This permutation procedure therefore respects the relationship between gene
size and the probability of chance overlap, namely that large genes are
more likely to be hit by chance. As previously reported, large genes are not
representative of all genes in terms of function (Raychaudhuri et al., 2010).
INRICH also presents global enrichment statistics Gp that test for an
excess of enriched genes at nominal gene-set p=0.001, 0.01, 0.05. This test
is based on the number of unique genes within an association interval that
are in at least one nominally-enriched gene-set. The empirical significance
of Gp is evaluated within the same permutation procedure described above.
2
Downloaded from http://bioinformatics.oxfordjournals.org/ at Univ of Colorado Libraries on May 3, 2012
2.1 Interval Data Generation
ACKNOWLEDGEMENT
The authors thank Dr. Peter Holmans for insightful comments.
REFERENCES
3
Downloaded from http://bioinformatics.oxfordjournals.org/ at Univ of Colorado Libraries on May 3, 2012
Hindorff, L., Sethupathy, P., Junkins, H., Ramos, E., Mehta, J., Collins, F., and
Manolio, T. (2009). Potential etiologic and functional implications of genome-wide
association loci for human diseases and traits. Proc Natl Acad Sci USA, 106(23),
9362–9367.
Holmans, P., Green, E., Pahwa, J., Ferreira, M., Purcell, S., Sklar, P., Consortium, W.
T. C.-C., Owen, M., O’Donovan, M., and N., N. C. (2009). Gene ontology analysis
of gwa study data sets provides insights into the biology of bipolar disorder. Am J
Hum Genet, 85(1), 13–24.
Hong, M. G., Pawitan, Y., Magnusson, P. K., and Prince, J. A. (2009). Strategies and
issues in the detection of pathway enrichment in genome-wide association studies.
Hum Genet, 126(2), 289–301.
Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., and Hirakawa, M. (2010). KEGG
for representation and analysis of molecular networks involving diseases and drugs.
Nucleic Acids Res, 38, D355–D360.
Lieberman, J., Stroup, T., McEvoy, J., Swartz, M., Rosenheck, R., Perkins, D., Keefe,
R., Davis, S., Davis, C., Lebowitz, B., Severe, J., Hsiao, J., and Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) Investigators (2005). Effectiveness
of antipsychotic drugs in patients with chronic schizophrenia. N Engl J Med, 353,
1209–1223.
Psychiatric GWAS Consortium Bipolar Disorder Working Group (2011). Large-scale
genome-wide association analysis of bipolar disorder identifies a new susceptibility
locus near odz4. Nature Genet, 43, 977983.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., Maller,
J., Sklar, P., de Bakker, P., Daly, M., and Sham, P. (2007). Plink: a tool set for wholegenome association and population-based linkage analyses. Am J Hum Genet, 81(3),
559–575.
Raychaudhuri, S., Korn, J., McCarroll, S., Altshuler, D., Sklar, P., Purcell, S., Daly,
M., and Consortium., I. S. (2010). Accurately assessing the risk of schizophrenia
conferred by rare copy-number variation affecting genes with brain function. PLoS
Genet, 6, e1001097.
The Gene Ontology Consortium (2000). Gene ontology: tool for the unification of
biology. Nat. Genet, 25(1), 25–29.
Wang, K., Li, M., and Bucan, M. (2007). Pathway-based approaches for analysis of
genomewide association studies. Am J Hum Genet, 81, 1278–1283.
A N A LY S I S
Chromatin marks identify critical cell types for fine
mapping complex trait variants
© 2012 Nature America, Inc. All rights reserved.
Gosia Trynka1–4,8, Cynthia Sandor1–4,8, Buhm Han1–4, Han Xu5, Barbara E Stranger1,4,7, X Shirley Liu5 &
Soumya Raychaudhuri1–4,6
If trait-associated variants alter regulatory regions, then they
should fall within chromatin marks in relevant cell types.
However, it is unclear which of the many marks are most
useful in defining cell types associated with disease and fine
mapping variants. We hypothesized that informative marks
are phenotypically cell type specific; that is, SNPs associated
with the same trait likely overlap marks in the same cell type.
We examined 15 chromatin marks and found that those
highlighting active gene regulation were phenotypically
cell type specific. Trimethylation of histone H3 at lysine 4
(H3K4me3) was the most phenotypically cell type specific
(P < 1 × 10−6), driven by colocalization of variants and
marks rather than gene proximity (P < 0.001). H3K4me3
peaks overlapped with 37 SNPs for plasma low-density
lipoprotein concentration in the liver (P < 7 × 10−5), 31 SNPs
for rheumatoid arthritis within CD4+ regulatory T cells
(P = 1 × 10−4), 67 SNPs for type 2 diabetes in pancreatic islet
cells (P = 0.003) and the liver (P = 0.003), and 14 SNPs for
neuropsychiatric disease in neuronal tissues (P = 0.007). We
show how cell type–specific H3K4me3 peaks can inform the
fine mapping of associated SNPs to identify causal variation.
Recent work showing that common phenotypically associated SNPs
are enriched for expression quantitative trait loci (eQTLs)1–6 suggests that they might act by altering gene regulatory regions. One
example is a common non-coding variant associated with plasma
low-density lipoprotein (LDL) concentration. This variant modifies
a CEBPB transcription factor–binding site in an enhancer and, in
doing so, alters the expression of SORT1, a gene that affects plasma
1Division
of Genetics, Brigham and Women’s Hospital, Harvard Medical School,
Boston, Massachusetts, USA. 2Division of Rheumatology, Brigham and Women’s
Hospital, Harvard Medical School, Boston, Massachusetts, USA. 3Partners
Center for Personalized Genetic Medicine, Boston, Massachusetts, USA.
4Program in Medical and Population Genetics, Broad Institute of MIT and
Harvard, Cambridge, Massachusetts, USA. 5Department of Biostatistics and
Computational Biology, Dana-Farber Cancer Institute and Harvard School of
Public Health, Boston, Massachusetts, USA. 6Faculty of Medical and Human
Sciences, University of Manchester, Manchester, UK. 7Present addresses:
Section of Genetic Medicine, Department of Medicine, University of Chicago,
Chicago, Illinois, USA and Institute for Genomics and Systems Biology, University
of Chicago, Chicago, Illinois, USA. 8These authors contributed equally to this
work. Correspondence should be addressed to S.R. (soumya@broadinstitute.org).
Received 6 July; accepted 28 November; published online 23 December 2012;
doi:10.1038/ng.2504
NATURE GENETICS
ADVANCE ONLINE PUBLICATION
LDL concentration7. Another similar example is an intergenic risk
allele for systemic lupus erythematosus (SLE) that decreases TNFAIP3
transcription by modifying the nuclear factor (NF)-Kb–binding site
within a promoter8. Whereas many eQTLs and regulatory variants
act universally, the ones most relevant to disease might have tissue
specific activity6. The cell type specificity of regulatory elements is one
of the major limitations in pursuing functional studies to investigate
the regulatory potential of common alleles9–13.
One approach to identify regulatory elements influenced by common variants involves assaying epigenetic chromatin marks 14–16.
For example, H3K4me3 and monomethylation at H3K4 (H3K4me1)
highlight active promoters and enhancers. But, a practical challenge
of this approach is that dozens of chromatin marks might potentially
be assayed17, and it is prohibitive to conduct studies on all of them
in large numbers of different tissues or in samples collected from
many individuals. However, because chromatin marks colocalize18,
the status of a small subset of the most informative marks might be
characterized, allowing for more focused assays in tissue libraries and
populations to link variants to regulatory mechanisms. Additionally,
it is challenging for a given phenotype to know which cell type(s)
are most useful to assay chromatin marks in order to fine map risk
alleles. If the critical cell types were known, then it might be possible
to identify the biologically important cell type–specific eQTLs.
Here, we hypothesize that a proportion of alleles for a given phenotype influence gene regulation by altering regulatory elements that
control expression within the cell types most relevant to the phenotype. If this is the case, then variants associated with the same
phenotype should overlap marks preferentially occurring within the
same cell type. Therefore, to identify the most informative chromatin
marks, we quantify the degree to which their activity in specific cell
types near phenotypically associated variants tracks with phenotype.
We then show how those chromatin marks that are most phenotypically cell type specific can identify causal cell types, asserting that
cell type–specific marks might be used to fine map and identify the
plausible causal variant at a particular locus.
RESULTS
Summary of statistical methods
We first sought to define a score that corresponds to the possibility
that a phenotypically associated SNP or a variant in tight linkage disequilibrium (LD) with it can alter cell type–specific gene regulation,
as highlighted by a specific chromatin mark. We define chromatin
marks as precise positions in the genome where there is a significant
1
A N A LY S I S
a
b
LD (r 2, 1000 Genomes
Project)
Observed association
(–log10 P)
Figure 1 Overview of the statistical approach.
(a) For phenotypically associated variants,
10
other variants in tight LD are found. For each
P < 5 10–8
Scorea = ha/da
8
associated
SNP associated with a phenotype from genetic
ha
6
variant
Cell type a
studies (lead SNP, blue diamond; top), we
4
marks
2
define a locus by identifying SNPs in tight LD
2
Scoreb = hb/db
da
(r > 0.8, dashed red line; bottom) using data
hb
Cell type b
from the 1000 Genomes Project (blue dots;
marks
bottom). (b) Each locus is scored on the height
db
1
and distance of the nearest peak to a variant in
Scoren = hn/dn
LD. For a selected chromatin mark, we define
peaks (red) in n cell types across the genome.
hn
Cell type n
0
marks
For each SNP in the locus (blue diamond and
d
Genomic position (kb)
n
Genomic position (kb)
light-blue circles), we compute a score equal to
the height of the closest peak (vertical purple
Phenotypes
Phenotypes
line) divided by the distance to the summit in
1
2
m
1
2
m
each of the n cell types (horizontal purple line).
In each locus within each cell type, we note the
value of the SNP with the highest score: this
a h/d
measure reflects the overlap between a locus
b
and a cell type–specific regulatory element.
c
a
(c) Across many phenotypes, we assess whether
d
b
marks overlap alleles in specific cell types.
•
c
•
Here, the measure of cell type specificity of
d
•
•
each risk locus is represented by the intensity
n
•
of red color. A phenotypically cell type–specific
•
n
0
Cell type specificity (h/d)
1
mark should consistently give signal in one or a
small number of cell types for a given phenotype
(yellow outline). We quantify the phenotypic cell
type specificity of each mark. (d) Permutations are performed to assess the significance of phenotypic cell type specificity. To compute the significance
of the phenotypic cell type specificity for a chromatin mark, we permutate SNPs from different loci across phenotypes; this preserves tissue-specific
signals without altering the correlation and prevalence of tissue-specific signals.
Locus 1
Locus 2
Locus 3
Cell types
Locus 1
Locus 2
Locus 3
Locus 1
Locus 2
Locus 3
Locus 1
Locus 2
Locus 3
Locus 1
Locus 2
Locus 3
Locus 4
Locus 5
Cell types
© 2012 Nature America, Inc. All rights reserved.
Locus 1
Locus 2
Locus 3
Locus 4
Locus 5
d
c
excess of reads from chromatin immunoprecipitation and sequencing
(ChIP-seq) data over control sequencing data. We assume that variants close to or directly under tall chromatin mark peaks in specific
cell types might be involved in cell type–specific gene regulation; on
the other hand, variants that are far from chromatin mark peaks are
much less likely to have a direct role in gene regulation. First, for each
phenotypically associated SNP, we identified each SNP or insertion
and/or deletion (indel) in tight LD (r2 > 0.8 in 1000 Genomes Project
data19; Fig. 1a). Next, for each cell type, we assigned each variant in
LD a score proportional to the height of the nearest chromatin mark
peak (referred to as h; Online Methods) divided by the physical distance to the summit (h/d in Fig. 1b; referred to as s; Online Methods).
If the physical distance to the nearest peak is more than 2.5 kb, then
the score is set to 0 to obviate any confounding distal effects. Thus,
a variant in LD directly under a strong peak will receive a very high
score. For each cell type, we assigned the phenotypically associated
SNP the maximum score achieved by any of its variants in LD. To
quantify the specificity of signals across cell types (as opposed to
the absolute magnitude), we normalized the h/d scores so that the
Euclidean metric across cell types was one (normalized h/d scores (sn;
Online Methods). Thus, a SNP within a chromatin mark that is active
in only one cell type will have a high score of 1 in that cell type and 0
in others. In contrast, a SNP close to chromatin marks that are not cell
type specific will have similarly modest scores across cell types.
Then, we wanted to quantify the phenotypic cell type specificity
of the overlap between SNPs and chromatin marks. To do this, we
identified sets of SNPs associated with different phenotypes and
then assessed the phenotypic cell type specificity of different marks
(Fig. 1c). For informative marks, one or few cell types should consistently score highly across many of the SNPs for a given phenotype. For an uninformative chromatin mark, the cell types with the
greatest scores vary from SNP to SNP within the same phenotype.
2
Therefore, for informative marks, there should be minimal deviation
of scores within a phenotype across multiple cell types. To quantify
the phenotypic cell type specificity of a chromatin mark, we defined
a metric representing the variation of signal seen within a cell type
within a specific phenotype (referred to as d; Online Methods). We
evaluated the statistical significance of this metric with permutations
with which we randomly reassigned SNPs to phenotypes (Fig. 1d).
This permutation strategy restricts analysis to only phenotypically
associated SNPs and, in doing so, avoids biases that might result
from known differences between phenotypically associated SNPs
and non–phenotypically associated SNPs in local LD structure, gene
density and epigenetic activity. We note that this approach accurately
estimates type I error (Supplementary Fig. 1a).
Active gene regulation is phenotypically cell type specific
To test the phenotypic cell type specificity of individual marks, we
identified a set of SNPs associated with any one of many complex
traits20. We selected only SNPs associated in European populations to facilitate LD calculations. To ensure adequate power, we
selected only those traits that had at least 15 reported associations
in European populations. Then, we pruned SNPs by LD so that they
were all independent (r2 < 0.1 and >100 kb away from other associated
SNPs in the genome; Online Methods). This resulted in a set of 510
independent SNPs associated with 31 complex traits. After defining
the genomic locations and heights of peaks for 15 chromatin marks
assayed in 14 Encyclopedia of DNA Elements (ENCODE) cell types15
(Supplementary Table 1), we observed statistically significant phenotypic cell type specificity for 4 marks (P < 0.0033 = 0.05/15; Fig. 2).
The most strongly associated chromatin marks were H3K4me3 and
acetylation of histone H3 at lysine 9 (H3K9ac) (P < 1 × 10−6), which
are known to highlight active gene promoters16,21. In fact, all four
most significant modifications are known to occur at regions of the
ADVANCE ONLINE PUBLICATION
NATURE GENETICS
A N A LY S I S
ENCODE Project
NIH Epigenomics Project
10–5
10–4
10–3
using only the reported lead SNPs and not examining SNPs in LD
resulted in considerably less significant results (Supplementary
Fig. 1b). We note that some of the variation in phenotypic cell type
specificity could be related to the variable number of assayed cell
types for different chromatin marks; power to detect phenotypic
cell type specificity correlates with the number of assayed cell types
(Supplementary Fig. 4).
10–2
H3K27me3
H3K9me3
CTCF-binding site
H3K9me1
H4K20me1
H3K4me2
H3K36me3
H3K4me1
H2A.Z
H3K27ac
DNase I HS
H3K79me2
H3K9ac
Pol2b-binding site
1
ENCODE
Project 14
cell types
NIH Epigenomics
Project
38 cell types
10–1
H3K4me3
© 2012 Nature America, Inc. All rights reserved.
Score
(observed versus random)
Phenotypic cell type specificity
(P value)
10–6
Figure 2 Evaluating the significance of phenotypic cell type specificity
for different marks. We used two data sets of marks assayed in different
cell types: the ENCODE Project and NIH Epigenomics Project. For each
mark, we performed up to 1 million permutations of SNPs and phenotypes
to calculate the null distribution of phenotypic cell type specificity for
comparison to observed phenotypic cell type specificity. Below, we show
the observed phenotypic cell type specificity (green lines) against the null
distribution (black and gray density plots). Above, we plot the corresponding
P values. The red dashed line indicates the significance threshold after
correcting for the testing of multiple independent hypotheses.
genome involved in active gene transcription; DNase I hypersensitivity sites (DHSs; P < 1 × 10−3) and dimethylation of histone H3
at lysine 79 (H3K79me2; P < 1 × 10−5) identify active promoter,
enhancer or transcribed regions. Because some chromatin marks
colocalize (Supplementary Fig. 2), we performed conditional
analyses to assess whether chromatin marks contributed to phenotypic cell type specificity independently (Supplementary Fig. 3).
We observed that the highly significant associations of H3K4me3,
DHSs and H3K9ac were generally not independent. In contrast, we
found that chromatin marks that did not correspond to active gene
regulation were not phenotypically cell type specific. In particular, H3K9me1, H3K9me3, CTCF-binding sites and trimethylation
at histone H3 lysine 27 (H3K27me3), highlighting transcriptionally repressed heterochromatic insulator and polycomb-repressed
regions, respectively, showed no evidence of being phenotypically
cell type specific (P > 0.40).
To assess the reproducibility of these results, we conducted a similar analysis of data from the US National Institutes of Health (NIH)
Epigenomics Project, consisting of assays for 6 different chromatin
marks in 38 different cell types22 (Supplementary Table 2). We again
observed that the most informative mark was H3K4me3 (P < 1 × 10−6),
along with H3K4me1 (Fig. 2). H3K9ac was more nominally significant
(P = 0.03), perhaps owing to the fewer cell types assayed in this experiment. The concordance of the results from these two data sets was
reassuring when considering that the data from the ENCODE Project
were obtained on cell lines, whereas most of the NIH Epigenomics
Project data were obtained using primary cell types.
Our approach benefits from taking advantage of 1000 Genomes
Project data to identify variants in LD (Fig. 1a). Repeating our analysis
NATURE GENETICS
ADVANCE ONLINE PUBLICATION
Variants colocalize with cell type–specific H3K4me3 peaks
Because chromatin marks tend to concentrate in and around genes, we
considered the possibility that the observed overlap between H3K4me3
peaks and variants might be an artifact of proximity to gene transcript
sequences with phenotypically cell type specific expression. To assess
the role of the specific peak locations versus proximity to specifically
expressed genes, we repeated our analyses after randomly shifting the
specific location of peaks locally (o 10 kb, s.d. of 2.5 kb) within phenotypically associated loci. While these small shifts would maintain the
proximity of peaks to genes, they would disrupt the specific colocalization of variants and H3K4me3 peaks. Indeed, in 1,000 such experiments, we found that shifting peak locations lowered the significance
of phenotypic cell type specificity (median P = 0.03), and we did not
observe any instance where the phenotypic cell type specificity was
more significant than it was in the actual data (Supplementary Fig. 5).
This result strongly suggests that the specific colocalization of variants
in LD with phenotypically associated SNPs and H3K4me3 peaks rather
than proximity to gene structures is driving the phenotypic cell type
specificity signal (P < 0.001 by permutation).
Enhancers and promoters underlie phenotypic cell type specificity
To understand whether the phenotypic cell type specificity that
we observed was driven by the activity of promoters or enhancers, we
divided chromatin peaks into those falling within proximal promoter
regions (including the transcriptional start site (TSS) o 2 kb) and
those falling outside of promoter regions and repeated our analysis.
Whereas phenotypic cell type specificity was seen both within and
outside of the immediate promoter regions, H3K4me3, H3K79me2
and DHSs were more significantly phenotypically cell type specific
outside of promoter regions than within (Supplementary Fig. 6).
We note that, although H3K4me3 marks are not generally thought
of as being enriched in enhancers, there was evidence that they
can be enriched in strong and disease-associated enhancers9,23,24.
Alternatively, H3K4me3 enrichment outside of promoter sites might
also represent unannotated sites.
We further assessed the phenotypic cell type specificity of previously
published functional annotations on the basis of hidden Markov model
states capturing information on nine separate chromatin marks 9.
We observed that hidden states 4 and 5, corresponding to active
proximal enhancers and active distal enhancers, respectively, were
most significantly phenotypically cell type specific (Supplementary
Fig. 7). State 4 is highly enriched for H3K4me3 peaks, the mark that
we observed to be the most phenotypically cell type specific.
Identification of key cell types for four phenotypes
We identified the cell types within which common variants likely influence gene regulation using published SNPs for 4 distinct phenotypes
(Fig. 3 and Supplementary Table 3) and H3K4me3 data from the
Epigenomics Project for a panel of 34 cell-types22. We selected these
phenotypes because there is a reasonable sense of what the critical
cell types might be and because a sufficient number of associated
SNPs had been identified. For each phenotype, we assigned a cell
type specificity score to each of its associated variants (Fig. 1a,b and
3
Online Methods) and compared to scores
from equal-sized sets of matched SNP sets
sampled from 45,950 LD-pruned SNPs 3.
Because phenotypically associated SNPs have more epigenetic activity
than other SNPs, we were careful to match sampled SNPs so that
they had similar total numbers of H3K4me3 peaks across all 34 cell
types as associated SNPs. Results were generally consistent in a more
stringent analysis when we sampled instead from only phenotypically associated SNPs from the National Human Genome Research
Institute (NHGRI) genome-wide association study (GWAS) catalog20
(Supplementary Fig. 8). In addition to these phenotypes, we present
separately the results for four additional phenotypes, B-cell–specific
cis eQTL associations, SLE, type 1 diabetes (T1D) and body mass
index (BMI) (Supplementary Fig. 9); in all of those instances, except
BMI, we were able to identify highly significant cell types.
Application to plasma LDL concentration implicates liver
As a positive control, we tested 37 SNPs associated with LDL concentration25 for overlap with H3K4me3 marks in different tissues.
These variants should implicate regulatory activity within the liver,
according to previous work7,26,27. In aggregate, we observed that the
37 SNPs implicated a total of 1,501 H3K4me3 peaks in 34 different cell
types. The most significant cell type was adult liver tissue (P = 7.2 ×
10−5; Fig. 3a). We observed overlap with liver-specific peaks using
other phenotypically cell type–specific marks, including H3K9ac
(P = 0.003) and H3K4me1 (P = 0.002). In contrast, we observed
little association with liver for the H3K27me3 or H3K9me3 marks
(Fig. 2 and Supplementary Table 4). Examining the relative proximity and specificity of the SNPs within 10,000 sets of matched
SNP sets used to calculate statistical significance, we identified the
95th-percentile threshold at a score of 0.58 (Fig. 4a). Of the 37 SNPs
associated with LDL concentration, 7 (19%) were near to a highly
liver-specific chromatin mark at this threshold. These seven SNPs
are generally in tight LD with a variant that is very close to cell type–
specific H3K4me3 peaks (median of 132 bp away; see Supplementary
Table 3 for details on the specific SNPs).
Application to rheumatoid arthritis implicates CD4+ Treg cells
For rheumatoid arthritis and other autoimmune diseases, the critical
immune cell types are often not clearly defined in the literature and
4
LDL
(37 loci)
a
Adult liver
Rheumatoid arthritis
(31 loci)
b
Treg primary cells
c
Neuropsychiatric
disorders
(14 loci)
CD34+ primary cells
Mobilized CD34+ primary cells
CD3+ primary cells
CD19+ primary cells
CD8+ memory primary cells
CD8+ naive primary cells
CD34+ cultured cells
CD4+ naive primary cells
CD4+ memory primary cells
Treg primary cells
Mesenchymal stem cells (bone marrow)
Cingulate gyrus
Anterior caudate
Substantia nigra
Inferior temporal lobe
Mid-frontal lobe
Hippocampus middle
Pancreatic islets
Chondrocytes (mesenchymal stem cells)
Adipose nuclei
Adult kidney
Mesenchymal stem cells (adipose)
Muscle satellite cultured cells
Skeletal muscle
Adipocyte (mesenchymal stem cells)
Adult liver
Mucosa, colon
Duodenum smooth muscle
Stomach smooth muscle
Mucosa, stomach
Rectal smooth muscle
Mucosa, rectum
Mucosa, duodenum
Smooth muscle, colon
Anterior caudate
d
T2D
(67 loci)
Hematopoietic
Brain
Muscluloskeletal,
endocrine & others
Figure 3 SNPs for four complex traits overlap
H3K4me3 marks in specific cell types. (a–d)
We considered four phenotypes: LDL cholesterol
plasma concentration (a), rheumatoid arthritis (b),
neuropsychiatric disorders (schizophrenia and
bipolar disease) (c) and T2D (d). For each
phenotype, we calculated the cell type–specific
overlap with H3K4me3 histone modification
peaks in 34 tissues (listed on the left). The
histograms on the right show the significance
of the overlap for each tissue with variants from
each of the phenotypes, estimated by sampling
sets of SNPs matched so that the total number
of peaks overlapping SNPs in LD was the same
as in the test set. Adjacent to each histogram,
we present correlation coefficients between
two tissues based on scores computed from
randomly sampled sets of independent loci.
Colored boxes in d show independent P values
for pancreatic islets and liver computed by
removing the SNPs scoring highly in one tissue
but not the other.
Gastrointestinal
© 2012 Nature America, Inc. All rights reserved.
A N A LY S I S
0
Pancreatic islets
Adult liver
1
Correlation
1
10–1
10–2
10–3
10–4
Enrichement for H3K4me3 peaks
(P value)
10–5
can be controversial28–30. When we tested the 31 SNPs associated
with rheumatoid arthritis31, we observed that they implicated 1,328
H3K4me3 peaks in 34 tissues, with the most significant association
to CD4+ T cells and, in particular, CD4+ regulatory T (Treg) cells
(P = 1.3 × 10−4; Fig. 3b). The phenotypically similar CD4+ memory
T cells were also highly significantly associated (P = 7.0 × 10−4)32.
Of the 31 SNPs associated with rheumatoid arthritis, we found that
6 (19.3%) were close to chromatin marks that were highly specific to
CD4+ Treg cells, with relative specificity of 0.53 or greater (permuted
95th-percentile threshold; Fig. 4b). These 6 SNPs are generally in tight
LD with a variant that is very close to cell type–specific H3K4me3
peaks (median of 37 bp away; see Supplementary Table 3 for details
on the specific SNPs).
In instances where dense genotyping has been applied to localize
the association signal, we speculate that cell type–specific overlap
might become more apparent. Indeed, for the 31 loci associated with
rheumatoid arthritis, we examined recent results from a fine-mapping
study using the dense genotyping platform the Immunochip33.
Indeed, when repeating the analysis with the newly defined index
SNPs from each locus using dense genotyping data, we found that
the significance of the enrichment for CD4 + Treg cells increased
(5.1 × 10−5; Supplementary Fig. 10) and that the median specificity
score for each locus increased from 0.13 to 0.16.
Application to psychiatric disorders implicates neuronal tissues
The 14 independent SNPs from neuropsychiatric disorders 34,35
mapped within 874 H3K4me3 peaks. Despite the limited power of this
analysis, we were encouraged to see that these SNP associations implicated multiple neuronal tissues, including the anterior caudate nucleus
(P = 0.0076) and the mid-frontal lobe of the brain (P = 0.044) (Fig. 3c);
we also observed a likely spurious association with colonic smooth
muscle (P = 0.026). The role of the frontal lobe in neuropsychiatric
disease in particular has long been appreciated36–38. Although none
ADVANCE ONLINE PUBLICATION
NATURE GENETICS
A N A LY S I S
Adult liver
P = 7.2 × 10–5
0.8
0.6
1.0
1.0
Treg cells
P = 1.3 × 10–4
0.8
95%
0.6
0.4
d
Anterior caudate
P = 7.6 × 10–3
0.8
0.6
95%
0.4
0.4
95%
0.2
0.2
0
0.2
0
10,000 sets
of matched
SNPs
(37 loci)
LDL
(37 loci)
1.0
95%
Rheumatoid
arthritis
(31 loci)
0.4
P = 6.1 × 10–3
0
10,000 sets Neuropsychiatric
of matched
disorders
(14 loci)
SNPs
(14 loci)
50
30
10
0
0.2
0.4
0.6
0.8
1.0
Pancreatic islets
10
95%
10,000 sets
of matched
© 2012 Nature America, Inc. All rights reserved.
0.6
0.2
0
10,000 sets
of matched
SNPs
(31 loci)
T2D
(67 loci)
P = 7.9 × 10–3
0.8
Adult liver
H3Kme3 cell type specificity score
c
b
1.0
H3Kme3 cell type specificity score
a
Figure 4 Cell type specificity for four sets of SNPs. (a–d) The distribution of cell type–
30
SNPs
specificity scores (h/d; Fig. 1b) is shown for SNPs associated with LDL cholesterol
(67 loci)
concentration, rheumatoid arthritis, neuropsychiatric disorders and T2D within liver (a),
50
CD4+ Treg cells (b), anterior caudate nucleus (c) and jointly in pancreatic islets (x axis)
and liver (y axis) (d). Blue points represent cell type specificity scores. Red circles indicate overlapping points, representing SNPs with very similar
scores. We compare these scores to specificity scores in the same tissue of 10,000 sampled sets of matched SNPs from HapMap (yellow density plots).
We plot the median specificity for both the distribution of observed SNPs and the sampled sets of matched SNPs (solid lines). Also, we present the
95th-percentile threshold for the sampled sets of matched SNPs (dashed line), which we use as a specificity cutoff. For each phenotype, about one-fourth
of variants overlap cell type–specific H3K4me3 peaks.
of these results reached a conservative level of significance after correcting for multiple-hypothesis testing, we are hopeful that additional
SNP discoveries will help clarify this result further. Of the 14 SNPs
associated with neuropsychiatric disorders, 3 (21%) had a tissuespecific chromatin mark within the anterior caudate, with a relative
specificity of 0.28 or greater (permuted 95th percentile; Fig. 4c).
Application to T2D implicates pancreatic islets and liver
In certain instances, it might be plausible that multiple tissues could
be implicated in a disease. When we examined 67 SNPs for type 2 diabetes (T2D)39–50, implicating a total of 2,776 H3K4me3 peaks within
34 different cell types, we observed the most significant enrichment
in pancreatic islets (P = 0.0061) and the liver (P = 0.0079) (Fig. 3d). In
particular, of the 67 SNPs associated with type 2 diabetes, 14 (20.1%)
were either highly specific for chromatin marks within the liver (at
a 0.57 permuted 95th-percentile threshold) or pancreatic islets (at a
0.65 permuted 95th-percentile threshold); these SNPs are in tight LD
with a marker that has a median distance of 46 bp from the summit of
a cell type–specific peak. When we tested the pancreatic islet and liver
tissues together, we found that the combination of liver and pancreatic
islets was even more significant than the tissues individually (P =
2.0 × 10−4; Online Methods) and was more significant than all other
possible tissue pairs. We found that the SNPs driving the overlap in
the two tissues were distinct (Fig. 4d). When we removed the SNPs
most specific for pancreatic islet marks (score > 0.3), we observed that
enrichment in liver was even more apparent (P = 0.0032); similarly,
when we removed the SNPs most specific for overlap with liver marks
(score > 0.3), we observed that the enrichment in pancreatic islets was
also more apparent (P = 0.0026). Both islet cells and the liver have
long been known to have a key role in mediating glucose synthesis,
insulin secretion and diabetes51.
Fine mapping with cell type–specific H3K4me3 peaks
One of the major challenges in understanding complex trait associations is to identify the causal variants and the mechanisms through
which they affect genes. Associated variants can be fine mapped to
NATURE GENETICS
ADVANCE ONLINE PUBLICATION
variants in tight LD within cell type–specific chromatin marks in the
appropriate cell type. Here, we present examples where cell type–
specific H3K4me3 peaks can potentially be used to localize associated
variants to causal variants.
First, we considered the rs629301 SNP that is associated with
plasma LDL concentration in the region including the SORT1 gene
(Fig. 5a). A liver-specific H3K4me3 peak, not seen as prominently
in other cell types, overlapped with this SNP and three other variants
in tight LD with it. This H3K4me3 peak is located far from the TSS
region and corresponds to a hepatocyte enhancer region7. The closest SNP to the summit of the peak (87 bp away) is the rs12740374
functional variant. This variant is known to alter a CEBPB-binding
site within the enhancer region controlling SORT1.
Another example is provided by the locus for T2D represented by
the rs704184 reported SNP association. rs10814915, tightly in LD
with the reported GWAS SNP (r2 = 0.93), scored highly for pancreatic islets but showed no tissue specificity for the liver (Fig. 5b). This
SNP located only 84 bp away from the summit of a highly pancreatic islet–specific peak. rs10814915 is predicted to be present within
a sequence bound by the glucocorticoid receptor (GR)52, which is
known to have a role in pancreatic islets and glucose regulation. The
SNP resides within an intron of the GLIS3 gene, which is involved in
the development of pancreatic islets.
Finally, we examined the locus for rheumatoid arthritis defined
by a reported association with the rs13119723 SNP in the intron of
a gene with unknown function, KIAA1109. This SNP is in LD with
other variants spanning over 500 kb within this locus, rendering
fine-mapping efforts particularly challenging. We identified a SNP,
rs13140464, in tight LD with rs13119723 (r2 = 0.9) (Fig. 5c), which
maps only 116 bp from the summit of the H3K4me3 peak, which is
highly specific to CD4+ Treg cells with a score of 0.63. This SNP is
located between the IL2 and IL21 genes, 122 kb downstream of IL2
and 34 kb upstream of IL21, and is 280 kb away from the published
SNP. It is tempting to speculate that rs13140464 might act by altering
a highly cell type–specific regulatory sequence controlling IL2 expression, which has a key role in CD4+ Treg maturation53.
5
A N A LY S I S
a
KIAA1324
109.8
CELSR2
(Mb)
109.9
c
Chr. 9 (p24.2)
3.9
4.0
4.1
4.2
4.3 (Mb)
Chr. 4 (q27)
123.1
123.2
123.3
123.4
123.5 (Mb)
MYBPHL
SORT1
PSMA5
KIAA1109
GLIS3
ADAD1
IL21
IL2
1
1
1
0.6
0.6
0.6
0.6
0.2
0.2
0.2
30
15
0
25
13
0
100
50
0
30
15
0
150
75
0
100
50
0
80
40
0
120
60
0
40
20
0
50
25
0
60
30
0
Adult
liver
r2
1
Treg
primary
cells
PSRC1
Pancreatic Skeletal
islets
muscle
SARS
40
20
109.81
© 2012 Nature America, Inc. All rights reserved.
b
Chr. 1 (p13.3)
109.7
109.82
109.83
0.2
25
13
0
109.81
109.82
109.83
0
25
13
0
25
13
0
25
13
0
123.1 123.2 123.3 123.4 123.5
123.49
123.50
123.51
Figure 5 Selected phenotypically associated loci with high cell type specificity. We present three examples of loci with cell type–specific overlap with
H3K4me3 peaks. Top, genomic coordinates and genes near the associated SNP. Middle, lead SNP (blue diamond) and other nearby SNPs from the
1000 Genomes Project (red dots correspond to those with high r 2, blue dots correspond to those with low r 2). We also show the SNP that is closest to
the cell type–specific peak (red diamond). Bottom, H3K4me3 sequence tag counts for selected cell types. Colored horizontal lines in the tissue panels
correspond to peak calls. Dashed vertical lines mark the summits of phenotypically cell type–specific peaks. (a–c) Shown are the SORT1 locus for LDL (a),
the GLIS3 locus for T2D (b) and the IL2-IL21 locus for rheumatoid arthritis (c).
DISCUSSION
In this study, we demonstrated that chromatin marks highlighting
active regulatory regions, such as H3K4me3, H3K9ac and DHSs, overlap phenotypically associated variants; furthermore, this overlap is
phenotypically cell type specific. These results strongly support the
hypothesis that many complex disease and trait alleles might act by
influencing gene regulation in a cell type–specific manner. In addition, we quantified the degree to which different marks are cell type
specific in their overlap with phenotypically associated SNPs. These
cell type–specific marks might not only be used to connect phenotypes to specific cell types, but they might also be useful in mapping phenotype-associated SNPs to potential regulatory variants. In
particular, we consistently observed that H3K4me3 marks could be
used to effectively identify specific cell types that are enriched among
specific phenotypes. We note that this statistical approach can be
applied to assess the significance of other chromatin marks or other
cell type–specific gene annotations as they become available.
In the phenotypes that we examined, we found that about onefourth of associated variants could be connected to a highly cell type–
specific mark within a critical cell type (Fig. 5). In instances where we
do not observe a SNP in tight LD within a highly cell type–specific
H3K4me3 peak, it is possible that a regulatory region that is not cell
type specific might be altered. Alternatively, in some instances the
reported SNP association will need to be further refined with dense
genotyping, or undiscovered variants in tight LD will need to be ascertained through sequencing, before the effect of a cell type–specific
peak can be identified. Finally, for many phenotypes, multiple cell
types could be involved, in which case this approach might have limited efficacy. We demonstrated one example of this type of scenario
in T2D, where we detected effects both in liver and pancreatic islet
cell types.
We acknowledge that our approach is potentially sensitive to the
diversity and number of cell types assayed. For instance, a limited
application to a set of hematopoietic cell types might not be
6
particularly informative if a set of purely neurological phenotypes is
assayed. We note that our approach depends critically on technical
factors—for instance, the quality of antibody reagents, experimental protocols or other technical factors that might introduce noise
into specific chromatin mark assays could mitigate true signals. Our
approach may perform better on the chromatin marks with higher
quality assays.
Once variants and cell types are identified, they will likely be
excellent candidates for cell type–specific functional investigations,
including allelic imbalance assays to define cis-eQTL activity54, cell
type–specific DHS quantitative trait locus (dsQTL) analyses 55 and
identification of active transcription factor–binding sites. These cell
type–specific investigations in appropriately chosen cell types might
ultimately help to lead investigators from common disease variation
to causal variants and molecular mechanisms.
URLs. All software is available online at http://www.broadinstitute.
org/mpg/epigwas/. ENCODE, http://genome.ucsc.edu/ENCODE/
downloads.html; NIH Roadmap Epigenomics Mapping Consortium,
http://www.roadmapepigenomics.org/; NHGRI GWAS catalog, http://
www.genome.gov/gwastudies/.
METHODS
Methods and any associated references are available in the online
version of the paper.
Note: Supplementary information is available in the online version of the paper.
ACKNOWLEDGMENTS
We thank M. Daly, M. Kellis, D. Diogo, X. Hu, Y. Okada, R. Plenge, S. Ripke,
G. Srivastava, E. Stahl and S. Sunyaev for critical feedback and discussion. G.T. is
supported by the Rubicon grant from The Netherlands Organization for Scientific
Research (NWO). B.E.S. and S.R. are supported by the Harvard University Milton
Fund, and Brigham and Women’s Hospital. S.R. is also supported by funds from
the US NIH (K08AR055688 and U01HG0070033) and the Arthritis Foundation.
X.S.L. is also supported by funds from the US NIH (R01 HG004069). We thank the
ADVANCE ONLINE PUBLICATION
NATURE GENETICS
A N A LY S I S
ENCODE Project, supported by the NHGRI, and the NIH Roadmap Epigenomics
Mapping Consortium for making data available.
AUTHOR CONTRIBUTIONS
S.R. led the study. G.T., C.S., S.R., B.H. and H.X. performed the analysis. G.T.,
C.S., S.R., B.E.S. and X.S.L. wrote the manuscript. All authors reviewed the final
manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
© 2012 Nature America, Inc. All rights reserved.
Published online at http://www.nature.com/doifinder/10.1038/ng.2504.
Reprints and permissions information is available online at http://www.nature.com/
reprints/index.html.
1. Nicolae, D.L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation
to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010).
2. Fraser, H.B. & Xie, X. Common polymorphic transcript variation in human disease.
Genome Res. 19, 567–575 (2009).
3. Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological
pathways affect human height. Nature 467, 832–838 (2010).
4. Nica, A.C. et al. Candidate causal regulatory effects by integration of expression
QTLs with complex trait genetic associations. PLoS Genet. 6, e1000895 (2010).
5. Fehrmann, R.S. et al. eQTLs reveal that independent genetic variants associated
with a complex phenotype converge on intermediate genes, with a major role for
the HLA. PLoS Genet. 7, e1002197 (2011).
6. Fairfax, B.P. et al. Genetics of gene expression in primary immune cells identifies
cell type–specific master regulators and roles of HLA alleles. Nat. Genet. 44,
502–510 (2012).
7. Musunuru, K. et al. From noncoding variant to phenotype via SORT1 at the 1p13
cholesterol locus. Nature 466, 714–719 (2010).
8. Adrianto, I. et al. Association of a functional variant downstream of TNFAIP3 with
systemic lupus erythematosus. Nat. Genet. 43, 253–258 (2011).
9. Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human
cell types. Nature 473, 43–49 (2011).
10. Creyghton, M.P. et al. Histone H3K27ac separates active from poised enhancers
and predicts developmental state. Proc. Natl. Acad. Sci. USA 107, 21931–21936
(2010).
11. Waki, H. et al. Global mapping of cell type–specific open chromatin by FAIRE-seq
reveals the regulatory role of the NFI family in adipocyte differentiation. PLoS
Genet. 7, e1002311 (2011).
12. Atchison, M.L. Enhancers: mechanisms of action and cell specificity. Annu. Rev.
Cell Biol. 4, 127–153 (1988).
13. Song, L. et al. Open chromatin defined by DNaseI and FAIRE identifies regulatory
elements that shape cell-type identity. Genome Res. 21, 1757–1767 (2011).
14. Boyle, A.P. et al. High-resolution mapping and characterization of open chromatin
across the genome. Cell 132, 311–322 (2008).
15. Encode Project Consortium. A user’s guide to the encyclopedia of DNA elements
(ENCODE). PLoS Biol. 9, e1001046 (2011).
16. Barski, A. et al. High-resolution profiling of histone methylations in the human
genome. Cell 129, 823–837 (2007).
17. Kouzarides, T. Chromatin modifications and their function. Cell 128, 693–705
(2007).
18. Wang, Z. et al. Combinatorial patterns of histone acetylations and methylations in
the human genome. Nat. Genet. 40, 897–903 (2008).
19. Thousand Genomes Project. A map of human genome variation from populationscale sequencing. Nature 467, 1061–1073 (2010).
20. Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide
association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106,
9362–9367 (2009).
21. Bernstein, B.E. et al. Genomic maps and comparative analysis of histone
modifications in human and mouse. Cell 120, 169–181 (2005).
22. Bernstein, B.E. et al. The NIH Roadmap Epigenomics Mapping Consortium.
Nat. Biotechnol. 28, 1045–1048 (2010).
23. Pekowska, A. et al. H3K4 tri-methylation provides an epigenetic signature of active
enhancers. EMBO J. 30, 4198–4210 (2011).
24. Jia, L. et al. Functional enhancers at the gene-poor 8q24 cancer-linked locus. PLoS
Genet. 5, e1000597 (2009).
25. Teslovich, T.M. et al. Biological, clinical and population relevance of 95 loci for
blood lipids. Nature 466, 707–713 (2010).
NATURE GENETICS
ADVANCE ONLINE PUBLICATION
26. Smith, L.C., Pownall, H.J. & Gotto, A.M. Jr. The plasma lipoproteins: structure and
metabolism. Annu. Rev. Biochem. 47, 751–757 (1978).
27. Hobbs, H.H., Brown, M.S. & Goldstein, J.L. Molecular genetics of the LDL receptor
gene in familial hypercholesterolemia. Hum. Mutat. 1, 445–466 (1992).
28. Firestein, G.S. Evolving concepts of rheumatoid arthritis. Nature 423, 356–361
(2003).
29. Lee, D.M. et al. Mast cells: a cellular link between autoantibodies and inflammatory
arthritis. Science 297, 1689–1692 (2002).
30. Boilard, E. et al. Platelets amplify inflammation in arthritis via collagen-dependent
microparticle production. Science 327, 580–583 (2010).
31. Stahl, E.A. et al. Genome-wide association study meta-analysis identifies seven new
rheumatoid arthritis risk loci. Nat. Genet. 42, 508–514 (2010).
32. Akbar, A.N., Vukmanovic-Stejic, M., Taams, L.S. & Macallan, D.C. The dynamic
co-evolution of memory and regulatory CD4+ T cells in the periphery. Nat. Rev.
Immunol. 7, 231–237 (2007).
33. Eyre, S. et al. High-density genetic mapping identifies new susceptibility loci for
rheumatoid arthritis. Nat. Genet. 44, 1336–1340 (2012).
34. Psychiatric GWAS Consortium Bipolar Disorder Working Group. Large-scale genomewide association analysis of bipolar disorder identifies a new susceptibility locus
near ODZ4. Nat. Genet. 43, 977–983 (2011).
35. Schizophrenia Genome-Wide Association Study (GWAS) Consortium. Genome-wide
association study identifies five new schizophrenia loci. Nat. Genet. 43, 969–976
(2011).
36. Goldman-Rakic, P.S. & Selemon, L.D. Functional and anatomical aspects of
prefrontal pathology in schizophrenia. Schizophr. Bull. 23, 437–458 (1997).
37. Goldstein, J.M. et al. Cortical abnormalities in schizophrenia identified by structural
magnetic resonance imaging. Arch. Gen. Psychiatry 56, 537–547 (1999).
38. Strakowski, S.M., Delbello, M.P. & Adler, C.M. The functional neuroanatomy of
bipolar disorder: a review of neuroimaging findings. Mol. Psychiatry 10, 105–116
(2005).
39. Morris, A.P. et al. Large-scale association analysis provides insights into the genetic
architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44, 981–990
(2012).
40. Cho, Y.S. et al. Meta-analysis of genome-wide association studies identifies
eight new loci for type 2 diabetes in east Asians. Nat. Genet. 44, 67–72
(2012).
41. Dupuis, J. et al. New genetic loci implicated in fasting glucose homeostasis and
their impact on type 2 diabetes risk. Nat. Genet. 42, 105–116 (2010).
42. Kong, A. et al. Parental origin of sequence variants associated with complex
diseases. Nature 462, 868–874 (2009).
43. Kooner, J.S. et al. Genome-wide association study in individuals of South Asian
ancestry identifies six new type 2 diabetes susceptibility loci. Nat. Genet. 43,
984–989 (2011).
44. Perry, J.R. et al. Stratifying type 2 diabetes cases by BMI identifies genetic risk
variants in LAMA1 and enrichment for risk variants in lean compared to obese
cases. PLoS Genet. 8, e1002741 (2012).
45. Qi, L. et al. Genetic variants at 2q24 are associated with susceptibility to type 2
diabetes. Hum. Mol. Genet. 19, 2706–2715 (2010).
46. Saxena, R. et al. Large-scale gene-centric meta-analysis across 39 studies identifies
type 2 diabetes loci. Am. J. Hum. Genet. 90, 410–425 (2012).
47. Shu, X.O. et al. Identification of new genetic risk variants for type 2 diabetes. PLoS
Genet. 6, pii: e1001127 (2010).
48. Tsai, F.J. et al. A genome-wide association study identifies susceptibility variants
for type 2 diabetes in Han Chinese. PLoS Genet. 6, e1000847 (2010).
49. Voight, B.F. et al. Twelve type 2 diabetes susceptibility loci identified through
large-scale association analysis. Nat. Genet. 42, 579–589 (2010).
50. Yamauchi, T. et al. A genome-wide association study in the Japanese population
identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B.
Nat. Genet. 42, 864–868 (2010).
51. Seino, S., Shibasaki, T. & Minami, K. Dynamics of insulin secretion and the
clinical implications for obesity and diabetes. J. Clin. Invest. 121, 2118–2125
(2011).
52. Ward, L.D. & Kellis, M. HaploReg: a resource for exploring chromatin states,
conservation, and regulatory motif alterations within sets of genetically linked
variants. Nucleic Acids Res. 40, D930–D934 (2012).
53. Setoguchi, R., Hori, S., Takahashi, T. & Sakaguchi, S. Homeostatic maintenance
of natural Foxp3+ CD25+ CD4+ regulatory T cells by interleukin (IL)-2 and induction
of autoimmune disease by IL-2 neutralization. J. Exp. Med. 201, 723–735
(2005).
54. McCarroll, S.A. et al. Deletion polymorphism upstream of IRGM associated with altered
IRGM expression and Crohn`s disease. Nat. Genet. 40, 1107–1112 (2008).
55. Degner, J.F. et al. DNase I sensitivity QTLs are a major determinant of human
expression variation. Nature 482, 390–394 (2012).
7
© 2012 Nature America, Inc. All rights reserved.
ONLINE METHODS
Chromatin mark data. We obtained two publicly available data sets for chromatin mark assays on different sets of tissues. We use the term chromatin
mark broadly to include histone modifications and DHSs, as well as common
epigenetic features, such as CTCF-binding sites.
First, we used data from the ENCODE Project, which included sequence
reads from ChIP-seq assays and controls in up to 14 different cell types from
a diverse set of 15 chromatin marks: CTCF-binding sites, the variant H2A
histone (H2A.Z), H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2,
H3K4me3, H3K79me2, H3K9ac, H3K9me1, H3K9me3, H4K20me1,
Pol2b-binding sites and DHSs15 (Supplementary Tables 1 and 2). We separately obtained hidden chromatin state annotations for 8 of the 14 cell types
defined by clustering chromatin marks9.
Second, we used data from the NIH Roadmap Epigenomics Mapping
Consortium that assayed only six chromatin marks on a large number of
cell types22. This data set included sequence reads from ChIP-seq assays and
controls for 6 histone modifications—H3K27me3, H3K4me3, H3K36me3,
H3K9ac, H3K4me1 and H3K9me3—assayed in 38 adult and fetal tissues
(Supplementary Table 2).
For both of these data sets, we downloaded data comprising hg19-mapped
sequence reads. In instances where there were multiple replicates of a given
ChIP-seq assay for the same tissue, we aggregated sequence reads for the individual assays. We also obtained mapped reads from control data comprising
sequenced genomic DNA. We ran MACS (v1.4) software to identify significant peaks (P < 1 × 10−5), specific locations within the genome with enrichment of tag sequences, setting the bandwidth parameter to 300 bp56. For each
chromatin mark, we located its summit, which represents the position with
the highest pileup of sequence tags.
Processing chromatin mark data. Once we identified peaks, we used MACS
to determine the fold enrichment of tags compared to controls, using the
equation
M
f mean
Mlocal
(1)
where Lpeak and Llocal are parameters for a Poisson distribution determined
by fitting the local sequence tag distributions in the peak region from ChIPseq data and control data, respectively. We considered f as the height of peak
instead of the raw number tags, as this approach leverages control data to
account for local biases in the genome (due to sequencing bias, mapping bias,
chromatin structure and genome copy-number variations) and improves the
robustness and specificity of the estimation.
We then corrected for global variation in multiple experiments for the same
chromatin mark in different cell types, using the equation
hi , j, norm fi , j
­ª
­¹
max «£ fi , j º
»­
£ fi , j
iŒcell type ­ j
¬
(2)
j
where fi,j corresponds to fold enrichment for the peak j in the cell type i before
normalization, and hi,j,norm is the fold enrichment after normalization (or the
height of the peak).
Phenotypes and associated SNPs. To estimate the phenotypic cell type specificity of each chromatin mark, we identified a comprehensive set of independent
SNPs associated with unique phenotypes. We used data from a catalog summarizing results from recent GWAS20 (downloaded January 2012). We selected
only the phenotype-associated SNPs with highly statistically significant associations (P < 5 × 10−8). To ensure the applicability of the 1000 Genomes Project
resource, we used only those SNPs associated in populations of European
descent. To limit the analysis to phenotypes that have an adequate number
of SNP associations, we selected only phenotypes with at least 15 such SNP
associations. To ensure the independence of the associated SNPs, we removed
SNPs with r2 > 0.1 and those that were <100 kb from a more strongly associated
NATURE GENETICS
variant in the genome. To preserve a priori specific phenotypes for independent testing, we removed SNPs associated with rheumatoid arthritis, BMI and
LDL plasma cholesterol concentration as well as height. For variants associated
with multiple phenotypes, we selected a single phenotype association and
discarded others; we selected the SNP associated with the phenotype with the
fewest SNPs. Our final data set consisted of 510 risk variants associated with
31 diseases or traits.
To test our approach, we also separately identified in the literature 37 SNPs
associated with LDL plasma concentration25, 31 SNPs associated with rheumatoid arthritis risk31, 67 SNPs associated with T2D risk39–50 and 14 risk loci
for neuropsychiatric disorders34,35.
Evaluating marks for their phenotypically cell–type specific overlap with
variants. Step 1. Identifying variants in LD with associated SNPs. We recognized that the observed phenotype association of a given variant might be the
consequence of other variants tightly linked to the associated variant (Fig. 1a).
We therefore comprehensively ascertained variants from the 1000 Genomes
Project to identify all variants (SNPs and indels) in LD19 (r2 > 0.8) on the basis
of haplotypes reconstructed with Beagle from the subset of 379 individuals
of European descent.
Step 2. Scoring regulatory activity near a risk SNP. Next, we examined
chromatin marks in the different cell types located near associated SNPs
(Fig. 1b). We assumed that the closer an associated SNP (or variant in LD)
was to a tall peak, the greater the chance that it might influence a regulatory
element highlighted by that peak. We scored each associated SNP k within
each cell type by identifying a SNP k` (or indel) in tight LD that was closest
to a chromatin mark peak j in tissue i. We then assigned a score sj,k equal to
the height of peak j in the tissue i, hi,j,norm (referred to as h in the main text)
divided by the distance d between the SNP k` and the summit of the peak j.
If there was no peak within 2.5 kb of each SNP in LD with SNP k, then si,k was
set to zero.
Step 3. Normalization to obtain a cell type specificity score. For each associated SNP k and chromatin mark, we obtained a vector of scores for multiple
cell types i. To compare the cell type specificity score across risk variants and
phenotypes, we applied Euclidean normalization in the following equation:
sni , k si , k
(3)
£ si2,k
i
This ensured that sni,k emphasized cell type specificity instead of the magnitude of the signal. For associated risk variants not near any peak, where si,k is
zero for all i, we replaced values with the average of values of other associated
SNPs with at least one nonzero si,k value over all cell types.
Step 4. Estimating the phenotypic cell type specificity of a chromatin mark.
If a chromatin mark is informative for phenotypic cell type specificity, then
the deviance of chromatin mark overlap for associated SNPs (sni,k) should be
minimal for a given phenotype and tissue. If a chromatin mark is not informative, then the deviance of chromatin mark overlap for associated SNPs will be
high for a phenotype and tissue.
Therefore, we defined a deviance-based metric of phenotypic cell type specificity for a mark, which was the aggregate sum of the squared differences
between sni,k values and mean values for fixed phenotypes p and cell types i,
d
§
2¶
¨ £ meani , p sn sni , k ·
¨
·
iŒcell types pŒphenotypes © kŒ p
¸
£
£
(4)
where meani,p(sn) is the mean of the normalized cell specificity scores in the
cell type i for SNPs associated with phenotype p. If a mark is informative,
then sn scores are dependent on the phenotype and cell type, and this sum of
squares should be relatively small.
Step 5. Evaluating the statistical significance of phenotypic cell type specificity.
To evaluate the statistical significance of phenotypic cell type specificity for
particular marks, we conducted up to 1 million permutations reassigning SNPs
to phenotypes randomly. This ensures that the properties of associated SNPs
in the analysis are maintained, only disrupting their phenotypic associations.
doi:10.1038/ng.2504
We recalculated d after each permutation. To compute P values, we calculated
the proportion of d scores from permutations (these correspond to the null
hypothesis) that were greater than the observed d score.
Using overlap with chromatin marks to identify the critical cell type(s)
for a specific phenotype. After identifying SNPs associated with a selected
phenotype, we compute a cell type specificity score ci,p for a phenotype p by
summing the normalized sni,k scores for a cell type i and associated SNPs k in
the following equation:
ci , p £ sni,k
kŒp
(5)
Using overlap with marks to identify pairs of critical cell types for a specific
phenotype. To test possible pairs of n cell types for association, we constructed
(n – 1) × n/2 artificial ChIP-seq profiles for each tissue pair. Each artificial
profile consisted of all of the peaks defined in both tissues, where the peak
heights were reduced to half of their original heights. We then tested for association with cell type pairs in the same way as for single cell types, except that
we replaced individual cell type scores with scores for cell type pairs.
56. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137
(2008).
© 2012 Nature America, Inc. All rights reserved.
To evaluate the statistical significance of cell type specificity scores ci,p, we
defined matched sets of SNPs not associated with phenotype p and used them
to calculate cell type specificity scores. Statistical significance was calculated
as the proportion of SNP sets with cell type specificity scores exceeding the
observed scores for actual phenotypic SNPs.
To define the matched SNP sets, we required that the sampled SNPs had
the same total number of chromatin mark peaks in the region in LD across all
cell types as associated SNPs. This ensures that randomly selected SNPs have
similar nearby regulatory activity. For the primary analysis, we drew random
SNPs from 45,950 independent HapMap SNPs that were clustered to ensure
minimal independence3. In a secondary analysis, we drew SNPs from phenotypically associated SNPs from the NIH GWAS catalog20.
doi:10.1038/ng.2504
NATURE GENETICS