Model-based inference Inference of population structure
Transcription
Model-based inference Inference of population structure
Inference of ancestry from multilocus genotype data olivier.francois@imag.fr March 2014 Outline I Statistical methods for estimating, visualizing and interpreting population genetic structure I PCA and STRUCTURE I Geographically explicit approaches to the inference of individual admixture proportions I How many ancestral populations to a sample? Population genetic structure I I Many organisms form genetically dierentiated subpopulations (herds, colonies, schools, prides, packs). Importance of geographic scales (regional, local) Population genetic structure I Natural populations are not random mating populations. I There is a geographic range within which individuals are more closely related to one another than to those far apart. I The inference of population structure is inuenced by the demographic history of a species, past events of population ssion and fusion, migrations, etc. I A clear understanding of population structure is useful for detecting genes under selection. Regional and local scales I At the regional scale, clusters and clines are the consequences of demographic processes such as colonization, admixture, fragmentation, reproductive isolation, or selective processes like local adaptation. I At the local scale, restricted dispersal creates local patches of identical genotypes, spatial autocorrelation of allele frequencies, and long-range isolation by distance patterns Genetic clusters and clines Genetic clusters and clines I Clines: Large-scale spatial trends in allele frequencies or genetic diversity Patterns of Isolation by distance Clusters and clines are not mutually exclusive patterns Hewitt 2000 Back to population genetics: Hardy-Weinberg Equilibrium I Allele and genotype frequencies in a population remain constant from generation to generation. I It assumes an innitely large population size, random mating, no mutation, no migration, and selective neutrality. I The genotype frequencies are deduced from the allele frequencies. 0 (q ) 1 (p ) 0 (q ) q2 pq 1 (p ) pq p2 Genotype frequencies at a bi-allelic locus (Single Nucleotide Polymorphism). Heterozygosity H = 2pq. Linkage Disequilibrium (LD) I I I Non random association of alleles at two or more loci Considering two bi-allelic loci, A and B , LD can be measured by D = pAB − pA pB = cov(A, B ) In the absence of evolutionary forces other than random mating, the linkage disequilibrium measure D converges to zero at a rate equal to the recombination rate between the two loci. Population structure creates LD at unlinked loci I I I I Suppose our sample contains two populations in equal proportions In population 1, we have pA1 = 1 and pB1 = 0 (D = 0) In population 2, we have pA2 = 0 and pB2 = 1 (D = 0) In the sample, pA = 1/2 and pB = 1/2 Thus, because pAB = 0, the linkage disequilibrium measure is non zero, and maximal in absolute value |D | = 1/4 . Population structure creates Hardy-Weinberg disequilibrium I I I Suppose our population contains two subpopulations in equal proportions, each in HW equilibrium Consider a bi-allelic (0/1) locus and let p1 and p2 denote the allele frequencies in subpopulations 1 and 2 (frequencies of 1). In the total population, we have p= Thus, H 6= 2pq . p1 + p2 2 and H = p1 q1 + p2 q2 Bayesian clustering algorithms I I I Assume K unknown subpopulations, n individuals genotyped at L loci. No-admixture (ssion) model: For each individual i and each cluster k , compute the probability that individual i originates in cluster k . Admixture (ssion and fusion) model: For each individual i and each cluster k , compute the fraction of genome of individual i that originates in cluster k . Genetic structure of human populations I Each individual is represented by a segment of length 1. I Each cluster is represented by a color. Building a likelihood I I I I Clusters minimize HW and LD disequilibria Data: (yi ` )i ≤n,`≤L is a matrix of 0 and 1, where each individual i is coded with 2 rows. Clusters: (zi )i ≤n , zi ∈ {1, . . . , K } Allele frequencies: Given zi = k , Pr(yi ` = 1 | zi = k , p) = p`,k and Pr(y | z , p) = n Y L Y 2 Y y i =1 `=1 j =1 p`,z (1 − p`,z )1−y j i i i j i Prior distributions I Independent allele frequencies are sampled at each locus ` and in each cluster k p`k ∼ beta(λ1 , λ2 ) (default value λi = 1). I Uniform distribution on individual cluster labels Pr(zi = k ) = K1 , k = 1, . . . , K . MCMC algorithm I Gibbs sampler: Start from an initial conguration of allele frequencies and cluster labels. Then repeat the following steps I Step 1 Estimate allele frequencies given the clusters p t ∼ Pr(p | y ; z t −1 ) I Step 2 Estimate clusters given the allele frequencies z t ∼ Pr(z | y ; p t ) I Special case of a Metropolis-Hastings algorithm (without rejection). Statistical mixture model I Known for long in statistics as the latent class model (Lazarsfeld-Henry 1968, Goodman 1974) I Bayesian implementation popularized by the software structure (Pritchard et al 2000) I Important innovation in structure: it includes admixture models (next slides). Admixture models I Genetic admixture is the process by which a hybrid population is formed from contributions by two or more parental (or ancestral) populations I In an admixed population, individual genomes are themselves (to a greater or a lesser extent) admixed. I Bayesian clustering methods are capable of calculating individual admixture proportions where the ancestral populations are not imposed by the sampling process. Divergence vs Admixture of populations Admixture of two ancestral populations What change to the model? I Introduction of additional parameters: Q -matrix (n × K dimensions) qi ,k = proportion of I individual i 0 s genome from population k One cluster for each allele copy zi ,` = population of and origin of allele copy yi ,` Pr(zi ,` = k | p, q ) = qi ,k where qi ,. ∼ D(α1 , . . . , αK ) Dirichlet distribution. How can we account for geography? I Gradients in gene frequencies are created by the contact of two or more populations. I Their shape is sigmoidal (Barton and Hewitt 1986) Modeling the cline Extension of the algorithm: including spatial information I Population genetic structure is spatially structured. Let xi be the spatial coordinates of individual i . We assume that log αi . = f (xi )T β. + i . where the prior distribution on β is non-informative and is a spatially autocorrelated Gaussian noise. I tess computer program (Durand et al 2009). Geographic admixture of 2 parental populations (simple model) Results of structure and tess I FST (of the pooled parental populations) is a measure that quanties the departure from HWE in the ancestral population Choosing K: The Deviance Information Criterion I Let θ̂ be a point estimate and dene the deviance as D (θ, y ) = −2 log p (y |θ) I The predicted (or expected) deviance is D̄ = Eθ|y [D (θ, y )] and the eective number of parameters pD = D̄ − D (θ̂, y ). I Dene DIC = D̄ + pD Choosing K DIC curves Information criterion Model 1 Model 2 K2 1 2 3 K1 4 Number of cluster 5 6 SWITCH TO POWER POINT A realistic scenario for a contact zone (Non equilibrium stepping-stone simulation) 65 Latitude 60 55 50 45 40 35 −10 0 10 20 Longitude 30 SWITCH TO POWER POINT Results Results of inference 65 Latitude 60 55 50 45 40 35 −10 0 10 20 Longitude 30 Application to Fundulus heteroclitus Application to Fundulus heteroclitus Application to Gypsophila repens Conclusion messages I Bayesian algorithms detect population structure and individuals admixture levels without a need to predene ancestral populations I Admixture models should be preferred to models without admixture unless we have evidence of pure subpopulation divergence I The choice of the number of cluster is a highly dicult problem but heuristic criteria seem to perform well in many cases Bibliography and resources I Hartl DL, Clark AG (2007) Principles of Population Genetics, Fourth Edition. Sinauer I Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945-959. I Durand E, Jay F, Gaggiotti OE, François O (2009) Spatial inference of admixture proportions and secondary contact zones. Molecular Biology and Evolution 26:1963-1973.