Information content
Transcription
Information content
2013-07-31 Lecture 5 Information Content Jurg Ott http://lab.rockefeller.edu/ott/ http://www.jurgott.org/PekingU/ Curvature • Curves of log(likelihood) with maxima at θ = 0.10 • At maximum,, curve with n = 10 is more flat than that with n = 20 2 1 2013-07-31 Measures of “Informativeness” • Fisher information (Fisher RA (1925) Theory of statistical estimation. Proc Camb Phil Soc 22:700-725) o I(θ) measures the precision, with which the θ parameter can be o o o o estimated. Based on curvature (2nd derivative) of loge[L(θ)] curve. S(θ) = loge[L(θ)] also called support for parameter or hypothesis 1/I(θ) = variance of estimate Binomial I(θ) i i l situation: i i (θ) = n/[θ(1 /[θ(1 – θ)] “Information” = technical statistical term 3 “Informativeness” in Linkage Analysis • Less interested in precision of θ estimate than in amount of information for detecting linkage • Suitable quantity = expected lod score, ELOD • Consider n = 3 offspring of phase-known mating; k = 0…3 recombinants • Lod score: Z ( ) log k (1 ) n k ( 12) n n log(2) k log( ) (n k ) log(1 ) • Distinguish r = true value of recombination fraction (only known to statistician) and θ = formal parameter in expression for likelihood 4 2 2013-07-31 Expected Lod Score, ELOD • Each of k = 0…3 • • • furnishes a lod score curve. At given θ, the weighted average over these curves leads to the expected (average) lod score curve, weights = prob. of occurrence (depends on r). Often only its max. is called the ELOD Max. occurs at true value of r (Rao CR [1973] Linear statistical inference and its applications. New York: Wiley) 5 Computing ELOD • Probability of k recombinants (usually k ) unknown): P( k ; r ) r (1 r ) n k k n k , • so that for ELOD curve: n E[ Z ( )] P (k ;r )Z k ( ), k 0 6 3 2013-07-31 Comparing ELODs • Compare data types or strategies in terms of ELOD o Thompson EA (1975) Ann Hum Genet 39, 173 o Ott (1985) Analysis of Human Genetic Linkage, 1st edition • ELOD (MELOD) approximately corresponds to a power of 50% 7 Phase‐known Double Backcross • Consider two loci: ((1)) with alleles A and B,, ((2)) with alleles • • • 1 and 2. Mating A1/B2 × A1/A1, allows counting recombinants and nonrecombinants. Curve E[Z(r; θ)]: r log(2θ) + (1 – r) log[2(1 – θ)]. ELOD curve is maximum for θ = r, find • For r → 0, ELOD = 0.30 per offspring. • On average, to obtain lod score of 3, need to count 10 offspring. 8 4 2013-07-31 Phase‐unknown Double Backcross with 2 offspring • Matingg ((A1/B2 or A2/B1)) × A1/A1,, cannot count • • • • recombinants and nonrecombinants. Assume equal phase probabilities Single offspring, A1/A1: Prob = 0.5r × 0.5 + 0.5(1 – r) × 0.5 = 0.25 → no information on linkage Two offspring, for example, A1/A1 each: Probability is f1 = [r2 + (1 – r)2]/8, ]/8 both are recombinants given one phase and nonrecombinants given the other phase. For given phase, if one is a recombinant and the other a nonrecombinant, then probability is f2 = r(1 – r)/4. 9 Phase‐unknown Double Backcross with 2 offspring • List of all possible offspring genotypes: • Define an offspring as Type 1 if nonrecombinant given • some phase, and Type 2 if recombinant given this phase. Combine classes with equal probability 10 5 2013-07-31 List of Type 1 and 2 Offspring • Offspring are not independent! • Again, combine classes with equal probabilities 11 Two Classes of Offspring Pairs • ELOD (θ = r): ) E[Z(r)] = 2r(1 – r)log[4r(1 – r)] + [r2 + (1 – r)2]log[2r2 + 2(1 – r)2]. 12 6 2013-07-31 Comparing ELODs • DB = phase-known double backcross, 1 offspring • 2 kids = phase-unknown double backcross, 1 offspring pair • “Lose 1 offspring for not knowing phase” 13 Conditional ELODs • In ppractice,, want to find ELOD or maximum ppossible lod • • score in family with given disease phenotypes. Conditional calculations complicated. Computer programs: SIMLINK (Boehnke 1986), SLINK (Weeks 1990) Off the cuff considerations often done assuming full penetrance and no recombination: • “In In a genome screen, screen it will always be possible to find a marker very close to the disease locus. Therefore, we should take as our measure of informativeness the maximum lod score (at = 0) obtained under the assumption of no recombinants.” 14 7 2013-07-31 Conditional ELODs: Simple Family • Reasoning: Genotype of II-3 • will reveal phase in mother (“ hild sets t phase”) h ”) (“one child Remaining 3 children can be scored as nonrecombinants, each providing a lod score of 2 × log(2) = 0.301 • Thus,, max. lod score ppossible = 0.903. • Correct for full penetrance, no phenocopies, mother always doubly heterozygous. 15 Conditional ELODs: Larger Family • Dominant disease • With full penetrance and no recombinants, expect up to 7 non-recombinants, lod score off 7 × 0.301 0 301 = 22.107. 107 16 8 2013-07-31 Conditional ELODs: Incomplete Penetrance 01 • A Assume ttrue r = 00.01 • Compute lod scores with SLINK program (2000 replicates) n = number of equally frequent marker alleles 17 Cost of Incomplete Penetrance: ELODs • Measure difference in ELOD for full versus incomplete penetrance • Consider phase-known mating, D1/d2 × d1/d1, D > d (D dominant), and pphenotypes yp A ((affected)) and U ((unaffected). ) • Obtain ELOD as a weighted sum of log of last line, weights = P(x) 18 9 2013-07-31 Cost of Incomplete Penetrance: Results • With tight linkage, f = 0.90 reduces ELOD to 75% of its value at f = 1 • May compensate for this loss by increasing sample size by factor 1/0.758 = 1.32,, that is,, need 30% more data. • Analogously, to compensate for f = 0.50 requires 3 times the sample size Ratio of ELOD relative to ELOD with full penetrance 19 Inconsistencies Ott (1978) Cytogenet Cell Genet 22, 702‐705 • Maximum likelihood estimates (MLEs) are generally consistent (asymptotically unbiased, variance → 0). A t i t bias bi may lead l d to t inconsistency i it • Ascertainment • Consider two codominant loci and phase-known mating A1/B2 × A1/B2 (common in CEPH families) • Offspring genotypes: 20 10 2013-07-31 Inconsistencies • Mating A1/B2 × A1/B2 , offspring phenotypes and their probabilities of occurrence; i = 3 is ambiguous 21 Inconsistencies • Mating A1/B2 × A1/B2 , collecting offspring phenotypes with same probabilities into one class each leads to: 22 11 2013-07-31 Ascertainment Strategies Book chapter 11 • Use all data. ELOD: E0 [ Z ( )] l 11 pl (r ) Z l ( ) 4 • Computing dE0/dθ (r is a constant!), setting it to 0, and solving this equation leads to θ = r. • Analyze only unambiguous data (l = 1…3): ELOD curve has a maximum ~ at r b, b = r(1 – r)(1 – 2r)/[1 + 2r(1 – r)] ≈ r for small r. r Recombination fraction estimate is twice its true value! • For small r, most l = 4 individuals will be nonrecombinants: Assume that all of them are nonrecombinant → negative asymptotic bias. 23 Ascertainment Strategies: Summary 24 12 2013-07-31 Observed Information • Maximum lod score, Z (ˆ) . • Testing for linkage: X 2 2 ln(10) Z (ˆ)approximately follows a chisquare (1 df) distribution in the absence of linkage (α = 0.0001). • Lander & Kruglyak (1995) Nat Genet 11, 244: • Equivalent andd (n i l number b k off recombinants bi ( – k) off nonrecombinants: bi Assume lod score curve was obtained on all phase-known data (book ch. 4). 25 Equivalent Numbers of Observations • Edwards’ method: • Two-point method: • This furnishes k = Z(0.0010) – Z(0.0001) and n = 3.322 [Z(0.0010) + 3k] • These numbers have no statistical meaning; may be useful to judge information content of data. 26 13 2013-07-31 Exact Tests for Linkage • Binomial situation: Count k recombinants in n meioses. • Result significant when Z (ˆ) > 3 • r = 0.5: significance level, level α (r < 0.5: power) α n 27 Power and Significance Level • Binomial situation: C t k recombinants bi t Count in n meioses. • Result significant when Z (ˆ) > 3 • n = number of families (1 offspring each when phase known) • m = number b off offspring in phase unknown families r 28 14