Statistics 102 Unit 2 Probability Suggested Reading Open
Transcription
Statistics 102 Unit 2 Probability Suggested Reading Open
Statistics 102 Unit 2 Probability Suggested Reading Open Intro, Sections 2.1, 2.2, 2.6 27 February 2015 1 / 68 Outline for Unit 2 Introduction Basic concepts from probability Conditional probability, general multiplication formula for probability Positive Predictive Value of a Diagnostic Test and Bayes’ Rule 2 / 68 Progress This Unit Introduction Basic concepts from probability Conditional probability, general multiplication formula for probability Positive Predictive Value of a Diagnostic Test and Bayes’ Rule 3 / 68 Two ‘easy’ probability problems Probability problems can be both surprisingly easy to phrase and maddeningly difficult to solve. The next two slides contain problems of this sort; try to guess at solutions using your intuition. We will return to these later to solve them, using the advantage we get from precise, ‘math-like’ formulations. 4 / 68 Mandatory drug testing A false positive in a drug screening test occurs when the test incorrectly indicates that a screened person is a drug user. Suppose a mandatory drug test has a false-positive rate of 1.2% (or 0.012), and suppose a company uses the test to screen employees for drug use. • Given 150 employees who are drug free, what is the probability that at least one will (falsely) test positive 1. < 0.25 2. Between 0.25 and 0.75 3. > 0.75. 5 / 68 Breast cancer and mammograms The National Cancer Institute estimates that approximately 3.65% of women in their 60’s get breast cancer. A mammogram typically identifies a breast cancer about 85% of the time, and is correct 95% of the time when a woman does not have breast cancer. If a woman in her 60’s has a positive mammogram, what is the likelihood she has breast cancer? (1) less than 0.5 (2) 0.5 or greater 6 / 68 Announcements: Friday February 20 • P-sets: P-set 3 now posted, due Friday, February 27 at • • • • • the usual time. Solutions to P-set 2 now posted. Clicker questions today, channel 41. Quiz next Wednesday, February 25. Topics to be announced in weekend email. Begin today with slide 7 Unit 2. Some R tidbits. . . Main Goal for the unit Use the language of probability and mathematics to help draw conclusions about populations from datasets. Material will be a mix of intuition and formal rules about probability and random phenomena. Important notions: • Rules for probability • Conditional probabilities Important material will be presented on the board in lecture. 7 / 68 Progress This Unit Introduction Basic concepts from probability Conditional probability, general multiplication formula for probability Positive Predictive Value of a Diagnostic Test and Bayes’ Rule 8 / 68 Main ideas • Some definitions • The axioms (rules) of probability • Combining events in sample spaces • Venn diagrams • Independence 9 / 68 Mandatory drug testing A false positive in a drug screening test occurs when the test incorrectly indicates that a screened person is a drug user. Suppose a mandatory drug test has a false-positive rate of 1.2% (or 0.012), and suppose a company uses the test to screen employees for drug use. • Given 150 employees who are drug free, what is the probability that at least one will (falsely) test positive 1. < 0.25 2. Between 0.25 and 0.75 3. > 0.75. Two solutions: one using independence on the board (and in clickers!); one in R. 10 / 68 Mandatory drug tests • Example: A mandatory drug test has a false-positive rate of 1.2% (or 0.012) • Given 150 employees who are drug free, what is the probability that at least one will (falsely) test positive? • Pr(At least 1 "+") = Pr(1 or 2 or 3 ... or 150 "+") = 1 – Pr (None "+") = 1 - Pr(150 "-") • Pr(150 "-") = Pr(1 "-")150 = (0.988)150 = 0.16 • Pr(At least 1 "+") = 1 - Pr(150 "-") = 0.84 11 / 68 Announcements: Monday February 23 • P-sets: P-set 3 due Friday, February 27 at the usual time. • Clicker questions today, channel 41. • Quiz Wednesday, February 25. Will cover the code used in the Golub analysis and elementary ideas in probability. • Begin today with slide 12 Unit 2. Solution in R We use R to ‘model’ the problem, then run a simulation to estimate the answer. Two steps in the modeling: • Conduct one set of 150 tests, in one set of 150 employees • Replicate the 150 employee tests 1,000 times. • Calculate the proportion of outcomes with at least one positive test in the 1,000 replicates. Next slide shows logic of the simulation; always best to start with this. 12 / 68 Modeling one set of drug tests in 150 employees Main steps: • Initialize parameters of the problem, using information given, including creating an initial vector of test results, all negative. • Simulate test results by sampling from the two values 0 (neg. test) and 1 (pos test), according to probabilities for each outcome. • Record whether there was at least one positive test. • Also record proportion of positive tests. Next slide shows the R code. The comments use as much space as the actual commamnds. We will run the code in lecture. 13 / 68 The R code for one set of tests ## coding for the drug testing problem ## begin with one set of employees ## set parameters and initialize prob.false.positive = 0.012 num.employees = 150 set.seed(6578) ## initialize population ## default for function vector() sets values to 0 ## This call to vector() creates a numeric vector ## of length num.employees, with all values = 0 test.result = vector("numeric", num.employees) ## now sample from test results ## using function sample() ## Type help(sample) for a complete explanation of ## the function ## 0 = neg result, 1 = post result test.result = sample(c(0,1), size = num.employees, prob=c(1 - prob.false.positive, prob.false.positive),replace = TRUE) 14 / 68 R for drug tests . . . ## at least one positive test? ## Use the logical condition (num.pos.tests > 0) sets ## the variable at.least.one.pos = TRUE if there is ## at least one positive test num.pos.tests = sum(test.result) at.least.one.pos = (num.pos.tests > 0) at.least.one.pos ## Also look at the number and proportion of positive tests num.pos.tests = sum(test.result) prop.pos.tests = num.pos.tests/num.employees prop.pos.tests 15 / 68 Replicating the 150 tests in R: Replicating in R uses a simple for() loop. The syntax is for(ii in number of loops){ code to be repeated } The use of ii in the loop counter is arbitrary. The next slides look more complicated than than they are. 16 / 68 The R code for replicating ## now replicate 1,000 times ## initialize again prob.false.positive = 0.012 num.employees = 150 num.replicates = 1000 set.seed(6578) ## initialize for replicates at.least.one.pos = vector("numeric", num.replicates) ## Nest earlier simulation in a ‘for’ loop which ## repeats the 150 tests num.replicates times ## Record in each for() loop whether or not at least one ## test was positive 17 / 68 Replicating . . . for(ii in 1:num.replicates){ test.result = vector("numeric", num.employees) test.result = sample(c(0,1), size = num.employees, prob=c(1 - prob.false.positive, prob.false.positive),replace = TRUE) num.pos.tests = sum(test.result) ## at least one positive test? at.least.one.pos[ii] = (num.pos.tests > 0) } ## Now calculate the proportion of replicates that produced ## at least one positive test sum(at.least.one.pos)/num.replicates 18 / 68 Recap The coding is less important than what can be learned. • Setting up the problem requires understanding the problem statement. • Interpreting the result helps reinforce probability concepts. • You can check your math solution • Once the program is written (and running!) it can be run many times with different parameters. On a p-set, you will have a chance to run this code, and to modify it a bit to examine different situations. 19 / 68 Independence and the Hardy-Weinberg distribution Genes can be defined in two ways: • A gene is a determinant, or a co-determinant, of a character that is inherited according to Mendel’s rules. • A gene is a functional unit of DNA. Previous unit used the both the first and the second definition; we look at the first here. A locus (plural loci) is a unique chromosomal location defining the position of an individual gene or DNA sequence. (Example: ABO blood group locus) 20 / 68 The Hardy-Weinberg distribution . . . Consider two alleles, A1 and A2 at the A locus. Assume ‘gene frequencies’ in the population are p and q, respectively, i.e., assume that p is the probability that a randomly chosen member of a population will have allele A1 at locus A. In our language, • Pr(randomly chosen allele = A1 ) = p • Pr(randomly chosen allele = A2 ) = q = 1 − p What happens in inheritance through reproduction? 21 / 68 Hardy-Weinberg . . . • The chance that both alleles are A1 is p2 . • The chance that both alleles are A2 is q 2 . • The chance that the first allele was A1 and the second A2 is pq. The chance that the first was A2 and the second A1 is qp. • Overall, the chance of picking one A1 and one A2 allele is 2pq. The above proportions are called the Hardy-Weinberg proportions. What important assumptions are being made here? 22 / 68 Disease inheritance An autosomal recessive condition affects 1 newborn in 10,000. What is the expected frequency of carriers? If a parent of a child affected by this condition remarries, what is the risk of producing an affected child in the new marriage? Assume that affected individuals do not live long enough to reproduce. Solutions in class 23 / 68 Clicker Q: Disease inheritance An autosomal recessive condition affects 1 newborn in 10,000, so the expected frequency of carriers is approximately 1/50. If a parent of a child affected by this condition remarries, what is the risk of producing an affected child in the new marriage? Assume that affected individuals do not live long enough to reproduce. 1. 1/200 2. 1/100 3. 1/1000 Checking H-W In any population or sample from a population, the genotype frequency will match H-W predicted frequencies only approximately Let’s check H-W for the distribution of the SNP actn3_r577x examined in Unit 1. Recall from Unit 1 . . . 24 / 68 Genotype and allele distribution ## assumes genetics package has been loaded, ## FAMuSS data set has been loaded, and fms attached. > r577x.genotype = genotype(actn3_r577x, sep="") > summary(r577x.genotype) Number of samples typed: 735 (52.6%) Allele Frequency: (2 alleles) Count Proportion C 750 0.51 T 720 0.49 NA 1324 NA Genotype Frequency: Count Proportion C/C 216 0.29 C/T 318 0.43 T/T 201 0.27 NA 662 NA 25 / 68 Progress This Unit Introduction Basic concepts from probability Conditional probability, general multiplication formula for probability Positive Predictive Value of a Diagnostic Test and Bayes’ Rule 26 / 68 From xkcd . . . 27 / 68 A more serious example of conditional probabilities Published in Patel, et al., NEJM (2015) Vol 372, pp 331 340, posted on web site. 28 / 68 Conditional Probability The notions of conditional probability and conditional distributions are pervasive in statistics, for both simple and complex problems. Needed because not all events are independent 29 / 68 Announcements: Wednesday February 25 • P-sets:: P-set 3 due Friday, February 27 at the usual • • • • time. P-sets: P-set 4 will be posted Friday, February 27. P-set 4 will be due Monday, March 9. Last p-set before the March 11 midterm. No Clicker questions today. Quiz today. Begin today with slide 30 Unit 2. Main ideas • Definition of conditional probability • A general multiplication rule for probabilities 30 / 68 Conditional Probability: Concept Conceptual Definition: The conditional probability of an event B, given a second event A, is the probability of B happening, knowing that the event A has happened. Notation: conditional probability is denoted by Pr(B|A). Coin tossing example: • Toss a fair coin 3 times. B = (exactly two heads), A = (at least 2 heads). • Pr(B|A) is the probability of having exactly two heads among the outcomes that have at least two heads. • Conditioning on A means we know the outcome is in the set (HHH , HHT , HTH , THH ) • In this set of outcomes, B consists of the last three, so Pr(B|A) = 3/4 • Note that Pr(B) = 3/8 31 / 68 Conditional Probability in life tables • B = (a randomly chosen person from a population lives at least 65 years) • A = (a randomly chosen person lives at least 60 years) • Then Pr(B|A) is the probability a person lives at least 65 years, given that they have been selected from the population of people 60 years of age or older. • Solution using US life tables. . . 32 / 68 x given age is the average number of years remaining to be lived by those surviving to that age on the basis of a given set of age-specific rates of dying. It is derived by dividing the total person-years that would be lived above age x by the number of persons who survived to that age interval (Tx / lx). Thus, the average remaining lifetime for males who reach age 20 is 56.2 years (5,537,328 divided by 98,486) (Table 2). Life expectancy in the United States Tables 1–9 show complete life tables by race (white and black) and sex for 2004. Tables A and B summarize life expectancy and survival by age, race, and sex. Life expectancy at birth for 2004 represents the average number of years that a group of infants would live if the infants were to experience throughout life the age-specific US Life Table, 2004 Population Table B. Number of survivors by age, out of 100,000 born alive, by race and sex: United States, 2004 All races Age 0. . 1. . 5. . 10 . . 15 . . 20 . . 25 . . 30 . . 35 . . 40 . . 45 . . 50 . . 55 . . 60 . . 65 . . 70 . . 75 . . 80 . . 85 . . 90 . . 95 . . 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . White Black Total Male Female Total Male Female Total Male Female 100,000 99,320 99,202 99,129 99,036 98,709 98,246 97,776 97,250 96,517 95,406 93,735 91,357 88,038 83,114 76,191 66,605 53,925 38,329 22,219 9,419 2,510 100,000 99,253 99,124 99,043 98,936 98,486 97,809 97,148 96,455 95,527 94,154 92,078 89,089 85,067 79,213 71,168 60,336 46,461 30,619 15,948 5,808 1,261 100,000 99,391 99,283 99,218 99,142 98,944 98,710 98,442 98,088 97,555 96,709 95,445 93,676 91,058 87,043 81,200 72,748 61,045 45,438 27,782 12,448 3,460 100,000 99,434 99,327 99,261 99,175 98,856 98,420 97,992 97,512 96,831 95,797 94,249 92,044 88,908 84,145 77,338 67,756 54,953 39,024 22,460 9,330 2,381 100,000 99,378 99,261 99,187 99,085 98,655 98,020 97,418 96,784 95,915 94,617 92,680 89,894 86,103 80,450 72,531 61,683 47,622 31,324 16,145 5,720 1,175 100,000 99,493 99,397 99,339 99,268 99,068 98,849 98,608 98,292 97,809 97,047 95,896 94,282 91,810 87,930 82,206 73,794 62,031 46,175 28,082 12,362 3,299 100,000 98,616 98,441 98,334 98,210 97,809 97,131 96,321 95,404 94,200 92,396 89,614 85,599 80,282 73,268 64,578 53,914 41,332 28,260 16,403 7,554 2,534 100,000 98,475 98,285 98,171 98,030 97,436 96,415 95,241 94,011 92,504 90,366 86,946 81,898 75,282 66,782 56,723 44,994 31,985 20,021 10,432 4,180 1,178 100,000 98,763 98,603 98,503 98,396 98,195 97,865 97,402 96,774 95,849 94,347 92,146 89,063 84,923 79,231 71,774 62,028 49,714 35,600 21,627 10,374 3,559 33 / 68 30 . . . . . . . . 35 . . . . . . . . 40 . . . . . . . . 45 . . . . . . . . 50 . . . . . . . . 55 . . . . . . . . 60 . . . . . . . . 65 . . . . . . . . 70 . . . . . . . . 75 . . . . . . . . 80 . . . . . . . . 85 . . . . . . . . 90•. . 83,114 ...... 95 . . . . . . . . 100 . 65. ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... .out . . . of .. ...... ...... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... .100,000 . . . . . . . live . ......... ......... 97,776 97,250 96,517 95,406 93,735 91,357 88,038 83,114 76,191 66,605 53,925 38,329 22,219 births 9,419 2,510 97,148 96,455 95,527 94,154 92,078 89,089 85,067 79,213 71,168 60,336 46,461 30,619 15,948to expected 5,808 1,261 US Life Table, ages 55 - 70 98,442 98,088 97,555 96,709 95,445 93,676 91,058 87,043 81,200 72,748 61,045 45,438 live27,782 to age 12,448 3,460 97,9 97,5 96,8 95,7 94,2 92,0 88,9 84,1 77,3 67,7 54,9 39,0 22,4 9,3 2,3 • Probability that randomly selected person live to age 65 is approximately 0.83. • Of the 88,038 who live to age 60, 83,114 live to age 65 • Conditional probability of living to at least age 65 among those who live to at least age 60 is (83,114/88,038) = 0.94 34 / 68 Conditional Probability: Mathematical Definition As long as Pr(A) > 0 Pr(B|A) = Pr(A ∩ B)/ Pr(A) In life table example, Pr(A) = = Pr(B ∩ A) = = = Pr(person lives at least 60 years) 88, 038/100, 000 Pr(lives at least 60 years and at least 65 years) Pr(person lives at least 65 years) 83, 114/100, 000 So Pr(B|A) = 83, 114/100, 000 = 0.94 88, 038/100, 000 35 / 68 Alcohol dependency Suppose a study of the US population showed that 10% of the population have some mental disorder, 8% have an alcohol related disorder, and 6% have both. If a person has been diagnosed with an alcohol related disorder, what is the probability that he/she has a mental disorder? If a person has been diagnosed with an mental disorder, what is the probability that he/she has an alcohol related disorder? 36 / 68 Independence again. . . A simple consequence of the definition of conditional probability: • A and B are independent if Pr(B|A) = Pr(B) Example • B = (a randomly chosen person from a population lives at least 65 years); A = (a randomly chosen person lives at least 60 years). Are A and B are independent? 37 / 68 Announcements: Friday February 27 • P-sets: P-set 3 due today at the 5:00 pm. • P-sets: P-set 4 will be posted later today. P-set 4 will • • • • • be due Monday, March 9. Last p-set before the March 11 midterm. Collection of review problems with solutions coming early next week. Exam 1 last year covered slightly different material, so it will not be posted. Clicker questions today. Quiz 2 with solutions now posted on the web site. The quiz will be returned in section next week. Begin today with slide 38 Unit 2. More conditional probabilities Suppose a disease is caused by a single major gene with two alleles (a) and (A) with frequencies 0.90 and 0.10, respectively. • If we assume independent mating (non-associative mating), what are the probabilities of the genotypes (aa), (aA) and (AA)? • Suppose the allele A causes a disease but that the gene is not fully penetrant, so that the probability of developing the disease is 0.8 for genotype (AA), 0.4 for genotype (Aa), and 0.1 for genotype (aa). Disease in the presence of genotype (aa) is called sporadic disease. • What is the overall probability of disease in the population? This overall probability is called the prevalence. 38 / 68 Conditional probabilities . . . • Suppose an individual is known to have the disease. Now what are the probabilities of the genotypes (aa), (aA) and (AA)? 39 / 68 Main ideas • Definition of conditional probability • A general multiplication rule for probabilities 40 / 68 Multiplication Rule of Probability Pr(A ∩ B) = Pr(B|A) Pr(A) We can also write Pr(A ∩ B) = Pr(A|B) Pr(B) Example • A bag contains 3 red and 3 white balls. Two balls are drawn from the bag, the second without replacing the first. • A = (first ball is red), B = (second ball is white). • Pr(A ∩ B) = Pr(B|A) Pr(A) = (3/5) × (1/2) = 3/10 41 / 68 Examples First with independence: • Toss a coin and let A = (observe heads in the toss) • Pr(A) = 1/2 • What is probability get 5 heads in a row when flip coin 5 times? • 12 × 12 × 12 × 12 × 1 2 = 1 32 Now without independence: • Draw 3 balls (without replacement) from an urn with 10 balls, 5 red, 5 green. • The probability of getting 5 green balls is 5 120 1 × 49 × 38 × 27 × 61 = 30240 = 252 10 42 / 68 Conditional distributions of heights In the US population, approximately 20% of men and 3% of women are taller than 6 feet (72 in) Let F = the event that someone is female and T = the event the person is taller than 6 feet 1. What is Pr(T |F )? 2. What is Pr(T |F c )? 3. What is the probability that the next person walking through the door is a woman and taller than 6 feet? 4. What is the probability that the next person walking through the door is taller than 6 feet tall? 43 / 68 TreeTree diagrams can help to organize diagrams (IPS p320) can help to your thinking... organize your thinking… 0.03 0.5 Female: Yes Person is… 0.97 0.2 0.5 Female: No 0.8 Taller than 6’ 0.015 Not Taller than 6’ 0.485 Taller than 6’ 0.10 Not Taller than 6’ 0.40 33 44 / 68 Progress This Unit Introduction Basic concepts from probability Conditional probability, general multiplication formula for probability Positive Predictive Value of a Diagnostic Test and Bayes’ Rule 45 / 68 Main ideas this section Diagnostic (or screening) tests and measures of test accuracy Calculating the positive predictive value of a test using • Tables • Bayes’ rule 46 / 68 Pre-natal testing for trisomy 21, 13, and 18 Some congenital disorders are caused by an additional copy of a chromosome being attached (translocated) to another chromosome during reproduction. • Trisomy 21: Down syndrome, approximately 1 in 800 births • Trisomy 13: Patau’s syndrome, physical and mental disabilities, approximately 1 in 16,000 newborns • Trisomy 18: Edward’s syndrome, nearly always fatal, either in stillbirth or infant mortality. Occurs in about 1 in 6,000 births Until recently, testing for these conditions consisted of screening the mother’s blood for serum markers, followed by amniocentesis in women who test positive. 47 / 68 Cell-free fetal DNA (cfDNA) cfDNA consists of copies of embryo DNA present in maternal blood. Recent advances in sequencing DNA provide possibility of non-invasive testing for these disorders, using only a blood sample. Initial testing of the technology was done using archived samples of genetic material from children whose trisomy status was known. The results are variable, but generally very good. 48 / 68 • Of 1000 children with the one of the disorders, approximately 980 have cfDNA that tests positive. The test has high sensitivity. • Of 1000 children without the disorders, approximately 995 test negative. The test has high specificity. 49 / 68 What do the parents of the unborn child care about? The designers of a test want a test to have high sensitivity and specificity. That makes it a good test. But a family undergoing testing wants to know the likelihood of the condition being present, if the test is positive. Suppose a child has tested positive for trisomy 23. What is the probability that the child in fact does have the trisomy 23 condition? 50 / 68 Trisomy 23 We will show 3 solutions (!) to this problem. • Intuitive solution that requires only common sense (and a bit of clear thinking. . . ). Will also illustrate use of R as a calculator • Algebraic solution using Bayes’ rule. • Simulation based solution, similar to that used in the drug testing problem. Each solution provides a different way to think about the problem. Intuitive solution based on a simple two-way table on the board. 51 / 68 Using R as a ‘programmable’ calculator ## calculations for trisomy 23 example, unit 2 ## parameters of the problem tri23.prevalence = 1/800 cfdna.sens = 0.980 cfdna.spec = 0.005 pop.size = 10000 ## expected number of healthy children and children with ## the disorder expected.cases = pop.size * tri23.prevalence expected.noncases = pop.size - expected.cases ## ## Number of children testing positive will consist of both true and false positives expected.true.pos.tests = expected.cases * cfdna.sens expected.false.pos.tests = expected.noncases * (1 - cfdna.spec) 52 / 68 Trisomy 13. . . ## ## now calculate expected number of positive tests in population expected.pos.tests = expected.true.pos.tests + expected.false.pos.tests ## Among all positive tests, calculate the fraction of ## positive tests correctly identifying trisomy 23 cfdna.ppv = expected.true.pos.tests /expected.pos.tests cfdna.ppv 53 / 68 Diagnostic Tests Events of interest, where () denotes an event: • D = (person has disease) • D C = (person does not have disease) • T + = (positive screening result) • T − = (negative screening result). Could use T and T C , but T + , T − are consistent with notation in medical and public health literature. 54 / 68 Measures of accuracy for diagnostic tests • Sensitivity = Pr(T + |D) (want very high!) • Specificity = Pr(T − |D C ) (want high!) • False negative rate = Pr(T − |D) = 1 - sensitivity • False positive rate = Pr(T + |D C ) = 1 - specificity These measures are all characteristics of a diagnostic test. 55 / 68 Positive predictive value of a test Suppose an individual tests positive for a disease D. Positive Predictive Value (PPV): The PPV of a diagnostic test is the probability that a person has a disease D, given that he/she has tested positive. • PPV = Pr(D|T + ) • The conditioning here is in the reverse order from the test characteristics The characteristics of the test give us Pr(T + |D) (among other things) but not Pr(D|T + ). 56 / 68 Bayes’ Theorem, aka Bayes’ Rule Simple form first: Pr(A|B) = Pr(A) Pr(B|A) Pr(B) Follows directly by noting that Pr(A ∩ B) Pr(B) Pr(A) Pr(B|A) = Pr(B) Pr(A|B) = Back to pre-natal screening (on the board, again). 57 / 68 The denominator Pr(B) in Bayes’ Theorem Seldom stated as simply as in earlier slides, because in many problems, Pr(B) is not given directly, but is calculated using the general multiplication formula for probabilities Suppose A and B are events. Then Pr(B) = Pr(B ∩ A) + Pr(B ∩ AC ) = Pr(A) Pr(B|A) + Pr(AC ) Pr(B|AC ) A!B A B 58 / 68 More complicated form of Bayes’ Theorem Pr(A ∩ B) Pr(B) Pr(A) Pr(B|A) = Pr(B) Pr(A) Pr(B|A) = Pr(A) Pr(B|A) + Pr(AC ) Pr(B|AC ) Pr(A|B) = Next slide converts this to notation of diagnostic testing. 59 / 68 Bayes’ Rule for diagnostic tests Take A = D, B = T + Recall PPV is Positive Predictive Value PPV = P(D|T + ) P(D)P(T + |D) P(D)P(T + |D) + (1 − P(D))P(T + |D C ) prevalence × sensitivity = [prev × sensitivity] + [(1-prev) × (1-specificity)] = 60 / 68 Venn diagram of Bayes’ Theorem D T+ Dc 7 Breast cancer and mammograms The National Cancer Institute estimates that approximately 3.65% of women in their 60’s get breast cancer. A mammogram typically identifies a breast cancer about 85% of the time, and is correct 95% of the time when a woman does not have breast cancer. If a woman in her 60’s has a positive mammogram, what is the likelihood she has breast cancer? 62 / 68 Two solutions to breast cancer example • Algebraic solution in the p-set, using the Bayes’ rule formula • Simulation of a single large population shown on next slide The R code for the simulation is spread over 3 slides, but much of the code is just comment. Examining and using the code provides an opportunity to match probabilisitic concepts with algorithmic thinking. 63 / 68 Using R to construct and examine a large population Code is a bit longer than drug testing problem, so split over several slides. Instead of simulating 150 drug tests with many replicates, we use R to construct a large population of individuals. Logic of the simulation: • Initialize • Loop through the population, simulating test outcome conditional on disease status by using a if() statement. • Calculate PPV empirically. 64 / 68 The R code ## ## Simulation of setting for diagnostic testing Based on breast cancer example discussed in class ## step 1, initialize population with no disease, no test population.size = 100000 disease.prevalence = 0.0365 test.sens = 0.85 test.spec = 0.90 set.seed(6579) ## Initialize the population, create the disease marker ## This code starts by establishing a population whose ## members have the disease with probabilith equal to ## the prevalence of the disease disease.presence = vector("numeric", population.size) disease.presence = sample(c(0,1), size = population.size, prob=c(1 - disease.prevalence, disease.prevalence),replace = TRUE) ## Now initialize the diagnostic test outcome vector diag.test.outcome = vector("numeric", population.size) 65 / 68 The R code . . . ## ## ## ## ## The code now loops through the population, sampling outcomes for the diagnostic test using sensitivity and specificity, conditional on the disease status of each member. The if() statements sample outcomes of tests conditional on disease status for (ii in 1:population.size) { if(disease.presence[ii] == 1) { diag.test.outcome[ii] = sample(c(0,1), size=1, prob = c(1 - test.sens, test.sens)) } if(disease.presence[ii] == 0) { diag.test.outcome[ii] = sample(c(0,1), size=1, prob = c(test.spec, 1 - test.spec)) } } ## end for loop ## ## Create a matrix where each row contains the disease status and test outcome for each population member disease.pres.and.diag.test = cbind(disease.presence, diag.test.outcome) 66 / 68 R code for mammogram example ## ## ## ## ## ## As in the trisomy example, calculate ppv by finding the proportion of true positive tests among all positive tests. Since disease and test outcome are labeled 0 and 1, the sum of the vector disease.presence is the number of members of th e population with the disease. The same reasoning is used to calculate the number of positive tests num.disease = sum(disease.presence) num.pos.test = sum(diag.test.outcome) ## Number of true positives is the number of positive ## tests among members with the disease d = (disease.presence == 1) num.true.pos = sum(diag.test.outcome[d]) diag.test.ppv = num.true.pos/num.pos.test diag.test.ppv 67 / 68 Summary, and a look ahead The notions of probability, independence and conditional probability provide ways to make probabilistic statements (assessments of uncertainty) about events in populations. The problems that arise (especially word problems) are often easy to state, but difficult to solve. Problems can be solved with algebraic calculations or using R. • Algebraic calculations are more familiar (and seem easier), but getting to the right calculation can be difficult. • Using R requires dealing with R syntax, but is often conceptually easier because simulating a population or replicates of an experiment follows the problem statement directly. 68 / 68