How to Lie with Statistics
Transcription
How to Lie with Statistics
How to Lie with Statistics R6 Darrell Huff Meredith Mincey Readings 5050 Spring 2010 In Darrell Huffs famous book, he teaches us how to lie with statistics so as to protect ourselves from false information. In Chapter 1, Huff tells us that a “surprisingly precise figure” is most likely false. Anything that is a nice round number or very specific is unlikely to be scientifically accurate. Those who use those precise figures haven’t done an appropriate sample, and they create bad samples in all kinds of ways. If the sample is large enough and selected properly, it will represent the whole better. If the sample is too small and or the creator too biased, the conclusions will be false but appear scientific. Unfortunately, bad samples lie behind most of what you read. Sometimes respondents to questions lie because they want to give a pleasing answer. But most of the time, results are only as good as the samples. Be skeptical. Creators who are serious about taking accurate samples must eliminate any chance of bias. To do this, creators can use a basic sample called “the random sample.” Creators choose random samples by selecting things by chance from the “universe.” The universe is the whole thing in which the sample is a part. For instance, perhaps the universe is UNT undergraduates, and you want to see how many undergraduates plan to enroll in graduate courses. All undergraduates currently at UNT would be the universe, but that’s a very large population to select samples from. It would be expensive to do a random sample large enough to accurately predict how many undergraduates plan to go to graduate school. A more economic substitute to the random sample is the “stratified random sample.” To take a stratified random sample, creators would divide the universe (UNT undergraduates) into several groups in proportion to their known prevalence. For instance, one group would be journalism majors who want to enroll in the Mayborn program. Because your population is much smaller, you won’t need as many random samples to make your data accurate. In Chapter 2, Huff explains the tricky nature of averages. The word “average” has a loose R6 meaning. People use averages to trick and influence public opinion or sell products. Readers are fooled when they don’t know the average without knowing what kind of average it is. Huff explains there are three kinds of averages: mean, median, and mode. The mean is the sum of all the numbers in a data set divided by the number of items in the list. Example: {1+2+3+4=10/4} Mean=2.5 The median is a finite list of figures found by arranging all the observations from lowest value to highest value and picking the middle one. Example: if a < b < c < d, then {a, b, c, d} Mean= b and c The mode is the value that occurs the most frequently in a data set. Example: {2, 2, 3, 6, 2, 7} Mode=2 Some averages fall so close together that it isn’t vital to distinguish among them, but the mode average is the most revealing because it shows the most common occurrence in your data set. In Chapter 3, Huff warns us of the data that is missing from the sample. People usually make inadequate samples. And instead of creating an honest headline, they omit the size of their sample. Unfortunately for advertisers, any change in a large sample group is likely to be too small to make a good headline. And unfortunately for readers, a large sample is more likely to be accurate. Sooner or later, a test group is going to show an improvement worth a headline, and that headline is unlikely to be true. Only a substantial number of trials follows “the law of averages.” The law of averages states that probability will influence all occurrences in the long term. Example: “The roulette wheel has landed on red three consecutive times. The law of averages says it's due to land on black!” Of course, the probabilities do not change according to past results. Even if the wheel has landed on red 10 consecutive times, the probability that the next roll will be black is still 47.6%. R6 Still, Huff says the law of averages is useful for descriptions and predictions. How useful depends on how many samples you take. But how many samples do you need to predict something accurately? The size of your sample depends on how large the population is and how varied the population is. Sometimes the number of samples can be deceptive. To avoid being fooled, figure out the degree of significance. Don’t trust an average or graph when important figures are missing. If the creator doesn’t explain the numbers, the range, or show any data that deviates from the average, they are fighting dirty. In Chapter 4, Huff explains the sampling method. Any product of the sampling method will have statistical error. Your sample can be taken to represent the whole field of what is a measured and that can be represented in figures. There are two ways of doing so: the probable error and the standard error. The probable error is the amount by which the mean of a sample is expected to vary because of chance alone. Example: Suppose you measure the size of a field by pacing along the fence while counting your steps. You count 100 steps along the fence. You do this a few times and notice that you came within three yards of hitting the exact 100 steps in half your trials, and missed by three yards in the other trials. You would calculate the probable error like so: 11±3 yards. Most statisticians use the standard error, which takes in about two-thirds of the cases. You can only calculate the standard error by knowing the sample’s size. Sometimes, though, people make a big ado about a difference that is demonstrable but tiny and unimportant. In Chapter 5, Huff explains what he likes to call “gee-whiz graphs.” Line graphs are the easiest statistical picture to use, and they’re good for showing trends and explaining something everyone’s interested in. Unfortunately, they’re also good for misleading the reader, intentionally or unintentionally. Suppose you want your bar graph to have more of a “wow” factor. You could cut part of the graph and make a bigger impression, but still present honest data. Your company can use misleading graphs to influence public opinion by changing the proportion of graph, and no one can place blame on you. Isn’t that something? Example: Which graph looks more impressive? Which one is more honest? R6 http://www.evsc.virginia.edu/~jhp7e/EVSC503/slides/stats_lie02/sld014.htm In Chapter 6, you also learn how to use pictorial graphs or pictographs to fool the reader. Readers like pictographs because they’re eye-appealing, but readers are less likely to understand the results correctly. When reading, watch out for bar graphs where bars change widths while representing a single factor. Is it sloppy craftsmanship or yellow journalism? Who knows? Example: Just how many adult frogs are in the south pond? The reader might conclude that frogs are simply bigger in September as compared to May, even though the title says that the graph displays the number of frogs. The reader will notice to the area of the image, not just the height. http://wikieducator.org/MathGloss/P/Pictograph In Chapter 7, Huff tells us what a semiattached figure can do. What is a semiattached figure? If R6 you can’t prove what you want, demonstrate something else and pretend it’s the same thing! Choose figures that sound best and trust that few readers will recognize how imperfectly it reflects the situation. You can recognize a semiattached figure occurs when information is missing or variables are not stated. Most advertisers want to fool you with numbers, but semiattached figures can also occur by inconsistent reporting at the source. For instance, if the advertiser asked controversial questions, it might lead to false information because respondents want to give what they believe is an acceptable answer. Example: 72% of all crow nests in a particular forest are in pine trees; therefore, crows prefer to nest in pine trees. (But 95% of all the trees in the forest are pine trees!) In Chapter 8, Huff explains the common problem of the post hoc fallacy. The post hoc fallacy occurs when you believe: If B follows A, then A caused B. In other words, because one event occurred before another, the previous event (A) directly resulted in the next event (B). However, just because A happens before B doesn’t mean they are related. More than likely, B was caused by a third factor. Example: Event A: The US has a high milk consumption rate. Event B: The US has a higher cancer rate than countries with a low consumption of milk. Post Hoc fallacy: Because the US has a high milk consumption rate and a higher cancer rate than countries that consume low amounts of milk, milk causes cancer. When there are many possible explanations, you shouldn’t pick one just because it suits your tastes. After all, the correlation can be caused by several things: a. Chance b. A co-variation in which the relationship is real, but you don’t know which variable is the cause or the effect. c. Sometimes the cause and the effect change places. d. Both variables are the cause and the effect. e. Nether variables effect the other, but the correlation is real. f. When the cause and the effect can only be speculation. R6 So what have we learned? That people will create false information when they make completely unwarranted assumptions. People will also create a fallacy in their data based on a conclusion that’s said to continue beyond the data demonstrated. Ask yourself, how did they connect event A to event B? In Chapter 9, Huff tells us how to “statisticulate.” Statisticulation is misinforming people by using statistical material and is caused by incompetence or chicanery. To be fair, statistics are usually manipulated by people who are not professional statisticians. According to Huff, salesmen, PR experts, journalists, and copywriters twist data to influence the reader. They frequently exaggerate data and rarely minimize anything unless it’s negative. They like to paint a picture of giving rather than taking. Maps can conceal facts and distort relationships and decimals can be deceiving, but have an air of exactness. They can use percentages to confuse you, and any percentage based on a small number of cases will be misleading. And a shifting base price will confuse you about discounts. If you can’t add up percentages freely, there’s a problem. Example: An ad for Instant Maxwell House Coffee emphasizes that 45% of those tested in a recent survey preferred its taste. (But how many people are in the sample?) So, how can readers protect themselves from learning false information? The first thing to do is to look for a bias or biased samples. Is the creator trying to prove a pet theory, earn a fee, or protect their reputation? Look for suppressed data and see if they published only favorable data. When reading graphs, check to see if units of measure that have shifted. Look for unqualified “averages.” Even if the creator is trying to be honest, their data can still be false. If someone is citing a claim, who is it really? Huff tells us to watch out for “o.k. names,” names that have some sort of prestige. The unscrupulous will use o.k. names to influence you, but haven’t actually consulted anyone. Check to see if the source really supports their claim. And watch out for “firsters.” Anyone can claim to be the first at anything. Check their claim more carefully to find the truth. And finally, watch out for a switch from the raw figure and the conclusion. Hopefully, by learning how to lie with statistics, you’ll know how to protect yourself in the future. Bibliography Huff, Darrell. (1954). How to Lie with Statistics. New York: W. W. Norton & Company Inc. Porter, John H. (1998). How to Lie with Statistics. Retrieved on 2010/4. http://www.evsc.virginia.edu/~jhp7e/EVSC503/slides/stats_lie02/sld001.htm Kirkman, T.W. (1996). Display of Statistical Data. Statistics to Use. Retrieved on 2010/4. http://wikieducator.org/MathGloss/P/Pictograph R6