File - Stacie`s E
Transcription
File - Stacie`s E
Stacie Bowman Math 1040 Skittles Term Project Report 4/24/2015 Introduction Have you ever noticed when eating a bag of skittles that there are often times an abundance of one color and a only a few or none of others? It begs the question; are skittle colors random or is there actually a higher probability to have some colors over others? This term project for Math 1040 aims to answer this question while showing how to make use of statistics with a real life data set. The project's goal is to illustrate and practice the beginning to end process of statistics: collecting, organizing, and analyzing data, then drawing conclusions and presenting results. Organizing and Displaying Categorical Data: Colors My Skittles Red 22 35.5% Orange 13 21.0% Yellow 3 4.8% Green 8 12.9% Purple 16 25.8% My Skittles Distribution - Pie Chart 25.8% 35.5% Red Orange Yellow Green 12.9% 4.8% Purple 21.0% My Skittles Distribution - Pareto Chart 25 Number of Skittles 20 Red 15 Purple Orange 22 10 16 5 Green 13 Yellow 8 3 0 Red Purple Orange Green Yellow Class Total Skittles Red 283 22.4% Orange 271 21.4% Yellow 254 20.1% Green 206 16.3% Purple 250 19.8% Class Total Skittles Distribution - Pie Chart 19.8% 22.4% Red Orange Yellow 16.3% Green 21.4% 20.1% Purple Class Total Skittles Distribution - Pareto Chart 300 Number of Skittles 250 Red 200 Orange 150 283 271 254 Yellow 250 206 100 Purple Green 50 0 Red Orange Yellow Purple Green When opening my bag of skittles I assumed there would not be a very even distribution of colors because I have experienced this in the past (especially noticeable when your favorite color/flavor is the least frequent). Although this is often the case with a small bag (small sampling), I expected the larger sampling of the class to be more evenly distributed (assuming skittles manufactures roughly the same number of each color). Both graphs for my own bag and the total skittles in the class turned out to be about what I expected; my own bag has an overall uneven color distribution with an overabundance of red and only a few yellow, whereas the total class distribution is much more even. I would expect an even larger sample size to be even more evenly distributed, just as flipping a coin more times gets closer to 50%/50%. The overall data is very different from the small sample of my own bag, but interestingly red is the most frequent for the overall class while also being the dominating color of my own bag. One can't help but wonder if the skittles factory actually does produce more red than other colors. Organizing and Displaying Quantitative Data: the Number of Candies per Bag The shape of the distribution is normal for the most part, with a majority (12 of 21) of the bags containing between 60 and 62 skittles. The overall graph is skewed only slightly to the left with outliers at 52 and 57. This is mostly what I expected to see, a normal and narrow distribution, but what surprised me was the outliers being so significantly away from the average. It seems you can either get lucky and get an extra 7 or so skittles or get unlucky and get 8 or so less than the average. My own bag had 62 skittles in it, which is actually the class mode (62) and is within the standard deviation (3.01) of the class mean (60.2), making my bag very typical when compared to the whole class. Skittle Count - Five Number Summary Min Q1 Q2 Q3 Max 52 59 60 62 67 Mean: 60.2 Stdev: 3.01 My Bag: 62 # of Bags: 21 Order # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Total 52 56 57 58 59 59 59 60 60 60 60 60 62 62 62 62 62 62 62 63 67 ----------> ----------> Min: 52 0.25*21= 5.25 rUp--> 6 Q1: 59 ----------> median Q2: 60 ----------> 0.75*21= 15.75 rUp--> 16 Q3: 62 ----------> Max: 67 Confidence Interval Estimates A confidence interval is used to indicate how accurate an estimate of a population parameter is expected to be. A confidence interval provides a measure that one is x% confident that the true population parameter falls within the confidence interval. Calculations show that we are 95% confident that the true proportion of purple candies falls between 0.176 and 0.220. This is roughly a fifth of the candies which seems reasonable as there are five colors. We are 99% confident that the true mean of candies per bag falls between 58.3 and 62.1. This could be interpreted as expecting about 60 skittles in a bag plus-or-minus 2 skittles. I also calculated that we are 98% confident that the true standard deviation of candies per bag is between 2.20 and 4.68. All of these ranges seems reasonable for the given parameters. Hypothesis Tests A hypothesis test offers a way to test a claim made about a population. The claim could be made about the proportion, mean, or other property of the population. The claim is tested against a selected significance level and is either rejected or not rejected. If the claim includes equality (is specific) then the hypothesis test determines if there is sufficient evidence to warrant rejection of the claim, if the claim does not include equality (e.g. greater than) then the test determines if there is sufficient evidence to support the claim. Our hypothesis tests for the bags of skittles tested the claims that 20% of all the skittles are green and that the mean number of candies per bag is 56. My above calculations determined that in both cases the claims are rejected. Concerning the claim that 20% are green, it is tested against our class proportion of 16.3% green skittles and is basically determined to be unlikely to be correct and therefore is rejected (calculated P-value is less than significance level). The claim that the mean number of skittles per bag is 56 is tested against our class mean of 60.2 and is also determined to be unlikely to be correct and is also rejected (calculated test statistic falls within the critical value). Hypothesis Testing Reflection The condition for a hypothesis test for population proportions is that np >= 5 and nq >= 5. In our case np = (1264)*(0.2) = 253 > 5 and nq = (1264)*(0.8) > 5 so our samples met these conditions and the test should be correct except the sampling may not be considered simple random, disqualifying the result. The condition for a hypothesis test for a population mean is that the population is normally distributed or n > 30. In our case the number of bags is 21 which is not greater than 30, so we must determine if the distribution is normal. The below graph shows the frequency distribution and it is not very normal, therefore our samples do not meet the requirements for hypothesis testing and our test result may be incorrect. Standard deviations require strict normal distribution which we do not have. Errors could have been made using this data because our sample mean (of skittles per bag) may not be at all representative of the population mean. The sampling method could be improved by including many more samples (more bags) and by selecting those bags from randomized locations (as opposed to all of them from here in town). Frequency Frequency Distribution- Skittles per Bag 8 7 6 5 4 3 2 1 0 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 # of skittles in bag Term Project Conclusion: Reflective Writing The skittles term project has really helped me understand how statistics can be used to make sense of a real world data set. This project helped me learn the whole process of statistics from collecting data to drawing conclusions and presenting results. The results of this work also taught me that you have to be careful to not make assumptions about a population parameter when your sample size is very small (such as one bag of skittles). My own bag showed only a few yellow, possibly leading one to assume that the factory produces much fewer yellow candies compared to the others. When seeing the data collected from the whole class however yellow was almost exactly a proportionate fifth of the total candies, illustrating how a small sample can be extremely misrepresentative of the larger population. The way I think about math has changed due to this project and class. Statistics is quite different from the math I am used to and it felt more real to me, involving real world information and data sets rather than simply solving abstract equations. The skittles project in particular really helped convince me that math can be used to make sense of all sorts of things, be it clinical trials or candy. I definitely think this project has also taught me skills that will be useful for other classes and for life in general. Skills such as organizing information, making sense of raw data, and presenting ideas and results. It appears that the Skittles company produces about the same amount of each color (based on proportions of class total), although a disproportionate amount can certainly end up in a single bag. The evidence also suggests the company does a decent job of keeping close to the same number of candies in each bag (small standard deviation relative to average) although outliers (lucky and unlucky bags) with several more or less skittles than average can occur. Before this project I would have had no idea how to make sense of data like this, leading to these conclusions. Statistics has turned out to be more interesting and useful than I ever thought.