Statistics Term Project Final Draft
Transcription
Statistics Term Project Final Draft
Chad Joel Harrison Statistics 1040 Sanborn 29 April 2015 Skittles Term Project For my Statistics 1040 class at Salt Lake Community College we have been given an assignment to do a statistical analysis of the amount of each color (red, orange, yellow, purple, and green) of Skittles candies in a 2.71 bag. The class in total recorded data from 40 different bags of Skittles and the information is broken down into some different statistical categories as follows. Skittles Data Pie Chart Red Orange Yellow Green Purple 20% 19% 19% 22% 20% Skittles Data Pareto Chart 550 500 450 400 Orange Purple Yellow Red Green The data I gathered seem to fall right in line with the totals from the class data. My orange totals were a little low compared to the class total where orange was the largest population of the data. Coming into this project with never attempting something like this before I would have imagined that the colors would all be evenly distributed. Red Orange Yellow Green Purple My Data 11 12 10 12 16 Class Data 457 541 487 452 500 Class Mean Candies in a bag – 60.9 Standard Deviation – 1.93 5 Number Summary Min – 54 Q1 – 60 Q2(Mean) – 61 Q3 – 62 Max – 64 Column Std. dev. Median Min Max Q1 Q3 n Mean IQR Total 1.92 61 54 64 60 62 40 60.925 2 Frequency Skittles Data Histogram 15 10 5 Frequency 0 54 55 59 60 61 62 63 64 More Amount of Skittles The shape of the data would be skewed left as can be seen in the Histogram. The total number of Skittles I had in my bag is a fair representation of the mean number of skittles per bag from the class data totals. The rest of the data from my bag of Skittles are almost the opposite, purple was my high, and for the class it was low. Likewise, orange was one of my lows, but for the class it was the largest sample of the groups collected. The Categorical data in this project would be represented by the colors of the candies. The quantitative data would be represented by the values, which are the totals or amounts of each color. The 5 number summary would look pretty silly if I tried to explain that using just the colors. I need both the categorical and qualitative information attached for most of the data to be interpreted correctly. The pie charts are a good representation of both categorical data and quantitative data working together, where the histogram and 5 number summaries are more representative of the qualitative data only. Confidence Intervals A Confidence Interval is a group of values from a sample that we would use to estimate the true value of a population parameter. We will take the data from the class and use it to calculate if the sample size we have is similar to other random samples of the combined data of 40 bags of Skittles of the same size. Using the software Statdisk, I will layout give the results of some of the different confidence intervals we will be looking at. Purple Skittle 95% Confidence Interval Proportion Margin of error, E = 0.016033 95% Confidence Interval (using normal approx): 0.1891373 < p < 0.2212033 With this data we are saying that we are 95% confident that in a sample of 40 bags of Skittles the proportion of purple Skittles would fall between 18.91% - 22.12% Mean Candies Per Bag 99% Confidence Interval Margin of error, E = 0.1007831 99% Confident the population mean is within the range: 60.79922 < mean <61.00078 This data shows that we are 99% confident that in a sample of 40 bags of Skittles we would have a mean range between 60.79 – 61 Skittles per bag. Standard Deviation 98% Confidence Interval 98% Confidence Interval for the St. Dev.: 1.867668 < SD < 1.996436 These data tell us that in a sample of 40 bags of Skittles we can be 98% confident that in a similar sample the standard deviation would be between 1.87 – 2.0 if we round Hypothesis Testing Hypothesis testing is where we make a claim about some characteristic of the population we are analyzing and then we take all the data we have collected from our sample and test the claim that we are making. Using the Statdisk software we will test the given hypotheses. Hypothesis Test 20% of all Skittles Candies are GREEN Ho= 20% of Skittles are Green Ha is not equal to 20% Sample proportion: 0.1854739 Test Statistic, z: -1.7927 Critical z: ±2.5758 P-Value: 0.0730 99% Confidence interval: 0.1651932 < p < 0.2057546 Fail to reject the null Hypothesis that 20% or 1/5 of the Skittles are green in a sample of 40 bags of the candies. There was sufficient evidence to warrant the claim that 20% of the Skittles are green. With the Hypothesis test we performed it shows that 17%21% would be within the range given for green skittles in a bag. Hypothesis Test The Mean Number of Candies in a Bag of Skittles is 56 Ho = µ of 56 Skittles per bag Ha not equal to µ of 56 Skittles per bag Alternative Hypothesis:µ not equal to µ(hyp) t Test Test Statistic, t: 125.3333 Critical t: ±1.9609 P-Value: 0.0000 95% Confidence interval: 60.82334 < µ < 60.97666 Reject the null Hypothesis that the mean number of skittles in a bag is 56 total candies. There was not sufficient evidence to warrant this claim. The estimated mean of Skittles in a bag would be 60.8-61. The conditions for all the confidence intervals and hypothesis tests that were performed were all done using Statdisk with the data we collected in class. To my surprise most of the data collected by the class fell right in line with the confidence intervals and different hypothesis we tested. Some of the data I used to compare my single bag sample from did not meet the requirements. My purple skittles were 26% of my total sample, but the confidence interval we performed for the purple Skittles said that we were 95% confident the proportion of purple Skittles would be between 18.9%-22.1%. Also, the standard deviation from my bag of Skittles is 2.28, but the test we performed gave us a range of 1.87-2.0, and my standard deviation was higher. However my mean candies per bag was right on par with the confidence interval we performed. Then when we take the larger sample of all the data collected by the class it gives most of our collective data a little bit more of an even spread between all the confidence intervals we performed. Standard deviation was 1.93, which was within the estimated range. Purple Skittles accounted for 20% of the class sample, and that is in the estimated range. Mean candies per bag was 61, which is also right in line with the estimated range. All of the calculations were performed by students in class, which leaves a wide variety of possibilities for human error to occur, but if it did it would only slightly throw off the sample sizes of the individual colors. Also, the use of technology is only as good as the person who is entering it into the computer or calculator, which would also fall under the human error category. The only way I can see that the sampling method we used to be improved would have been to have someone else double check the counts of the individual samples given, the input of the data into excel, and the entering of data to complete the calculations necessary to complete the confidence intervals and hypothesis tests. Chad Joel Harrison Statistics 1040 Sanborn 229 April 2015 Skittle Term Project During the course of this project I learned over and over again how helpful the use of technology helped me as I dove deeper and deeper into the world of statistics. I feel like I only began to unlock the potential of what you can complete using tools like Excel and Statdisk. As helpful as these tools are it also adds a different element to the process of completing the assignment where you have to make sure that all the completed calculations make sense, and that I didn’t let human error taint any of the work that I had done. Thankfully the internet is full of tutorials on how to use Excel to make different graphs and formulas. The problem solving skills I used throughout this project mostly evolved around the use of technology. I was making sure to double and triple check everything I was entering into Excel and Statdisk. After a while I didn’t even have to look at the original sheet where I collected my data and transferred data from the class over. I will probably be able to remember for then next few months that my total number of candies collected in the class sample was 2437. I learned that even though some of the parts of this assignment, this class, and the homework seem at times tedious and time consuming, there is a need for statistics in many different aspects of the world we live in. I recently started a new job working for Two Men and a Truck as a Quality Control Manager and there are statistics all over that place. Through the data that Two Men and a Truck has collected through the decades they have been in business, they can give you a very good estimation of how much a move will cost, how much of the truck you will use, how many guys it will take to complete the job, and how much they can compensate a customer who happens to have something damaged during the process of their move. Math is not one of my favorite subjects, hence the reason my last few semester at Salt Lake Community College have been dedicated to finishing the math classes I have been procrastinating until the end of my Associates Degree, but I have gained a greater appreciation for math including statistics in the way it helps us in our every day lives.