1 Bayesian versus Orthodox statistics: Which side are you on?
Transcription
1 Bayesian versus Orthodox statistics: Which side are you on?
1 Bayesian versus Orthodox statistics: Which side are you on? RUNNING HEAD: Bayesian versus orthodox statistics Zoltan Dienes School of Psychology University of Sussex Brighton BN1 9QH UK dienes@sussex.ac.uk 2 Abstract Some common situations are presented where Bayesian and orthodox approaches to statistics come to different conclusions; you can see where your intuitions initially lie. The approaches are placed in the context of different notions of rationality and I accuse myself and others as having been irrational in the way we have been using statistics, i.e. as orthodox statistics. One notion of rationality is having sufficient justification for one’s beliefs. Assuming one can assign numerical continuous degrees of justification to beliefs, some simple minimal desiderata lead to the “likelihood principle” of inference. Hypothesis testing violates the likelihood principle, indicating that some of the deepest held intuitions we train ourselves to have as orthodox users of statistics are irrational on a key intuitive notion of rationality. I consider practical considerations so people can make a start at being Bayesian, if they so wish: If we want to, we really can change! Keywords: Statistical inference, Bayes, significance testing, evidence 3 Bayesian versus Orthodox statistics: Which side are you on? Introduction Psychology and other disciplines have benefited enormously from having a rigorous procedure for extracting inferences from data. The question this article raises is whether we could not be doing it better. Two main approaches are contrasted, orthodox statistics versus the Bayesian approach. Around the 1940s the heated debate between the two camps was momentarily won in terms of what users of statistics did: Users followed the approach systematised by Jerzy Neyman and Egon Pearson (at least this approach defined norms; in practice researchers followed the somewhat different advice of Ronald Fisher; see e.g. Gigerenzer, 2004). But it wasn’t that the intellectual debate was decisively won. It was more a practical matter: A matter of which approach had the mathematics been most well worked out for detailed application at the time; and which approach was conceptually easier for the researcher to apply. But now the practical problems have been largely solved; there is little to stop researchers being Bayesian in almost all circumstances. Thus the intellectual debate can be opened up again, and indeed it has (e.g. Hoijtink, Klugkist, & Boelen, 2008; Howard, Maxwell, & Fleming, 2000; Rouder et al , 2007; Taper & Lele, 2004). It is time for researchers to consider foundational issues in inference. And it is time to consider whether the fact it takes less thought to calculate p values is really an advantage, or whether it has led us astray in interpreting data (e.g. Harlow, Mulaik, & Steiger, 1997; Meehl, 1967; Royall, 1997; Ziliak & McCloskey, 2008), despite the benefits it has also provided. Indeed, I argue we would be most rational, under one 4 intuitively compelling notion of rationality, to be Bayesians. To see how which side your intuitions fall, at least initially, we next consider some common situations where the approaches come to different conclusions. Bayesian or Orthodox: Where do your intuitions fall? Consider the following scenarios and see what your intuitions tell you. You might reject all the answers or feel attracted to more than one. Real research questions do not have pat answers. See if, nonetheless, you have clear preferences for one or a couple of answers over another. Almost all answers are consistent either with some statistical approach or with what a large section of researchers do in practice, so do not worry about picking out the one ‘right’ answer (though, given certain assumptions, I will argue that there is one right answer!). 1) You have run the 20 subjects you planned and obtain a p value of .08. Despite predicting a difference, you know this won’t be convincing to any editor and run 20 more subjects. SPSS now gives a p of .01. Would you: a) Submit the study with all 40 participants and report an overall p of .01? b) Regard the study as non-significant at the 5% level and stop pursuing the effect in question, as each individual 20-subject study had a p of .08? c) Use a method of evaluating evidence that is not sensitive to your intentions concerning when you planned to stop collecting subjects, and base conclusions on all the data? 5 2) After collecting data in a three-way design you find an unexpected partial two-way interaction, specifically you obtain a two-way interaction (p = .03) for just the males and not the females. After talking to some colleagues and reading the literature you realise there is a neat way of accounting for these results: certain theories can be used to predict the interaction for the males but they say nothing about females. Would you: a) Write up the introduction based on the theories leading to a planned contrast for the males, which is then significant? b) Treat the partial two-way as non-significant, as the three-way interaction was not significant, and the partial interaction won’t survive any corrections for post hoc testing? c) Determine how strong the evidence of the partial two-way interaction is for the theory you put together to explain it, with no regard to whether you happen to think of the theory before seeing the data or afterwards, as all sorts of arbitrary factors could influence when you thought of a theory? 3) You explore five possible ways of inducing subliminal perception as measured with priming. Each method interferes with vision in a different way. The test for each method has a power of 80% for a 5% significance level to detect the size of priming produced by conscious perception. Of these methods, the results for four are non-significant and one, the Continuous Flash Suppression, is significant, p = .03, with a priming effect numerically similar in size to that found with conscious perception. Would you: 6 a) Report the test as p=.03 and conclude there is subliminal perception for this method? b) Note that when a Bonferoni-corrected significance value of .05/5 is used, all tests are non-significant, and conclude subliminal perception does not exist by any of these methods? c) Regard the strength of evidence provided by these data for subliminal perception produced by Continuous Flash Suppression to be the same regardless of whether or not four other rather different methods were tested? 4) A theory predicts a difference in reaction time between two conditions. A previous study finds a significant difference between the conditions of 20 seconds, with a Cohen’s dz of 0.5. You wish to replicate in your lab. In order to obtain a conventional power of 80% you run 35 subjects and find a t of 1.0 and a p of .32. Would you a) conclude that under the conditions of your replication experiment there is no effect? b) conclude that null results are never informative and withhold judgment about whether there is an effect or not? c) realise that while 20 seconds is a likely value given the theory being tested, the difference could in fact be 15 seconds either side of this value and still be consistent with the theory. You treat the evidence as inconclusive; e.g. your certainty in the theory might go down modestly from being about 65% to a bit more than 50%, and so you decide to run more subjects until the evidence more strongly supports the null over the theory or the theory over the null? 7 5) You look up the evidence for a new expensive weight loss pill. Use of the pill resulted in significant weight loss after 3 months daily ingestion with a beforeafter Cohen’s dz of 1.0 with n=10 subjects giving a p of .01. In addition, you accept that there are no adverse side effects. Would you: a) Reject the null hypothesis of no change and buy a 3 month’s supply? b) Decide 10 subjects does not provide enough evidence to base a decision on when it comes to taking a drug, withhold judgment for the time being, and help sponsor a further study? c) Decide that in a 3-month period you would like to loose between 10-15kg. In fact, despite the high standardised effect size, the raw mean weight loss in the study was 2kg. The evidence that the pill uses a mechanism producing 0-10 kg loss (which you are not interested in) rather than 10-15kg (which you are) is overwhelming. You have sufficient data to decide not to buy the pill? We will discuss answers to this quiz below. But first we need to establish what the rational basis for orthodox and Bayesian statistics consists in and why they can produce different answers to the above questions. Rationality What is it to be rational? One answer is it is having sufficient justification for one’s beliefs; another is that it is a matter of having subjected one’s beliefs to critical scrutiny. Popper and others inspired by him took the second option under the name of 8 critical rationalism (e.g. Popper, 1963, Miller, 1994). On this view, there is never a sufficient justification for a given belief because knowledge has no absolute foundation. Propositions can be provisionally accepted as having survived criticism, given other propositions those people in the debate are conventionally and provisionally willing to accept. All we can do is set up (provisional) conventions for accepting or rejecting propositions. An intuition behind this approach is that irrational beliefs are just those not subjected to sufficient criticism. Critical rationalism bears some striking similarities to the orthodox approach to statistical inference, the Neyman Pearson approach (an approach almost universally used by users of statistics, but few would know it by that name, or any other). On this view, statistical inference cannot tell you how confident to be in different hypotheses; it only gives conventions for behavioural acceptance or rejection of different hypotheses, which, given a relevant statistical model (which can itself be subjected to testing), results in pre-set long term error rates being controlled. One cannot say how justified a particular decision is or how probable a hypothesis is; one cannot give a number to how much data supports a given hypothesis (i.e. how justified the hypothesis is, or how much its justification has changed); one can only say that the decision was made by a decision procedure that in the long run controls error probabilities (as objective probabilities in the sense of long run frequencies) (see Dienes, 1008, chapter 3, for a conceptual introduction to this approach; also Oakes, 1986; and Royall, 1997, for why p values do not provide such degrees of support). Note probability is a long run relative frequency so it does not apply to the truth of hypotheses, nor even to particular experiments. It is the long run relative frequency of errors for a given decision procedure. It can be obtained from the tail area of test 9 statistics (e.g. tail area of t-distributions), adjusted for factors that affect long run error rates, like how many other tests are being conducted. These error rates apply to decision procedures not to individual experiments. An individual experiment is a oneoff event, so it does not determine a unique long-run set of events; but a decision procedure can in principle be considered to apply over a long run indefinite number of events (i.e. experiments). Now consider the other approach to rationality, that it is a matter of having sufficient justification for one’s beliefs. If we want to assign numerical degrees of justification (i.e. of belief ) to propositions, what are the rules for logical and consistent reasoning? Cox (1946; see Sivia & Skilling, 2006) took two minimal desiderata, namely that 1. If we specify degree of belief in P we have implicitly specified degree of belief in not-P 2. If we specify degree of belief in P and also specify degree of belief in (Q given P) then we have implicitly specified degree of belief in (P&Q) Cox did not assume in advance what form this specification was nor what the relationships were; just that the relationships existed. Using deductive logic Cox showed that degrees of belief must follow the axioms of probability if we wish to accept the above minimal constraints. Thus, if we want to determine by how much we should revise continuous degrees of belief, we need to make sure our system of inference obeys the axioms of probability. In my experience, researchers think all the time in terms of the degree of support data provide for a hypothesis. If they want to think that way, they should make sure their inferences obey the axioms of probability. 10 One version of such continuous degrees of belief are subjective probabilities, i.e. personal conviction in an opinion (e.g. Howson & Urbach, 2006). One can hone in on one’s initial personal probabilities by various gambling games (see Dienes, 2008, chapter four, for an introductory review of these ideas). This can be a useful idea for how one could have probabilities for different propositions when it is hard to specify clear and full reasons for why the probabilities must have certain values. It is natural that people regard the same theory as being more or less plausible, and that probabilities can be personal. However, when probabilities of different propositions form part of the inferential procedure we use in deriving conclusions from data then we need to make sure that the procedure is fair. Thus, there has been an attempt to specify “objective probabilities” that follow from the informational specification of a problem (e.g. Jaynes, 2003). This will be a useful way of thinking about probabilities for evaluating how much data support different hypotheses. In this sense, probabilities can be normative convictions a person should have given the constraints and information made explicit in the statement of the problem. In this way, the probabilities become an objective part of the problem, whose values can be argued about given the explicit assumptions, and do not depend on any further way on personal idiosyncrasies. Note these sort of probabilities can be regarded as consistent with critical rationalism (despite Popper’s aversion to Bayes): The assumptions defining the problem are without absolute foundation, they are open to criticism, but can be debated until tentatively accepted. In any case, whatever probabilities one starts with (entirely subjective personal ones gathered by reaching deep in one’s soul, or objectively specified ones given stated constraints), Bayesian inference insists that 11 one must revise these initial probabilities in the light of data in ways consistent with the axioms of probability. In the Bayesian approach, probability applies to the truth of theories (the relative frequency notion of probability as used in Neyman Pearson statistics does not apply to theories). Thus we can answer questions about p(H), the probability of a hypothesis being true (our prior probability), and also p(H|D), the probability of a hypothesis given data (our posterior probability), neither of which we can do on the orthodox approach. The probability of obtaining the exact data we got given the hypothesis is the likelihood. From the axioms of probability, it follows directly that: Posterior is given by likelihood times prior From this theorem (Bayes’ theorem) comes the likelihood principle: All information relevant to inference contained in data is provided by the likelihood (e.g. Birnbaum, 1962). When we are determining how given data relatively changes the probability of our different theories, it is only the likelihood that connects the prior to the posterior. The likelihood is the probability of obtaining the exact data obtained given a hypothesis, P(D|H). This is different from a p-value, which is the probability of obtaining the same data or data more extreme given both a hypothesis and a decision rule. Thus, a p-value for a t-test is a tail area of the t-distribution (adjusted according to the decision rule); the corresponding likelihood is the height of the t-distribution at the point representing the data - not an area and certainly not an area adjusted for the 12 decision rule. In orthodox statistics these adjustments must be made because they accurately reflect the factors that affect long term error rates of a decision procedure. The likelihood principle may seem a truism; it seems to just follow from the axioms of probability. But in orthodox statistics, p-values are changed according to the decision rule: How one decided to stop collecting data; whether or not the test is post hoc; how many other tests one conducted. None of these factors influence the likelihood. Thus, orthodox statistics violates the likelihood principle. I will consider each of these cases because they have been used to argue Bayesian inference must be wrong, given that we have been trained as researchers to regard these violations of the likelihood principle to be a normative part of orthodox statistical inference. But these violations of the likelihood principle also lead to bizarre paradoxes. I will argue that when the full context of a problem is taken into account, the arguments against Bayes based on these points fail. On the other hand, where subjective probabilities play a role in the construction of the likelihood, care does need to be taken in establishing inter-subjective consensus for Bayesian inferences to be generally acceptable. The Bayes factor One form of Bayesian analysis pits one theory against another, say theory1 against theory2. Theory1 could be your pet theory put under test in an experiment; theory2 could be the null hypothesis, or some other sort of default position. If your personal probability of theory1 being true before the experiment is P(theory1) and that for theory2 is P(theory2), then your prior odds in favour of theory1 over theory2 is P(theory1)/P(theory2). These prior probabilities and prior odds can be entirely 13 personal or subjective; there is no reason why people should agree about these before data are collected. Once data are collected we can calculate the likelihood for each theory. These likelihoods are things we want people to agree on; thus, any probabilities that contribute to them should be plausibly or simply determined by the specification of the theories. The Bayes factor B is the ratio of the likelihoods. From the axioms of probability, Posterior odds = B*prior odds If B is greater than 1 then the data supported your experimental hypothesis over the null. If B is less than 1, then the data supported the null hypothesis over the experimental one. If B is about 1, then the experiment was not sensitive (Jeffreys, 1961, suggests Bayes factors above 3 - or below 1/3 - are “substantial”). Note that B automatically gives a notion of sensitivity; it directly distinguishes data supporting the null from data uninformative about whether the null or your theory was supported. Contrast this state of affairs with just relying on p values in significance testing. The most common mistake people make is believing they can simply take a nonsignificant p-value, and from this alone decide that the null was supported over the theory. In fact, the Neyman Pearson approach itself proscribes against this. It says one should calculate power, and only accept the null when power was high. If people followed the Neyman Pearson approach as it should be done, they would know when a null result meant one could accept the null and when it meant withholding decision. Unfortunately, one CAN simply calculate p-values without calculating power, this is easier, so that’s what people do. In Bayes one does not have the choice: You get the 14 full answer whether you want it or not. This is a practical reason for preferring Bayes over Neyman Pearson. According to this argument, both approaches would work fine if done correctly; but Bayes forces one to do it correctly and Neyman Pearson obviously does not. Decades of statisticians admonishing psychologists has not made them do it properly (see e.g Cohen, 1977, 1994; Harlow, Mulaik, & Steiger, 1997; Hunter & Schmidt, 2004). Now I wish to consider reasons for preferring Bayes even if Neyman Pearson is done correctly. And those reasons are based on the ways in which Neyman Pearson depart from the likelihood principle. Problems with Neyman Pearson 1) On the Neyman Pearson approach one must specify the “stopping rule” in advance, i.e. the conditions under which one would stop collecting data. Once those conditions are met, there is to be no more data collection. Typically this means you should plan in advance how many subjects you will run (with a power calculation). You are forbidden from running until you get a significant result – because if you try you will always succeed, given sufficient time, even if the null is true. Further, one cannot plan to run 30 subjects, find a p of .06, and then run 10 more, and report the pvalue of .04 SPSS now delivers for the full set of 40 subjects, and declare it significant at the 5% level. Five percent is an inaccurate reflection of the error rate of the decision procedure one used (that procedure being: “Run 30 subjects, test, if nonsignificant, run 10 more subjects and test again”). You cannot test your data once at the .05 level, then run some more subjects, and test again at .05 level. The Type I error rate is no longer .05 because you gave yourself two chances at declaring 15 significance. Remember on the Neyman Pearson approach probabilities are long run relative frequencies – i.e. the long run properties of your decision procedure, and thus do not apply to your individual experiment. You will make a Type I error 5% of the time at the first test; the second test can only increase the percentage of Type I errors. The significance can thus be never less than .05, no matter how many subjects one subsequently runs and how strong the evidence seems against the null and no matter what SPSS tells you the p-value is. Each test must be conducted at a lower significance level for the overall error rate to be kept at .05. This puts one in an impossible moral dilemma having tested once at the 5% level if an experiment after 30 subjects yields a p of .06. One cannot reject the null on that number of subjects; yet one cannot accept it either (no matter what the official rules are, would you accept the null for a p of .06?). One cannot publish the data; yet one cannot in good heart bin the data and waste public resources. Bayes avoids this impossible moral dilemma. When calculating a Bayes factor, it does not matter when you decide to stop running subjects. You can always run more subjects if you think it will help. 16 Figure 1 The importance of the stopping rule for significance testing Consider the cartoon in Figure 1. Two researchers come to the end of running 40 subjects. One calculates the p-value based on the full set of data without correction: he had planned to run that many from the start and only test once at the end of data collection. He gets a significant result and decides to reject the null. The other researcher tested after 20 subjects, found it non-significant, so tested again at the end of 40. But he wasn’t cheating: he made the appropriate corrections (see Armitage, Berry and Mathews, 2002, pp 615-623 for examples of legitimate corrections). Now on the final test, because of the correction, the result is non-significant. Notice how the appropriate inference (to accept or reject the null) depends on more than the likelihood, that is, on more than how likely these precise data are given the null 17 hypothesis or theory under test. It depends not just on what the data are but also on what might have happened, even if it did not. In particular, appropriate inference depends on if the researchers would have stopped after 20 subjects if it had been significant – even though it wasn’t, and they didn’t. Notice in the cartoon that if whether they would have stopped collecting data after 20 subjects, even though they didn’t, depends on who has the better kung fu, then the mathematically correct result depends on whose kung fu is better! And if whether they would have stopped collecting data after 20 subjects depends on who has the strongest unconscious desire to please the other then the mathematically correct answer depends on whose unconscious wish to please the other is strongest! This example might seem deliberately absurd but the point is very real and practical. How many people have looked at the results some way through testing – how many would have stopped collecting if the results had been clear enough then? Would you? (Are you sure?) How many people “top up” with just a few more subjects? I believe the practice to be very common. As I said, there are ethical problems with not doing so in many cases. And maybe many people justify it because in their hearts they believe in the likelihood principle: Surely what subjective intentions are concealed in the mind are irrelevant to drawing inferences from data; what matters is just what the obtained data are. The problem is, one cannot believe the likelihood principle and follow Neyman Pearson techniques. Long term error rates using significance tests are affected by counterfactuals, even if likelihoods are not. The Bayes factor behaves differently from p-values as more data are run (regardless of stopping rule). For a p-value, if the null is true, any value in the interval 0 to 1 is equally likely no matter how much data you collect. For this reason, sooner 18 or later you are guaranteed to get a significant result if you run subjects for long enough. In contrast, as you run more subjects and the null is true, or closer to the truth than your alternative, the Bayes factor is driven towards zero. You could run an infinite number of subjects and never have B achieve a given value, e.g. 4. Hence you can run as many subjects as you like, stopping when you like. Isn’t this what researchers really want of their inferential statistics? And if this sounds more sensible, it is because it is literally more rational. 2) On Neyman Pearson, it matters whether you formulated your hypothesis before or after looking at the data (post hoc vs planned comparisons): Predictions made in advance of rather than before looking at the data are treated differently. In Bayesian inference, it does not matter what day of the week you thought of your theory. The evidence for your theory is just as strong regardless of its timing relative to the data. This is because the likelihood is unaffected by the time the data were collected. Note the likelihood principle contradicts not only Neyman Pearson on this point, but also the advice of e.g. Popper (1963) and Lakatos (1978), who valued the novelty of predictions (though Lakatos later gave up the importance of temporal novelty, Lakatos & Feyerabend, 1999, pp 109-112). Kerr (1998) also criticised the practice of HARKing: Hypothesising After the Results are Known. Indeed, novel predictions are often impressive as support for a theory. But this may be because in making novel predictions the choice of auxiliary hypotheses (i.e. those hypotheses implicitly or explicitly used in connecting theory to specific predictions) was fair and simple. Post hoc fitting can involve preference (for no good reason) for one auxiliary over many others of at least equal plausibility. Thus, a careful consideration of the reason for 19 postulating different auxiliaries should render novelty irrelevant as a factor determining the evidential value of data. That is, the issue is not the timing of the data per se, but the priori probability of the hypotheses involved. And prior probability is something Bayes is uniquely well equipped to deal with! Consider an example that has been used as a counter argument to the likelihood principle. I have a pack of cards face down. I lift up the top card. It is a six of hearts. Call the hypothesis that the pack is a standard pack of playing cards “Hs”. Call the hypothesis that all cards in the pack are six of hearts “H6h”. The likelihood of drawing a six of hearts given the pack is a standard pack is 1/52. So L(Hs) = 1/52. The likelihood of drawing a six of hearts given the hypothesis that the pack consists only of 52 six of hearts is 1. So L(H6h) = 1. So the Bayes factor in favour of the pack consisting only of sixes of hearts versus being a standard pack is 52. If you saw a pack of cards face down your initial prediction would be that they are a standard pack of playing cards. The drawing of a single six of hearts would not change your mind. Surely the hypothesis that they are all sixes of hearts is purely post hoc, a mindless fitting of the data! If Bayesian statistics support the post hoc theory over the theory that it is a standard pack of cards, surely something is wrong with Bayes! If someone could predict in advance that the pack was all sixes of hearts that would be one thing; but that is precisely the point, they wouldn’t. By missing out on the importance of what can be predicted in advance, does not the likelihood principle fail scientists? Remember the likelihood principle follows from the axioms of probability. The axioms of probability are by their nature almost self-evident assumptions. They will not lead to wrong conclusions. Indeed, in this case, if we put the problem in its full context, we see the Bayesian answers are very sensible (see Royall, 1997, p 13, 20 for the following argument). Before we pick up the card there are 52 hypotheses that the pack is all of one sort of card – the hypothesis that it is all aces of hearts, all twos of hearts, and so on. Call the probability that one or other of these hypotheses is true π. So the probability of any one of them being true is π /52, assuming we hold them to all have equal probability. Once we have observed the six of hearts, all these hypotheses go to zero, except for H6h. The probability of that hypothesis goes to π. The probability that the whole pack is all of one suit remains the same – it is still π.. The probability that it is a standard pack of cards remains the same. The axioms of probability and their consequence, Bayes theorem, give us just the right answer. There is no need to introduce an extra concern with ability to predict in advance; that concern is already implicitly covered in the Bayesian approach. It is not the ability to predict in advance per se that is important; that ability is just an (imperfect) indicator of the prior probability of relevant hypotheses. The data provide stronger or weaker evidence for different hypotheses regardless of the day of the week the data were collected. When performing Bayesian inference there is no need to adjust for the timing of predictions per se. Indeed, it would be paradoxical to do so: Adjusting conclusions according to when the hypothesis was thought of would introduce irrelevancies into inference, leading to one conclusion on Tuesday and another on Wednesday for the same data and hypotheses. 3) On Neyman Pearson you must correct for how many tests you conduct in total. For example, if you ran 100 correlations and 4 were just significant at the 5% level, researchers would not try to interpret those significant results. On Bayes, it does not 21 matter how many other statistical hypotheses you investigated. All that matters is the data relevant to each hypothesis under investigation. Consider an example from Dienes (2008) to first pump your intuitions along Neyman Pearson lines; and then, as above, we will see how the axioms of probability do indeed give us the sensible answer, and hopefully your intuitions come to side with Bayes. The example is about searching for the reincarnation of a recently departed lama by a search committee set up by the Tibetan Government-in-exile. The lama’s walking stick is put together with a collection of 20 others. Piloting at a local school shows each stick is picked equally often by children in general. We have now set up a test with a known and acceptable test-wise Type I error probability, i.e. controlled to be less than 5% for each individual test. If a given candidate picks the stick, p = 1/21 < .05. Various omens narrows the search down to 21 candidate children. They are all tested and one of these passes the test. Can the monks conclude that he is the reincarnation? The Neyman Pearson aficionado says, “No! With 21 tests family-wise error rate = 1 – (20/21)21 = 0.64. This is unacceptably high. Of course if you test enough children, sooner or later one of them will pass the test. That proves nothing. As Bayes does not correct for multiple testing, surely the Bayesian approach must be wrong!” The Bayesian argues like this. “Assume the reincarnation will definitely choose the stick. If the 10th child chose the stick, the Bayes factor B for the 10th child = 21. Whatever your prior odds that the 10th child was the reincarnation they should be increased by a factor of 21.” The Neyman Pearson aficionado cannot contain himself. “Ha! You have manufactured evidence out of thin air! By ignoring the issue of multiple testing you have found strong evidence in favour of a child being the reincarnation just because 22 you tested many children!” (see e.g. Mayo, 1996, 2004, for this argument against Bayes). The Bayesian patiently continues with an argument similar to the one in the previous section, “The likelihood of any child who did not choose the stick is 0. Call the prior probability that one or other of these children was the reincarnation π. If prior probabilities for each individual child equal, prior probability that any one is the reincarnation = π/21. After data, twenty of these go to zero. One goes to 21 * π/21 = π. They still sum to π. If you were convinced before collecting the data that the null was false you can pick the reincarnation with confidence; conversely if you were highly confident in the null beforehand you should be every bit as confident afterwards. And this is just as it should be!” The Bayesian answer does not need to correct for multiple testing because if an answer is already right it does not need to be corrected. Once one takes into account the full context, the axioms of probability lead to sensible answers, just as one would expect. As I point out in Dienes (2008), a family of 20 tests in which one is significant at the .05 level typically leads one by Bayesian reasoning to have MORE confidence in the family-wise null hypothesis that ‘all nulls are true’ while decreasing one’s confidence in the one null that was significant. And this fits one’s intuitions that if evidence went against the null in 4 out of 100 correlations, one would be more likely to think the complete null is true, but still find oneself more likely to reject the null for the four specific cases. If all 100 correlations bore on a theory that predicted non-zero correlations in all cases, then one’s confidence in that theory would typically decrease by a Bayesian analysis. 23 The moral is that in assessing the evidence for or against a theory, one should take into account ALL the evidence relevant to the theory, and not cherry pick the cases that seem to support it. Cherry picking is wrong on all statistical approaches. A large number of results showing evidence for the null against a theory still count as against the theory, even if a few of the effects the theory predicted are supported. And Bayes gives one the apparatus for combining such evidence to come to an overall conclusion, an apparatus missing in Neyman Pearson. Thus, it is Bayes, rather than Neyman Pearson, most likely to demand of researchers they draw appropriate conclusions from a body of relevant data involving multiple testing. Bayes factors close to zero count as evidence against the theory; in practice, non-significant values are either left to count or not depending on whim. 4) Finally, because the likelihood tells one by how much to change confidence, and because significance testing violates the likelihood principle, the Neyman Pearson approach does not tell one how to change confidence. A significant difference (i.e. accepting the experimental hypothesis on Neyman Pearson) can mean one should reduce one’s confidence in a theory that predicted the difference…. and a null result can mean confidence in theory should increase (see Dienes 2008, chapter four, and below for examples). Neyman Pearson does not tell you how justification for a theory has changed. No matter what anyone tells you, Neyman-Pearson analyses do not directly license assigning any degree of confidence to one conclusion rather than another. If you are interested in the extent to which you should change your confidence in a theory, you must do a Bayesian analysis. Consider for example if a particular theory has often predicted effects which turn out to be about 100 ms in size. 24 In a new prediction in a domain where one would expect the same size effect, a significant effect of 5 ms is obtained. This may well count strongly against the theory. To summarise the arguments of this section, key differences between the approaches that follow from the likelihood principle are shown in Table 1. Table 1 Contrasts between Bayesian and orthodox statistics following from whether or not the likelihood principle is obeyed. Because orthodox statistics are sensitive to the factors listed, contrary to the likelihood principle, in each case different people with the same data and hypothesis may come to opposite conclusions. Orthodox: Could this factor affect whether or not a null hypothesis is rejected? Bayes: Does this factor ever affect the support of data for a hypothesis? When you initially intended to stop running participants Yes Whether or not you predicted a result in advance of obtaining it Yes The number of tests that can be grouped in a family No. You can always run more participants to acquire clearer evidence if you wish No. No one need ever try to second guess which really came first. No. Please test as many different hypotheses as is worth your time – but you must take into account all evidence relevant to a theory. Your answers to the quiz Yes 25 Now consider the situations we started with. What are your intuitions now? In all cases, answer c) is the Bayesian answer. For question 1, I suspect a majority of researchers have at some time taken a) as their answer in similar cases. They have Bayesian intuitions, but use them with the wrong tools, the only tools apparently available, and tools inappropriate for actually cashing out the intuition. Choice a) is also the answer one might pick by thinking with a meta-analytic mind set, but use of meta-analysis here is complicated by the fact the stopping rule was conditional upon obtaining a significant finding. Thus, the correct orthodox answer is to regard the data non-significant, as in b). Power may be low, but in effect one committed to that level of Type II error in planning the study. Answer c) spells out the intuition, and Bayes provides the tools for implementing it. For question 2, again I suspect many people have decided a) in similar circumstances, because of the Bayesian intuitions in c), and so used the wrong tools for the right reasons. One suspects in many papers the introduction was written entirely in the light of the results. We implicitly accept this as good practice, indeed train our students to do likewise for the sake of the poor reader of our paper. But b) is the correct answer based on the Neyman Pearson approach, and maybe your conscience told you so. But should you be worrying about what might be murky – which really came first, data or hypothesis? – or, rather, about what really matters, whether the predictions really follow from a substantial theory in a clear simple way? For question 3, practice may vary depending on whether the author is a believer or sceptic in subliminal perception. After all, there is no strict standard about what counts as a “family” for the sake of multiple testing. There is a pull between accepting the intuition in c) that surely there is evidence for this method, and the 26 realisation that more tests means more opportunities for inferential mistakes. But one should not confuse strength of evidence with the probability of obtaining it (Royall, 1997). Evidence is evidence even if, as one increases the circle of what tests are in the “family”, the probability that some of the evidence will be misleading increases. For question 4, many researchers may conclude a), and indeed feel that by taking power into account they are morally ahead of the pack of typical researchers who ignore power. Those schooled in Fisherian intuitions may choose b). A Bayesian analysis forces one to consider the range of effect sizes consistent with a theory. As soon as one considers this question, whether as a Bayesian or not, it will become apparent that the effect obtained in a previous study does not define the lowest effect size one is interested in. Thus, even in studies that do take power into account, likely they do not consider a minimally interesting effect, and likely they use rather low power (80%) – rather low that is, for obtaining evidence that could substantially favour a null hypothesis over the theory of interest. Because orthodox methods are not based on how persuasive you should find evidence, they often allow conclusions that in fact should not be persuasive. And without doing a Bayesian analysis, you have no idea how persuasive they should be. Question 5 again illustrates a case where orthodox statistics could produce the same answer as Bayes: One could calculate a confidence interval on raw weight loss and see that it excludes the values one is interested in. In this case, the advantage of the Bayesian approach is that it forces you to consider what range of effects you are really interested in. It forces you to take into account that which is inferentially relevant. Orthodox statistics do not (cf e.g. Kirsch, 2009). 27 Now dear reader and fellow journey person: Are you a closet – or indeed, out Bayesian? What inferential methods seem rational to you? If you want some pointers in bringing our your inner Bayesian, read on! Bayes factors in practice The last example illustrates the importance of considering the size of an effect in using it to evaluate a theory. Effect size is very important in the Neyman Pearson approach: One must specify the sort of effect one predicts in order to calculate power. On the other hand, Fisherian significance testing leads people to ignore effect sizes. People have followed Fisher’s way, while paying lip service to effect sizes. By contrast, to calculate a Bayes factor one must specify what sort of effect sizes the theory predicts. Bayes forces people to think about effect size. Nowadays many journals require one mention standardised effect sizes for each inferential test. But has this led people to either calculate power or use confidence intervals? When confidence intervals are given, do authors argue what size effect would be predicted by the theory and indicate whether the confidence interval includes or excludes effects interesting on the theory? That is what people should be doing (particularly when interpreting null results), but they do not. People may mention effect sizes but they don’t seem to do anything with this information other than state what they are. And when power is calculated, it is often calculated for “a Cohen’s d of 0.5” because, the authors say, they are interested in “medium sized” effects. This is a step above not calculating power at all. But it is still an unthinking mechanical response when we could be doing so much more. What size effect does 28 the literature suggest is interesting for this particular domain? Rather than plucking “0.5” out of thin air we should get to know the data of our field. Often the already existing published data in our field allow us to say in raw units just what sort of effect size a theory deals with. Sometimes one really does not know; then using wild speculations – like a Cohen’s d of 0.5 because that is the sort of effect psychologists in general often deal with – may be the best one can do. But often it will be not far off the least one can do. (For arguments for the frequent relevance of raw rather than standardised effect sizes, see Baguley, 2009; Ziliak & McCloskey, 2008.) It is easy to see why psychologists, and other users of statistics, became Fisherian, i.e. just reported p values and did not consider effect sizes in an inferential way. Reporting p-values requires minimal thought. While that response might sound cynical, it is indeed an advantage of p-values: The result requires minimal assumptions. For this reason, I suspect p-values may be interesting as a side line when reporting many tests. It is the result one gets with minimal assumptions. Nonetheless, this approach has done enough mischief we really have to change. Unless one incorporates effect sizes into one’s inferences, dealing with null results is impossible. How predictably are null results handled without thought, leaving the reader genuinely uninformed about the status of some theory or treatment or danger, whatever the confident assertions of the author. The data has been collected to inform the reader, but the reader is not getting informed. Untold damage has been done to many fields because of this lapse. Despite some attempts to encourage researchers to use confidence intervals their use has not taken off (Fidler et al, 2004). Confidence intervals of some sort would deal with many problems (either confidence, credibility or likelihood intervals; 29 see Dienes 2008 for definitions and comparison). But an approach that has something of a flavour of a t or other inferential test might be taken up more easily. Further, confidence intervals themselves have all the problems enumerated above for Neyman Pearson inference in general (unlike credibility or likelihood intervals). So here I urge the use of the Bayes factor. To calculate a Bayes factor in support of a theory (relative to say the null hypothesis), one has to specify what the probability of different effect sizes are, given the theory. In a sense this is not new: We should have been specifying predicted effect sizes anyway. And if we are going to do it, Bayes gives us the apparatus to flexibly deal with different degrees of uncertainty regarding the predicted effect size. For example consider a theory that predicts a difference will be in one direction. A minimally informative distribution, containing only the information that the difference is positive, is to say all positive differences are equally likely between zero and the maximum difference allowed by the scale used. Such a vague prediction works against finding evidence in favour of the theory. Generally we can do better than that. For example, it seems rather unlikely that the difference will be as large as the maximum allowed by the scale: That requires all subjects in one condition were at one extreme of the scale, and all subjects in the other condition were at the other extreme. In general, smaller effects are more likely than the larger ones. This can be modelled by one half of a normal distribution, with its mode at zero, and its tail dropping away in the positive direction. But how to scale the rate of drop? If similar sorts of effects as those predicted in the past have been on the order of a 5% difference between conditions in classification accuracy, then we can set the standard deviation of the normal to be 5%. This distribution would imply that smaller effects 30 are more likely than bigger ones; and that effects bigger than about 10% are unlikely. If an argument based on the existing literature makes these assumptions plausible, then the Bayes factor based on those assumptions is one that can be accepted generally. To play with how assumptions affect the Bayes factor, see the web site for Dienes (2008) for flash programs and Matlab code for the Bayes factor, and Baguley and Kaye (in press) for corresponding R code; and Rouder et al (2009) for another Bayes factor calculator. Bayes factors vary according to assumptions, but they cannot be made to vary ad lib: Often a wide range of assumptions leads to essentially the same conclusion. The results will depend on the assumptions about likely effect sizes given the theory. But notice these assumptions are open to public scrutiny. They can be debated and other assumptions used according to the debate. In this sense Bayes is objective. In Neyman Pearson inference, the inference depends on how the experimenter decided to stop and when he thought of the hypothesis. These concerns are not open to public scrutiny and may not even be known by the author. All assumptions relevant to Bayesian inference are available to critical debate because they are part of the public problem situation itself and not locked in the head of researchers. Strangely, starting from subjective probabilities leads to more objective conclusions than starting with objective probabilities (as relative frequencies)! A statistical philosophy would not have persisted so long if it did not produce tenable conclusions in many cases. And indeed, in many cases Bayesian and orthodox answers agree, which is reassuring. For example, Dienes et al (2009) investigated possible correlates of hypnotic suggestibility. Previous research had found that a task measuring cognitive inhibition correlated about .40 with hypnotic suggestibility. With 31 180 participants, Dienes et al found a correlation of -.05, with a 95% confidence interval of [-.20. .09]. The null hypothesis was accepted. We can use the software from Dienes (2008) to calculate a Bayes factor by Fisher-z transforming the correlation so that its sampling distribution becomes normal. Based on the previous study alone, a positive correlation would be expected of around .40. However, in light of the larger literature, one can in addition say that smaller correlations are more likely than larger ones: Replicable correlates of hypnotic suggestibility, such as they are, tend to have values closer to .20 or .10 rather than .40 or .50 over many studies (e.g. Kirsch & Council, 1992). Thus, a reasonable distribution of the population (Fisher-z transformed) correlation value is a half normal with its mode at 0 and an SD of .25, effectively ruling out correlations greater than about .50, and allowing any value between 0 and .50, with smaller values more likely. This is a judgment based on knowledge of the literature, and it is of course open to debate. The assumptions are laid bare for anyone to consider. The obtained (Fisher-z transformed) correlation of .05 has a standard error of 1/squareroot(N-3) = .075. Feeding these numbers into the online software, the Bayes factor is .18, i.e. substantial evidence for the null hypothesis over the alternative considered. Note that this does not amount to accepting the null in any absolute sense; there is just more evidence for the null than the alternative considered. If the alternative was a normal centred on zero with a standard deviation of .10 (i.e. if one expected possible correlations between 0 and about .20 in either direction) the Bayes factor is .69, which barely changes one’s prior odds at all: The data do not discriminate the null hypothesis from the hypothesis of a range of small correlations. And of course, this is just as it should be. Finally, one can compare the evidence for a 32 correlation of zero to the specific value of .40 obtained in the prior study (the likelihoods for 0 and .40 can be obtained from the corresponding heights in normal tables). The Bayes factor is .00 to two decimal places, indicating exceptionally strong evidence for a correlation of zero rather than the previously obtained value of .40. The data are sufficiently clear that the precise statistical philosophy does not change the ultimate conclusions, even for a null result. But using Bayes does draw one’s attention to the amount of evidence for one hypothesis relative to a specified other. Orthodox hypothesis testing does not: One accepts or rejects the null outright. The specification of what the theory predicts of course depends on the theory. Bayesians tend at heart to be against mechanical unthinking procedures, and this is both a strength and a weakness. It is ultimately a strength, though it makes it harder for the procedures to be adopted generally. Consider a psychologist who decides to use a conventional expected medium effect size (Cohen’s d) of 0.5 to scale his distribution of the predicted effect, because medium effect sizes are typical for psychologists. He uses a distribution symmetric around zero, dropping off in either direction, scaled by a Cohen’s d of 0.5 (see Rouder et al 2009 for a Bayes factor calculator that works given these assumptions). These may be excellent assumptions to make, but they should not be made as an unthinking default. Another person may convincingly argue that one can be more specific: Based on past literature, this theory typically deals with raw effects of 100 ms. Cohen’s d may change according to what other factors and covariates are in the experiment, but one expects a raw effect of about 100ms in a certain direction. Further, the person might argue that an effect size of say 10ms would actually argue against the theory being the relevant explanation, because the effect would be too small. (It is not so far fetched to think psychologists 33 could say something this exact about effect sizes based on their theories: Consider the difference between pop out and serial search in the attention literature.) Now we could model the predictions of the theory by a normal centred on 100, and we would need to debate its standard deviation. The less information we have, the more we spread the distribution out until limited by known constraints that mean it cannot be spread further (e.g. maybe we want the probability of effects less than 20ms to be very small). The process of making these arguments means getting to know one’s theories and the existing data. In one sense Orthodox and Bayesian answers will never contradict each other because they answer entirely different questions (Royall, 1997). But a Bayes factor can indicate more support for the null hypothesis than a theory that predicts a nonzero effect when significance testing indicates one should reject the null. And vice versa, a Bayes factor can indicate more support for a theory that predicts a nonzero effect than the null hypothesis when a result is non-significant. Because the distribution of effects predicted by a theory depends on the theory, no firm rules can be given for when orthodox and Bayesian answers will differ in this respect. It all depends on the theory considered (cf Berger, 2003). Here we consider some hypothetical examples. Consider the theory that making prejudice between ethnic groups can be reduced by making both racial groups part of the same in-group. A manipulation for reducing prejudice following this idea could consist of imagining being members of the same sports team. A control group could consists of imagining playing a sport with no mention of the ethnic group. A post-manipulation mean difference in prejudice (in the right direction) is obtained with 30 participants of x raw units, equal to the standard error of difference; i.e. a non-significant t value of 1.00 is obtained. 34 What follows from this null result? Should one reduce one’s confidence in the theory (assuming the experiment is regarded as well designed)? It depends. Let us say in previous research, instead of imagining the scenario, participants actually engaged in a common activity. A reduction in prejudice on the same scale was obtained of 2x. It seems unlikely that imagination would reduce prejudice by more than the real thing. If smaller effects are regarded as more likely than larger effects in general, then we may model predictions by half a normal, with its mode on zero, and a standard deviation of x units. In this case, the Bayes factor is 1.38. That is the data are essentially uninformative but if anything we should more confident in the theory after getting these null results. It would be a tragic mistake to reject the usefulness of imagination treatments for prejudice based on this experiment. Indeed, even if the mean difference had been exactly zero, the Bayes factor is 0.71, that is, one’s confidence in the theory relative the null should be barely altered. More strongly, even if the mean difference had been x in the wrong direction, the Bayes factor is still 0.43. This does not count as substantial evidence against the theory by Jeffreys’ (1961) suggested convention of 3 (or a 1/3) for indicating substantial evidence; and indeed if one felt strongly confident of the theory before collecting the data, one could normatively still be very confident afterwards. Determining predicted effect sizes is hard if a motivation for a study is simply “I wonder what would happen if…”. One can only predict effect sizes if a motivation is given for the study that relates to previous work. And the more previous work it relates to, the more informed the prediction of effect sizes can be. Thus, a Bayesian analysis puts pressure on formulating general theories. The more one can specify a mechanism for an effect, the more one can, by reference to past work involving the 35 same putative general mechanism, pin down the expected effect size. And the more one can pin down an expected effect size by reference to other research, the more strongly the data can in principle support one’s theory, or else falsify it. This must be good for theory development. For example, consider obtaining a t-value of 2, a raw mean difference of x, leading to rejection of the null by orthodox statistics. If the experiment was not based on any theory at all, and not related to any past data or expectations, the vaguest prediction is that the difference should be between the extremes possible on the scale, e.g. plus and minus 10x. The Bayes factor is 0.46, marginally counting against the vague expectation of some difference. But if theory led one to predict a difference of around x, and lying between 0 and 2x, we could model the prediction of the theory as a normal with mean x and standard deviation x/2. Now the Bayes factor is 5, strongly supporting the theory. That is, the more researchers can make reasonably precise predictions, the more Bayes can reward them. (And if they do so with implausible assumptions, surely other researchers will correct them, and argue for more reasonable ways of applying the theory, leading to useful debate on plausible auxiliary hypotheses to link theory to data.) In the development of new paradigms, where there may not be past data to refer to, pilot studies can be done on the basic effect to scale the expected size of manipulations of the effect in order to calculate a Bayes factor (as is done in Dienes, Baddeley, & Jansari, submitted). Weaknesses of the Bayesian approach The strengths of Bayesian analyses are also its weaknesses: 36 1. Calculating a Bayes factor depends on answering the following question about which there may be disagreement: What way of assigning probability distributions of effect sizes as predicted by theories would be accepted by protagonists on all sides of a debate? Answering this question might take some arguing. But isn’t this just the sort of argument that psychology has been missing out on and could really do with (cf Meehl, 1967)? People would really have to get to know their data and their theories better to argue what range of effect sizes their theory predicts. This will take effort compared to simply calculating p-values. The very effort of calculating Bayes factors will have a desirable consequence: People will think carefully about what specific (probably onedegree-of-freedom) contrasts actually address the key theoretical questions of the research, and people will not churn out e.g. all the effects of an ANOVA just because they can. Results sections will become focused, concise and more persuasive. But to begin with, psychologists may start using Bayes factors only to support key conclusions, especially based on null results, in papers otherwise based on extensive orthodox statistics. Of course, it would have to be done that way initially because editors and reviewers expect orthodox statistics. And it would in any case be good to explore the use of Bayes factors gradually. Once Bayes factors become part of the familiar tool box of researchers, their proper use can be considered in the light of that experience. An alternative response to the problem of assigning a probability distribution to effect sizes is to not take on the full Bayesian apparatus: One CAN just report likelihoods for the simple hypotheses that the population value is 1, 1.1, ….etc (e.g. 37 consider comparing the likelihood of a correlation of zero with a correlation of .40 in the example above). This is “theory free” in the sense that no prior probabilities are needed for these different hypotheses (see Blume, in press; Dienes, 2008, chapter five; Royall, 1997). This procedure results in a likelihood interval, similar to confidence interval (though one that follows the likelihood principle). The “likelihood approach” has the advantage of not committing to an objective or subjective notion of probability, and not worrying about precisely how to specify prior distributions, while committing to the likelihood principle. On the other hand, if a probability distribution over effect sizes can be agreed on, the full use of Bayes can be obtained (Jaynes, 2003). In particular, one can average out nuisance parameters, and assign relative degrees of support to different theories each consistent with a range of effect sizes (for example, the null hypothesis need not be just the hypothesis of zero, but the range of values too small to be support for a theory). 2. Bayesian procedures, because they are not concerned with long term frequencies, are not guaranteed to control Type I and type II error probabilities (Mayo, 1996). Royall (1997) showed how the probability of making certain errors with a likelihood ratio – or Bayes factor - can be calculated in advance. In particular, for a planned number of subjects, one can determine the probability that the evidence will be weak (Bayes factor close to 1) or misleading (Bayes factor in wrong direction). These error probabilities have interesting properties compared to Type I and II error rates: No matter how many subjects one runs, Type I error is always the same, typically 5%. But for Bayes factors, the more subjects one runs, the smaller the probability of weak or misleading evidence. Further, these probabilities decrease as 38 one runs more subjects no matter what one’s stopping rule. One can always decide to run some more subjects to firm up the evidence. Ultimately, however, the issue is about what is more important to us: To use a procedure with known long term error rates or to know the degree of support for our theory (the amount by which we should change our conviction in a theory)? If we want to know the degree of evidence or support for our theory, we are being irrational in relying on orthodox statistics. Conclusion I suggest that the arguments for Bayes are sufficiently compelling, even if not completely persuasive, that psychologists should be aware of the debates at the logical foundations of their statistics and make an informed choice between approaches for particular research questions. The choice is not just academic; it would profoundly affect what we actually do as researchers. References Armitage, P., Berry, G., & Matthews, J. N. S. (2002). Statistical methods in medical research. (4th Edition). Blackwell. Baguley, T. (2009). Standardized or simple effect size: What should be reported? British Journal of Psychology, 100, 603-617. 39 Baguley, T., & Kaye, W.S. (in press). Review of “Understanding psychology as a science: An introduction to scientific and statistical inference”. British Journal of Mathematical & Statistical Psychology. Berger, J. (2003). Could Fisher, Jeffreys and Neyman have agreed on testing? Statistical Science, 18, 1-32. Birnbaum, A. (1962). On the foundations of statistical inference (with discussion). Journal of the American Statistical Association, 53, 259-326. Blume, J. D. (in press). Likelihood and its Evidential Framework. In P. S. Bandyopadhyay & M. Forster (eds), Handbook of the Philosophy of Statistics. Elsevier. Cohen, J. (1977). Statistical power analysis for behavioral sciences. Academic Press. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 9971003. Cox, R. T. (1946). Probability, frequency, and reasonable expectation. American Journal of Physics, 14, 1-13. Dienes, Z. (2008). Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Palgrave Macmillan 40 Website: http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/inference/ Dienes, Z., Brown, E., Hutton, S., Kirsch, I. , Mazzoni, G., & Wright, D. B. (2009). Hypnotic suggestibility, cognitive inhibition, and dissociation. Consciousness & Cognition, 18 , 837-847. Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004). Editors Can Lead Researchers to Confidence Intervals, but Can’t Make Them Think. Psychological Science, 15, 119-126. Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33, 587-606. Harlow, L. L., Mulaik, S. A., Steiger, J. H.(Eds) (1997). What if there were no significance tests? Erlbaum. Hoijtink, H., Klugkist, I., & Boelen, P. A. (Eds) (2008). Bayesian evaluation of informative hypotheses. Springer. Howard, G. S., Maxwell, S. E., & Fleming, K. J. (2000). The Proof of the Pudding: An Illustration of the Relative Strengths of Null Hypothesis, Meta-Analysis, and Bayesian Analysis. Psychological Methods, 5 (3), 315-332. Howson, C., & Urbach, P. (2006). Scientific reasoning: The Bayesian approach. (Third edition.) Open Court. 41 Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings. Sage. Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge University Press. Jeffreys, H. (1961). The Theory of Probability. Third edition. Oxford University Press. Kerr, N. L. (1998). HARKing: Hypothesizing After the Results are Known. Personality and Social Psychology Review, 2, 196-217. Kirsch, I. (2009). The emperor’s new drugs. The Bodley Head. Kirsch, I., & Council, J. R. (1992). Situational and personality correlates of suggestibility. In E. Fromm & M. Nash (Eds.), Contemporary hypnosis research (pp. 267–291). New York: The Guilford Press. Lakatos. I. (1978). The methodology of scientific research programmes: Philosophical papers, vol 1. Cambridge University Press. Lakatos, I. and Feyerabend, P. (1999). For and against method. University of Chicago Press. 42 Mayo, D. (1996). Error and the growth of experimental knowledge. University of Chicago press. Mayo, D. G. (2004). An error statistical philosophy of evidence. In M. L. Taper& S. R. Lele, The nature of scientific evidence: Statistical, philosophical and empirical considerations. University of Chicago Press (pp 79 – 96). Meehl, P. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115. Miller, D. (1994). Critical rationalism: A restatement and defence. Open Court. Oakes, M. (1986). Statistical inference: A commentary for the social and behavioural sciences. Wiley. Popper, K. (1963). Conjectures and refutations. Routledge. Rouder, J. N., Morey, R. D., Speckman, P. L., & Pratte, M. S. (2007). Detecting chance: A solution to the null sensitivity problem in subliminal priming. Psychonomic Bulletin & Review, 14, 597-605. 43 Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t-Tests for Accepting and Rejecting the Null Hypothesis. Psychonomic Bulletin & Review. 16, 225-237. Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. Chapmen & Hall. Sivia, D. S., & Skilling, J. (2006). Data analysis: A Bayesian tutorial, second edition. Oxford University Press. Taper, M. L., & Lele, S. R. (2004). The nature of scientific evidence: Statistical, philosophical and empirical considerations. University of Chicago Press. Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error cost us jobs, justice and lives. The University of Michigan Press.