Replication and Transparency of Macro Models
Transcription
Replication and Transparency of Macro Models
Replication and Transparency of Macro Models (and other musings and thoughts) Johannes Pfeifer (Mannheim) Replication and Transparency in Economic Research January 6/7, 2016 Two Guiding Principles Never attribute to malice that which is adequately explained by stupidity (Hanlon’s Razor) Let him who is without sin cast the first stone (John 8:7) History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 1/74 www.smbc-comics.com Why Macro is different History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 2/74 Why Macro is different At least in the area of business cycle research we work with fairly complicated structural models but rather straightforward data This gives rise to a particular set of computational challenges Many papers tend to be very complex and (almost) require a PhD for replication History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 3/74 Quick history of thought (after monetarism) Muth (1961): rational expectations, i.e. agents should not make systematic prediction errors given available information Lucas (1976): for policy advice we need structural models that are invariant to changes in the policy experiments under consideration Kydland and Prescott (1982): simple DSGE model with TFP shocks generates cyclical fluctuations resembling the ones found in the data “We chose not to test our model [...] this most likely would have resulted in the model being rejected...” (Kydland and Prescott, 1982) “The models constructed within this theoretical framework are necessarily highly abstract. Consequently, they are necessarily false, and statistical hypothesis testing will reject them.” (Prescott, 1986) Late 1980s/early 1990s: New Keynesians push for more sophisticated models and formal econometric tests History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 4/74 Challenge I: model solution Dynamic stochastic structure of models gives rise to nonlinear stochastic difference equations that describe evolution of model variables Solving these difference equations is hard, but there are two ways out: Substitute easier problem for original one: linearize model Use numerical techniques to solve model Computer is better than humans in both tasks History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 5/74 Challenge II: bringing the model to the data Estimating linearized models via maximum likelihood using Kalman filter is straightforward But: the likelihood is a high-dimensional object Even for simple models, it can be ill-behaved, showing hardly any curvature and exhibiting many local maxima For more complicated models, you can think of it as an egg-crate History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 6/74 Challenge II: bringing the model to the data “Dilemma of absurd parameter estimates” (An and Schorfheide, 2007): ML estimates often at odds with information from outside of the model Solution: use Bayesian techniques that augment likelihood with prior information → makes posterior more well-behaved Problem: Bayesian econometrics often involves working with intractable posterior distributions → need to work out complicated integrals Solution: use numerical integration techniques in the computer (relying on Metropolis-Hasting, Gibbs sampler, etc.) Smets and Wouters (2007): milestone study showing that forecasting power of DSGE model estimated with Bayesian techniques is on par with BVAR History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 7/74 Result If you work in quantitative macro, the computer is your best friend and your worst enemy! Nowadays macroeconomic work is almost impossible without scientific computing software Coding is an integral part of economic research, unless purely theoretical (even then software helps checking algebra) Development of Information Technologies has been a driver for more sophisticated research techniques History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 8/74 The problem Complexity has massively increased Training, focus on computational details, and software development have not necessarily kept pace McCullough and Vinod (1999): classical study showing that even commercially available software packages sometimes return wildly differing results in standard applications → thorough benchmarking needed But: macroeconomists rely less on standard commercially available packages like Stata This puts verification to the forefront History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 9/74 Clemens (2015) Journal of Economic Surveys (2015) Vol. 00, No. 0, pp. 1–17 C 2015 John Wiley & Sons Ltd Result Table 1. A Proposed Standard for Classifying Any Study as a Replication. Replication Robustness Same Different Methods in follow-up study versus methods reported in original Sufficient conditions for discrepancy Random chance, error, or fraud Sampling distribution has changed Types Same specification Same population Same sample Verification Yes Yes Yes Reproduction Yes Yes No Reanalysis No Yes Yes/No Extension Yes No No Examples Fix faulty measurement, code, data set Remedy sampling error, low power Alter specification, recode variables Alter place or time; drop outliers Notes: The “same” specification, population, or sample means the same as reported in the original paper, not necessarily what was contained in the code and data used by the original paper. Thus for example if code used in the original paper contains an error such that it does not run exactly the regressions that the original paper said it does, new code that fixes the error is nevertheless using the “same” specifications (as described in the paper). THE MEANING OF FAILED REPLICATIONS Sampling distribution for parameter estimates 3 History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 10/74 Coding Scientists spend 30% or more of their time on developing their own software (Hannay et al., 2009; Prabhu et al., 2011) Thus research quality and results highly dependent on developed software Knowing how to do it right is as important as learning programming. Helps to get more reliable results Decreases the amount of time needed to develop software and boosts the optimality of work Allows for replicability (which increases the validity of the results) Mistakes in codes not only dangerous for the quality of the project, but also for those citing it (Domino Effect) History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 11/74 www.dilbert.com Bad code only helps in rare cases History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 12/74 Case Study: Reinhart and Rogoff (2010) Herndon et al. (2014): “We replicate Reinhart and Rogoff (2010) and find that coding errors, selective exclusion of available data, and unconventional weighting of summary statistics lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period”. Result was cited by e.g. German finance minister for pushing for austerity in Europe History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 13/74 Coding Errors Mistakes done even by professionals According to McConnell (2004) and NASA: Industry average experience is about 1 to 25 errors per 1000 lines of code for delivered software Applications Division at Microsoft experiences about 10 to 20 defects per 1000 lines of code during in-house testing, and 0.5 defect per 1000 lines of code in released product Space-shuttle software has achieved a level of 1 defect in 500,000 History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 14/74 the error (i.e., it may give the correct output for some inputs but not others). Finally: many errors may cause a program to simply Rampant software errorsimplausible may undermine crash or to report an obviously result, but we arescientific really only concerned with errors that propagate downstream and are reported. bility tputs vably ramvered ware ntific ng in ttennt7–11. tices ines. lines vered Soergel (2015) results In combination, then, we can estimate the number of errors that actually affect the result of a single run of a program, as follows: Likelihood of relevant errors a function of various factors: Number of errors per program execution = total lines of code (LOC) * proportion executed * probability of error per line * probability that the error meaningfully affects the result * probability that an erroneous result appears plausible to the scientist. For these purposes, using a formula to compute a value in Excel counts as a “line of code”, and a spreadsheet as a whole counts as a “program”—so many scientists who may not consider themselves coders may still suffer from bugs13. All of these values may vary widely depending on the field and the source of the software. Consider the following two scenarios, in History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 15/74 completely unrelated Soergel (2015) Multiplying these, we expect that two errors changed the output of this program run, so the probability of a wrong output is effecRampant software errors may undermine scientific But results software is diffe tively 100%. All bets are off regarding scientific conclusions drawn because software is from Even such anoptimistic analysis. scenarios look pretty bleak have unbounded erro lon, an off-by-one e Scenario 2: A small focused analysis, rigorously executed will render the resul Let’s imagine a more optimistic scenario, in which we write a simbug would alter a sm ple, short program, and we go to great lengths to test and debug it. More likely, it syste In such a case, any output that is produced is in fact more likely to some downstream a be plausible, because bugs producing implausible outputs are more quences. In general, likely to have been eliminated in testing. inaccurate, not mer • 1000 total LOC • 100% executed • 1 error per 1000 lines • 10% chance that a given error meaningfully changes the outcome • 50% chance that a consequent erroneous result is plausible Here the probability of a wrong output is 5%. History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Many erroneous Bugs that produce results are more likel a program becomes the above errors-perlished scientific code sible output is plaus such as these may ea at face value: • An indexin mistake ass Scientific Computing 16/74 Soergel (2015) Why should we care about coding errors? In general, software errors produce outcomes that are inaccurate, not merely imprecise Errors in experiments in hard sciences often reduce precision: results will be a bit off Software bugs are different! “Small” bugs may have unbounded error propagation Sign error or a shift by one entry when matching data columns can render data complete noise Results affected by software bugs will often be inaccurate History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 17/74 Solution!? Maximize Probability of Error Detection Use “standard” software with large user-base where possible Use software with decent software engineering standards Do not blindly trust someone else’s codes History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 18/74 Perturbation Techniques There are various codes for perturbation solutions to DSGE models, ranking from worst to best: 1. solab.m (Klein, 2000) and Chris Sims’s gensys 2. Schmitt-Grohé and Uribe (2004) toolkit 3. Dynare (Adjemian et al., 2011) If you are teaching students, first teach them the method of undetermined coefficients and Blanchard and Kahn (1980) to see nuts and bolts Then teach them Dynare History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 19/74 Dynare Open source free software for solving, simulating, and estimating D(S)GE models Works under Matlab and Octave Advantages: Automatizes many error-prone steps (never linearize by hand) Large collection of mod-files available (quasi-standard by now) Big community Decent software engineering standards History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 20/74 Dynare: software engineering Dynare manual (http://www.dynare.org/ documentation-and-support/manual) Dynare forum (http://www.dynare.org/phpBB3/) Code archive (https://github.com/DynareTeam/dynare/) List of known bugs (http://www.dynare.org/DynareWiki/KnownBugs) Testsuite (http://www.dynare.org/testsuite/) History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 21/74 The “black box” argument Common argument: “Dynare is a black box” Counterargument: It is only a black box until you decide to open it All code is publicly available and the algorithms are well-documented You don’t need to reinvent the wheel The probability of a bug in Dynare going undetected is lower than in your own code! History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 22/74 Dynare resources https://github.com/johannespfeifer/dsge_mod Johannes Pfeifer (2013a). “A guide to specifying observation equations for the estimation of DSGE models”. Mimeo. University of Mannheim Johannes Pfeifer (2013b). “An introduction to graphs in Dynare”. Mimeo. University of Mannheim Macro Model Database 2.0 (Wieland et al., 2012) History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 23/74 Macro Model Database: A few words of warning Great tool and good starting point with about 60 heavily used models Take their “replication” claims with a grain of salt Most mod-files deviate from Dynare “best practices” Most serious issue: parameter dependencies are often not correctly handled Limits reusability of mod-files as they cannot directly be used for estimation → Users cannot bring models to the data as they are History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 24/74 Standard Software: A Caveat “Linus’s Law”: “Given enough eyeballs, all bugs are shallow” Problem: only applies when many eyeballs read and test code Large user-base not sufficient (see “Heartbleed” OpenSSL-bug, which affected about 17% of all secure web-servers for two years) History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 25/74 Example discussed here: A paper that had required a correction, but where many papers that relied on the old, wrong code had not been corrected History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 26/74 Lu et al. (2013) Retractions in Economics www.nature.com/scientificreports Retractions in economics and business administration are extremely rare A moment of introspection: Maybe we work more thoroughly than other subjects? Maybe are just more honest or have fewer opportunities for misbehavior? History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing Figure 1 | Retraction characteristics. Of the 1,423 retractions indexed by the Web of Science, the percentage of total retractions is greatest in the sciences, 27/74 Necker (2014) Surveys among economists I “The correction, fabrication,or partial exclusion of data, incorrect co-authorship, or copying of others’ work is admitted by 1–3.5%. The use of “tricks to increase t-values, R2, or other statistics” is reported by 7%. Having accepted or offered gifts in exchange for (co-)authorship, access to data, or promotion is admitted by 3%. Acceptance or offering of sex or money is reported by 1–2%. One percent admits to the simultaneous submission of manuscripts to journals. [..] According to their responses, 6.3% of the participants have never engaged in a practice rejected by at least a majority of peers. John et al. (2012) report almost the same fraction for psychologists.” Translation: We are not better than other subjects History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 28/74 Necker (2014) Surveys among economists III “Respondents were asked which fraction of research in the top general and top field journals (A+ or A) they believe to be subject to different types of misbehavior. (“up to ... %,” scale given in deciles). The fabrication of data is expected to be the least widespread. The median response is “up to 10%.” Respondents believe that incorrect handling of others’ ideas, e.g., plagiarism, is more common; the median is “up to 20%” of published research.” History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 29/74 101+ (%) Survey among economists II 417 314 466 387 383 489 List et al. (2001) Note: Research rank is from Scott and Mitias (1996). TABLE 2 Summary Statistics of Responses Research “Felonies” (Falsification) Self (Q 9) Have you ever falsified research data? Other (Q 9a) What percentage of research in the top 30 journals do you believe [is falsified]? Randomized response n = 140: 4.49(0.30) 7.04(0.85) Direct response n = 94: 4.26(0.22) 5.13(0.73) Research “Misdemeanors” Selling Grades Self (Q 10) Have you ever [committed any of four “minor” infractions]? Others (10A) What percentage of research in the top 30 journals do you believe is affected by [these “minor” infractions]? Self (Q 11) Have you ever accepted sex, money, or gifts in exchange for grades? Others (Q 11a) What percentage of economics faculty members do you believe have accepted sex, money, or gifts exchange for in grades? 10.17(0.34) 16.98(1.52) 0.40(0.27) 4.26(0.50) 7.45(2.72) 12.95(1.50) 0.0(0.0) 3.82(0.51) Notes: Cell contents are means (standard errors) and represent percentages. For randomized response questions, we compute means and variances based on RR = Z − 1 − P/P; RR = Z1 − Z/n − 1P 2 , where Z is the observed of yes responses, P is the probability of answering the sensitive question, is the proportion of yes responses to the nonsensitive question (in our case a series of coin flips, hence = 1), n is the sample size. Echoes earlier findings from 1998 ASSA meeting History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 30/74 Steen et al. (2013) What should we expect? Why Are There More Scientific Retractions? Table 1. Correlations among journal impact factor (IF) and time-to-retraction expressed in months for different infractions. Months to retract Correlation r Sample n Journal IF Mean SD Mean SD IF6Months R= P, Misconduct+Poss. misconduct 889 8.71 10.08 43.03 37.40 20.079 22.39 0.01 Misconduct 697 9.10 10.24 46.78 38.38 20.120 23.19 0.01 Possible misconduct 192 7.31 9.38 29.41 29.97 0.030 0.41 NS Plagiarism 200 2.63 2.42 26.04 32.55 20.134 21.90 0.05 Error 437 10.98 11.61 26.03 27.95 0.029 0.60 NS Duplicate publication 290 3.91 6.33 26.61 29.63 20.027 0.46 NS All retractions 2047 7.30 9.54 32.91 34.24 20.027 1.22 NS This table includes all retracted articles. ‘‘Misconduct+Poss. misconduct’’ includes both ‘‘Misconduct’’ and ‘‘Possible misconduct,’’ which are also analyzed separately. The correlation coefficient r is tested for significance with the R statistic, which has a t-distribution. Numbers do not sum because this table does not include ‘‘other’’ and ‘‘unknown’’ infractions, and because some papers were retracted for more than one infraction. doi:10.1371/journal.pone.0068397.t001 was about defective transcription of Foxp3 in patients with How many retractions do you know? psoriasis and was submitted from the Third Military Medical reflects changes in institutional behavior as well as changes in the behavior of authors. University in Chongqing, China. Two papers [19,20] were about nanoembossed ferroelectric nanowires and came from Fudan University in Shanghai. It was judged that the same ‘‘Z. Shen’’ The PubMed database of the National Center for Biotechnolwrote the latter two papers, but a different ‘‘Z. Shen’’ wrote the ogy Information was searched on 3 May 2012, using the limits of former paper. ‘‘retracted publication, English language.’’ A total of 2,047 articles In the course of identifying whether each first author had also were identified, all of which were exported from PubMed and written other retracted papers, it was often possible to identify entered in an Excel database [8]. Each article was classified networks of collaborating authors. In the case of ‘‘Z. Shen’’ above, according to the cause of retraction, using published retraction we noted that the senior author of the psoriasis paper was ‘‘Y. Historynotices, Coding Preserving Actions(ORI), Data Policies Working reproducibly Scientific Computing 31/74 proceedings fromIntegrity the OfficeofofLiterature Research Integrity Liu,’’ whereas the senior author of the nanowire papers was ‘‘R. Methods Digression Q: How do economics journals deal with these issues? A: Often not good. At least in economics, it is almost impossible to directly spot any issues when looking at homepages History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 32/74 Primiceri (2005) Review of Economic Studies History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 33/74 Del Negro and Primiceri (2015) Review of Economic Studies History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 34/74 Jermann and Quadrini (2012b) American Economic Review History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 35/74 Jermann and Quadrini (2012a) American Economic Review History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 36/74 Kunce et al. (2002) American Economic Review History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 37/74 Gerking and Morgan (2007) American Economic Review Effects of Environmental and Land Use Regulation in the Oil and Gas Industry Using the Wyoming Checkerboard as a Natural Experiment: Retraction By SHELBY GERKING History AND WILLIAM E. MORGAN* The purpose of this note is to call attention to, Although IHS classifies wells by land type, and to take responsibility for, errors in a previwells of a given type in a given region in a given ously published paper (Mitch Kunce, Shelby year will have the same reported cost per foot 1 Gerking, and William Morgan 2002). The regardless of whether they were drilled on fedmain finding reported in that paper is that oil eral or private property. Thus there is no indeand natural gas wells are significantly more pendent variation in much of the drilling cost costly to drill on federal property than on pridata independent of the variables used in the vate property. This note explains why the paregression model. per’s results are being retracted from the While the data provided by IHS do not show literature. a difference in drilling cost by land type condiFindings presented in the original paper cantional on the variables in the regression model, not be substantiated because the data furnished errors in our handling of the data made it appear Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 38/74 Fernández-Villaverde, Rubio-Ramı́rez, et al. (2006) Econometrica History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 39/74 Ackerberg et al. (2009) Econometrica History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 40/74 Lackman (1982) Quarterly Journal of Economics History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 41/74 Chenault (1984) Quarterly Journal of Economics History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 42/74 QJE (1984) Quarterly Journal of Economics NOTICE TO OUR READERS The following article (with minor copy-editing differences) was published in The Quarterly Journal of Economics, vol. 97, no. 3, August 1982, pp. 541-42 under the name of Prof. Conway L. Lackman of Rutgers University. Shortly after publication, Prof. Larry Chenault of Miami University asserted to the Board of Editors that the published article was, with minor differences, a paper that Chenault had written and submitted to two other professional journals. Professor Lackman's submission to The Quarterly Journal of Economics, received on 22 September 1981, was not a typewritten original, but a xerographic document. After refereeing, the paper was accepted for publication on 3 December. 1981. Prof. Lackman thereafter received from The Quarterly Journal of Economics galley proofs, accompanied by a copy of his submission. Following the return of galley proofs, the paper was published with minor changes from Lackman's submission copy. Upon receipt of Chenault's assertions, a member of the Board History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 43/74 How to deal with known issues? Sometimes there are well-known issues with published papers People in the inner circle of the community are well-aware of these issues (cf. “Worm Wars” of Miguel and Kremer (2004)) But: newcomers and outsiders often are not Consequently, they may spend an inordinate amount of time trying to replicate or build upon problematic papers or put too much trust in published papers Do we as economists perform well in preserving the integrity of the literature? How many PhD students have wasted years of their life due to this? → high social costs History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 44/74 Is a better refereeing process the solution? In some cases, the referees obviously failed to do their job My experience: more often detecting true issues requires months of hard work Fernández-Villaverde, Guerrón-Quintana, et al. (2011): no indication in the paper at all that something might be off; only codes gave it away In the game of refereeing, the incentives are stacked against referees thoroughly checking codes (particularly at early rounds and when paper gets rejected) Any effort you put in is anonymous and will only be valued by the editor → incentives even worse than for comments Puts bigger burden on post-publication peer review/checking (attention will correlate with impact) History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 45/74 Comments Comments seem to have partially filled this gap: Kurmann and Mertens (2013) on Beaudry and Portier (2006) (∼ 570 citations) Born and Pfeifer (2014) on Fernández-Villaverde, Guerrón-Quintana, et al. (2011) (∼ 290 citations) Ackerberg et al. (2009) on Fernández-Villaverde, Rubio-Ramı́rez, et al. (2006) ( ∼ 80 citations) History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 46/74 Comments: Costs vs. benefits Writing a comment is risky; private returns are almost surely smaller than the social returns: You do not know the standard the journal will apply and whether it will get published Other journals often do not touch comments on papers not in their own journal You may alienate the original authors and make powerful enemies Often only original research counts towards evaluations: “You should better do original research instead of wasting your time on other people’s research” You might get the reputation as a “nitpicker” Additionally: Some journals do not publish comments at all Comments are not an attractive option for lower-tier journals History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 47/74 Sidenote: Top vs. lower-ranked journals Not much evidence on reliability/correctness of articles in different journal tiers Do top journal articles have higher quality because they are a positive selection and attract more scrutiny upon publication? Or are mistakes more likely because the research is at the frontier, less standard, and more complex? Are lower ranked journal articles more problematic, because they face less scrutiny by readers and referees? Do editors at lower-tier journals have incentives to deal with messy cases or is it better to sweep them under the rug? My Take Lower-ranked journals have higher share of problematic articles History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 48/74 Comments on Journal Homepages Not common in economics AEA offers this for AEJs, but strangely not for the AER Flies too much under the radar (https://www.aeaweb.org/ articles.php?doi=10.1257/pol.6.1.167) Requires login History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 49/74 http://replication.uni-goettingen.de/wiki/ New instruments: Replication Wiki The Replication Wiki aims at providing an authoritative database on replication issues It already catalogues many papers where formal replications have been conducted It also offers a “talk page” where issues with papers can be discussed But: no anonymous comments are possible In particular PhD students and early career researchers shy away from being associated with a critique of important figures in the field → functionality not used that much History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 50/74 Zimmermann (2015) New instruments: Replication Studies Some journals are willing to publish replication studies Journal of Applied Econometrics has exclusive list of journals for which replication studies are considered Econ Journal Watch, an online-only, open access journal with goal to “watch the journals for inappropriate assumption, weak chains of argument, phony claims of relevance, omissions of pertinent truths, and irreplicability (EJW also publishes replications).” Journal of the Economic Science Association promises to be explicitly receptive of replication studies, but scope is limited to experimental economics International Journal of Economic Micro Data is new online open access journal with replication section Problem: “it takes as long to write a short paper as a long one” History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 51/74 www.pubpeer.com New instruments: Pubpeer - The online journal club Site offering post-publication peer review Not much used in economics, but heavily used in life-sciences Gained traction after several high-profile publications in the “tabloids” Nature and Science were brought down by comments on Pubpeer Has potential to become the go-to portal for issues with articles, but still has long way to go for network effects in economics to kick in History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 52/74 http://blog.pubpeer.com/?p=200 Big advantage: anonymous commenting Important: with great power comes great responsibility History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 53/74 New instruments: Versioning Having one and only one version of a published article is anachronism from print age In internet age, in principle nothing prevents updating of articles, provided changes are tracked Might be interesting way to deal with problems and corrections For an example, see Soergel (2015) at http://dx.doi.org/10.12688/f1000research.5930.2 History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 54/74 www.dilbert.com First step: Data and Replication History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 55/74 https://www.aeaweb.org/aer/data.php Data policies First step in macro research should be straightforward: replication Data policy at AER stipulates: “For econometric and simulation papers, the minimum requirement should include the data set(s) and programs used to run the final models, plus a description of how previous intermediate data sets and programs were employed to create the final data set(s). Authors are invited to submit these intermediate data files and programs as an option; if they are not provided, authors must fully cooperate with investigators seeking to conduct a replication who request them.” History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 56/74 Chang and Li (2015) The bleak picture “We attempt to replicate 67 papers published in 13 well-regarded economics journals using author-provided replication files that include both data and code. [...] Aside from 6 papers that use confidential data, we obtain data and code replication files for 29 of 35 papers (83%) that are required to provide such files as a condition of publication, compared to 11 of 26 papers (42%) that are not required to provide data and code replication files. We successfully replicate the key qualitative result of 22 of 67 papers (33%) without contacting the authors. Excluding the 6 papers that use confidential data and the 2 papers that use software we do not possess, we replicate 29 of 59 papers (49%) with assistance from the authors. Because we are able to replicate less than half of the papers in our sample even with help from the authors, we assert that economics research is usually not replicable.” History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 57/74 Donoho (2010) Why does this matter? “An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.” (John Claerbout) History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 58/74 The missing estimation codes The AER replication codes typically only provide codes for simulation of the “final model”, but not the estimation codes themselves History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 59/74 Discussed here: 3 examples from the AER where only simulation codes available, but not the estimation codes to get the parameterization for the simulation In one case, the estimation codes would have allowed directly seeing the error conducted In another case, only these missing estimation codes would allow checking where the obviously wrong results come from History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 60/74 Discussed here: Example from the AER Data and replication files were only available for baseline case of one country, not for other countries analyzed Replication files not available for estimation, only for simulation National accounts data were copy and pasted into a Matlab file without providing information on source, vintage or seasonal adjustment Figures showed that differing samples were used, but no mention which ones History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 61/74 All code is there. Problem Solved!? Even if all code is there, the code does not necessarily run (anymore) Takeaway Clearly state the version of software used to run, including the operating system Example: Dynare 4.4.3 on Matlab 2015b, Windows 7, 64bit Make sure all external files are included If you do not have the rights to include the files in a repository, clearly state where it can be obtained Ideally: upon constructing repository designed for submission, try to run it on a different machine to see whether everything is included and works History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 62/74 Markowetz (2015) Why should I work reproducibly? People respond to incentives (at least according to Mankiw) Five selfish reasons reproducibility helps to avoid disaster reproducibility makes it easier to write papers reproducibility helps reviewers see it your way reproducibility enables continuity of your work reproducibility helps to build your reputation History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 63/74 How do I work reproducibly? Modern journal articles are not conducive to reproducible research Due to printed versions, size (or length) still matters Many papers are sufficiently detailed to understand the gist of the relevant elements, but are ill-suited for replication How often have you read a version of: “for a more detailed and readable version of the paper, see the working paper”? Two crucial tools 1. Technical Appendices 2. Replication Files History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 64/74 Technical Appendices Much of the meat of the papers is relegated to Appendices Technical Appendices in quantitative macro are often as long or longer than the paper Unfortunately, they still often do not contain all the required information History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 65/74 Technical Appendices: what should they contain? A list of all variables and the corresponding set of equations that determines these variables for the final model That encompasses documenting how to get from the presented (nonstationary) model to the (stationary) one useable in the computer A clear description of the computational algorithms used, including all “shortcuts” taken A table with all parameter values used, not just the ones determining the dynamics (try finding the labor disutility parameter in many macro papers) Dynare allows users to easily output LATEX code of the equations used as well as a list of variables and a parameter table History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 66/74 Data Appendix and Data Files List with all data sources used, including the Mnemonics that allow unique identification State the exact sample used for every exercise and the exact seasonal adjustment/filtering conducted Many filters (Baxter and King (1999)-filter, first difference filter) introduce artifacts at the beginning and end of the sample State how you dealt with these, i.e. is the used sample after applying a filter or before? Avoid only providing the final, cleaned, and treated data Instead, provide files that show how final sample was created from raw data History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 67/74 www.dilbert.com Avoid using Excel! History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 68/74 www.dilbert.com Avoid using Excel! Unless... History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 69/74 Born and Pfeifer (2014) Simulation Studies and Random Numbers Case Study: Fernández-Villaverde, Guerrón-Quintana, et al. (2011) σNX/σY Average over Rep. 5 4 3 2 1 0 1.63 Data: 0.39 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 In principle, random number generator seeds should not matter But in practice they do! Codes should always provide the seed! History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 70/74 Finally, the most important one... Write code for your future self! History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 71/74 Summary of Best Practices The following is based on Greg Wilson et al. (2014). “Best practices for scientific computing”. PLoS Biology 12 (1), e1001745. doi: 10.1371/journal.pbio.1001745 A list of suggestions relevant for scientific work in the field of economics 1. Write a program for people, not computers 2. Let the computer do the work 3. Make incremental changes 4. Don’t repeat yourself 5. Plan for mistakes 6. Optimize software only after it works correctly 7. Document design and purpose, not mechanics 8. Collaborate History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 72/74 Conclusion As already mentioned, organization and optimization of programming procedure highly efficient as it: decreases incidence of mistakes, and those made - easier to find, increases possibility to replicate the project, making it more reliable, makes time spent on writing codes much more efficient Thus as everything else, programming doesn’t just need to be done, it needs to be done correctly Follow the rules, optimize your time, make it easier for you and for others History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 73/74 Best Practices Thank you for your attention! History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 74/74 Bibliography I Ackerberg, Daniel, John Geweke, and Jinyong Hahn (2009). “Comments on “convergence properties of the likelihood of computed dynamic models””. Econometrica 77 (6), 2009–2017. Adjemian, Stéphane et al. (2011). “Dynare: reference manual version 4”. Dynare Working Papers 1. CEPREMAP. An, Sungbae and Frank Schorfheide (2007). “Bayesian analysis of DSGE models”. Econometric Reviews 26 (2-4), 113–172. Baxter, Marianne and Robert G. King (1999). “Measuring business cycles: approximate band-pass filters for economic time series”. Review of Economics and Statistics 81 (4), 575–593. Beaudry, Paul and Franck Portier (2006). “Stock prices, news, and economic fluctuations”. American Economic Review 96 (4), 1293–1307. References A 75/13 Bibliography II Blanchard, Olivier Jean and Charles M. Kahn (1980). “The solution of linear difference models under rational expectations”. Econometrica 48 (5), 1305–11. Born, Benjamin and Johannes Pfeifer (2014). “Risk matters: the real effects of volatility shocks: Comment”. American Economic Review 104 (12), 4231–4239. Chang, Andrew C. and Phillip Li (2015). “Is economics research replicable? sixty published papers from thirteen journals say “usually not””. Finance and Economics Discussion Series 2015-083. Board of Governors of the Federal Reserve System. Chenault, Larry A. (1984). “A note on the stability limitations in “a stable price adjustment process””. Quarterly Journal of Economics 99 (2), 385–386. Clemens, Michael A. (2015). “The meaning of failed replications: a review and proposal”. Journal of Economic Surveys. References A 76/13 Bibliography III Del Negro, Marco and Giorgio E. Primiceri (2015). “Time varying structural vector autoregressions and monetary policy: a corrigendum”. Review of Economic Studies 82 (4), 1342–1345. eprint: http://restud.oxfordjournals.org/content/82/ 4/1342.full.pdf+html. Donoho, David L. (2010). “An invitation to reproducible computational research”. Biostatistics 11 (3), 385–388. eprint: http://biostatistics.oxfordjournals.org/content/11/ 3/385.full.pdf+html. Fernández-Villaverde, Jesús, Pablo A. Guerrón-Quintana, Juan F. Rubio-Ramı́rez, and Martı́n Uribe (2011). “Risk matters: the real effects of volatility shocks”. American Economic Review 101 (6), 2530–61. References A 77/13 Bibliography IV Fernández-Villaverde, Jesús, Juan F. Rubio-Ramı́rez, and Manuel S. Santos (2006). “Convergence properties of the likelihood of computed dynamic models”. Econometrica 74 (1), 93–119. Gerking, Shelby and William E. Morgan (2007). “Effects of environmental and land use regulation in the oil and gas industry using the wyoming checkerboard as a natural experiment: retraction”. American Economic Review 97 (3), 1032–1032. Hannay, Jo Erskine et al. (2009). “How do scientists develop and use scientific software?” Proceedings of the 2009 icse workshop on software engineering for computational science and engineering. SECSE ’09. Washington, DC, USA: IEEE Computer Society, 1–8. References A 78/13 Bibliography V Herndon, Thomas, Michael Ash, and Robert Pollin (2014). “Does high public debt consistently stifle economic growth? a critique of reinhart and rogoff”. Cambridge Journal of Economics 38 (2), 257–279. eprint: http://cje.oxfordjournals.org/ content/38/2/257.full.pdf+html. Jermann, Urban and Vincenzo Quadrini (2012a). “Erratum: macroeconomic effects of financial shocks”. American Economic Review 102 (2), 1186. (2012b). “Macroeconomic effects of financial shocks”. American Economic Review 102 (1), 238–71. Klein, Paul (2000). “Using the generalized schur form to solve a multivariate linear rational expectations model”. Journal of Economic Dynamics and Control 24 (10), 1405–1423. References A 79/13 Bibliography VI Kunce, Mitch, Shelby Gerking, and William Morgan (2002). “Effects of environmental and land use regulation in the oil and gas industry using the Wyoming checkerboard as an experimental design”. American Economic Review 92 (5), 1588–1593. Kurmann, André and Elmar Mertens (2013). “Stock prices, news, and economic fluctuations: comment”. American Economic Review 104 (4), 1439–1445. Kydland, Finn E and Edward C. Prescott (1982). “Time to build and aggregate fluctuations”. Econometrica 50 (6), 1345–70. Lackman, Conway L. (1982). “A note on the stability limitations in “a stable price adjustment process””. Quarterly Journal of Economics 97 (3), 541–542. References A 80/13 Bibliography VII List, John A., Charles D. Bailey, Patricia J. Euzent, and Thomas L. Martin (2001). “Academic economists behaving badly? a survey on three areas of unethical behavior”. Economic Inquiry 39 (1), 162–170. Lu, Susan Feng, Ginger Zhe Jin, Brian Uzzi, and Benjamin Jones (2013). “The retraction penalty: evidence from the web of science”. Scientific Reports 3 (3146). Lucas, Robert E. (1976). “Econometric policy evaluation: a critique”. Carnegie-Rochester Conference Series on Public Policy 1 (1), 19–46. Markowetz, Florian (2015). “Five selfish reasons to work reproducibly”. Genome Biology 16 (274). McConnell, Steve (2004). Code complete. 2nd ed. Microsoft Press. McCullough, B. D. and H. D. Vinod (1999). “The numerical reliability of econometric software”. Journal of Economic Literature 37, 633–665. References A 81/13 Bibliography VIII Miguel, Edward and Michael Kremer (2004). “Worms: identifying impacts on education and health in the presence of treatment externalities”. Econometrica 72 (1), 159–217. Muth, John F. (1961). “Rational expectations and the theory of price movements”. Econometrica 29 (3), 315–335. Necker, Sarah (2014). “Scientific misbehavior in economics”. Research Policy 43 (10), 1747–1759. Pfeifer, Johannes (2013a). “A guide to specifying observation equations for the estimation of DSGE models”. Mimeo. University of Mannheim. (2013b). “An introduction to graphs in Dynare”. Mimeo. University of Mannheim. Prabhu, Prakash et al. (2011). “A survey of the practice of computational science”. Proceedings 24th acm/ieee conference on high performance computing, networking, storage and analysis. SC ’11. Seattle, Washington: ACM, 19:1–19:12. References A 82/13 Bibliography IX Prescott, Edward C. (1986). “Theory ahead of business cycle measurement”. Federal Reserve Bank of Minneapolis Quarterly Review 10 (4), 9–21. Primiceri, Giorgio E. (2005). “Time varying structural vector autoregressions and monetary policy”. Review of Economic Studies 72 (3), 821–852. QJE (1984). “Notice to our readers”. Quarterly Journal of Economics 99 (2), 383–384. Reinhart, Carmen M. and Kenneth S. Rogoff (2010). “Growth in a time of debt”. American Economic Review 100 (2), 573–78. Schmitt-Grohé, Stephanie and Martı́n Uribe (2004). “Solving dynamic general equilibrium models using a second-order approximation to the policy function”. Journal of Economic Dynamics and Control 28 (4), 755–775. References A 83/13 Bibliography X Smets, Frank and Rafael Wouters (2007). “Shocks and frictions in US business cycles: a Bayesian DSGE approach”. American Economic Review 97 (3), 586–606. Soergel, David A. W. (2015). “Rampant software errors may undermine scientific results”. F1000Research 3 (303). Steen, R. Grant, Arturo Casadevall, and Ferric C. Fang (2013). “Why has the number of scientific retractions increased?” PLOS One 8 (7), e68397. Wieland, Volker, Tobias Cwik, Gernot J. Müller, Sebastian Schmidt, and Maik Wolters (2012). “A new comparative approach to macroeconomic modeling and policy analysis”. Journal of Economic Behavior & Organization 83, 523–541. Wilson, Greg et al. (2014). “Best practices for scientific computing”. PLoS Biology 12 (1), e1001745. References A 84/13 Bibliography XI Zimmermann, Christian (2015). “On the need for a replication journal”. FRB St Louis Paper 2015-016. References A 85/13