Replication and Transparency of Macro Models

Transcription

Replication and Transparency of Macro
Models
(and other musings and thoughts)
Johannes Pfeifer (Mannheim)
Replication and Transparency in Economic Research
January 6/7, 2016
Two Guiding Principles
Never attribute to malice that which is adequately
explained by stupidity
(Hanlon’s Razor)
Let him who is without sin cast the first stone
(John 8:7)
History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing
1/74
www.smbc-comics.com
Why Macro is different
2/74
Why Macro is different
At least in the area of business cycle research we work with
fairly complicated structural models but rather straightforward
data
This gives rise to a particular set of computational challenges
Many papers tend to be very complex and (almost) require a
PhD for replication
3/74
Quick history of thought (after monetarism)
Muth (1961): rational expectations, i.e. agents should not
make systematic prediction errors given available information
Lucas (1976): for policy advice we need structural models
that are invariant to changes in the policy experiments under
consideration
Kydland and Prescott (1982): simple DSGE model with TFP
shocks generates cyclical fluctuations resembling the ones
found in the data
“We chose not to test our model [...] this most likely would
have resulted in the model being rejected...” (Kydland and
Prescott, 1982)
“The models constructed within this theoretical framework are
necessarily highly abstract. Consequently, they are necessarily
false, and statistical hypothesis testing will reject them.”
(Prescott, 1986)
Late 1980s/early 1990s: New Keynesians push for more
sophisticated models and formal econometric tests
4/74
Challenge I: model solution
Dynamic stochastic structure of models gives rise to nonlinear
stochastic difference equations that describe evolution of
model variables
Solving these difference equations is hard, but there are two
ways out:
Substitute easier problem for original one: linearize model
Use numerical techniques to solve model
Computer is better than humans in both tasks
5/74
Challenge II: bringing the model to the data
Estimating linearized models via maximum likelihood using
Kalman filter is straightforward
But: the likelihood is a high-dimensional object
Even for simple models, it can be ill-behaved, showing hardly
any curvature and exhibiting many local maxima
For more complicated models, you can think of it as an
egg-crate
6/74
Challenge II: bringing the model to the data
“Dilemma of absurd parameter estimates” (An and
Schorfheide, 2007): ML estimates often at odds with
information from outside of the model
Solution: use Bayesian techniques that augment likelihood
with prior information
→ makes posterior more well-behaved
Problem: Bayesian econometrics often involves working with
intractable posterior distributions
→ need to work out complicated integrals
Solution: use numerical integration techniques in the
computer (relying on Metropolis-Hasting, Gibbs sampler, etc.)
Smets and Wouters (2007): milestone study showing that
forecasting power of DSGE model estimated with Bayesian
techniques is on par with BVAR
7/74
Result
If you work in quantitative macro, the computer is your best
friend and your worst enemy!
Nowadays macroeconomic work is almost impossible without
scientific computing software
Coding is an integral part of economic research, unless purely
theoretical (even then software helps checking algebra)
Development of Information Technologies has been a driver
for more sophisticated research techniques
8/74
The problem
Complexity has massively increased
Training, focus on computational details, and software
development have not necessarily kept pace
McCullough and Vinod (1999): classical study showing that
even commercially available software packages sometimes
return wildly differing results in standard applications
→ thorough benchmarking needed
But: macroeconomists rely less on standard commercially
available packages like Stata
This puts verification to the forefront
9/74
Clemens (2015)
Journal of Economic Surveys (2015) Vol. 00, No. 0, pp. 1–17
C 2015 John Wiley & Sons Ltd
Result
Table 1. A Proposed Standard for Classifying Any Study as a Replication.
Replication
Robustness
Same
Different
Methods in follow-up study versus
methods reported in original
Sufficient
conditions for
discrepancy
Random chance, error,
or fraud
Sampling distribution
has changed
Types
Same
specification
Same
population
Same
sample
Verification
Yes
Yes
Yes
Reproduction
Yes
Yes
No
Reanalysis
No
Yes
Yes/No
Extension
Yes
No
No
Examples
Fix faulty measurement,
code, data set
Remedy sampling error,
low power
Alter specification,
recode variables
Alter place or time; drop
outliers
Notes: The “same” specification, population, or sample means the same as reported in the original paper, not necessarily what was contained in the code and
data used by the original paper. Thus for example if code used in the original paper contains an error such that it does not run exactly the regressions that the
original paper said it does, new code that fixes the error is nevertheless using the “same” specifications (as described in the paper).
THE MEANING OF FAILED REPLICATIONS
Sampling
distribution for
parameter
estimates
3
10/74
Coding
Scientists spend 30% or more of their time on developing their
own software (Hannay et al., 2009; Prabhu et al., 2011)
Thus research quality and results highly dependent on
developed software
Knowing how to do it right is as important as learning
programming.
Helps to get more reliable results
Decreases the amount of time needed to develop software
and boosts the optimality of work
Allows for replicability (which increases the validity of the
results)
Mistakes in codes not only dangerous for the quality of the
project, but also for those citing it (Domino Effect)
11/74
www.dilbert.com
Bad code only helps in rare cases
12/74
Case Study: Reinhart and Rogoff (2010)
Herndon et al. (2014): “We replicate Reinhart and Rogoff
(2010) and find that coding errors, selective exclusion of
available data, and unconventional weighting of summary
statistics lead to serious errors that inaccurately represent the
relationship between public debt and GDP growth among 20
advanced economies in the post-war period”.
Result was cited by e.g. German finance minister for pushing
for austerity in Europe
13/74
Coding Errors
Mistakes done even by professionals
According to McConnell (2004) and NASA:
Industry average experience is about 1 to 25 errors per
1000 lines of code for delivered software
Applications Division at Microsoft experiences about 10
to 20 defects per 1000 lines of code during in-house
testing, and 0.5 defect per 1000 lines of code in released
product
Space-shuttle software has achieved a level of 1 defect in
500,000
14/74
the error (i.e., it may give the correct output for some inputs but
not others). Finally: many errors may cause a program to simply
Rampant
software
errorsimplausible
may undermine
crash or to
report an obviously
result, but we arescientific
really
only concerned with errors that propagate downstream and are
reported.
bility
tputs
vably
ramvered
ware
ntific
ng in
ttennt7–11.
tices
ines.
lines
vered
Soergel (2015)
results
In combination, then, we can estimate the number of errors that
actually affect the result of a single run of a program, as follows:
Likelihood of relevant errors a function of various factors:
Number of errors per program execution =
total lines of code (LOC)
* proportion executed
* probability of error per line
* probability that the error
meaningfully affects the result
* probability that an erroneous result
appears plausible to the scientist.
For these purposes, using a formula to compute a value in Excel
counts as a “line of code”, and a spreadsheet as a whole counts as a
“program”—so many scientists who may not consider themselves
coders may still suffer from bugs13.
All of these values may vary widely depending on the field and the
source of the software. Consider the following two scenarios, in
15/74
completely unrelated
Soergel (2015)
Multiplying these, we expect that two errors changed the output of
this
program
run,
so
the
probability
of
a
wrong
output
is
effecRampant
software errors may undermine scientific
But results
software is diffe
tively 100%. All bets are off regarding scientific conclusions drawn
because software is
from Even
such anoptimistic
analysis. scenarios look pretty bleak
have unbounded erro
lon, an off-by-one e
Scenario 2: A small focused analysis, rigorously executed
will render the resul
Let’s imagine a more optimistic scenario, in which we write a simbug would alter a sm
ple, short program, and we go to great lengths to test and debug it.
More likely, it syste
In such a case, any output that is produced is in fact more likely to
some downstream a
be plausible, because bugs producing implausible outputs are more
quences. In general,
likely to have been eliminated in testing.
inaccurate, not mer
•
1000 total LOC
•
100% executed
•
1 error per 1000 lines
•
10% chance that a given error meaningfully changes the
outcome
•
50% chance that a consequent erroneous result is plausible
Here the probability of a wrong output is 5%.
History Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly
Many erroneous
Bugs that produce
results are more likel
a program becomes
the above errors-perlished scientific code
sible output is plaus
such as these may ea
at face value:
•
An indexin
mistake
ass
Scientific Computing 16/74
Soergel (2015)
Why should we care about coding errors?
In general, software errors produce outcomes that are inaccurate,
not merely imprecise
Errors in experiments in hard sciences often reduce precision:
results will be a bit off
Software bugs are different!
“Small” bugs may have unbounded error propagation
Sign error or a shift by one entry when matching data
columns can render data complete noise
Results affected by software bugs will often be inaccurate
17/74
Solution!? Maximize Probability of Error Detection
Use “standard” software with large user-base where possible
Use software with decent software engineering standards
Do not blindly trust someone else’s codes
18/74
Perturbation Techniques
There are various codes for perturbation solutions to DSGE
models, ranking from worst to best:
1. solab.m (Klein, 2000) and Chris Sims’s gensys
2. Schmitt-Grohé and Uribe (2004) toolkit
3. Dynare (Adjemian et al., 2011)
If you are teaching students, first teach them the method of
undetermined coefficients and Blanchard and Kahn (1980) to
see nuts and bolts
Then teach them Dynare
19/74
Dynare
Open source free software for solving, simulating, and
estimating D(S)GE models
Works under Matlab and Octave
Advantages:
Automatizes many error-prone steps (never linearize by
hand)
Large collection of mod-files available (quasi-standard by
now)
Big community
Decent software engineering standards
20/74
Dynare: software engineering
Dynare manual (http://www.dynare.org/
documentation-and-support/manual)
Dynare forum (http://www.dynare.org/phpBB3/)
Code archive (https://github.com/DynareTeam/dynare/)
List of known bugs
(http://www.dynare.org/DynareWiki/KnownBugs)
Testsuite (http://www.dynare.org/testsuite/)
21/74
The “black box” argument
Common argument: “Dynare is a black box”
Counterargument: It is only a black box until you decide to
open it
All code is publicly available and the algorithms are
well-documented
You don’t need to reinvent the wheel
The probability of a bug in Dynare going undetected is lower
than in your own code!
22/74
Dynare resources
https://github.com/johannespfeifer/dsge_mod
Johannes Pfeifer (2013a). “A guide to specifying observation
equations for the estimation of DSGE models”. Mimeo.
University of Mannheim
Johannes Pfeifer (2013b). “An introduction to graphs in
Dynare”. Mimeo. University of Mannheim
Macro Model Database 2.0 (Wieland et al., 2012)
23/74
Macro Model Database: A few words of warning
Great tool and good starting point with about 60 heavily used
models
Take their “replication” claims with a grain of salt
Most mod-files deviate from Dynare “best practices”
Most serious issue: parameter dependencies are often not
correctly handled
Limits reusability of mod-files as they cannot directly be used
for estimation
→ Users cannot bring models to the data as they are
24/74
Standard Software: A Caveat
“Linus’s Law”: “Given enough eyeballs, all bugs are shallow”
Problem: only applies when many eyeballs read and test code
Large user-base not sufficient (see “Heartbleed”
OpenSSL-bug, which affected about 17% of all secure
web-servers for two years)
25/74
Example discussed here:
A paper that had required a correction, but where many
papers that relied on the old, wrong code had not been
corrected
26/74
Lu et al. (2013)
Retractions in Economics
www.nature.com/scientificreports
Retractions in economics and business administration are
extremely rare
A moment of introspection:
Maybe we work more thoroughly than other subjects?
Maybe are just more honest or have fewer opportunities
for misbehavior?
Figure 1 | Retraction characteristics. Of the 1,423 retractions indexed by the Web of Science, the percentage of total retractions is greatest in the sciences,
27/74
Necker (2014)
Surveys among economists I
“The correction, fabrication,or partial exclusion of data, incorrect
co-authorship, or copying of others’ work is admitted by 1–3.5%.
The use of “tricks to increase t-values, R2, or other statistics” is
reported by 7%. Having accepted or offered gifts in exchange for
(co-)authorship, access to data, or promotion is admitted by 3%.
Acceptance or offering of sex or money is reported by 1–2%. One
percent admits to the simultaneous submission of manuscripts to
journals. [..] According to their responses, 6.3% of the participants
have never engaged in a practice rejected by at least a majority of
peers. John et al. (2012) report almost the same fraction for
psychologists.”
Translation: We are not better than other subjects
28/74
Necker (2014)
Surveys among economists III
“Respondents were asked which fraction of research in the top
general and top field journals (A+ or A) they believe to be subject
to different types of misbehavior. (“up to ... %,” scale given in
deciles). The fabrication of data is expected to be the least
widespread. The median response is “up to 10%.” Respondents
believe that incorrect handling of others’ ideas, e.g., plagiarism, is
more common; the median is “up to 20%” of published research.”
29/74
101+ (%)
Survey among economists II
417
314
466
387
383
489
List et al. (2001)
Note: Research rank is from Scott and Mitias (1996).
TABLE 2
Summary Statistics of Responses
Research “Felonies” (Falsification)
Self (Q 9)
Have you ever
falsified
research data?
Other (Q 9a)
What percentage
of research in the
top 30 journals do
you believe [is
falsified]?
Randomized response n = 140:
4.49(0.30)
7.04(0.85)
Direct response n = 94:
4.26(0.22)
5.13(0.73)
Research “Misdemeanors”
Selling Grades
Self (Q 10)
Have you ever
[committed any
of four “minor”
infractions]?
Others (10A)
What percentage
of research in the
top 30 journals do
you believe is
affected by [these
“minor”
infractions]?
Self (Q 11)
Have you ever
accepted sex,
money, or gifts
in exchange for
grades?
Others (Q 11a)
What percentage
of economics
faculty members
do you believe
have accepted
sex, money, or
gifts exchange
for in grades?
10.17(0.34)
16.98(1.52)
0.40(0.27)
4.26(0.50)
7.45(2.72)
12.95(1.50)
0.0(0.0)
3.82(0.51)
Notes: Cell contents are means (standard errors) and represent percentages. For randomized response questions,
we compute means and variances based on RR = Z − 1 − P/P; RR = Z1 − Z/n − 1P 2 , where Z is the
observed of yes responses, P is the probability of answering the sensitive question, is the proportion of yes responses
to the nonsensitive question (in our case a series of coin flips, hence = 1), n is the sample size.
Echoes earlier findings from 1998 ASSA meeting
30/74
Steen et al. (2013)
What should we expect?
Why Are There More Scientific Retractions?
Table 1. Correlations among journal impact factor (IF) and time-to-retraction expressed in months for different infractions.
Months to retract
Correlation r
Sample n
Journal IF
Mean
SD
Mean
SD
IF6Months
R=
P,
Misconduct+Poss. misconduct
889
8.71
10.08
43.03
37.40
20.079
22.39
0.01
Misconduct
697
9.10
10.24
46.78
38.38
20.120
23.19
0.01
Possible misconduct
192
7.31
9.38
29.41
29.97
0.030
0.41
NS
Plagiarism
200
2.63
2.42
26.04
32.55
20.134
21.90
0.05
Error
437
10.98
11.61
26.03
27.95
0.029
0.60
NS
Duplicate publication
290
3.91
6.33
26.61
29.63
20.027
0.46
NS
All retractions
2047
7.30
9.54
32.91
34.24
20.027
1.22
NS
This table includes all retracted articles. ‘‘Misconduct+Poss. misconduct’’ includes both ‘‘Misconduct’’ and ‘‘Possible misconduct,’’ which are also analyzed separately. The
correlation coefficient r is tested for significance with the R statistic, which has a t-distribution. Numbers do not sum because this table does not include ‘‘other’’ and
‘‘unknown’’ infractions, and because some papers were retracted for more than one infraction.
doi:10.1371/journal.pone.0068397.t001
was about defective transcription of Foxp3 in patients with
How many retractions do you know?
psoriasis and was submitted from the Third Military Medical
reflects changes in institutional behavior as well as changes in the
behavior of authors.
University in Chongqing, China. Two papers [19,20] were about
nanoembossed ferroelectric nanowires and came from Fudan
University in Shanghai. It was judged that the same ‘‘Z. Shen’’
The PubMed database of the National Center for Biotechnolwrote the latter two papers, but a different ‘‘Z. Shen’’ wrote the
ogy Information was searched on 3 May 2012, using the limits of
former paper.
‘‘retracted publication, English language.’’ A total of 2,047 articles
In the course of identifying whether each first author had also
were identified, all of which were exported from PubMed and
written other retracted papers, it was often possible to identify
entered in an Excel database [8]. Each article was classified
networks of collaborating authors. In the case of ‘‘Z. Shen’’ above,
according to the cause of retraction, using published retraction
we noted that the senior author of the psoriasis paper was ‘‘Y.
Historynotices,
Coding
Preserving
Actions(ORI),
Data Policies Working reproducibly Scientific Computing 31/74
proceedings
fromIntegrity
the OfficeofofLiterature
Research Integrity
Liu,’’ whereas the senior author of the nanowire papers was ‘‘R.
Methods
Digression
Q: How do economics journals deal with these issues?
A: Often not good. At least in economics, it is almost impossible
to directly spot any issues when looking at homepages
32/74
Primiceri (2005)
Review of Economic Studies
33/74
Del Negro and Primiceri (2015)
Review of Economic Studies
34/74
Jermann and Quadrini (2012b)
American Economic Review
35/74
Jermann and Quadrini (2012a)
36/74
Kunce et al. (2002)
37/74
Gerking and Morgan (2007)
Effects of Environmental and Land Use Regulation in the Oil
and Gas Industry Using the Wyoming Checkerboard as a
Natural Experiment: Retraction
By SHELBY GERKING
History
AND
WILLIAM E. MORGAN*
The purpose of this note is to call attention to,
Although IHS classifies wells by land type,
and to take responsibility for, errors in a previwells of a given type in a given region in a given
ously published paper (Mitch Kunce, Shelby
year will have the same reported cost per foot
1
Gerking, and William Morgan 2002). The
regardless of whether they were drilled on fedmain finding reported in that paper is that oil
eral or private property. Thus there is no indeand natural gas wells are significantly more
pendent variation in much of the drilling cost
costly to drill on federal property than on pridata independent of the variables used in the
vate property. This note explains why the paregression model.
per’s results are being retracted from the
While the data provided by IHS do not show
literature.
a difference in drilling cost by land type condiFindings presented in the original paper cantional on the variables in the regression model,
not be substantiated because the data furnished
errors in our handling of the data made it appear
Coding Preserving Integrity of Literature Actions Data Policies Working reproducibly Scientific Computing 38/74
Fernández-Villaverde, Rubio-Ramı́rez, et al. (2006)
Econometrica
39/74
Ackerberg et al. (2009)
Econometrica
40/74
Lackman (1982)
Quarterly Journal of Economics
41/74
Chenault (1984)
42/74
QJE (1984)
NOTICE TO OUR READERS
The following article (with minor copy-editing differences) was
published in The Quarterly Journal of Economics, vol. 97, no. 3,
August 1982, pp. 541-42 under the name of Prof. Conway L. Lackman of Rutgers University. Shortly after publication, Prof. Larry
Chenault of Miami University asserted to the Board of Editors that
the published article was, with minor differences, a paper that Chenault had written and submitted to two other professional journals.
Professor Lackman's submission to The Quarterly Journal of
Economics, received on 22 September 1981, was not a typewritten
original, but a xerographic document. After refereeing, the paper
was accepted for publication on 3 December. 1981. Prof. Lackman
thereafter received from The Quarterly Journal of Economics galley proofs, accompanied by a copy of his submission. Following the
return of galley proofs, the paper was published with minor changes
from Lackman's submission copy.
Upon receipt of Chenault's assertions, a member of the Board
43/74
How to deal with known issues?
Sometimes there are well-known issues with published papers
People in the inner circle of the community are well-aware of
these issues (cf. “Worm Wars” of Miguel and Kremer (2004))
But: newcomers and outsiders often are not
Consequently, they may spend an inordinate amount of time
trying to replicate or build upon problematic papers or put too
much trust in published papers
Do we as economists perform well in preserving the integrity
of the literature?
How many PhD students have wasted years of their life due to
this?
→ high social costs
44/74
Is a better refereeing process the solution?
In some cases, the referees obviously failed to do their job
My experience: more often detecting true issues requires
months of hard work
Fernández-Villaverde, Guerrón-Quintana, et al. (2011): no
indication in the paper at all that something might be off;
only codes gave it away
In the game of refereeing, the incentives are stacked against
referees thoroughly checking codes (particularly at early
rounds and when paper gets rejected)
Any effort you put in is anonymous and will only be valued by
the editor
→ incentives even worse than for comments
Puts bigger burden on post-publication peer review/checking
(attention will correlate with impact)
45/74
Comments
Comments seem to have partially filled this gap:
Kurmann and Mertens (2013) on Beaudry and Portier (2006)
(∼ 570 citations)
Born and Pfeifer (2014) on Fernández-Villaverde,
Guerrón-Quintana, et al. (2011) (∼ 290 citations)
Ackerberg et al. (2009) on Fernández-Villaverde,
Rubio-Ramı́rez, et al. (2006) ( ∼ 80 citations)
46/74
Comments: Costs vs. benefits
Writing a comment is risky; private returns are almost surely
smaller than the social returns:
You do not know the standard the journal will apply and
whether it will get published
Other journals often do not touch comments on papers
not in their own journal
You may alienate the original authors and make powerful
enemies
Often only original research counts towards evaluations:
“You should better do original research instead of
wasting your time on other people’s research”
You might get the reputation as a “nitpicker”
Additionally:
Some journals do not publish comments at all
Comments are not an attractive option for lower-tier
journals
47/74
Sidenote: Top vs. lower-ranked journals
Not much evidence on reliability/correctness of articles in
different journal tiers
Do top journal articles have higher quality because they are a
positive selection and attract more scrutiny upon publication?
Or are mistakes more likely because the research is at the
frontier, less standard, and more complex?
Are lower ranked journal articles more problematic, because
they face less scrutiny by readers and referees?
Do editors at lower-tier journals have incentives to deal with
messy cases or is it better to sweep them under the rug?
My Take
Lower-ranked journals have higher share of problematic articles
48/74
Comments on Journal Homepages
Not common in economics
AEA offers this for AEJs, but strangely not for the AER
Flies too much under the radar (https://www.aeaweb.org/
articles.php?doi=10.1257/pol.6.1.167)
Requires login
49/74
http://replication.uni-goettingen.de/wiki/
New instruments: Replication Wiki
The Replication Wiki aims at providing an authoritative
database on replication issues
It already catalogues many papers where formal replications
have been conducted
It also offers a “talk page” where issues with papers can be
discussed
But: no anonymous comments are possible
In particular PhD students and early career researchers shy
away from being associated with a critique of important
figures in the field
→ functionality not used that much
50/74
Zimmermann (2015)
New instruments: Replication Studies
Some journals are willing to publish replication studies
Journal of Applied Econometrics has exclusive list of journals
for which replication studies are considered
Econ Journal Watch, an online-only, open access journal with
goal to “watch the journals for inappropriate assumption,
weak chains of argument, phony claims of relevance,
omissions of pertinent truths, and irreplicability (EJW also
publishes replications).”
Journal of the Economic Science Association promises to be
explicitly receptive of replication studies, but scope is limited
to experimental economics
International Journal of Economic Micro Data is new online
open access journal with replication section
Problem: “it takes as long to write a short paper as a long
one”
51/74
www.pubpeer.com
New instruments: Pubpeer - The online journal club
Site offering post-publication peer review
Not much used in economics, but heavily used in life-sciences
Gained traction after several high-profile publications in the
“tabloids” Nature and Science were brought down by
comments on Pubpeer
Has potential to become the go-to portal for issues with
articles, but still has long way to go for network effects in
economics to kick in
52/74
http://blog.pubpeer.com/?p=200
Big advantage: anonymous commenting
Important: with great power comes great responsibility
53/74
New instruments: Versioning
Having one and only one version of a published article is
anachronism from print age
In internet age, in principle nothing prevents updating of
articles, provided changes are tracked
Might be interesting way to deal with problems and
corrections
For an example, see Soergel (2015) at
http://dx.doi.org/10.12688/f1000research.5930.2
54/74
www.dilbert.com
First step: Data and Replication
55/74
https://www.aeaweb.org/aer/data.php
Data policies
First step in macro research should be straightforward:
replication
Data policy at AER stipulates: “For econometric and
simulation papers, the minimum requirement should include
the data set(s) and programs used to run the final models,
plus a description of how previous intermediate data sets and
programs were employed to create the final data set(s).
Authors are invited to submit these intermediate data files
and programs as an option; if they are not provided, authors
must fully cooperate with investigators seeking to conduct a
replication who request them.”
56/74
Chang and Li (2015)
The bleak picture
“We attempt to replicate 67 papers published in 13 well-regarded
economics journals using author-provided replication files that
include both data and code. [...] Aside from 6 papers that use
confidential data, we obtain data and code replication files for 29
of 35 papers (83%) that are required to provide such files as a
condition of publication, compared to 11 of 26 papers (42%) that
are not required to provide data and code replication files. We
successfully replicate the key qualitative result of 22 of 67 papers
(33%) without contacting the authors. Excluding the 6 papers that
use confidential data and the 2 papers that use software we do not
possess, we replicate 29 of 59 papers (49%) with assistance from
the authors. Because we are able to replicate less than half of the
papers in our sample even with help from the authors, we assert
that economics research is usually not replicable.”
57/74
Donoho (2010)
Why does this matter?
“An article about computational results is advertising, not
scholarship. The actual scholarship is the full software
environment, code and data, that produced the result.” (John
Claerbout)
58/74
The missing estimation codes
The AER replication codes typically only provide codes for
simulation of the “final model”, but not the estimation codes
themselves
59/74
Discussed here:
3 examples from the AER where only simulation codes
available, but not the estimation codes to get the
parameterization for the simulation
In one case, the estimation codes would have allowed directly
seeing the error conducted
In another case, only these missing estimation codes would
allow checking where the obviously wrong results come from
60/74
Discussed here:
Example from the AER
Data and replication files were only available for baseline case
of one country, not for other countries analyzed
Replication files not available for estimation, only for
simulation
National accounts data were copy and pasted into a Matlab
file without providing information on source, vintage or
seasonal adjustment
Figures showed that differing samples were used, but no
mention which ones
61/74
All code is there. Problem Solved!?
Even if all code is there, the code does not necessarily run
(anymore)
Takeaway
Clearly state the version of software used to run, including the
operating system
Example: Dynare 4.4.3 on Matlab 2015b, Windows 7, 64bit
Make sure all external files are included
If you do not have the rights to include the files in a
repository, clearly state where it can be obtained
Ideally: upon constructing repository designed for submission,
try to run it on a different machine to see whether everything
is included and works
62/74
Markowetz (2015)
Why should I work reproducibly?
People respond to incentives (at least according to Mankiw)
Five selfish reasons
reproducibility helps to avoid disaster
reproducibility makes it easier to write papers
reproducibility helps reviewers see it your way
reproducibility enables continuity of your work
reproducibility helps to build your reputation
63/74
How do I work reproducibly?
Modern journal articles are not conducive to reproducible
research
Due to printed versions, size (or length) still matters
Many papers are sufficiently detailed to understand the gist of
the relevant elements, but are ill-suited for replication
How often have you read a version of: “for a more detailed
and readable version of the paper, see the working paper”?
Two crucial tools
1. Technical Appendices
2. Replication Files
64/74
Technical Appendices
Much of the meat of the papers is relegated to Appendices
Technical Appendices in quantitative macro are often as long
or longer than the paper
Unfortunately, they still often do not contain all the required
information
65/74
Technical Appendices: what should they contain?
A list of all variables and the corresponding set of equations
that determines these variables for the final model
That encompasses documenting how to get from the
presented (nonstationary) model to the (stationary) one
useable in the computer
A clear description of the computational algorithms used,
including all “shortcuts” taken
A table with all parameter values used, not just the ones
determining the dynamics (try finding the labor disutility
parameter in many macro papers)
Dynare allows users to easily output LATEX code of the
equations used as well as a list of variables and a parameter
table
66/74
Data Appendix and Data Files
List with all data sources used, including the Mnemonics that
allow unique identification
State the exact sample used for every exercise and the exact
seasonal adjustment/filtering conducted
Many filters (Baxter and King (1999)-filter, first difference
filter) introduce artifacts at the beginning and end of the
sample
State how you dealt with these, i.e. is the used sample after
applying a filter or before?
Avoid only providing the final, cleaned, and treated data
Instead, provide files that show how final sample was created
from raw data
67/74
www.dilbert.com
Avoid using Excel!
68/74
www.dilbert.com
Avoid using Excel! Unless...
69/74
Born and Pfeifer (2014)
Simulation Studies and Random Numbers
Case Study: Fernández-Villaverde, Guerrón-Quintana,
et al. (2011)
σNX/σY
Average over Rep.
5
4
3
2
1
0
1.63
Data: 0.39
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
In principle, random number generator seeds should not
matter
But in practice they do!
Codes should always provide the seed!
70/74
Finally, the most important one...
Write code for your future self!
71/74
Summary of Best Practices
The following is based on Greg Wilson et al. (2014). “Best
practices for scientific computing”. PLoS Biology 12 (1),
e1001745. doi: 10.1371/journal.pbio.1001745
A list of suggestions relevant for scientific work in the field of
economics
1. Write a program for people, not computers
2. Let the computer do the work
3. Make incremental changes
4. Don’t repeat yourself
5. Plan for mistakes
6. Optimize software only after it works correctly
7. Document design and purpose, not mechanics
8. Collaborate
72/74
Conclusion
As already mentioned, organization and optimization of
programming procedure highly efficient as it:
decreases incidence of mistakes, and those made - easier
to find,
increases possibility to replicate the project, making it
more reliable,
makes time spent on writing codes much more efficient
Thus as everything else, programming doesn’t just need to be
done, it needs to be done correctly
Follow the rules, optimize your time, make it easier for you
and for others
73/74
Best Practices
Thank you for your attention!
74/74
Bibliography I
Ackerberg, Daniel, John Geweke, and Jinyong Hahn (2009).
“Comments on “convergence properties of the likelihood of
computed dynamic models””. Econometrica 77 (6), 2009–2017.
Adjemian, Stéphane et al. (2011). “Dynare: reference manual
version 4”. Dynare Working Papers 1. CEPREMAP.
An, Sungbae and Frank Schorfheide (2007). “Bayesian analysis of
DSGE models”. Econometric Reviews 26 (2-4), 113–172.
Baxter, Marianne and Robert G. King (1999). “Measuring business
cycles: approximate band-pass filters for economic time series”.
Review of Economics and Statistics 81 (4), 575–593.
Beaudry, Paul and Franck Portier (2006). “Stock prices, news, and
economic fluctuations”. American Economic Review 96 (4),
1293–1307.
References
A 75/13
Bibliography II
Blanchard, Olivier Jean and Charles M. Kahn (1980). “The
solution of linear difference models under rational expectations”.
Econometrica 48 (5), 1305–11.
Born, Benjamin and Johannes Pfeifer (2014). “Risk matters: the
real effects of volatility shocks: Comment”. American Economic
Review 104 (12), 4231–4239.
Chang, Andrew C. and Phillip Li (2015). “Is economics research
replicable? sixty published papers from thirteen journals say
“usually not””. Finance and Economics Discussion Series
2015-083. Board of Governors of the Federal Reserve System.
Chenault, Larry A. (1984). “A note on the stability limitations in
“a stable price adjustment process””. Quarterly Journal of
Economics 99 (2), 385–386.
Clemens, Michael A. (2015). “The meaning of failed replications: a
review and proposal”. Journal of Economic Surveys.
References
A 76/13
Bibliography III
Del Negro, Marco and Giorgio E. Primiceri (2015). “Time varying
structural vector autoregressions and monetary policy: a
corrigendum”. Review of Economic Studies 82 (4), 1342–1345.
eprint: http://restud.oxfordjournals.org/content/82/
4/1342.full.pdf+html.
Donoho, David L. (2010). “An invitation to reproducible
computational research”. Biostatistics 11 (3), 385–388. eprint:
http://biostatistics.oxfordjournals.org/content/11/
3/385.full.pdf+html.
Fernández-Villaverde, Jesús, Pablo A. Guerrón-Quintana,
Juan F. Rubio-Ramı́rez, and Martı́n Uribe (2011). “Risk
matters: the real effects of volatility shocks”. American
Economic Review 101 (6), 2530–61.
References
A 77/13
Bibliography IV
Fernández-Villaverde, Jesús, Juan F. Rubio-Ramı́rez, and
Manuel S. Santos (2006). “Convergence properties of the
likelihood of computed dynamic models”. Econometrica 74 (1),
93–119.
Gerking, Shelby and William E. Morgan (2007). “Effects of
environmental and land use regulation in the oil and gas industry
using the wyoming checkerboard as a natural experiment:
retraction”. American Economic Review 97 (3), 1032–1032.
Hannay, Jo Erskine et al. (2009). “How do scientists develop and
use scientific software?” Proceedings of the 2009 icse workshop
on software engineering for computational science and
engineering. SECSE ’09. Washington, DC, USA: IEEE Computer
Society, 1–8.
References
A 78/13
Bibliography V
Herndon, Thomas, Michael Ash, and Robert Pollin (2014). “Does
high public debt consistently stifle economic growth? a critique
of reinhart and rogoff”. Cambridge Journal of Economics 38
(2), 257–279. eprint: http://cje.oxfordjournals.org/
content/38/2/257.full.pdf+html.
Jermann, Urban and Vincenzo Quadrini (2012a). “Erratum:
macroeconomic effects of financial shocks”. American
Economic Review 102 (2), 1186.
(2012b). “Macroeconomic effects of financial shocks”.
American Economic Review 102 (1), 238–71.
Klein, Paul (2000). “Using the generalized schur form to solve a
multivariate linear rational expectations model”. Journal of
Economic Dynamics and Control 24 (10), 1405–1423.
References
A 79/13
Bibliography VI
Kunce, Mitch, Shelby Gerking, and William Morgan (2002).
“Effects of environmental and land use regulation in the oil and
gas industry using the Wyoming checkerboard as an
experimental design”. American Economic Review 92 (5),
1588–1593.
Kurmann, André and Elmar Mertens (2013). “Stock prices, news,
and economic fluctuations: comment”. American Economic
Review 104 (4), 1439–1445.
Kydland, Finn E and Edward C. Prescott (1982). “Time to build
and aggregate fluctuations”. Econometrica 50 (6), 1345–70.
Lackman, Conway L. (1982). “A note on the stability limitations in
“a stable price adjustment process””. Quarterly Journal of
Economics 97 (3), 541–542.
References
A 80/13
Bibliography VII
List, John A., Charles D. Bailey, Patricia J. Euzent, and
Thomas L. Martin (2001). “Academic economists behaving
badly? a survey on three areas of unethical behavior”. Economic
Inquiry 39 (1), 162–170.
Lu, Susan Feng, Ginger Zhe Jin, Brian Uzzi, and Benjamin Jones
(2013). “The retraction penalty: evidence from the web of
science”. Scientific Reports 3 (3146).
Lucas, Robert E. (1976). “Econometric policy evaluation: a
critique”. Carnegie-Rochester Conference Series on Public
Policy 1 (1), 19–46.
Markowetz, Florian (2015). “Five selfish reasons to work
reproducibly”. Genome Biology 16 (274).
McConnell, Steve (2004). Code complete. 2nd ed. Microsoft Press.
McCullough, B. D. and H. D. Vinod (1999). “The numerical
reliability of econometric software”. Journal of Economic
Literature 37, 633–665.
References
A 81/13
Bibliography VIII
Miguel, Edward and Michael Kremer (2004). “Worms: identifying
impacts on education and health in the presence of treatment
externalities”. Econometrica 72 (1), 159–217.
Muth, John F. (1961). “Rational expectations and the theory of
price movements”. Econometrica 29 (3), 315–335.
Necker, Sarah (2014). “Scientific misbehavior in economics”.
Research Policy 43 (10), 1747–1759.
Pfeifer, Johannes (2013a). “A guide to specifying observation
equations for the estimation of DSGE models”. Mimeo.
University of Mannheim.
(2013b). “An introduction to graphs in Dynare”. Mimeo.
University of Mannheim.
Prabhu, Prakash et al. (2011). “A survey of the practice of
computational science”. Proceedings 24th acm/ieee conference
on high performance computing, networking, storage and
analysis. SC ’11. Seattle, Washington: ACM, 19:1–19:12.
References
A 82/13
Bibliography IX
Prescott, Edward C. (1986). “Theory ahead of business cycle
measurement”. Federal Reserve Bank of Minneapolis Quarterly
Review 10 (4), 9–21.
Primiceri, Giorgio E. (2005). “Time varying structural vector
autoregressions and monetary policy”. Review of Economic
Studies 72 (3), 821–852.
QJE (1984). “Notice to our readers”. Quarterly Journal of
Economics 99 (2), 383–384.
Reinhart, Carmen M. and Kenneth S. Rogoff (2010). “Growth in a
time of debt”. American Economic Review 100 (2), 573–78.
Schmitt-Grohé, Stephanie and Martı́n Uribe (2004). “Solving
dynamic general equilibrium models using a second-order
approximation to the policy function”. Journal of Economic
Dynamics and Control 28 (4), 755–775.
References
A 83/13
Bibliography X
Smets, Frank and Rafael Wouters (2007). “Shocks and frictions in
US business cycles: a Bayesian DSGE approach”. American
Economic Review 97 (3), 586–606.
Soergel, David A. W. (2015). “Rampant software errors may
undermine scientific results”. F1000Research 3 (303).
Steen, R. Grant, Arturo Casadevall, and Ferric C. Fang (2013).
“Why has the number of scientific retractions increased?”
PLOS One 8 (7), e68397.
Wieland, Volker, Tobias Cwik, Gernot J. Müller,
Sebastian Schmidt, and Maik Wolters (2012). “A new
comparative approach to macroeconomic modeling and policy
analysis”. Journal of Economic Behavior & Organization 83,
523–541.
Wilson, Greg et al. (2014). “Best practices for scientific
computing”. PLoS Biology 12 (1), e1001745.
References
A 84/13
Bibliography XI
Zimmermann, Christian (2015). “On the need for a replication
journal”. FRB St Louis Paper 2015-016.
References
A 85/13

Replication and Transparency of Macro Models

Transcription

Similar documents

Chris Bosh - Computer Science Education Week

CrossCode Flyer 01_29_14_REV

Cedar Lake Speedway

Ideabox

Global Service Offering

Broderick - Top Coding Tips in Fracture Care

ED Sheet - Zotec Partners

Database Marketing Solutions - Anchor Computer

Higher productivity is within reach.

Biologically Important Forests: ecological foundation of the