The case for fast-track replication papers
A recent research note by Benjamin Comer and Jason Ingram, a graduate student and an associate professor at Sam Houston State University, respectively, studied whether three sources of public data on police killings were consistent. They looked at Fatal Encounters, Mapping Police Violence, and the Washington Post’s data on the topic from 2015 through the end of 2019. Overall they found that these three sources were very similar to each other. This is the exact kind of research we need a lot more of - research that looks at the data we use and sees (basically) how good it is or how consistently different sources answer the same question. But there’s a problem. This paper is already outdated. It was outdated the moment it was submitted and is even more outdated now.
The note’s study period ends December 31st, 2019, but the data is updated daily, meaning we’re over two years out of date. If I used any of these datasets today, my information on whether results will change based on the source I use is based on two-year-old data.
A two-year difference may seem small - and for a lot of criminology topics, we’re relying on research from much longer ago to guide us - but there’s already some evidence that the quality of the data from the Washington Post is much lower than during the period studied by Comer and Ingram. Ian Adams, an Assistant Professor at the University of South Carolina, found that the share of their data with an unknown victim (i.e., the person killed by the police) race has increased substantially in the past couple of years.Now, there’s an easy solution for this. If I wanted to check data past the study period, I could email the authors and ask for the code they used. Then I just rerun the code and get results for the data outside the studied period. But anyone who’s tried to replicate research - especially in criminology - knows this is a nearly impossible solution. I think it would work in this case since I emailed Comer before asking for replication data, and he sent it to me. This is a rarity. Nearly all the people I emailed asking for replication code, and data have ignored the email. And editors tend to support researchers not sharing data - I’ve never seen or experienced an editor doing otherwise. Even if I do get the code, there’s the problem of using it. The authors in this paper used SPSS, a language I don’t use. I’d have to buy SPSS or use a demo version to run their code. This could be an expensive and time-consuming procedure depending on the language used (or languages as some papers use multiple ones), the code's quality, and the user's experience.
Let’s stop briefly to be explicit about the problem before going into the solution. Many papers are outdated as soon as they are published. This isn’t to say that the papers themselves are bad. Just that that - by the necessity of the publication process - cut off their study period at a certain point. And that outcomes after the end of the study period are of interest. This doesn’t apply to all studies, but it does to many. It applies to many policy analyses, data validation studies, and descriptive research (assuming data is consistently available), among others. Importantly, I am talking about quantitative research, where the authors wrote code to clean and analyze their data. When more data is available, they can (to oversimplify it a bit) just rerun their old code on the new data and get new results. It’s that easy (sort of). Now let’s talk about some incentives to why we write papers since that’s relevant to my proposal to solve this problem.
Why do academics write papers? It’s not for scientific advancement; it’s not personal understanding of the world; it’s not even because we think that it’s our job to do as researchers. It’s for career advancement. Of course, this is an oversimplification, and all of the above reasons so play a part, but the primary reason, I believe, is for career advancement. You will not get a job without papers if you're a grad student. You will not get tenure without papers if you're an assistant professor. If you’re an associate or full professor, the same reasons but for your grad students who do all the work. If you’re an emeritus professor, move your stuff out of your office and stop hogging the space. Again, oversimplifying, but you need new and impactful papers to advance your career. Or at least ones that differ from an identical replication of your old studies. This is why there are few replication papers and even fewer “the same paper but a longer post-period” papers. Science would improve by having replication and identical-replication papers, but since publishing papers is enormously resource intensive that academics need to prioritize the ones that help them the most, and that’s not replication papers.
There are a couple of current solutions to this problem, but I consider all of them ultimately unsatisfactory. First is that original (or different) authors do another paper rerunning their past code for new data. This is rare, but I’ll give an example of when that happens - and why it’s not as good as my solution. The other current answer is for the original authors to rerun their old results but not try to get them published, for example, by writing a blog post or Twitter thread. This has several advantages: it’s quick to do, you get a more extensive and diverse audience than an academic paper, and it doesn’t have all the hurdles of an academic paper. But it’s also reasonably worthless to your career. Blog posts, Twitter threads, and other social media outlets are not helpful to any grad student or early career professor. They may get mainstream media attention, which would be good, but otherwise, they don’t go on one’s CV and won’t be cited. In other words, academia will ignore it regarding career advancement.
Ideally, we’d have a balance between the speed of social media posting and the credibility of academic journals. When a journal publishes a paper that has the potential to do this kind of replication with more recent data, that paper should be automatically allowed to publish an update after some time has passed (say, when the next annual batch of data comes out, or six months after publishing, the exact period would depend on the data studied). These fast-tracked updates shouldn’t be full papers. They should be short notes that can copy the original article's old data, methods, and results text and update it to correspond to the new data. All the work of the lit review and discussion and fighting peer review has already been done in the original paper, so now the authors should be able to rerun old code and post new results. This can be manipulated, so rules should be in place. For example, authors who publish only partial results so they can wait and post updates should be rejected. And it’s not as simple as rerunning old code in some cases. Suppose there are changes in the data or study context. In that case, the authors must note those and ensure results aren’t due to anything new (such as an additional intervention after the original study period). And while the original authors should be prioritized in doing these updates, if they don’t do it soon enough, then someone else should be allowed to, and they should be fast-tracked too - as long as they are using the original code and not writing their own which may be wrong. These kinds of paper updates would keep the career incentives for researchers while also having the speed that’s important to research with current data.
So it’s a pretty simple idea: update a paper once more data comes out and give the authors credit for it with a new citable publication. You may think it’s the same as authors writing a new paper but a bit shorter. So let me give you an example of a paper that does this kind of update, and why my method is preferred. In 2021 the Journal of Quantitative Criminology published the first experimental study on the effect of outdoor lighting on crime. This study, by Aaron Chalfin, Benjamin Hansen, Jason Lerner, and Lucie Parker, studies very bright outdoor lights randomly assigned to housing developments in New York City and found a considerable decrease in crime in the treated areas relative to control areas for six months ending in early 2016. They only studied six months as that was the time when the lights were supposed to be removed. So they wrote up the paper for that period. But the lights were never removed, meaning there was much more data available, allowing for longer-term analysis of the effect of these lights.
So some of the original authors wrote a new paper, looking this time at three years of post-period data. This study, written by David Mitre-Becerril, Sarah Tahamont, Jason Lerner, and Aaron Chalfin, and published by Criminology & Public Policy came out last week and is basically the same paper - even down to having a lot of the text be identical in both papers. It’s great that this paper was published, as criminology needs long-term follow-up studies, and I believe this does an excellent job of doing that. But consider the amount of work required to publish this compared to what it should be. This is an entire article, not a research note, and is 27 pages, including everything. The authors had to write everything again, even though it was about the same experiment. It underwent peer review, meaning that an editor and 1-3 reviewers read it and had the authors make changes. This work is multiplied by the number of journals submitted if multiple journals review it.
In my proposal, the Journal of Quantitative Criminology would invite the authors to resubmit an update if more data was available. It would fast-track it to publication without all the work of submitting a new paper. And it would be published quickly. A significant time burden for papers is peer review, or more specifically, people not doing their review quickly. If we trust the peer-review process to do a good job on the original paper, we should trust it to do the same thing but with extra data. We don’t need to submit each update to the whims of new (usually capricious) reviewers. If there’s a problem with the data or methods, then that would apply to the original paper as well, and people should take issue with the original paper.
To best reduce crime, we need research that is as up-to-date as possible and tells us both the short- and long-term effects of policies. We can get some of this information through my proposal: allowing authors to publish updates that count as separate publications, such as research notes, that show their original study but with more recent data.
Of course, I should use all available sources to show how results change.
All this is assuming, of course, that the code was correct.