Discover more from Crime Thoughts
Why do so many papers think crime data ends in 2016?
First published in October, 2022
Please note that in the few weeks between when I wrote this and when I posted it, NACJD posted NIBRS data through 2020.
There’s a curious phenomenon in some recent papers: they think that FBI data ended in 2016. Always 2016. Sometimes they use a single year, and sometimes they use many years, but they always end with 2016. Why 2016?
The simple answer is that 2016 is the last year available from the National Archive of Criminal Justice Data (NACJD), which hosts a lot of crime datasets. You can think of them as the Netflix of crime data since they offer a collection of crime datasets for download. Among their collection are the FBI’s Uniform Crime Reporting (UCR) Program Data (which is really several datasets) and the National Incident-Based Reporting System (NIBRS) data1. In both cases, the most recent data the FBI has released is 2020, but NACJD only has updated data through 2016.2 What NACJD does to these data is take the hard-to-use data from the FBI and convert it into more usable formats like R, Stata, or SPSS.3 For NIBRS data, which comes in several different files (called “segments”), each dedicated to one part of the data (e.g., victims, offenders, property, etc.), NACJD also has what they call “Extract Files,” which combine data from all of the segments to create four files, each for a different unit of analysis: victim, offender, arrestee, and incident.4 Data users can do this all themselves, but it simplifies using NIBRS data by removing that step in the cleaning process.5 The downside is that it limits records to only three per group (e.g., victims, offenders, arrestees, etc.). For an excellent review of how this affects research, please see this paper by Drs. Lantz and Wenger.6
So case closed? 2016 is the last year NACJD has, and NACJD is the principal repository for data, so it makes sense to only use data through 2016. I don’t buy it. That is a feeble argument to me. NACJD is a great resource, but they’re not the only source of FBI data. They’re not even the primary source of FBI data. The FBI is. On the FBI’s official website, you can download the raw UCR and NIBRS data (which they call “Master Files”) and use data up through the most recent year - each September/October, the FBI releases data for the previous calendar year. It’s a bit of a pain to convert these master files into machine-readable format (e.g., R, Stata), but it is certainly doable. I did so in the summer after my first year of grad school. And now I have data in machine-readable formats for every year of UCR and NIBRS data available to download publicly. So it’s certainly doable.
FBI data is available through 2020 for all of their datasets, either through the FBI’s website or my data posted on openICPSR. What is the excuse for people to rely on NACJD data that ended in 2016 - almost six years ago at the time of this writing? Do they not trust my data? That’s fair. I’m pretty suspicious about people’s code, so I might not trust my work if I didn’t know better. But certainly, they’d trust the FBI’s data, which is readily available for download.7 Maybe these authors don’t have the programming skills to convert the data to a usable format. That may be true, but then it calls into question all the code they write to clean and analyze the data they got from NACJD. Maybe they don’t know that these other sources exist. But that can’t be right because researchers would check to see if the data they’re using is the most recent available.
I don’t want just to be hypothesizing about things without any evidence.8 The easy way to see why people used only data through 2016 is to read their papers' data section and see their explanation. Except in rare cases where the timing of a post-period is necessary - say, when you’re studying a policy and a significant confounder happens in 2017 - it does not make sense to exclude several years of available data. Researchers who do this need to explain why they made this decision. If they don’t, that’s a failure on their part and on the reviewers and editors who approved the paper. Is it a major failure? No, but as I’ll note below, it’s easily fixed and shouldn’t be allowed.
My plan to see what researchers say has a few restrictions. There’s not much difference in UCR data for 2016 and more recent years. Of course, having more data is better, but most agencies reported data in 2016, and most agencies still report data (or did through 2020, the last year the FBI collected UCR data).9 NIBRS, on the other hand, has far fewer agencies reporting meaning that each additional year of data is hugely important. So for this post, I’ll only look at papers using NIBRS. And these papers must use NIBRS for a primary analysis, not merely as a source for some data used to make a point in the intro or discussion.
Publishing papers can take years, so I’ll limit data to only very recent ones. They need to be published in 2022, but I’ll accept ones where the “first publication” date is 2021 as long as the edition date is 2022. I’ll also include working papers posted in 2022 since this means they’re new research. For every paper, I look for the first publication date both on the website and in the PDF to exclude papers published much earlier than the edition date.10 This excludes, for example, my paper that has an issue date of 2022 but was first published in 2020 and written a couple years before that.11 This isn’t full-proof as some journals don’t have first published dates, so that this method may include papers older than 2021 or 2022. And some papers use data other than NIBRS, and if those data are only as recent as 2016, then using 2016 would be reasonable. If so, the authors will say that in the data section.
This isn’t supposed to be a comprehensive literature review. Using old data is so weird that authors would mention it, so I don’t think I need to read many papers to find the answer. So I opened Google Scholar, limited papers to those published in 2022, and searched for “nibrs 2016.” I looked at papers from the first ten pages of the results.12 Google Scholar includes ten papers per page, so I looked at 100 papers. Any with a first published date before 2021 were excluded, as well as those that didn’t use NIBRS data for an analysis (i.e., not just used for background information) or that used years more recent than 2016. I only included papers written in English as that’s the only language I can read. I also exclude undergraduate or graduate theses/dissertations.
I found 20 papers that fit this criterion.13 In Table 1, I present some descriptive information about these papers, including the title, author(s), journal, year published, and the reason the authors gave for using NIBRS data that ends in 2016. In addition to the reason, I include the quote from the paper for that reason. I also have whether the source was cited for the data since this may explain whether they are using NACJD data. The table is ordered alphabetically by the first author’s last name. Table 1 is available at the bottom of this post.
Before we look at the results, it’s important to note that 20 papers are quite a lot when it comes to NIBRS papers published in 2022 (including “first published” in 2021). NIBRS is not a very common dataset - though it is increasingly being used. UCR data is still far more common and likely will be for several years at least. So it is consequential that 20 papers end with 2016 NIBRS data (or earlier) even though they were all published very recently.
Now let’s turn to the table. There were 19 articles published in 18 journals and one working paper, all written by 47 unique authors.14 These papers are primarily in criminology and economics journals but also include some sociology journals like Race & Justice. Most of these papers used multiple years of data ending in 2016, but not all. Three used only data from 2016, while another three used only data from 2015. Two papers used multiple years of data ending in 2015, while another had data ending in 2008. Twelve of the papers (60%) did not say their source for the data; all of the rest either cited NACJD or ICPSR (the umbrella organization NACJD is part of).
But the point of this post is to see why these papers didn’t use the most recent data available - relying only on the years NACJD has posted. Perhaps not surprisingly, most papers never stated their reason. Fourteen papers (70%) did not explain why they ended the data in 2016. Another five papers (25%) incorrectly say 2016 is the most recent year of data available. The final paper states, "2016 was selected as the end year to capture post-legalization trends, as trends had stabilized by this point in time.”15 Several of the papers even talk about how more years of data would be helpful to their analysis, mainly as they study rare events, so limiting data to 2016 greatly affects their sample size. They understand the benefit of having more data but nonetheless choose not to use it.
So we have 20 papers - not an insignificant share of all recent NIBRS papers - that end with data that is years out of date, and nearly all give either no reason or an incorrect reason for this decision. And let's be clear. Using old data is a decision. It may be a justifiable decision as resources are limited, and using available, cleaned data is easier than doing that work yourself. But science shouldn't be easy. It should be correct. The point of research is to depict the world accurately, which means the world as it is today.16 There's already a lag since FBI data takes over a year to come out, and the publication process is lengthy, so we shouldn't be doing anything to make data older than it can be. I don't write this post to say everyone should use the data I cleaned and posted. By all means, don't use it! But papers should use the most recent data available, even if that requires extra work such as cleaning the FBI's Master Files.
There are two fundamental problems here: 1) using old data and 2) incorrectly saying that 2016 is the last year available. Using old data is bad because all research that uses it is now less relevant to current policy and understanding than if the authors included more recent years of data. Saying that 2016 is the last year of data is wrong because research should not have factual errors. Both issues can be solved pretty simply: reviewers and editors must ensure authors use the most recent data available and provide evidence when saying that a particular year is the latest. This also includes asking people to review who know the data. This is an unrealistic solution since it requires peer review to improve, which is unlikely. So a blanket rule that all papers need to explain precisely why they used particular years of data, or else be desk rejected, would also be necessary. This won’t solve the problem since people can still be wrong (e.g., “2016 is the last year of data”) or make up some justification, but this, alongside having reviewers who know the data, will help.
What about cases where the paper was written when 2016 was the last year available? Some papers (especially econ papers) can take years to publish. So by the time they finally get accepted, their data may be old. Should they have to rerun their results using the newest years of data? Yes, absolutely. I know how painful it is to rerun data and change all the tables and results. But unless there’s a good reason not to - such as a major policy change or a confounder that affects your analysis - these papers should use the most recent data. One reason (though probably less likely than just not wanting to rerun everything) that authors may not do this is because they have results they like (i.e., significant results) and don’t want to risk them changing by adding more years of data. To this, I say too bad. Our job as scientists is to measure how the world works, regardless of the results. In cases where the review process has taken so long that the accepted paper uses data that are now old, then I think that authors should still have to rerun everything, including the newest data but not have that sent back to peer review. And peer review should be about the data and methods, not the results.
You may think that this is a reasonably trivial post. I shouldn’t be worried about papers that exclude a few years of recent data. But this is, I think, something that’s pretty emblematic of the problems facing a lot of criminology (and related fields) research. There’s so much corner-cutting that goes on in the research process. This may not matter for any given paper - and of the 20 papers in the table, a few may have a good reason to end in 2016 - but it causes substantial problems in the aggregate. For example, consider that peer-reviewed academic papers are not supposed to have factual errors. When there are these errors, it not only misinforms readers but demonstrates that being right is not a priority to those authors, reviewers, editors, or the field. So when five of the 20 papers I included incorrectly say that 2016 is the latest year of data available, that will harm the field's reputation. Even before getting into how the data was used or the analyses, factual errors in just describing the data are a problem. In the aggregate, this harms the field and will affect all research as our field will be considered one with low standards.
In the particular, having these kinds of basic factual errors or unexplained strange decisions calls into question the integrity of each individual article. Maybe the rest of this article is perfect, but so much research is based on blind trust. Readers almost never have access to your data, code, or thought process, and very few papers are detailed enough to understand precisely what is being done with the data or in the analysis. Readers must trust that you did everything right in your data cleaning and analysis. So when these errors or unexplained decisions occur, why should readers trust that everything else is done right? This is especially true in criminology, where outright fraud by people like Eric Stewart is treated with senior members of the field closing ranks and only reluctantly giving him a slap on the wrist.
Not using the latest years of data may be minor, especially if you disagree with me on the above paragraphs. But enough little things build to major things. Where do we take a stand? This is an easy place to do so. The data is available. Including it would improve these papers. It is more work to clean the FBI Master Files - though not to use my data - but it’s not that difficult, and something being hard is no excuse for not doing it. This is an easy decision, and when we can’t even stand firm on this point, when will we?
NIBRS is technically part of UCR but I consider them different datasets entirely as NIBRS data can (and does) replace almost everything collected in the other UCR datasets.
The FBI should release 2021 data soon.
The original data from the FBI are in what is called fixed-width ASCII files, which are text files that require processing. So NACJD’s process takes this step out of working with the data.
Technically the original file you get directly from the FBI is a single file you need to split into multiple files.
This step is not required for all analyses.
Their primary findings are that limiting records to only the first three doesn’t make much difference.
The FBI even stopped making you wait for them to mail you a DVD of the data several years ago.
What am I, a theorist?
Starting January 1, 2021, the FBI no longer collected UCR data and only collected NIBRS data.
Please let me know if I miss anything, and I’ll correct this post.
Incidentally, that paper did use the NACJD’s NIBRS Extract files ending in 2016.
As may be expected, most of the studies I found were on the first several pages. Papers on later pages generally didn’t actually use NIBRS data.
To anyone with an issue with me identifying papers that meet my criteria, I have two fairly critical scientific concepts to introduce you to 1) being specific and 2) providing evidence for your claims.
Crime & Delinquency had two papers.
I do not understand that reasoning.
The exception is research that studies historical events, which often have lessons for the modern day. But I believe this is only a tiny share of research studies.