NIBRS is a looming problem for criminology, not just for crime data
Last year marked the first year the FBI stopped collecting Uniform Crime Reporting (UCR) data and only took National Incident-Based Reporting System (NIBRS) data from police agencies. Now, 16 months after they started collecting data, we still have another 6 or 7 months to wait until they release the dat. Thee FBI typically releases NIBRS in October or November of the year following the data collection. NIBRS data has been around since 1991, so criminologists have had over 30 years to prepare for using NIBRS data instead of relying on UCR data. I don’t think they used this time well, and we’ll soon be entering a painful era for criminology in handling this transition. One that I do not believe many are ready for or even anticipate.
NIBRS data is far more detailed than older UCR data and gives pretty detailed information about every victim, offender, and crime incident reported to the police. I have an entire book on this data but, put simply, an issue with the transition is that many agencies do not report NIBRS data - only about half of them did in 202. Whilee this number is growing, it’s still far too low for nationally representative numbers, meaning that we’ll have substantial dark spots in our understanding of current crime trends. Other people have alreadydiscussedt what the issues will mean regarding data access, primarily focusing on the large share of agencies not reporting NIBRS data (see here for one exampls). In this post, however, I’ll discuss a different problem. In particular, I’ll talk about the issues this transition will introduce to the field regarding research quality, accuracy, and who can do it. There are two significant issues that I think criminology will encounter: 1) NIBRS is far more complex, soit will require strong programming skills, 2) and it is far more extensive in size, soit will require more powerful computing resources.
Let’s start with some definitions. When I say in the title that it’s a looming problem for criminology, I’m using “criminology” as a very loose term. I mean the actual people and organizations that research crime. This includes grad students and professors in criminology departments, researchers at think tanks and advocacy groups, reporters, and anyone else who creates the product of crime knowledge. This also includes people in other fields that research crime, such as economists, political scientists, sociologists, etc. Most of the language I’ll use focuses on academia since that’s what I’m most familiar with. Still, I believe it all applies to research conducted by those not affiliated with a university.
My basic argument in this post is that while using UCR data is relatively trivial and accessible to many people, NIBRS data requires far more decisions, programming skills, and computing power.
Issues with NIBRS data are also a hallmark of broader issues with using large and complex datasets, which are increasingly common in criminology. This means that we may see a concentration in who can use this data to a small share of people and organizations who have the skills or resources to use it easily. And the second part is critical; people who have enough resources, such as a strong server, can largely ignore many of the technical issues that come with NIBRS, and I’ll give an example of this below. Some people and groups will undoubtedly use NIBRS data - and some already do - but this will not be equal. News organizations and think tanks have the money to hire people who understand NIBRS and have the skills to use it. Many university programs do not have the funds to hire dedicated data scientists, build strong data servers, and don’t teach their students how to program. These programs, and the people in them, will suffer due to the NIBRS transition.Before I explain more of the problems NIBRS will cause to criminology, let me give you a metaphor and an example of what the UCR-to-NIBRS transition means regarding how much more complex NIBRS data is.
UCR is driving along a one-lane dirt road in the middle of nowhere. There are few places you can go, but you probably won’t get lost. You could probably make the drive on a donkey. NIBRS is driving along a major interstate near a population center. It’s a much more complicated drive with more chances for accidents but also many more places to go. Your donkey isn’t going to cut it. Many criminologists are still riding a donkey.
In terms of data work, they can probably do most simple tasks using code that technically does work but is likely far longer and more prone to issues than if an actual programmer wrote it.To demonstrate the complexities that NIBRS introduces, let’s look at a simple example: counting the number of aggravated assaults with a gun an agency experienced in a year. For simplicity, let’s assume that the agency reported data for all 12 months of the year we’re looking at. Using UCR data, you would load the Offenses Known and Clearances by Arrest dataset (the “crime” data of UCR), select the row for that agency, and then select the aggravated assault with a gun column.
If you’re using the monthly data, you’d select the 12 columns for each month of aggravated assaults with a gun and then sum them up. It’s hard to make mistakes here.Getting the number of aggravated assaults with a gun takes many more steps in NIBRS than in UCR, both in terms of programming steps and potentially subjective decisions. With UCR data, you could only measure gun assaults one way: the number of incidents in that month. Using NIBRS, you can measure it by victims of gun assaults, offenders of gun assaults, total gun assault offenses, and the standard number of incidents of gun assaults. You could also break it down by the characteristics of the victim, the offender, or the incident (e.g., location, time, day of the week). But let’s stick to the number of gun assault incidents to stay consistent with UCR data.
To do this, we need to load the Offenses Segment part of NIBRS and then select all the rows for the agency we are interested in. Then keep only rows where the offense committed is an aggravated assault. Violent crime offenses include columns for up to three weapons used. To keep only gun assaults from our total aggravated assault data, we need to check each column to see if it has one of the following possible gun types: handgun, firearm (type not stated), shotgon, and other firearm). Aggravated assaults with one of those weapons would be considered a gun assault. Since each incident can have multiple offenses, we’d need to subset the data to keep only one row per incident. Since we want yearly data, we’d see how many rows there are, as there’d be one row per gun assault in that year. So to get the same measurement of the number of incidents of gun assaults in a city in a year, we need far more steps for NIBRS than UCR. More steps mean more code and more chances to make mistakes. For example, I spelled “shotgun” wrong in the list of gun types above. If that were in my code to subset the data, it would not grab any offense where a shotgun was used.The first issue with NIBRS is that it is much more complex than UCR, meaning that users will need good programming skills. NIBRS data is split into six segments, each covering a different topic about a given crime, including the people involved and the characteristics of the incident. For a given analysis you may want to combine two or more segments together. As shown in the example above, NIBRS data is a complicated dataset and has many interconnected pieces. And that example only used data from a single segment. Combining segments - such as if you want to know how many gun assaults involved victims that were seriously injured (requiring the Offense Segment and the Victim Segment) - just adds to the complexity. With UCR you were generally stuck with using predefined crime categories and basic monthly counts. NIBRS actually makes you think. There’s the programming complexity where you need to do a lot more to get to the same point as NIBRS. You need to write much more code and that ,means you’ll probably make more mistakes in your code, especially if you - as many academics do - write inefficient code.
Given the large size of NIBRS, as discussed below, bad code - even if technically correct - will also take far longer to run than good code.For the most part the same basic code that is used for UCR will also be used for NIBRS. You’ll still be subsetting, cleaning, aggregating, graphing, and analyzing the data. There’s a lot more lines of code you’ll need, but it’s mostly the same. The danger is that more code written means more chances to make a mistake. I am referring to mistakes as code that runs (i.e. does not cause an error message) but does not do what the user wants it to do. For example, if you mean to subset data from 2000-2020 but accidentally start in 2010 the code will run but you’ll have 10 fewer years than you wanted. The relationship between the amount of code and the chance of making a mistake is not linear. More code is increasingly more difficult to debug and to check. 10 lines of code can still introduce mistakes, but with so few lines it’s fairly easy to check the code yourself and to have someone else review it. Double it to 20 lines and it’s likely more than twice as difficult to check, though still relatively easy, since you have more opportunities to make a mistake and each line likely builds on the past line so a mistake anywhere will compound itself on later code. For example, subset your data wrong - such as the typo I added above for “shotgun” and all of your results will be wrong. Considering that most NIBRS work that I’ve done had several hundred lines of code, and I am a fairly efficient programmer, this is likely many times more code than what people are used to if they’ve only worked with UCR or similarly constructed data.
Even the most experienced programmer makes mistakes. Less experienced programmers are more likely to make mistakes and worse at catching them by rereading their code. To put this extremely politely, most academics - grad students and professors - are on the less experienced side of programming.
The code that I’ve seen is generally ad-hoc, created for a specific use case (i.e. the current paper) and not written with much planning or concern for efficiency. It is code written expecting the world to end tomorrow; written to complete the task at hand as quickly as possible, even if that causes issues in the future. In other words this is bad code.While using NIBRS is largely the same coding skills as using UCR, there are additional skills needed. Given the large file size there are some additional skills needed to handle that, as discussed more below. Different segment files are at different units of analysis meaning that just deciding which unit to use gets more confusing and differs from UCR data which in most cases gives you no choice. Consider, for example, a single incident that has two offense, two victims, and four offenders. If you wanted to use all of this information but only at the incident-level you’d need to make rules (not write code here, just make decisions) about how to go from the more detailed units - the offense-level, victim-level, and offender-level - to the incident-level. This includes decisions like which victim do you keep? Do you take the average of the victim’s characteristics? Is each victim counted? How do you deal with some victims having unknown values for certain traits but other victims having known values? This gets more complicated the more variables you are including and the more segments you are using. With complicated decisions comes more complicated code. And, again, complexity is dangerous when it comes to introducing mistakes in the code.
The second issue is that NIBRS data are far larger than UCR data, meaning that your standard old laptop probably isn’t good enough. A single file with all yearly UCR Offenses Known and Clearances by Arrest data that includes all agencies for all years available (1960-2020) is about 1.2GB large as a Stata file (70MB large as a R file).
In comparison, the 2020 Offense Segment of NIBRS is 6.9GB large (90MB as a R file). Combining all of the NIBRS data that’s available (1991-2020) from all of the segments is about 400GB as Stata files and 7.5GB as an R file. 2020 had data from about half of agencies, and the ones that submitted data are generally smaller than the ones that didn’t. So as non-reporting agencies start sending data to NIBRS we’ll likely see the file size balloon. I wouldn’t be surprised to see 20GB files once we approach 90% of agencies reporting. Since NIBRS data consists of several segments a single year of data could potentially be greater than 100GB.A reasonable response from you is that I’m exaggerating the file size since you’ll never load all the data into R (or Stata, or whatever software you use). You’ll probably do it one file at a time and then subset it so it’s just the rows and columns you want, which will be a much smaller file size than the original file. This is true. Using the example from above, if we used Stata for our gun assault analysis we’d load the initial 6GB file and then quickly subset it to only gun assaults in a single agency which is likely a tiny file of a few thousand rows maximum.
Here’s the issue with this. If you don’t have a strong computer or if your school or company doesn’t provide a good server, then you need to take the time to write and debug the code to handle your data in a way that you don’t crash the software by working on files that are too large. Computers are generally not great at handling very big files and then appropriately dealing with subsetting. In R, for example, if you open enough very large files then R will run out of memory and stop working (or be very slow) even if you delete or subset the data after each file is opened.
There are ways to handle dealing with these big datasets without using a strong computer (and I spent a lot of time in grad school learning how) but it is both time intensive and requires programming skills not often taught in normal lessons. If your school (or company) has enough resources you never have to do this. With enough money, you can literally buy the time otherwise spent on handling big data and use it on something else.As an example, for a recent project I did I wanted to create an R object with 205 million rows and about a dozen columns. The data was on registered voters, and I wanted to aggregate some variables to the Census tract-level. There’s absolutely no reason I had to create this file other than that it was the easiest, laziest way to get the final aggregated data I wanted while doing the least amount of thinking and coding possible. I could have written code that did the same task in much smaller pieces, say three million rows at a time instead of all 205. This would have taken much longer to write and likely much longer for R to run all of the code. Instead I used Princeton’s server, choose to get a few thousand GB of RAM for the task and had it done quickly. While an extreme example, this is the fork that many will experience when trying to use NIBRS. Some will be able to use their resources to ignore file size issues. Others will have to take the time to write code that can handle large files well - and that far greater amount of time to become a good enough programmer to even get to that point.
So in summary the transition to NIBRS is going cause problems for criminology because it’s bigger and more complex than the UCR data that it is replacing. This will require both better programming skills and computing resources to use it effectively. More specifically I have three proposals to alleviate these problems.
First, schools should require and provide programming and data analysis training for their grad students. This should be either a class (ideally multiple classes) in the department or in another department at that school. Everyone involved in research should get better at programming but improving training to grad students is far more likely to happen than requiring professors and other researchers to do so. Grad school is also the ideal time to learn new skills since that is the point of going, and these students are likely the ones doing the data work for their professors anyways. Giving them formal training in how to do their job will make them better at it. This proposal is not without its limitations. Adding a new class requires having professors in the department who can teach it well - and if they teach it poorly that’s probably worse than not teaching it at all. This means more work for the professor to learn the language taught or hiring someone who can. And adding more classes is more work for students, even if the class is taught in a different department. It may also mean reducing other classes offered or required.
This kind of training can also only go so far. Research is more than just data analysis. And everyone involved in research has more work to do than just handling data. Grad students have classes to attend, stipends to complain about, and graduate chairs to plot against. And so on. Professors have classes to teach, tweets to write, and department chairs to plot against. So my third proposal is that departments and non-academic organizations should hire dedicated data scientists to handle the data. This isn’t a very radical proposal. Many academic organizations have data scientists, though usually called something different and not always entirely focused on data, as part of research “labs”. Non-academic organizations do the same with employees whose sole or primary focus is working with the data.
My proposal, therefore, is an expansion of these roles and to have them be more general interest rather than hired for a specific grant or project. Most research is not grant funded. And most research requires relatively little programming time for an experienced programmer. Consider, for example, how many papers use a single publicly available dataset for the entire paper.
While a grad student or professor may be able to handle this data, even being able to use it quickly, it’s likely that a dedicated programmer would be able to complete the task quicker. The tradeoff here is that it may lead to even more concentration of power if better-resourced organization could hire data scientists while poorer ones continue to rely on grad students and other current employees. And while a trained programmer would likely make fewer programming mistakes they’ll likely make more mistakes in handling and interpreting the data since they wouldn’t understand the nuances of the data. For example, the NIBRS gun assault example above requires understanding the data well, not just knowing how to program well.Improving data skills for researchers - including grad students and non-academic researchers - or hiring dedicated data scientists will improve the efficiency and accuracy of research. Papers will be done quicker than they are now and will likely have fewer mistakes in the code meaning that we can trust the results more. But this won’t solve the issue of having data too big to be handled on standard computers; data that require powerful servers. There’s no easy solution to this. Teaching students and other researchers how to handle big data will help, but still comes at the cost of better-resourced people and organizations getting to avoid this work by paying for more powerful servers. Organizations can always buy strong servers themselves though this is unlikely to happen for organizations with limited funding. One potential solution, and my third proposal, is for organizations to band together to create a group server. This is already happening within universities where multiple departments may share a school-wide server. If multiple departments from different schools contribute funding they could buy a powerful server that they’d be unlikely to afford on their own. Since these servers are cloud-based there should be few problems with allowing access, at least on a technical side. Funding may be complicated if multiple organizations want to share the server but have different abilities to pay for it. This is a place where organizations such as the American Society of Criminology and the Academy of Criminal Justice Sciences - and more regional societies - can play an important role in organizing or funding this proposal.
Do I think these suggestions will actually get enacted? Mostly no. A simple test that I have for seeing the importance of a job in a certain group is to see the importance of the person doing it. Who handles the data in most academic research? The graduate or undergraduate RA. Professors tend to stick with writing the paper and doing the statistics.
When a job is given to the least powerful people that is a job that the culture of the group considers unimportant. It also suggests that the cost of any increased difficulty in doing the job - in working with NIBRS data instead of UCR data - will fall on these low-power people, rather than the professors in charge. Without costs falling on the professors, there is unlikely to be any changes, especially as some of my proposals would be financially expensive.So what do I think will be the likely outcome of the NIBRS transition on criminology? First, there will be an acceleration of the concentration of research to a small number of well-resourced people and groups. These groups can afford the powerful servers that let people largely ignore big data issues and can hire people who are able to focus on the data. To be clear, NIBRS is an important dataset but by no means the only one used in crime research. Not being able to use this data well or use it easily will hurt an organization but likely not by very much. Still, crime data is getting increasingly enormous and that’s a trend that only going to continue, and likely faster than it is now. Nearly all of the problems and the solutions to dealing with NIBRS will apply for other very large datasets. Schools and other organizations that don’t have the skills or resources to handle these large datasets may not be at too much of a disadvantage now, but that gap will continue to grow. The NIBRS transition should be a wake-up call. Will it?
Please note that I use trivial in terms of how it is relative to using NIBRS. Using UCR still does require skills, particularly in terms of understanding the data’s limitations and the programming skills to manipulate it correctly (e.g. subset, aggregated, etc.). The simpler tasks in UCR, however, do not require any programming skills. This is good in the sense that it opens the data to more people using it but is bad since these people won’t be equipped with the skills to handle NIBRS data.
There is already some degree of concentration happening as it seems like a small number of researchers use NIBRS regularly while many others use UCR instead.
A server is simply a cloud-based version of the software that you’re using that is often far more powerful than a local computer because it can have hundreds or thousands of GB of RAM and more storage space than most computers.
Again, here I use criminologists to mean anyone who does crime research, even if they don’t consider themselves criminologists.
Aggravated assault with a gun is a UCR crime so it is included in the data, but you get no information other than how many occurred in that agency in that month.
Again for simplicity we’ll assume that any incident with a gun assault is a gun assault incident, so we’re not following the UCR Hierarchy Rule.
I consider inefficient code to be code that uses many more lines than necessary to do the same work. For example, copy and pasting code instead of writing a for loop or a function. This is by necessity a rather vague definition as sometimes it is better to write longer code that’s easier to understand than very concise code.
I don’t have enough experience with people outside of academia to judge their skills. Though their work does seem more complex in terms of programming skills required.
This is after compressing the Stata file using the compress command which greatly reduces the file size.
This is also a compressed Stata file.
The documentation says that this is handled automatically but in my experience that is incorrect. Or at least is only partially correct.
The money is almost certainly not the researcher’s own money but is the program or school’s money to host a server. You could - and should - buy as much RAM as possible for your personal computer but that still won’t compare with a real server.
A Google Scholar search for “monitoring the future criminology” brings up 185,000 results. This can’t say how many studies actually use this data, or use it together with other datasets, but it’s likely that many thousands of these results are for studies that use only this data.
This is, of course, not true in all scenarios but is common based on my observations and experience.
There are also few costs in criminology on writing good code or even on having major errors in the code. I will discuss this point in more detail in a later post.