Most code you write for research is a waste of time (because someone else has already done it)
It is very rare in criminology research to use quantitative data that no one has ever used before.It does happen. Primarily in surveys that the authors wrote themselves - including modifying past surveys - but also through collecting administrative data yourselves, such as data sharing agreements or FOIA requests. The majority of papers, however, rely on data that other people have used before. Often thousands of times before since many criminology papers use publicly available administrative data from government agencies such as the FBI, BJS, or local police departments.
So how does the data cleaning part of the papers that use these common datasets go? Precisely the same as if it were utterly original data that no one had touched before. Let’s use the FBI’s Uniform Crime Reporting Program Data as an example. If you wanted to use this data, you’d download it from openICPSR and write some code to clean it. You’d grab just the rows and columns you want, aggregate them to a particular unit, check for outliers, make some graphs, and so on for whatever your use case. Now imagine I wanted to do the same study but with a tiny variation. My data cleaning process would be nearly identical to yours. I’d spend the time working with the data even though I’m mostly doing the same work you already did.
There are two major issues with people writing their own code when using data that others have used. First, it’s an enormous waste of time. Most criminology research is not that complicated. It’s a lot of correlational studies analyzing a small number of datasets - often only a single dataset. Even studies that try to measure causal relationships are often relatively simple to think about and analyze, using difference-in-difference approaches or natural experiments. What takes a lot of time on the analysis side (so excluding writing up the paper) is working with the data. Crime data is messy. It’s usually rather miserable to work with.Criminology papers tend to take forever to do because, at least in part, the data available is usually hard to work with, and the people working with it are not good at programming. So researchers - or, to be more specific, their graduate students - take up so much time because the data they use is not very friendly, even to the most experienced programmer. So take the time it takes to clean up data for a project and then multiply it by the number of studies that use this data. This is the amount of time wasted by people redoing work that others have already done. This is likely dozens or hundreds of hours - if not more - per person each year spent doing work that others have already done.
And who is doing this work? For many studies, it’s graduate students, the least qualified group to be doing research.This leads to the second major issue, having many people do the same code makes it more likely that someone will make a mistake. If ten people write code to clean the same dataset, there is a good chance that someone will make a mistake in the code. If 100 people do, it’s guaranteed that someone will. Considering how many researchers are not formally trained in programming, there are probably a lot of programming mistakes in papers.
The most obvious solution is to require - or at least encourage - people to make their code public when they publish a paper. If your code was public in the example above, I could have saved a lot of time by reusing it instead of doing it myself. I do not think this is a useful solution to this problem.And it could be worse than people writing their own code every time. I believe this for two reasons: 1) most code is written for a specific use case, so won’t actually solve the problem of messy data and 2) many researchers are bad programmers, and repeating bad code is dangerous. This solution is also practically more complex than the one I propose below, primarily due to the perverse incentives to hoard data cleaning code.
The first issue is a big one: for most research you’ll only use a tiny part of the data. For example, if you’re using police crime data, you’ll usually use only a small number of crimes - often murder or index crimes - and ignore the rest. So you’ll probably look only at a relatively small number of rows and columns from the entire dataset. I'll be fine if I copy your code and only look at the same rows and columns. But as soon as I’m interested in other parts of the data, your code may not apply. It’s a bit like extrapolating your regression results beyond the bounds of your data. If, for example, the variables you looked at had no missing data at all, your code wouldn’t do anything to handle missingness. If I rely on your code but use it for variables with missing data, I’d have an issue as I’d do nothing to handle that missing data. Given how complex crime data is - and how many data issues there are - you’ll need to carefully check every variable to ensure there aren’t any issues.
The second issue is that most researchers are not programmers. They’re not formally trained to write code and tend to spend little time writing it other than the bare necessity for their specific project. Their code is often slower, much longer, and more fragile than necessary. By fragile I mean that it’s prone to breaking if something is changed - for example, hard coding variable names instead of writing functions. This makes it hard to reuse the code because you’ll need to change all of these hard-coded variables to new variables. Miss even one - and I’ve certainly done this myself when reusing old code - and you’ll cause a major error, such as selecting the wrong variable. This kind of code is also the most prone to mistakes. Even the most experienced programmers who have the time to write careful code and have other people review the code can make errors, so having people inexperienced and not focused on writing good code makes it more likely that there will be an error. And with that error, the results of the paper will be wrong.
The danger in using code released by paper authors as a solution to replicating data cleaning work is that an error in the original code will likely be repeated by people using this code. Making the code public is a way to find these errors, but I doubt that people will do this. Some will, but these are the people who are likely pretty experienced programmers anyways, so will have the least need to use other people’s code for their research. And there’s a lot of evidence that people often do not do their due diligence when checking on the data they’re using or the papers they’re citing. Consider, for example, how many people use county-level UCR data, two decades after Maltz and Targonski warned against it or that retracted papers continue to be cited. So I have little faith that they’re doing a comprehensive examination of the code they find to ensure that there are no errors.
Currently, there is little work being done by academics to improve the accessibility of data other than the relatively small amount of replication data that is released.There are exceptions to this, such as Matt Ashby’s Crime Open Database or my work with most FBI data. The main reason for this, I believe, is because criminology strongly disincentivizes data work and data sharing. The currency of criminology is in published papers. You get an academic job by writing many papers - better quality is a plus, but quantity seems more important.
Good data work - which includes carefully checking every variable and the overall data and producing documentation about the data - takes a long time. And while it’s certainly helpful for research, this is work that doesn’t directly translate to papers, so is infrequently done. Given how important papers are, taking the time to do this work is a substantial amount of time not working on papers. And when people release the data after all of this work, anyone who uses the data gets to shortcut that work. If you spend six months on some data and release it publicly, I can use it immediately, saving me six months of work. And I could have also spent those six months working on papers that get me a job (or tenure, grants, prestige, etc.). So if someone does work on data for a while, they have a substantial career incentive to keep it to themselves.
The data work that academics do is largely done on their own time due to a desire to improve crime data or as ancillary to a paper. And usually, they’re the only ones working on that particular dataset. If they stop for whatever reason, that data likely will not be updated again. As an example, in 2020, David Abrams, a law professor at Penn, started a website tracking near real-time crime data from over two dozen large US cities.This site allowed users to visualize crime data from many different cities and see how it changed over time. That project led to a published paper, and while the site is still available, it has not been updated since 2021.
This isn’t to say that there is no good crime data work being done, just that it’s rarely academics doing it. It’s mostly think tanks (which do research, but I consider them outside of academia), news organizations, advocacy groups, and the government doing this work. To name but a few, there’s the excellent Prison Population Forecaster from the Urban Institute and a murder data dashboard tracking near real-time murder data from many large cities by Jeff Asher and Ben Horwitz. There’s also an explosion of data journalism, allowing users to interact with crime data without any programming knowledge (see here for one example). Government agencies also provide useful crime data tools like the (flawed) Crime Data Explorer from the FBI and the National Crime Victimization Data Dashboard by the BJS. These projects are great and do a tremendous amount of good for our access and understanding of crime data. Academics have much to offer in terms of understanding the data and in conducting research on assessing how good the data is.By limiting ourselves primarily to conducting research instead of improving crime data - which is, in essence, improving the infrastructure for all future research - we’re doing a disservice to our fellow academics and the field of crime research at large.
In some cases, academics receive funding to focus on good crime data collection. An instructive case of what happens when money and focus are spent on improving crime data is the Jail Data Initiative, part of the Public Safety Lab at NYU. This project is “scraping daily county jail rosters and criminal case records in over 1,000 counties” and has quickly created the best source of jail data that I’m aware of. There’s very little data available about jails, particularly about individual people in jails, so this is filling a massive gap in available data. Importantly, this is not work that (as far as I’m aware) requires extensive data-sharing agreements with agencies or special relationships with people who have the data. Everything they’re scraping is public. It just requires a tremendous amount of work to collect this data, especially to collect it every day given how frequently websites change (and thus, the scraping no long works). Their work has been used in several reports (see the link earlier for a list of them) and academic articles. This project is funded by private donors, including Arnold Ventures, the Chan Zuckerberg Initiative, and Pew Charitable Trusts, so funders are willing to fund projects to improve criminal justice data quality and access.
Work like the Jail Data Initiative is an excellent example of my proposal. My solution to the problem of insufficient data and the massive waste of effort for people cleaning the same data: make crime data easier to use. This sounds pretty dumb (or at least way too obvious), so let me explain. Let's make data cleaner at the start. That is, from the point where researchers download it - then everyone using it will have less work to do and can focus their attention on analyzing the question they have.Cleaner data means users will need to write less code which then means fewer opportunities to make mistakes.
Cleaning up data is a good idea but how do I propose doing it? Public and private funders should pay people to make crime data easier to use. This includes essential things like converting data into different formats, such as ASCII files to R or Stata file formats. Lots of valuable data is hidden in tables on government websites or PDFs, so it can also include scraping that data and making it available in a machine-readable format. If this proposal sounds a lot like what ICPSR or NACJD does, well, you’re right. It partially advocates to do more of what they do - and go beyond it.
They make administrative datasets publicly available while providing a helpful codebook detailing broad information about the data and each variable. What I’m proposing goes further than what ICPSR/NACJD does. They are what I consider a neutral party for data. They get the data and release it without judgment or guidance on using it well. This is a good thing since we’d have far fewer datasets available without their hard work. What I think we also need is a sort of positive party for data.This involves collecting data from hard-to-use, often disparate places, like the Jail Data Initiative does. Lots of crime data are like this, in many different agencies’ websites or PDFs, and could be gathered, but requires the (often ample) resources to do so.
Then what I think is genuinely key to this is to create extensive documentation not just about what is in the data but how to use the data. To give an example from my book on using NIBRS data, the data tells you which hour each crime happened. In most codebooks, this would be the end of the description of that variable. But NIBRS has a weird feature where midnight-1 am and noon-1 pm are, by far, the most common hours for a crime to happen. This is, in my opinion as someone knowledgeable about the data (and crime data in general), not genuine but is merely a quirk of the data.This is something that each user could potentially discover on their own but is far more likely to be found by someone who intends to try to understand the data and its issues. While having people focus on how to use the data will illuminate many of these problems, they won’t be able to understand every part of the data, especially for big datasets with many hundreds or thousands of variables. And for each variable they may spend less time on it than someone who uses only a tiny part of the data in their research. Many researchers approach using crime data, like using a flashlight in a dark room. They’ll be able to see the specific parts of the room they’re shining the light on (the parts of the data they use in the study) but have large dark spots for unlit areas. For someone who focuses on the data to try to understand it entirely, not for a specific project, it’s like standing in the room waiting for their eyes to adjust to the darkness. It takes much longer, and they’ll likely bump into furniture a lot more than with a flashlight (learn about problems in the data by encountering them), but they’ll end up with a much better understanding of the whole room (the data).
For funders wanting to increase research on a topic, paying to improve data quality and access is probably the most cost-effective. It won’t work for private data or data requiring special access, but for public but hard-to-use data it can be a game changer. Most researchers won’t spend dozens of hours (or hundreds or thousands) trying to get data. But they will download data someone else has already spent that time on and use it in their research. Most funding I’ve seen (from private and public donors) is research-specific, funding researchers to study a specific topic. Funding for data would do the same, but indirectly. It’d make research more accessible and let the strong paper-dominated incentives of criminology do the rest. Throwing money at this problem, however, is not the same as fixing it. And there are some dangers.
A likely negative outcome of funding data projects is that people not qualified to work on the data will be funded, and any mistake they make will be compounded in all research that uses the data. Since I’ve spent much of this post arguing that normal academics writing code isn’t a good solution, partially due to them not being programmers, funding should go to people focusing on a particular dataset and who are good programmers. Though this doesn’t necessarily have to be the same person, a team of people is often better. As an added check, all code used to gather, clean, or otherwise interact with the data should also be made publicly available. I want to be clear that I’m not proposing hiring a bunch of people with no understanding of the data. Crime data is tricky because of the nuances of what it means and how it’s created, not just because the dataset is hard to use. This should be criminologists (broadly defined, including people from other fields such as econ, political science, and others) with programming skills and experience with this particular data to do the work.
Another necessary tradeoff to making the data easier to use is that when you lower the bar to use the data, you also lower the bar of who will use it. This can lead to people conducting research with the data who don’t spend the time to understand the data and the inherent problems with it. See, for example, all of the papers that use county-level UCR data. Crime data is very tricky. There are important nuances to understanding and believing the data that take knowledge of how, for example, police operate (and thus input the data) and how the criminal justice system works to truly use the data correctly. Make it too easy to use, and you invite people to use the data poorly. Given (my hope that) criminology data affects policies, flawed research can affect people’s lives in meaningful ways.And I use the word “invite” specifically. People who make this data available and easier to use do have responsibility over what people use it for, even though they’ll likely have no control over how it’s used.
To reiterate this post, most crime data is hard to use due to essential nuances in the data and complex data formats. Most researchers are not programmers, so they have limited ability to use complex data. A solution to this is for public and private funders to pay for projects that improve crime data access and usability, including producing documentation on how to use the data correctly. It’s just like the old saying: if you build a machine-readable format of easy-to-use data and extensive documentation, they will come and research it.
Qualitative data is frequently collected by the researchers writing the paper so in those cases would be original data.
And I work with crime data for a living.
Except for their professors who don’t know how to program at all.
To be clear, I think that the code used in a paper should be released when the paper is published. The main benefit I see in this is for code checking to try to find errors in it and if there are errors the paper should be corrected.
And that’s mostly just the final dataset used in analyses, not the original data.
Another possible addition to this list is the Criminal Justice Administrative Records System but I exclude it because, based on my understanding of the data, nothing is publicly available.
The website was built by Kathy Qian and David Feng.
See, for example, this paper on how accurate arrest data is in NIBRS.
This project is not simply for data collection. It also involves conducting research on the data.
The fundamental solution to improving criminology’s use of data is always to make criminologists better programmers. So this shouldn’t be an excuse to not do that. But even excellent programmers would benefit from using crime data that’s been cleaned to be easy to use.
In the academic sense, not the fun sense of party.
Likely a placeholder for when the actual hour is unknown.
Like leading to policies that make it more likely they are victimized.