Discover more from Crime Thoughts
How bad is real estate crime data? Really bad
Recently, a popular tweet shed light on the fact that real estate agents are unable to disclose crime data about the neighborhood you're interested in, primarily due to racism concerns.
The article linked in the tweet from the U.S. News & World Report said the following:
Crime statistics and details about schools can be interpreted as references to race – a violation of the Fair Housing Act – which is why Relman says a real estate agent won’t tell you about crime in a neighborhood.
For example, a real estate agent that said, “That’s a pretty high-crime neighborhood,” would be a race-based violation of a federally protected class. The idea is, I guess, that crime is something that non-Whites do, so saying a place has lots of crime is akin to saying that this neighborhood has many non-White people. And talking about the racial breakdown of a neighborhood is illegal? That does not sound right to me, but the point of this post isn’t to discuss the law. Instead, it’s to look at the crime data that these real estate websites used to provide, and which some still do.
In the wake of George Floyd’s murder, numerous leading real estate websites decided to remove the crime data previously available on their platforms (see here for an article about this change). They argue that the data is flawed and (most importantly) that crime data can be racially biased and cause racial bias in where people choose to buy homes. "It appears they didn't take these issues into account prior to using the data, and their application of it is one of the most flawed I've ever seen.
A number of press releases from these companies reveal a concerning lack of understanding when it comes to crime data. Here’s what Christian Taubman, Redfin’s “Chief Growth Officer,” said was one alternative they considered for better crime data:
To get around the gaps with reported crimes, the main other data source we considered was the National Crime Victimization Survey from the Bureau of Justice Statistics. By virtue of being a survey, this has the advantage of being able to capture both officially reported and unreported crimes. However, also by virtue of being a survey, if there’s racial bias in respondents’ answers this will get reflected directly in the data. And there are troubling signs of this: in the 2019 survey, people reporting crimes were more likely to describe their offender as young, male, and Black than would be expected given the representation of those groups in the population.
There are some fundamental issues like assuming that differences in group’s share of offenders being different than their share of the population are caused by bias - it may be, but that’s a more complicated question. But here’s the main issue. As the name suggests, the National Crime Victimization Survey is intended to provide nationwide estimates. You cannot get neighborhood-level victimization data from this survey. They never used this data, but it shows how little these companies care about measuring crime right. And they demonstrated their lack of caring about accuracy in the crime data they did use.
So, what were the crime measures they used?
From what I've observed – and I had the chance to review some of these sites before they removed their crime data – these companies utilize FBI Uniform Crime Reporting (UCR) Data and somehow transform it into neighborhood-level crime rates.Here’s the problem, it is impossible to get neighborhood-level data from UCR. UCR data is available only at the police agency level. You cannot disaggregate it down to a smaller geographic unit. This should not be a complex concept, but apparently, it is.
In this post, our focus will be on the methodology adopted by Neighborhood Scout, a company that continues to supply crime data to a wide range of users including real estate agents, property managers, investors, and individual home buyers. They not only offer crime rates in each neighborhood but benchmark it against all other neighborhoods in the U.S. to show how it ranks. And even include your chances of becoming a victim of a violent crime. If this sounds impossible, well, it is.
Their main crime description page is available here. And here’s an example of what their crime report looks like.
Here's how they characterize their product (and it's important to note, they're effectively selling their research findings).
NeighborhoodScout Crime Risk Reports provide an instant, objective assessment of property and violent crime risks and rates for every U.S. address and neighborhood. We offer seamless national coverage and up to 90% accuracy.
This is a striking assertion, especially since they are explicitly providing crime data at both the neighborhood and address level, a level of detail that far surpasses what can be derived from UCR data. And they do use UCR data, as they say on this page.
We start by collecting the raw crime data from all 18,000 law enforcement agencies in the United States. We then assign these reported crimes from each of these law enforcement agencies to the specific local communities the agency covers, and hence in which community the crimes have occurred, using a custom relational database that our team built from the ground up.
The 18k agencies are clearly UCR data, and they say elsewhere that it is FBI data. They convert this to neighborhood information, which, as they say here, is at the Census tract level, the Census’s rough approximation of a neighborhood. This is important because the Census provides a great deal of demographic information at the tract level, which is likely how they go from agency to neighborhood level.
So how do they get their neighborhood-level data? From algorithms, of course!
First, they take the UCR data and, “using a custom relational database that our team built from the ground up,” assign the crime to the “specific local communities the agency covers.” Considering overlapping jurisdiction, this is no easy task, especially for special agencies like schools or Sheriff’s Offices. Once they do this, here’s how they describe their next steps:
Once we have this modified set, we build upon it, producing sub-zip code crime hazard data with risk indices for violent crime, property crime, motor vehicle theft, crime density, and more. We then develop algorithms to statistically estimate the incidences of both violent and property crimes for each neighborhood in America.
The resultant formulae produce numbers of crimes and crime rates for neighborhoods with upwards of 90% accuracy. We deploy 80 proprietary formulas to increase the accuracy of our predictions, and apply them based on city or town characteristics to produce the best model fit in each case. This method produces the best crime risk information for every neighborhood in America.
Still with me? I'm having difficulty myself. How do they generate 'sub-zip code crime hazard data'? Moreover, what exactly does 'crime hazard data' mean? The “80 proprietary formulas” are the algorithms they use to move from agency to neighborhood-level data. While I may not be a statistician, I can't recall ever coming across a paper that employs 80 formulas to accomplish any task. There's a wealth of information to parse in these two paragraphs, and the process they're describing is quite obscure.
What do their algorithms do? How do they get neighborhood-level data before using any of their magical algorithms? How do they know it’s 90% accurate? Accurate compared to what? It raises the question, how can 'city or town characteristics' have an impact on data that's already at the agency level, which is often a local city police agency?
Here's my interpretation of what's occurring. They take the UCR data, apply some kind of algorithm to handle missing values, and claim to achieve data with “97% accuracy”. Given that decades of criminology research have not yet yielded an effective method for handling missing data, I'm skeptical that this group has managed to succeed where dedicated researchers have struggled. But for the sake of argument, let's assume they have effectively addressed the issue of missing data.
They then take the UCR agency-level crime data and divide it by the population of the corresponding Census tract, which in essence represents a neighborhood. For instance, if a tract constitutes 10% of the city’s population, it would theoretically receive 10% of the crime data. However, it's not quite as simple as that. Based on their assertion that the data is informed by city characteristics, I suspect they may be applying a form of weighting to adjust population figures according to demographic attributes that correlate with crime. Thus, a tract with 10% of the population might be assigned a higher crime figure if its residents exhibit characteristics linked with crime, and a lower figure if those characteristics are absent.
This also gets more complicated when you consider how hard it is to match a Census tract to the police agencies with jurisdiction there. They account for this by including “crimes that truly occur within any city or town, not just crimes reported by a single municipal agency.” But it’s a much more complicated task than they give it credit for. Census tracts are unique within a county in a state but can be in multiple cities at once. Police agency jurisdictions are also complex. A Sheriff’s Office, for example, may technically be an agency for the entire county but, in practice, does not work in areas where a local police agency is already at. So your local Sheriff will probably stay in unincorporated areas or in cities that contract to the Sheriff for policing. And that ignores special types such as university police or transportation system police, whose jurisdiction is often small, weird, and overlapping with larger agencies.
This is the method that I am assuming they follow, though admittedly, they do not give enough information to figure out precisely what they do. Maybe I’m completely off-base here. Maybe their methods are great. Their algorithms may genuinely be a very accurate representation of crime at the neighborhood level. Achieving this feat would be a remarkable accomplishment. One that, as far as I’m aware, no academic has managed to do.
While criminologists dispute how to aggregate data up to the county level, they’ve allegedly solved the issue of de-aggregating data to a microscopic geographic level. Indeed, down to the address! If so, let them prove it. Release their “algorithms” and their raw data to researchers so we can compare them against other data to see how accurate their results are.
Neighborhood Scout certainly charges enough for their results. Ten monthly reports will cost you $39, while 200 reports are $120. That’s quite a chunk of change for results that are likely far from reflecting the reality of crime in that neighborhood. And when selling a product that claims to have near-perfect data (“90% accuracy”!) to individuals and organizations who make decisions based on this data, they need to be correct.
As I noted above, Neighborhood Scout is only one example of what these companies do. And is very consistent with how the companies that no longer use these data used to do. This is probably how many people ever see crime data - as wildly inaccurate representations of what crime looks like in their neighborhoods. Plenty of research shows that people do not understand crime trends, either in their communities or nationally (see here for one example). How much of this misperception about crime is due to getting their information from misleading sites like this? Probably quite a bit. How many people made a decision based on this likely wildly inaccurate data? Again, probably quite a few.
And that’s saying something.
Redfin is a large real estate brokerage company.
Some companies also use local crime data with the actual location of each crime, but for the most part, it is UCR data.