Fishing for a date
Is teenage dating related to drug use? This is the exact kind of question you’ll see studied in pretty much every criminology journal. But it’s a more complicated question than it initially seems. You can’t do an ethical experiment assigning teens to date though maybe you can find some obscure natural experiment where dating “ability” varies unrelated to drug usage (though that’ll probably apply to only a very selective group). And to my knowledge, no administrative dataset tracks if (and how often) teenagers date, though you may find some data on teen drug usage. So what is a criminologist to do? Find a survey that measures dating and drug usage among teens and hope it has enough covariates to convince reviewers that there aren’t any major confounders. This is risky because you may have a small sample or miss key covariates, rendering results null when there may be a significant relationship. So I offer an important solution: p-fishing.
P-fishing is running many tests and only (or mostly) reporting the significant ones. To be clear, this is an unethical practice that shouldn’t - but is - done in criminology (and other fields). How common is this? It’s unclear, but a recent paper by Chin, Pickett, Vazire, and Holcombe, published in the Journal of Quantitative Criminology in 2021, suggests that it’s pretty high. That paper, available here, surveyed 1,612 criminologists and asked how often they engaged in what the authors call “questionable research practices.” They didn’t ask about p-fishing specifically, but the closest question was whether respondents reported “a set of results as the complete set of analyses when other analyses were also conducted.” A stunning 53% of respondents said they had done this.Given that this study is a self-report survey and that it’s stupid to admit that you’re a bad researcher, I think it’s a selective sample of the less-bad researchers, which makes results conservative.
It’s easy to see why researchers would p-fish. It doesn’t feel as unethical as other things you can do, like p-hacking or making up results. You’re not doing anything wrong - your regressions are right, after all. It’s just that you’re just reporting the best results. And no one cares about null results anyways. But a significant problem with p-fishing is that if you have to run a bunch of regressions to find the results you like, then the relationship is not strong (or not there at all), and you’re reporting that it is. Our job as researchers is to explain how the world works as best we can, and that requires honestly reporting results just as they are, without picking and choosing the ones we want to report.
So why is p-fishing so common in criminology and other fields? Because it is so easy to do and so hard to detect. In this post, I will demonstrate how easy it is to p-fish. And for good measure, I’ll also detail some of the ways people can oppose p-fishing - and how easy it is to ignore these measures.
My research question is whether more dating is related to more drug use. To do this, I’ll use a huge nationally-representative sample of high school students: Monitoring the Future.I’ll be using data from the 2019 survey. This is an enormously popular dataset for social scientists. A Google Scholar search for “monitoring the future survey” yielded 78,600 results for 2022 alone. If only 1% of those results used the data in their study, then it’d still be among the most highly-used datasets ever. This survey is very long and has many questions, making it perfect for research - and fishing.
The enormous breadth of the questions asked makes it perfect for p-fishing. Don’t get significant results from one question? Use a slightly different one. I’ll be looking at 29 different ones. This includes every dichotomous question (i.e., did you use this drug in the time period asked or not) for every drug type asked about and for every time period asked about.
My method is to use OLS with my binary outcome measure predicted by how often the respondent goes on dates, controlling for all the variables discussed below. There are five possible dating frequency values: Never (the reference category), Once a month, 2-3 times a month, once a week, 2-3 times a week, and 3+ times a week. So actually, quite a bit more than 29 results as each regression have four dating values to compare to the reference group.
I threw in a battery of relevant control variables, which most other studies also include. Which ones? So many papers never explain precisely what variables they include, so that’s a rather rude question to ask me. But to prove how trustworthy my research is, I’ll tell you. I include respondent race and sex, their mother and father’s education level, whether they live in an MSA, their Census region, and if their mother and father live with them. I also include some variables that, at this point in an actual paper, I’d say is how I’m improving off past research since they’re vital variables missing from past research: the number of siblings the respondent has, the respondent’s political party, how often they attend religious services, and their self-perceived intelligence.
Notice that I keep saying “respondent” instead of “teenager who happened to be at school when the survey was conducted and who may or may not have answered truthfully.” And will I mention that the outcome I’m interested in (drug usage) is probably correlated with attending school and taking the survey? Of course not. And my reviewers are likely people who used this data too, so they’re incentivized not to question its quality.
I present the results of these 29 regressions at the bottom of this page. In a regular p-fishing paper, I’d include only the most significant results, but I include them here to demonstrate a more nefarious way to use these results.
Before we go on, let’s talk about exactly how I did my analyses. With 29 separate regressions, you may be concerned (and a good reviewer may bring up) that I may have a typo somewhere in all that code, meaning that one or more of my results is wrong. I wrote a for loop to go through each of the 29 outcome columns and run my regression (including controls) on that.29 regressions, and it’s as simple as running a for loop. If I wanted to add another 100 outcomes, including them in the loop would be a trifle. I can run an infinitely large number of outcomes, limited only by what’s included in the data, with almost no extra work. The barrier to p-fish is basically non-existent.
Looking at the regression results at the bottom of this page, you should be amazed at my incredibly publishable results. Most things are significant and usually very significant (more * means more important!). I won’t even talk about effect sizes because who cares about that in a criminology paper. I intentionally ran 29 different tests with the intent to p-fish, so you really shouldn’t trust these results (not to mention flaws with just including whatever covariates were available, which may not be all of the relevant ones). Now let’s talk about how you, an astute reader or reviewer, would try to figure out this problem with the paper - and how you’ll fail.
The main concern you’ll have is that I run so many tests. I don’t think anyone reading a study with 29 tests would trust it unless it adjusts for multiple hypothesis tests. Even so, you’ll may reject the study due to its blatant fishing attempt. Here’s the solution and why I kept the non-significant results at the bottom of the page: I can turn this fishing expedition into multiple salami-sliced papers with a much smaller number of outcomes.
Conduct a small number of tests, say five or so, and no one will care about it. Conduct 29, and people will say something.Just split up those 29 tests into six papers of 5 outcomes each, and no one will ever know you’re p-fishing. Criminology loves salami slicing, so this would be a great solution. More papers and avoids any questions about poor research practices. With the non-significant results, I can even sprinkle in a non-significant outcome in every paper so people don’t accuse me of p-hacking. I’m showing non-significant results; I don’t just care about significance because I’m a good researcher! How often does this happen? I can’t give a firm number, but it’s pretty standard. Just look at the CVs of a random criminologist - especially people who primarily use secondary survey data like Monitoring the Future. Many have multiple papers on the same topic using the same data, which are nearly indistinguishable from each other.
I can even split the results into meaningful outcome groups so readers aren’t concerned that I write multiple nearly identical papers. If I say that I’m just looking at “light drugs,” “hard drugs,” or “psychedelic drug outcomes,” then readers will think I’m making a reasonable choice to limit my outcomes to these categories. And when I cite each previous paper, I show that not only is this research important because other people are doing it, but I show that my data and methods are valid. And since it takes so long to publish papers, no one will know that I’m salami slicing for years - not until many or all of the papers are finally published.
Another concern you may have is that I ran so many tests that some are significant due to random chance. This is a concern that few of you probably have. Very few criminology papers ever correct for multiple hypothesis tests, and those that do generally still have significant results, making these corrections seem convenient (would they be reported if they made results non-significant?). So did I correct for the 29 tests? Of course not. Reviewers won’t ask, and neither will the editor. Why would I make my results less publishable for no benefit to me? I’m here to publish papers. And let’s not pretend that readers will care. They’ll still cite my paper and probably won’t even read it enough to know if I made the corrections. But the above solution of making papers with five or so outcomes each also solves this problem. With so few outcomes, few readers or researchers would even think that outcomes need to be corrected for multiple tests.
Now, you may think that even with all this blatant p-fishing, my results don’t matter because it’s just a correlational study. But it’s not. It is a causal study. I discuss this in more detail in this post, but here’s the general issue. I (and basically all researchers doing similar studies) will clearly say it’s not causal. I’ll say that I found a “relationship” or that the results trend together. Maybe even “future research” should try to measure causality. In other words, I’ll build a wall proving that I’m doing research right: staying within the bounds of what I measured. And I’ll chip away at that wall in the rest of the paper.
I’ll start in the intro by discussing the direction I expect this relationship to be. I expect that increases in dating is related to increases in drug usage for XYZ reasons. I’ll build those causal pathways without claiming it’s genuinely causal. Of course, correlations don’t have policy implications, and I (and the reviewers and editors) need to have policy implications. So in the discussion section, I’ll treat my findings as causal. To reduce drug usage, we should do something about dating. Read the paper carefully, and you can see that it’s correlational. Skim it or skip the data and methods section - and many people seem to do - and it’s much more causal. And let’s not forget that academic papers do not stay in academia. If a policy maker or any non-researcher (media, advocate, etc.) reads it, they’ll probably not distinguish between correlation and causation, and the paper often doesn’t do much to help.
I’ve now shown an easy way to p-fish your way into several papers. And with how enormous some datasets are (Monitoring the Future has hundreds of variables), you can do this infinitely. Add new outcome variables to your loop, and you have years worth of publishable results in only a few minutes. Results, to be clear, that almost no one reading or reviewing your paper will know you fished for.
Before I get into what I think are potential solutions for this, let’s talk about a separate issue. I p-fished my way to these 29 results, so since I’m telling you that I did, you probably should be wary of these results. But what if I did this honestly but still achieved the same results? The covariates I choose are reasonable, so an honest researcher would probably select the same (or similar) ones. And the outcomes I choose are also reasonable - drug use is, after all, a popular field of study. The problem (other than fundamental issues with these types of correlational cross-sectional studies) is that I ran 29 tests at once and will likely only report some of them.
Now imagine an honest researcher who takes my same approach, using the same data, but now decides to subset results into multiple papers based on a reasonable breakdown of outcomes (e.g., soft drugs, hard drugs, etc.). They group these outcomes several at a time (say 4 or 5 per paper) and run results for each of these as they write each paper. Check 4-5 outcomes, write a paper, and repeat for the following paper. They never check all 29 at one time, and they report all results so they are not p-fishing. By the end, however, they’ll have checked 29 outcomes the same as I do and never correct for multiple hypothesis tests. They will have honestly fallen into the trap of reporting significant results that likely aren’t significant due to chance. This is probably not too common, as Chin (2021) found that using questionable research practices is unrelated to the respondent’s career stage or degree of methods training and that people later in their careers support questionable research practices more than early career people. Still, it is likely to occur and has the same result as my unethical approach here. So we need solutions that address both.
As I see it, the core problem is that researchers have too much obsession with statistical significance, regardless of the quality of how the relationship is measured. A statistically significant correlation is seen as publishable, while a non-significant one is not. P-fishing is two problems in one: running a bunch of tests and reporting on some (usually just the significant) results. Let’s talk about the second problem first because I think it’s easier to address (and because running a lot of tests is potentially not that bad if you properly correct it).
First, let’s start with the light, easy stuff that probably won’t do anything, but it’s a shame that it isn’t even attempted. Journals should make it explicit that authors must explain all the results they ran, even ones they don’t report. When submitting the paper, they should assert that they do this, and editors and reviews must insist upon it when it is under review. Published papers should also have to claim that all results are discussed, even if some are not reported. This could include saying they ran a regression but later found data issues, so they aren’t reporting results. Authors could always lie or justify that the non-reported results will go into a different paper or are just exploratory, so don’t count. So I doubt this will do much, but it should be done nonetheless. Having official written journal policies against p-fishing (and other questionable research practices) is literally the least journals can do - and most still don’t do it.
We shouldn’t expect the researchers to be honest about whether they p-fish. The honest ones wouldn’t p-fish at all. The ones that do would be fine lying about it. So this may help reduce the issue of people who genuinely don’t know that you should report all results, but I doubt this is a large share of people. So we need a way to detect when p-fishing happens.
Pre-registration reports and sharing code are two suggestions I’ve seen, but both are gameable.Authors can just run results before writing the pre-registration reports or only share the code for the results they end up using. So my suggestion is a bit abstract and will require more effort from a group that, in my experience, is largely opposed to doing work: reviewers and editors. What these people should do (and readers as well, though ideally p-fishing papers never get past the review stage) is consider all possible data and tests that could be done and compare that to what is presented in the paper.
This is an admittedly abstract exercise that will undoubtedly fail, but it’s an additional hurdle that’ll raise the standards of all research. Consider, for example, that the data I used in this post is from 2019 and only 2019, even though Monitoring the Future data is available every year from 1976-2020. Many papers use only a single year of study, so this probably won’t be questioned. But it should. Why am I only using a single year of data? Is it because results are best when I use that data and not a different year? Why didn’t I use multiple years, increasing my sample size, making my results more precise, and allowing analyses of subsets of respondents? This also applies to single-city or single-crime studies, which are very common. Why choose these specific samples or outcomes when others are readily available? These are the kinds of questions reviewers and editors should be asking. You may think it’s not very collegial to be suspicious of papers and expect unethical behavior. But the standard of research should be high, particularly in a field with as much policy relevance as criminology.
Another concern you may have is that this is a slippery slope to having expansive studies that are far more work than what was done before - and which may exasperate differences between established academics and those who desperately need published papers to advance their career. Rising research standards will inevitably harm those who have fewer resources or need papers more than others; there’s no getting around this (other than potentially getting more resources to these people). But I’m also not suggesting that a (for example) single-city study now needs to use data from all cities similar to their studied city. But when data is available but not used, that should raise some red flags. If you can do the same study in 5 cities with publicly available data that meets your study criteria and choose only one city, then that makes me suspicious. And even if you decide that one city without any attempt to p-fish - maybe you don’t have sufficient programming skills to do all 5 - then higher standards would still improve the research. Should we accept lower-quality research because it’s easier? I think not.
The next suggestion addresses running multiple tests, and it may be controversial, but I think it will be effective. We must p-fish to the extreme! Since I made this post, people can see the relationship between dating and 29 drug use variables, controlling for relevant covariates. Someone who wants to study this relationship can’t simply use Monitoring the Future data and run a regression. It’s no longer novel, so the standard to publish is higher. Higher standards are good. I readily admit that I don’t believe that the results I found are real. It’s simply a correlation using a cross-sectional sample of teenagers who happen to be at a school on a particular day and whose covariates I include if they happen to be in the data. Does that sound convincing to you? I hope not, but many papers publish similar results in criminology journals.
So this kind of defensive-first-strike p-fishing would help raise the standards and prevent frivolous results from being punished. Here’s what I propose. Take the large, commonly used datasets and just run every regression. With covariates, without covariates, combine outcomes, don’t combine outcomes; throw the kitchen sink at it. Then report all these results. It’d be a massive report or website with likely thousands or tens of thousands of regressions. The basic idea is that you take the absolute lowest hanging fruit - and therefore the easiest to manipulate - of just running regressions on convenient data, making it impossible to do. Now researchers have to do better.
Let me be clear: we do not need papers that only have a simple correlation. As I’ve shown here, that can be done in a for loop for an infinitely large number of outcomes at a time. Descriptive research is essential, but it’s also easy to do. Or at least descriptive research from these massive secondary datasets is easy. People don’t need a Ph.D. to do them. When something can be automated like this, we should automate it. Write some code to run every analysis we want (which can be rerun for every new version of the data) and release it so people can see it. The fact that it’s running so many analyses should be a glaring warning sign not to take results too seriously. But the results will still be guideposts for researchers to conduct better studies on the topic.
People who staked their careers on these weak correlational studies using these big survey datasets will hate these suggestions. It means more work for them. But people who want to understand the world better and people who want to produce research that can be applicable - and accurate - in the real world should like these suggestions. This will not solve the problems of flawed research, but I think it will help.
Some of this may include legitimate reasons not to report results, like enormous confidence intervals making the measure imprecise or learning about data problems after running the analysis, but these are likely rare.
This also surveys middle school students, but I only use the 12th-grader sample.
I also have some code to pretty up the results, like adding brackets to the CIs and rounding numbers.
Except in food science results which is an amusing field to read papers in.
These are also suggestions to combat p-hacking.