What an Analysis of 6 Million Articles Reveals About the State of U.S. Newspapers

As newsrooms continue to shrink, a new study finds mixed news about the survival of investigative reporting.

September 08, 2021

| by Shinhee Kang
A stack of freshly printed newspapers, in the background printing machines and technical equipment. Credit: iStock/industryview

Stanford GSB researchers trained a neural network to identify muckraking newspaper articles. | iStock/industryview

Since the late aughts, wave after wave of layoffs has pummeled the U.S. media industry. Employment at newspapers has plummeted a staggering 57% between 2008 and 2020. As downsizing, cost-cutting, and closures continue, the future of the newspaper business seems in peril.

But does this mean the deterioration of journalistic content is imminent? And, more specifically, are newspapers producing less of the investigative stories readers rely on to uncover corruption, expose wrongdoing, and hold powerful individuals and institutions accountable?

“The worry is, who else will investigate a scandal in some small regional city?” says Gregory J. Martin, an assistant professor of political economy at Stanford Graduate School of Business. “The existence of this kind of coverage is important for accountability of elected officials and has positive effects on the functioning of representative democracy.”

Martin’s concern about the future of investigative reporting was shared by his fellow Stanford GSB researchers Shoshana Vasserman, an assistant professor of economics, and Eray Turkel, a PhD candidate studying applied microeconomics. Tracking the output of investigative reporting was difficult using existing tools, so they trained a neural network to identify investigative newspaper articles. Applying their model to millions of articles, they then sought to see whether a decade of shrinking newsrooms has led to a decline in the production of such content.

Their findings, recently published in Proceedings of the National Academy of Sciences, were not entirely dire. “On the one hand, things aren’t as bad as we thought,” Martin says. “On the other hand, very recently, it seems like the trend is quite down. And that seems to line up with major layoffs and downsizing. It’s mixed news.”

Investigating “Investigativeness”

The researchers looked at more than 5.9 million articles published between 2010 and 2020 by 50 U.S. newspapers with a track record of publishing investigative stories. To train the classification algorithm to identify which of these articles were based on investigative reporting, the team processed their full text and metadata to create a set of descriptive features indicative of “investigativeness.”

Even with a sensitive classification tool, sifting through millions of articles to find hard-hitting investigations was a challenge. A “very, very small fraction of the data that we had was investigative,” Turkel says. Given that just one to two percent of newspaper journalism can be characterized as investigative, it’s “inherently hard for any kind of machine learning method,” Martin adds. And, because investigative articles cover a wide range of topics, they don’t contain a predictable set of keywords that a computer can easily pick up on, unlike, say, sports stories. “It’s more contextual,” Martin says.

There was a significant drop in investigative content starting in 2019, corresponding with recent rounds of layoffs.

One part of the approach was to look at an article’s impact on topics discussed in future news stories. “If an article came out that uncovered a scandal around an institution or agency, and it’s important, you will presumably see other articles in the future mentioning this story or using the same kind of words,” Turkel explains. The classifier also used labels based on articles that had received investigative journalism awards. “We took 1,000 or so award-winning investigative articles to see how similar an article is in relation to them, in terms of their style or topic.”

Other signs of investigativeness were keywords such as “corruption” and references to documents, such as court records and Freedom of Information Act releases. Another predictive feature was a search for articles that appear to be part of a series. “If a paper is going to invest in an investigation, they’ll produce multiple pieces out of it because the cost of doing the investigation is high,” Martin says.

The team’s algorithm performed well at spotting quality investigative journalism. Even though it was trained on a narrowly defined set of award-winning investigative articles, the classifier could accurately identify articles with clear investigative qualities as well as identify highly productive investigative reporters and outlets that specialize in investigative work.

First, the Good News

Surprisingly, the number of investigative articles published held steady over most of the decade. “Going into this, I expected a steep decline,” Martin says. “What we actually find is pretty stable.”

One implication of these findings is that newsroom acquisitions by investment groups did not result in sustained declines in the production of investigative articles. But, the authors are careful to note, their findings also indicate that reorganizing and downsizing are slow-moving processes, so we may not be seeing the full extent of the consequences at this point.

As Turkel notes, the study is limited to 50 newspapers that have survived the past decade. Another caveat is that there was a significant drop in investigative content starting in 2019, corresponding with recent rounds of layoffs. For example, after the Austin American-Statesman was bought by a national publishing group in 2018, it went from publishing more than 15 investigative stories a month to two. “The metric picked that up well — it drops dramatically — so we can see that in the quality of investigative content they produce,” Martin says.

The study’s dataset of millions of articles and their predicted investigative news scores is publicly available. Turkel says other researchers have expressed interest in using the data to measure the impacts of newspaper closures, acquisitions, takeovers, and consolidations.

The Stanford GSB team plans to continue to use its data and algorithm to better understand the state of the news industry, particularly how readers respond to investigative reporting. In partnership with the nonprofit Mozilla, the researchers are launching a platform where web users can volunteer their browsing data for study. “We’re interested in how much quality drives reading and consumption,” Martin says. “We’d like to be able to say something about the value of investigative journalism to readers.”

For media inquiries, visit the Newsroom.

Explore More