Datum: 29.05.2025

AI linked to explosion of low-quality biomedical research papers

Analysis flags hundreds of studies that seem to follow a template, reporting correlations between complex health conditions and single variables based on publicly available data sets.

Převzato z: Nature.com, dostupné na: www.nature.com

The scientific literature is at risk of becoming flooded with papers that make misleading health claims based on openly available data that are easy to process using artificial intelligence (AI) tools, researchers have warned.

In a study published in PLoS Biology on 8 May1, scientists analysed more than 300 papers that used data from the US National Health and Nutrition Examination Survey (NHANES), an open data set of health records. The papers all seemed to follow a similar template, associating one variable — for example, vitamin D levels or sleep quality — with a complex disorder such as depression or heart disease, ignoring the fact that these conditions have many contributing factors.

“We have a sudden explosion in publication rates [of papers] that are extremely formulaic that could easily have been generated by large language models,” says study co-author Matt Spick, a biomedical scientist at the University of Surrey in Guildford, UK.

Spick and his colleagues found that the associations in many of the papers did not hold up to statistical scrutiny, and that some studies seemed to have cherry-picked data.

“Imagine you’re trying to pass an exam that has a particular pass rate, and you add as many questions as you want. You see which ones you got right, and you remove the ones that you got wrong. That’s basically what they’re doing,” explains Charlie Harrison, a computational biologist at Aberystwyth University, UK, who also worked on the study.

Ioana Alina Cristea, a clinical psychologist and meta-researcher at the University of Padua, Italy, agrees that the papers “seem to be written with a recipe”.

“We need these systematic evaluations to get some way to gauge the extent of the problem,” she says.

Surge in studies

NHANES is a long-running survey that collects data from thousands of people in the United States about their health, diet and lifestyle. The data set is publicly available and ready to plug into coding or AI systems for analysis, which has led to an increase in studies based on NHANES data over the past two years, Spick says. In 2024 alone, more than 2,200 association studies using NHANES data were published, and more than 1,200 have been published so far this year, according to the PubMed index of biomedical literature.

Harrison, Spick and their colleagues focused on a sample of 341 studies published between 2014 and 2024 that were based on NHANES data. The papers appeared in 147 journals produced by a range of publishers, including Frontiers Media, Elsevier and Springer Nature (Nature’s news team is editorially independent of its publisher).

The researchers identified 169 variables in these papers that were suggested to have statistically significant associations with health conditions. In some cases, the same variables seemed to have been reported as either causes or outcomes in different studies. For example, one paper suggested that levels of an inflammatory protein in the blood were associated with developing gum disease, whereas another linked an increase in the level of the same protein to a carbohydrate-rich diet. “They’ve got these fingerprints all over them of basically being produced to a formula,” says Spick.

The authors further analysed a subset of 28 papers that associated single variables with depression — the condition that appeared most frequently in their sample. They ran a statistical correction test to help to identify results that seem meaningful but could occur by chance. After this test, the reported associations remained significant in only 13 of the 28 papers. “The significance of the relationships no longer holds. They’re not valid any more,” says Harrison.

Some papers also omitted parts of the available NHANES data set in their analyses, focusing on only certain years or age groups without giving a reason. Out of 14 papers looking at links between a marker of blood inflammation and conditions including diabetes and hearing loss, only 4 used complete NHANES data sets. And most of the papers analysed in the study limited their scope to a few years of data. “It would be difficult for that to happen accidentally,” says Spick. He suggests that in some cases, data could have been selected or omitted to achieve a positive association, or to generate several papers from a single data set. “You can work your way through all of the possible combinations to find something that shows some statistical significance.”

Easy target

Although the study didn’t explore whether any these papers could have been produced by paper mills — companies that churn out fake scientific papers to order — the fact that NHANES is easy to plug into AI systems would make it an easy target for those aiming to mass-produce low-quality papers, the authors say. They found that for their sample of papers, the publication rate started to increase markedly in 2022, around the time that large language models started to become more sophisticated and mainstream. And 190 of the papers — more than half of those sampled — were published in 2024.

The researchers suggest that public databases such as NHANES should ask researchers to register their study plans before giving them access to data. Such measures would be “an auditable step to try to stop people wholesale mining these kinds of data sets”, says Harrison. “When they’re exploited like this, it drowns out any meaningful finding.”

Cristea agrees that action is needed to halt the proliferation of questionable single-association studies. “It’s not informative any more to know that a single factor is related to depression, for example, because there are so many” other factors, she says. “It’s not going to lead to treatments.”

Nature 641, 1080-1081 (2025)

 

Zpět

 

KONTAKT

Biologické centrum AV ČR, v.v.i.
Parazitologický ústav
Branišovská 1160/31
370 05 České Budějovice

NAJÍT PRACOVNÍKA