Missing metadata — data that provides information about other data — might not sound like a big deal, but it’s a costly problem that’s hindering humanity’s plans to protect the planet’s biodiversity.
“The way I see it, it’s pretty simple,” said Rachel Toczydlowski, a postdoctoral researcher in Michigan State University’s College of Natural Science. “If we want to monitor and conserve global genetic diversity — the most fundamental level of biodiversity — we need to improve our data archiving practices ASAP.”
Put another way, if humans want to be better stewards of their planet, they need to be better stewards of their data when cataloging organisms.
Toczydlowski is the lead author of a new study published August 16 in the journal Proceedings of the National Academy of Sciences, which features researchers from 14 institutions in three countries. The team audited the largest global repository for storing genetic sequence data to see if the entries included basic metadata needed to make them useful for monitoring genetic diversity. More than half of the datasets they examined were missing that metadata, such as when and where a sample was collected.
When properly archived, these datasets allow researchers to track genetic diversity, which is a barometer of how well organisms are equipped to deal with a changing planet.
“Just as an ecosystem can be made up of thousands of species, every individual plant or animal has thousands of genes in its genome that help it to adapt and survive in its unique environment,” Toczydlowski wrote with Eric Crandall, the senior author on the study and an assistant research professor of biology at Pennsylvania State University.
Organisms with lots of genetic diversity are, thus, very adaptable. Those lacking genetic diversity are more vulnerable to changing conditions, such as warming and drying environments, the appearance of an invasive species and poor health resulting from inbreeding. Genetic diversity affects the health of species, which in turn affects the health of ecosystems. Having diversity across all these levels is critical for a healthy planet, Toczydlowski said.
Researchers therefore want to know how much genetic diversity is in a given place at a given time to understand the health of those organisms and their environment. Tracking changes in genetic diversity over time would also let ecologists forecast how ecosystems will fare in the future and prepare accordingly. Conservationists, for example, could use the information to determine which organisms would be best suited to launch successful restoration efforts in disrupted ecosystems. But that goal can be met only if the available data are complete.
To get an idea of how much metadata was missing, the team surveyed thousands of data sets from the International Nucleotide Sequence Database Collection — the largest data repository of its kind — representing more than 325,000 individual organisms from nearly 17,000 different species. The researchers found that 86% of these samples were missing important metadata.
“Researchers spend incredible amounts of time and money to generate genomic sequence data, and these data can provide novel insights into basically every field of biology, from conservation to ecology to behavior to evolution,” said Gideon Bradburd, an assistant professor in the Department of Integrative Biology and a co-author on the study. Toczydlowski works in Bradburd’s lab, which is also part of MSU’s Ecology, Evolution and Behavior program.
“But, if the context of the data — the location and time at which individuals are sampled — is dissociated from these genetic resources, they become much less useful,” Bradburd said. “Especially for conservation monitoring.”
“At a personal level, that was one of the most frustrating things for me,” Toczydlowski said. “As an evolutionary biologist, I know the hundreds of hours that it takes to generate a single dataset.”
There’s the time that’s spent obtaining permits to collect samples, then traveling to field sites and then actually tracking down the samples in the wild. And all of that is before researchers return to the lab to extract the DNA they want to sequence, which costs about $50 per sample.
That may not sound like much, but when added up over all the samples from this study that researchers cannot reuse in future analyses because of missing metadata, the sum is in the tens of millions of dollars.
“For me, though, the loss is about more than the money,” Toczydlowski said. “It’s more akin to losing a really special family recipe. A lot of time and care went into generating it but no one else can benefit from that if it’s not written down.”
“Almost every photo that people take with their smartphones contains metadata that describes the time and place the photo was taken, so it comes as a surprise that expensive genetic sequence data do not have similar information attached,” said Crandall of Penn State. “The system for providing these metadata is difficult to learn quickly, and currently there just aren’t enough incentives for researchers to spend their valuable time on this.”
There is good news, though. Undergraduate and graduate students on the team were able to find a good chunk of that missing metadata published elsewhere in the scientific literature.
“They were able to resurrect about 20,000 individual samples that couldn’t have been used in future conservation monitoring otherwise,” Toczydlowski said. And the fact that these students were able to contribute is, in itself, a silver lining.
When the pandemic struck, Toczydlowski and her colleagues started discussing what they should do with grant money that was set to expire and had been earmarked for attending conferences. With travel and gatherings off the table, the team pivoted and put the money toward enlisting students to track down missing metadata about when, where and how samples used to generate genetic sequence data were collected.
“I thought they’d think that this was kind of boring,” Toczydlowski said. “It isn’t doing field work in Hawaii. It’s spending hours in front of a computer reading papers and finding data.”
But the students were fully engaged and enthusiastic, she said. Many commented that they made changes to their own research practices as a result of this experience. They also were able to work and network with other students, postdoctoral researchers and faculty from across the U.S., Australia and New Zealand. “We had all these people at different stages of their careers working on this project,” she said. “I find that to be pretty rare and it’s something that I’m really proud of.”
Moving forward, though, scientists will need to keep better records. The students were unable to locate the missing metadata for 67% of the datasets they worked on. This effort also took more than 2,000 human hours of work and likely wouldn’t have happened if the pandemic hadn’t stalled other research projects.
The team hopes its new paper and call for change will inspire better practices, but it doesn’t want the onus to fall solely on the researchers who collect the data.
“I think researchers are asked to do way more things than we have the time to do well, and there really aren’t incentives to do careful data archiving,” Toczydlowski said. “We need to ask the journals, funding agencies and data centers to help researchers put the data out there in a standard and robust way.”