A new study from Michigan State University makes inroads in learning to “read” the genome, a key goal of modern biology.
The results, published in eLife, show that the DNA content of our genomes resembles a complex biological language, composed of coding regions and regulatory regions. Although protein-coding regions in DNA could be compared to a traffic signal – utilizing a simple stop or go message – the regulatory regions in DNA are more like poetry.
“The regulatory sites in DNA operate like a light switch to turn a gene on and off. In animals, it’s extremely complex,” said David Arnosti, MSU professor of biochemistry and molecular biology and the paper’s lead author. “There might be hundreds of protein factors in the cell that bind to the gene and impact activity. And there might be hundreds of binding places.”
He compares the “language” used in these regulatory sites to poetry.
“It may be Emily Dickinson, or Shakespeare or Allen Ginsberg; but all are using ‘words’ to evoke thoughts and emotions, to control the message,” he said. “As we enter an era where the DNA sequences of entire human populations are increasingly accessible, we would like to know the functional significance of changes in gene regulatory regions.”
Arnosti conducted the study with Rupinder Sayal, now assistant professor of biochemistry at DAV University (India); Jacqueline Dresch, now professor of mathematics at Clark University; and Irina Pushel, now a pre-doctoral researcher at the Stowers Institute for Medical Research.
The team studied a set of regulatory proteins responsible for switching genes on and off in the Drosophila embryo. A regulatory factor called Dorsal controls a network of genes crucial for development of fruit fly embryos. Dorsal binds to the regulatory region or “enhancer” of a gene called rhomboid; the element has been well studied and is known to be a fairly typical example of an enhancer region. Dorsal is closely related to a critical gene for human immunity as well as inflammation in disease.
“We analyzed dozens of variants of this gene and quantitatively measured expression in about 1,000 embryos, creating a quantitative data set that could be used to train mathematical models, utilizing parameter optimization,” Arnosti said. “Our study shows that the regulatory properties of specific control proteins are accessible by employing quantitative experiments and mathematical models.”
By applying an ensemble of models, the research team was able to identify conserved regulatory properties in other sequences to “read” the genome.
“Using this approach, we will eventually be able to do the same thing you would do in English class – pick up a book of haiku or Shakespeare and understand that ‘this is a love poem,’ or ‘this is an elegy,’ because we’ll understand how the words – the DNA elements – are used in different contexts to convey different meanings on the regulation of genes,” Arnosti said.
Similar studies will be required to learn how mutations found across the genome may impact gene expression, leading to better diagnosis and treatment of disease.
“For example, we can compare gene expression in a tumor and in normal tissue from the same patient to figure out what went wrong in this gene network,” Arnosti said. “Having the power to read regulatory potential from DNA sequences can contribute to using precision medicine – prescribing certain drugs or treatments that would work specifically on that patient’s cancer.”
This research could also contribute to evolutionary studies to survey and understand genomes where no research has been done before, revealing important regulatory properties that may aid development of new products or disease treatments.
The work was funded by a grant from the National Institutes of Health.