Every cell within your body possesses the identical genetic sequence; however, each cell only activates a portion of those genes. These specific gene expression patterns per cell, which guarantee that a neuron differs from a dermal cell, are partly influenced by the three-dimensional arrangement of the genetic material, which regulates the availability of each gene.
Chemists at MIT have innovated a novel approach to ascertain those 3D genome architectures, utilizing generative artificial intelligence. Their methodology can estimate thousands of structures within mere minutes, significantly outperforming current experimental procedures for studying the structures.
Employing this method, scientists could more readily examine how the 3D configuration of the genome influences individual cells’ gene expression patterns and functions.
“Our objective was to predict the three-dimensional genome configuration from the foundational DNA sequence,” states Bin Zhang, an associate professor of chemistry and the principal author of the research. “Having achieved this, which brings this method on par with the forefront experimental techniques, it truly opens up numerous intriguing possibilities.”
MIT graduate students Greg Schuette and Zhuohan Lao are the primary authors of the document, which is published today in Science Advances.
From sequence to structure
Within the nucleus of a cell, DNA and proteins construct a complex known as chromatin, which exhibits several tiers of organization, enabling cells to fit 2 meters of DNA into a nucleus that is merely one-hundredth of a millimeter in diameter. Long strands of DNA coil around proteins referred to as histones, resulting in a structure somewhat akin to beads strung on a thread.
Chemical markers called epigenetic modifications can attach to DNA at specific sites, and these markers, which differ by cell type, influence the folding of chromatin and the accessibility of adjacent genes. Variations in chromatin configuration assist in determining which genes are expressed in distinct cell types or at different times within a particular cell.
In the last two decades, researchers have devised experimental methods to ascertain chromatin structures. One commonly employed technique, referred to as Hi-C, operates by linking neighboring DNA strands within the cell’s nucleus. Scientists can then identify which segments are proximal by fragmenting the DNA into numerous small sections and sequencing it.
This technique can be applied to large populations of cells to compute an average structure for a segment of chromatin or to individual cells to identify structures within that specific cell. Nevertheless, Hi-C and related methods are labor-intensive and can take about a week to collect data from a single cell.
To surpass these constraints, Zhang and his team created a model that leverages recent advancements in generative AI to formulate a rapid, precise way to forecast chromatin arrangements in single cells. The AI model they developed can swiftly assess DNA sequences and predict the chromatin structures those sequences might yield in a cell.
“Deep learning excels in pattern identification,” Zhang remarks. “It enables us to examine extensive DNA segments, spanning thousands of base pairs, and discern the essential information encoded in those DNA base pairs.”
ChromoGen, the model established by the researchers, comprises two elements. The first part, a deep learning model trained to “read” the genome, evaluates the information inscribed in the DNA sequence and chromatin accessibility data, the latter of which is broadly available and specific to cell types.
The second element is a generative AI model that predicts physically accurate chromatin configurations, having been trained on over 11 million chromatin arrangements. These datasets were produced from experiments using Dip-C (a variant of Hi-C) on 16 cells from a human B lymphocyte line.
When combined, the first component guides the generative model in how the cell type-specific environment affects the formation of distinct chromatin structures, and this system effectively captures sequence-structure correlations. For each sequence, the researchers utilize their model to generate numerous potential structures. This is due to the fact that DNA is a highly disordered molecule, so a single DNA sequence can lead to multiple possible configurations.
“A significant complication in predicting the structure of the genome is that there isn’t one definitive solution we are targeting. There’s a range of structures, regardless of the segment of the genome you are examining. Predicting that intricate, high-dimensional statistical distribution is something that is extremely challenging,” Schuette notes.
Swift analysis
Once trained, the model can yield predictions on a considerably faster timeline than Hi-C or other experimental methods.
After training their model, the researchers employed it to produce structure predictions for over 2,000 DNA sequences and then compared these with experimentally determined structures for those sequences. They discovered that the structures forecasted by the model were identical or strikingly similar to those observed in the experimental data.
“We typically examine hundreds or even thousands of configurations for each sequence, offering a reasonable representation of the diversity of structures that a specific region may possess,” Zhang notes. “If you conduct your experiment multiple times in various cells, you will likely obtain a markedly different conformation. That is the aim of our model.”
The researchers also discovered that the model was capable of making accurate predictions for data originating from cell types other than the one on which it was trained. This indicates that the model might be valuable for examining how chromatin structures vary between cell types and how those variations influence their functionality. The model could also be employed to investigate different chromatin states that may exist within a single cell and how those transformations impact gene expression.
“ChromoGen provides a novel framework for AI-driven exploration of genome folding principles and demonstrates that generative AI can connect genomic and epigenomic features with 3D genome structure, paving the way for future studies on the variability of genome structure and function across a wide range of biological contexts,” states Jian Ma, a professor of computational biology at Carnegie Mellon University, who was not involved in the investigation.
Another potential application could entail examining how mutations in a specific DNA sequence alter the chromatin conformation, which could illuminate how such mutations might lead to disease.
“There are numerous compelling questions that I believe we can address using this type of model,” Zhang expresses.
The researchers have made all their data and the model accessible to those interested in utilizing it.
This research was made possible by funding from the National Institutes of Health.