“`html
Utilizing machine learning, MIT chemical engineers have developed a computational framework that can forecast how efficiently a specific molecule will dissolve in an organic solvent — an essential step in the creation of almost any pharmaceutical. This sort of prediction may significantly facilitate the formulation of innovative methods for producing medications and other beneficial compounds.
The innovative model, which forecasts the extent to which a solute will dissolve in a designated solvent, is expected to assist chemists in selecting the appropriate solvent for any given reaction in their synthesis, the researchers affirm. Frequent organic solvents consist of ethanol and acetone, along with hundreds of other options that can also be employed in chemical reactions.
“Anticipating solubility truly is a bottleneck in synthetic planning and chemical manufacturing, particularly for drugs, so there’s been an enduring interest in enhancing solubility predictions,” mentions Lucas Attia, an MIT graduate student and one of the principal authors of the new study.
The researchers have made their model publicly accessible, and numerous companies and laboratories have already commenced its application. The model could be especially beneficial for pinpointing solvents that are less hazardous than some commonly utilized industrial solvents, the researchers indicate.
“Certain solvents are known to dissolve a wide range of substances. They’re highly beneficial, but they harm the environment and pose risks to people, so numerous companies mandate minimizing the usage of these solvents,” explains Jackson Burns, an MIT graduate student who is also a principal author of the publication. “Our model is exceptionally advantageous in identifying alternative solvents that will ideally be much less harmful to the environment.”
William Green, the Hoyt Hottel Professor of Chemical Engineering and the director of the MIT Energy Initiative, serves as the senior author of the study, published today in Nature Communications. Patrick Doyle, the Robert T. Haslam Professor of Chemical Engineering, is also a co-author of the paper.
Tackling solubility
The new model emerged from a project that Attia and Burns collaborated on in an MIT course focused on applying machine learning to challenges in chemical engineering. Traditionally, chemists have estimated solubility using a tool called the Abraham Solvation Model, which assesses a molecule’s overall solubility by accumulating the contributions of its chemical structures. Although these predictions are helpful, their precision is constrained.
In recent years, researchers have begun employing machine learning to achieve more precise predictions of solubility. Before Burns and Attia initiated their new model, the most advanced model for predicting solubility was one created in Green’s lab in 2022.
This model, known as SolProp, operates by predicting a set of interconnected properties and merging them utilizing thermodynamics to ultimately estimate the solubility. However, the model faces challenges in predicting solubility for solutes it hasn’t previously encountered.
“In drug and chemical discovery workflows, when you’re developing a novel molecule, you aim to predict in advance its solubility,” Attia states.
A significant factor hindering the success of existing solubility models has been the lack of a comprehensive dataset for training. Nonetheless, in 2023, a new dataset termed BigSolDB was introduced, compiling data from almost 800 published studies, including solubility information for roughly 800 molecules dissolved in over 100 organic solvents frequently employed in synthetic chemistry.
Attia and Burns opted to train two distinct types of models utilizing this data. Both models depict the chemical structures of molecules using numerical representations called embeddings, which encompass details such as the number of atoms in a molecule and the binding relationships between various atoms. The models can utilize these representations for predicting an array of chemical properties.
One of the models utilized in this study, termed FastProp and developed by Burns and others in Green’s lab, incorporates “static embeddings.” This indicates that the model already possesses the embedding for each molecule prior to conducting any analysis.
The alternative model, ChemProp, develops an embedding for each molecule during training, while simultaneously learning to associate the features of the embedding with characteristics such as solubility. This model, developed across multiple MIT laboratories, has already been employed for purposes such as antibiotic discovery, lipid nanoparticle design, and predicting chemical reaction rates.
The researchers trained both model types on over 40,000 data points from BigSolDB, incorporating information on temperature effects, which significantly influence solubility. Subsequently, they evaluated the models on around 1,000 solutes that had been excluded from the training data. They discovered that the predictions from the models were two to three times more accurate than those from SolProp, the previous leading model, with the new models demonstrating particular accuracy in forecasting solubility changes due to temperature.
“The ability to replicate small variations in solubility attributed to temperature accurately, even amidst considerable overarching experimental noise, was a very encouraging sign that the network had effectively learned a fundamental solubility prediction function,” Burns remarks.
Precise predictions
The researchers anticipated that the ChemProp-based model, which can learn new representations throughout its operation, would yield more precise predictions. However, to their astonishment, they found that both models performed nearly identically. This implies that the principal limitation on their performance is the data quality, with the models achieving the best possible results based on the available data, the researchers assert.
“ChemProp should consistently surpass any static embedding when sufficient data is available,” Burns states. “We were astonished to find that the static and learned embeddings showed statistically indistinguishable performance across various subsets, which suggests to us that the data limitations prevalent in this domain dominated model performance.”
The models could enhance in accuracy, the researchers indicate, if superior training and testing data were obtainable — ideally, data gathered by a single individual or team all trained to execute the experiments uniformly.
“A major limitation of using compiled datasets of this nature is that different labs adopt diverse methods and experimental conditions while conducting solubility tests. This contributes to variability between different datasets,” Attia explains.
Since the FastProp-based model makes predictions swiftly and features code that is simpler for other users to adapt, the researchers decided to make it available to the public under the name FastSolv. Several pharmaceutical firms have already begun utilizing it.
“There are applications throughout the drug discovery pipeline,” Burns mentions. “We’re also eager to observe, beyond formulation and drug discovery, where others might apply this model.”
The research received funding, in part, from the U.S. Department of Energy.
“`