Can Large Language Models Comprehend the Real World?

In the 17th century, German astronomer Johannes Kepler discovered the laws of motion that enabled accurate predictions of where the planets in our solar system would be visible in the sky during their orbit around the sun. Yet, it wasn’t until many years later that Isaac Newton articulated the universal laws of gravitation, providing a deeper understanding of these principles. Though inspired by Kepler’s theories, Newton’s work extended far beyond, allowing the same equations to be applied to various phenomena, such as the trajectory of a cannonball, the moon’s influence on Earth’s tides, and the methods for launching a satellite from Earth to the moon or other celestial bodies.

Modern artificial intelligence systems have become quite proficient at making precise predictions akin to Kepler’s orbital forecasts. However, the question arises: do they grasp the reasons behind these predictions with the profound understanding derived from foundational principles like Newton’s laws? As the reliance on these AI systems intensifies, researchers are endeavoring to assess how these systems operate and the depth of their comprehension of the real world.

Researchers at MIT’s Laboratory for Information and Decision Systems (LIDS) and Harvard University have developed a novel method to evaluate how well these predictive systems comprehend their subjects and whether they can transfer knowledge from one area to a related one. Currently, the conclusion from their investigations indicates that, for the examples analyzed, the answers lean towards — not significantly.

The outcomes were shared at the International Conference on Machine Learning, held in Vancouver, British Columbia, last month, by Harvard postdoctoral researcher Keyon Vafa, MIT graduate student in electrical engineering and computer science and LIDS affiliate Peter G. Chang, MIT assistant professor and LIDS principal investigator Ashesh Rambachan, along with MIT professor, LIDS principal investigator, and senior author Sendhil Mullainathan.

“Humans have consistently made the leap from effective predictions to comprehensive world models,” mentions Vafa, the main author of the study. Thus, the query their group aimed to resolve was, “have foundational models—has AI—been able to transition from mere predictions to comprehensive world models? We’re not questioning whether they’re capable, can they, or will they; it’s simply about whether they’ve achieved this so far,” he states.

“We understand how to evaluate if an algorithm makes accurate predictions. However, what we need is a method to determine whether it comprehends well,” clarifies Mullainathan, the Peter de Florez Professor holding dual positions in the MIT departments of Economics and Electrical Engineering and Computer Science, and the senior author of the study. “Even articulating what understanding means was a challenge.”

In the analogy of Kepler versus Newton, Vafa explains, “both developed models that performed exceptionally well on a single task, and essentially operated in the same manner on that task. What Newton provided were concepts that could generalize to new tasks.” This capacity, when applied to the predictions generated by various AI systems, would entail creating a world model that allows it to “surpass the specific task at hand and generalize to new problems and paradigms.”

Another analogy that clarifies this point is the contrast between centuries of cumulative knowledge regarding selective breeding of crops and animals versus Gregor Mendel’s insights into the fundamental laws of genetic inheritance.

“There is considerable enthusiasm in the field about utilizing foundational models not only to perform tasks but to gain insights about the world,” particularly in natural sciences, he notes. “It would need to evolve, possessing a world model adaptable to any conceivable task.”

Are AI systems approaching the ability to achieve such generalizations? To explore this, the team examined different instances of predictive AI systems at varying levels of complexity. In the most straightforward examples, the systems succeeded in developing a realistic model of the simulated environment; nevertheless, as complexity increased, their aptitude diminished swiftly.

The team introduced a new metric, a quantitative approach to measuring how effectively a system approximates real-world scenarios. They termed this measurement inductive bias — a tendency or bias towards responses that mirror reality, based on insights drawn from analyzing extensive datasets on specific cases.

The simplest examples they explored were known as lattice models. In a one-dimensional lattice, movement is constrained to a line. Vafa likens it to a frog hopping between aligned lily pads. As the frog jumps or remains still, it calls out its actions — right, left, or stay. Upon reaching the final lily pad in the row, it can only either remain or retreat. If someone, or an AI model, can only hear the calls without any knowledge of the total number of lily pads, can it deduce the configuration? The answer is affirmative: Predictive models excel in reconstructing the “world” in such basic scenarios. However, even with lattices, as the dimensionality increases, the systems fail to make that leap.

“For instance, in a two-state or three-state lattice, we demonstrated that the model possesses a reasonably good inductive bias toward the actual state,” states Chang. “But as we escalate the number of states, it begins to diverge from real-world models.”

A more intricate challenge is a system that plays the board game Othello, where players alternately position black or white pieces on a grid. The AI models can predict the allowable moves accurately at a given moment, but they struggle to infer the overall configuration of pieces on the board, including those that are currently obstructed from play.

The team subsequently examined five distinct categories of predictive models presently in use, and once more, the more sophisticated the involved systems, the less effective the predictive processes were at aligning with the true underlying world model.

With this new metric of inductive bias, “our aspiration is to establish a testing ground where you can evaluate various models and training methodologies on problems with known true world models,” Vafa states. If a model performs well in these cases where the underlying reality is clear, we can have increased confidence that its predictions may prove beneficial even in scenarios “where we lack knowledge of the truth,” he adds.

Individuals are already attempting to leverage these predictive AI systems to facilitate scientific advancements, including the properties of chemical compounds that have yet to be synthesized, potential pharmaceutical compounds, or predicting the folding behavior and traits of unknown protein molecules. “For the more authentic challenges,” mentions Vafa, “even for fundamental mechanics, we discovered that considerable progress is still necessary.”

Chang notes, “There has been considerable hype surrounding foundational models, as individuals strive to develop domain-specific foundational models — biology-centric, physics-oriented, robotics-focused, and other types of domains where extensive data has been gathered” and training these models to generate predictions, “hoping they will acquire some domain-specific knowledge to apply to other subsequent tasks.”

This research illustrates that there is much work ahead, but it also provides a roadmap forward. “Our paper indicates we can use our metrics to evaluate how much a representation learns, enabling us to devise better methods for training foundational models or, at the very least, assess the models we are currently developing,” Chang adds. “In engineering, once we have established a metric for any given aspect, practitioners excel at refining that metric.”

Leave a Reply Cancel reply