“`html
Extensive language models (ELMs) excel in employing textual reasoning to grasp the context of a document and deliver a rational response regarding its content. However, these same ELMs frequently falter in accurately solving even the most straightforward mathematical problems.
Textual reasoning often proves to be a less-than-optimal method for processing computational or algorithmic tasks. Although some ELMs can produce code like Python to address symbolic inquiries, the models don’t consistently recognize when to utilize code or which specific code would be most effective.
It appears that ELMs might require a mentor to guide them towards the most effective technique.
Introducing CodeSteer, an intelligent assistant crafted by MIT scientists that helps an ELM alternate between coding and text generation until it successfully resolves a query.
CodeSteer, a smaller ELM itself, autonomously generates a series of prompts to iteratively guide a larger ELM. It evaluates the model’s current and past responses after each cycle and offers guidance on how it can alter or improve that solution until it considers the answer correct.
The researchers discovered that enhancing a larger ELM with CodeSteer increased its accuracy on symbolic tasks, such as multiplying numbers, playing Sudoku, and arranging blocks, by over 30 percent. It also allowed less advanced models to surpass more sophisticated models equipped with improved reasoning capabilities.
This progress could enhance the problem-solving abilities of ELMs for intricate tasks that are particularly challenging to tackle with textual reasoning alone, such as devising paths for robots in unpredictable environments or organizing shipments in a global supply chain.
“There is a competition to develop increasingly better models capable of handling everything, but we’ve adopted a complementary strategy. Researchers have invested years creating effective technologies and tools to address issues in various fields. We aspire to empower ELMs to select the appropriate tools and techniques, utilizing others’ expertise to boost their own capabilities,” states Chuchu Fan, an associate professor of aeronautics and astronautics (AeroAstro) and lead investigator at the MIT Laboratory for Information and Decision Systems (LIDS).
Fan, who is the senior contributor to the study, is collaborating on a paper related to the work with LIDS graduate student Yongchao Chen; AeroAstro graduate student Yilun Hao; University of Illinois at Urbana-Champaign graduate student Yueying Liu; and MIT-IBM Watson AI Lab Research Scientist Yang Zhang. The research will be presented at the International Conference on Machine Learning.
An ELM “mentor”
Pose the question to an ELM which number is greater, 9.11 or 9.9, and it frequently provides the incorrect response using textual reasoning. However, when asked to utilize code for the same inquiry, it can generate and execute a Python script to compare the two numbers, easily solving the problem.
Originally trained to comprehend and forecast human language, ELMs are more inclined to respond to queries using text, even when code would be more suitable. While they have been taught to generate code through fine-tuning, these models often produce an incorrect or less efficient version of the code.
Rather than attempting to retrain a powerful ELM like GPT-4 or Claude to enhance these abilities, the MIT researchers finely tune a smaller, streamlined ELM to facilitate a larger model in switching between text and code. Fine-tuning a smaller model doesn’t alter the larger ELM, eliminating the risk of undermining the larger model’s other capabilities.
“We were also motivated by human behavior. In sports, a coach may not surpass the star athlete’s skill, but the coach can still provide valuable suggestions to guide the athlete. This guiding methodology functions for ELMs as well,” remarks Chen.
This mentor, CodeSteer, collaborates with the larger ELM. Initially, it analyzes a query and decides whether text or code is more appropriate for the problem and which type of code would be optimal.
It then formulates a prompt for the larger ELM, instructing it to employ a coding method or textual reasoning to address the inquiry. The larger model adheres to this prompt to respond to the query and relays the outcome back to CodeSteer, which assesses it.
If the response is incorrect, CodeSteer will continue to prompt the ELM to try various approaches that might resolve the issue, such as integrating a search algorithm or constraints into its Python code, until the answer is correct.
“We discovered that frequently, the larger ELM tends to be lazy, opting for a shorter, less efficient code that fails to perform the correct symbolic calculations. We designed CodeSteer to mitigate this tendency,” Chen explains.
A symbolic checker assesses the complexity of the code and sends a signal to CodeSteer if it appears too simplistic or inefficient. The researchers have also integrated a self-answer checker within CodeSteer, which prompts the ELM to generate code that calculates the answer to ensure it is correct.
Addressing intricate tasks
While creating CodeSteer, the researchers were unable to locate suitable symbolic datasets for fine-tuning and testing the model, as many existing benchmarks do not specify whether a particular query could be best addressed with text or code.
Consequently, they compiled a collection of 37 complex symbolic tasks, including spatial reasoning, mathematics, order reasoning, and optimization, and constructed their own dataset, named SymBench. They executed a fine-tuning strategy that utilizes SymBench to optimize the performance of CodeSteer.
In their experiments, CodeSteer outperformed all nine baseline methods assessed and elevated average accuracy from 53.3 percent to 86.4 percent. It maintained similar performance even on unencountered tasks and across a variety of ELMs.
Furthermore, a general-purpose model enriched with CodeSteer can achieve higher accuracy than cutting-edge models specifically designed for complex reasoning and planning, all while requiring considerably less computation.
“Our approach leverages an ELM’s inherent capabilities. By augmenting an ELM with the capacity to intelligently utilize coding, we can take a model that is already quite robust and further enhance its performance,” Chen states.
Looking ahead, the researchers aim to refine CodeSteer to expedite its iterative prompting process. Moreover, they are exploring effective ways to fine-tune a unified model capable of alternating between textual reasoning and code generation, instead of depending on a separate assistant.
“The authors propose an elegant solution to the pivotal challenge of tool utilization in ELMs. This straightforward yet impactful technique empowers state-of-the-art ELMs to achieve notable performance enhancements without necessitating direct fine-tuning,” comments Jinsung Yoon, a staff research scientist at Google Cloud AI, who was not affiliated with this work. “This research signifies a substantial contribution that promises to significantly amplify the application of ELMs to a diverse array of tasks with which they currently struggle.”
“Their achievement in training a smaller, specialized model to strategically direct larger, advanced models is particularly significant,” adds Chi Wang, a senior staff scientist at Google DeepMind who was not affiliated with this work. “This intelligent collaboration between diverse AI ‘agents’ paves the way for more resilient and versatile applications in complex real-world scenarios.”
This research is partially funded by the U.S. Office of Naval Research and the MIT-IBM Watson AI Lab.
“`