“`html
Conversational agents like ChatGPT and Claude have seen an astonishing surge in utilization over the last three years because they can assist you with an extensive array of tasks. Whether you’re composing Shakespearean sonnets, troubleshooting code, or seeking an answer to a rare trivia question, AI systems appear to have everything covered. What’s behind this adaptability? Billions, or even trillions, of text data scattered throughout the internet.
However, that data alone is insufficient to instruct a robot to be a practical domestic or industrial aide. To comprehend how to manipulate, stack, and arrange various items in different settings, robots require demonstrations. You might envision robot training data as a compilation of instructional videos that guide the systems through each step of a task. Gathering these demonstrations on actual robots is labor-intensive and not completely reproducible, prompting engineers to create training data through AI-generated simulations (which often do not accurately simulate real-world physics), or painstakingly crafting each digital environment from the ground up.
Scholars at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Toyota Research Institute may have discovered a method to generate the varied, authentic training environments robots require. Their “steerable scene generation” technique constructs digital environments such as kitchens, lounges, and eateries that engineers can utilize to replicate numerous real-life interactions and situations. Trained on over 44 million 3D rooms populated with models like tables and dishes, the tool positions existing assets in fresh scenes and subsequently refines each one into a physically accurate, lifelike setting.
Steerable scene generation constructs these 3D realms by “steering” a diffusion model — an AI system that fabricates visuals from random noise — towards a scenario you would encounter in daily life. The researchers employed this generative model to “in-paint” an environment, integrating specific components throughout the scene. Picture a blank canvas abruptly transforming into a kitchen dotted with 3D objects, which are gradually reorganized into a setting that emulates real-world physics. For instance, the system ensures that a fork does not pass through a bowl on a table — a frequent flaw in 3D graphics known as “clipping,” where models overlap or intersect.
How precisely steerable scene generation directs its creation towards authenticity depends on the strategy selected. Its primary strategy is “Monte Carlo tree search” (MCTS), wherein the model generates a series of alternative scenarios, filling them out in various manners towards a specific aim (like enhancing physical realism or incorporating as many edible items as feasible). It’s employed by the AI program AlphaGo to defeat human rivals in Go (a game akin to chess), as the system evaluates potential sequences of moves before opting for the most advantageous one.
“We are the first to apply MCTS to scene generation by framing the task as a sequential decision-making challenge,” states MIT Department of Electrical Engineering and Computer Science (EECS) PhD student Nicholas Pfaff, a researcher at CSAIL and principal author of a paper detailing the project. “We continually build upon partial scenes to yield improved or more preferred scenes over time. Consequently, MCTS generates scenes that are more intricate than the diffusion model was originally trained on.”
In one particularly revealing experiment, MCTS added the maximum number of objects to a straightforward restaurant scene. It showcased up to 34 items on a table, featuring enormous stacks of dim sum dishes, after being trained on scenes averaging only 17 objects.
Steerable scene generation also enables you to create varied training scenarios through reinforcement learning — fundamentally, instructing a diffusion model to achieve a goal via trial-and-error. After initial data training, your system enters a second training phase where you define a reward (essentially, a desired outcome with a score signifying how close you are to that aim). The model autonomously learns to generate scenes with higher scores, frequently producing scenarios that differ significantly from those it initially trained on.
Users can also directly prompt the system by entering specific visual descriptors (like “a kitchen with four apples and a bowl on the table”). Then, steerable scene generation can materialize your requests with accuracy. For instance, the tool accurately followed users’ prompts at rates of 98 percent when constructing scenes of pantry shelves and 86 percent for cluttered breakfast tables. Both scores represent at least a 10 percent enhancement over comparable methods like “MiDiffusion” and “DiffuScene.”
The system can also finalize particular scenes through prompting or light directions (like “come up with a different scene arrangement using the same objects”). You could request it to arrange apples on several plates at a kitchen table, for example, or to position board games and books on a shelf. It essentially “fills in the blank” by placing items in vacant spots while preserving the remainder of the scene.
As per the researchers, the strength of their project lies in its capacity to generate numerous scenes that roboticists can genuinely use. “A crucial insight from our findings is that it’s acceptable for the scenes we pre-trained on to not precisely mirror the scenes we actually need,” remarks Pfaff. “By utilizing our steering techniques, we can transcend that broad distribution and sample from a ‘better’ one. In other words, generating the varied, realistic, and task-oriented scenes that we truly wish to train our robots in.”
These expansive scenes became the testing grounds where they could capture a virtual robot engaging with different items. The machine carefully positioned forks and knives into a cutlery holder, for example, and rearranged bread onto plates in various 3D environments. Each simulation appeared fluid and realistic, mirroring the real-world, adaptable robots that steerable scene generation could eventually assist in training.
While the system could represent a promising path forward in producing ample diverse training data for robots, the researchers claim their work stands as more of a proof of concept. In the future, they aim to harness generative AI to create entirely new objects and scenes, rather than relying on a fixed library of assets. They also intend to incorporate articulated objects that the robot could open or twist (like cabinets or jars filled with food) to enhance the interactivity of the scenes.
To enhance their virtual environments further, Pfaff and his colleagues may include real-world objects using a library of objects and scenes gathered from images on the internet, leveraging their previous work on “Scalable Real2Sim.” By broadening the diversity and realism of AI-constructed robot testing grounds, the team aspires to cultivate a community of users that will generate substantial data, which could then serve as a massive dataset to educate dexterous robots in various skills.
“Currently, producing realistic scenes for simulation can be quite a challenging task; procedural generation can readily yield numerous scenes, but they are unlikely to accurately represent the environments the robot would encounter in the real world. Manually crafting bespoke scenes is both time-consuming and costly,” states Jeremy Binagia, an applied scientist at Amazon Robotics who was not involved in the paper. “Steerable scene generation presents a superior approach: train a generative model on a vast array of pre-existing scenes and tailor it (using a method like reinforcement learning) to specific downstream applications. In contrast to prior works relying on off-the-shelf vision-language models or focusing solely on arranging objects in a 2D grid, this method guarantees physical feasibility and accommodates full 3D translation and rotation, facilitating the generation of much more captivating scenes.”
“Steerable scene generation with post-training and inference-time search provides a novel and effective framework for automating scene generation at scale,” remarks Toyota Research Institute roboticist Rick Cory SM ’08, PhD ’10, who was also not involved in the paper. “Moreover, it can create ‘never-before-seen’ scenes that are deemed crucial for downstream tasks. In the future, merging this framework with extensive internet data could reach a significant milestone towards efficient training of robots for deployment in the real world.”
Pfaff authored the paper alongside senior author Russ Tedrake, the Toyota Professor of Electrical Engineering and Computer Science, Aeronautics and Astronautics, and Mechanical Engineering at MIT; a senior vice president of large behavior models at the Toyota Research Institute; and CSAIL principal investigator. Other contributors included Toyota Research Institute robotics researcher Hongkai Dai SM ’12, PhD ’16; team lead and Senior Research Scientist Sergey Zakharov; and Carnegie Mellon University PhD student Shun Iwase. Their efforts were partially funded by Amazon and the Toyota Research Institute. The researchers presented their work at the Conference on Robot Learning (CoRL) in September.
“`