“`html
Humans inherently acquire knowledge by establishing associations between sight and sound. For example, we can observe an individual playing the cello and identify that the cellist’s actions are producing the music we hear.
A novel technique devised by researchers from MIT and other institutions enhances an AI model’s capacity to learn in a similar manner. This advancement could prove beneficial in areas such as journalism and film production, where the model might assist in curating multimodal content through automated video and audio retrieval.
In the long run, this research could enhance a robot’s capability to comprehend real-world settings, where auditory and visual signals are frequently interconnected.
Building on previous work from their research group, the scientists devised a method that facilitates machine-learning models in aligning corresponding audio and visual data from video segments without requiring human annotations.
They modified how their initial model is trained so it acquires a more detailed correspondence between a specific video frame and the accompanying audio at that moment. The scientists also implemented certain structural adjustments that aid the system in balancing two distinct learning objectives, thereby boosting efficiency.
Collectively, these relatively straightforward enhancements increase the precision of their method in video retrieval tasks and in identifying actions in audiovisual scenes. For example, the updated approach could automatically and accurately link the sound of a door slamming with the visual of it closing in a video segment.
“We are developing AI systems that can interpret the world similarly to humans, processing both audio and visual information simultaneously and seamlessly integrating both modalities. Looking ahead, if we can merge this audio-visual technology into some of the tools we utilize daily, like large language models, it could unlock numerous new applications,” states Andrew Rouditchenko, a graduate student at MIT and co-author of a paper on this study.
Joining him on the paper are lead author Edson Araujo, a graduate student at Goethe University in Germany; Yuan Gong, a former MIT postdoctoral researcher; Saurabhchand Bhati, a current postdoctoral fellow at MIT; Samuel Thomas, Brian Kingsbury, and Leonid Karlinsky from IBM Research; Rogerio Feris, principal scientist and manager at the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Hilde Kuehne, a computer science professor at Goethe University and an affiliated professor at the MIT-IBM Watson AI Lab. The findings will be presented at the Conference on Computer Vision and Pattern Recognition.
Synchronization
This research builds on a machine-learning technique the researchers introduced a few years back, which offered an efficient method to train a multimodal model to concurrently process audio and visual data without needing human labels.
The researchers input unlabeled video clips into this model, known as CAV-MAE, which encodes the visual and audio information separately into representations referred to as tokens. Using the natural audio from the recording, the model autonomously learns to associate matching pairs of audio and visual tokens close together within its internal representation space.
They discovered that utilizing two learning objectives balances the model’s learning process, allowing CAV-MAE to grasp the related audio and visual data while enhancing its capability to recover video clips that align with user queries.
However, CAV-MAE treats audio and visual samples as a single unit, meaning that a 10-second video segment and the sound of a door slamming are linked together, even if that audio event occurs within just one second of the video.
In their refined model, called CAV-MAE Sync, the researchers divide the audio into smaller segments before the model calculates its data representations, allowing it to create separate representations corresponding to each smaller audio segment.
During its training, the model learns to connect a single video frame with the audio that occurs during that exact frame.
“By doing so, the model cultivates a more refined correspondence, which proves beneficial later when we consolidate this information,” Araujo remarks.
They also introduced architectural enhancements that assist the model in balancing its two learning objectives.
Introducing “wiggle room”
The model employs a contrastive objective, where it learns to connect similar audio and visual data, alongside a reconstruction objective aimed at recovering specific audio and visual information based on user queries.
In CAV-MAE Sync, the researchers added two novel types of data representations, or tokens, to refine the model’s learning capability.
These include specialized “global tokens” that assist with the contrastive learning objective and dedicated “register tokens” that help the model concentrate on key aspects for the reconstruction objective.
“Essentially, we introduce a bit more flexibility to the model, allowing it to perform each of these two tasks, contrastive and reconstructive, a bit more independently. This enhancement benefited overall performance,” Araujo explains.
While the researchers instinctively felt these improvements would elevate the performance of CAV-MAE Sync, a careful combination of strategies was required to steer the model in the intended direction.
“Given that we have multiple modalities, we require a strong model for both modalities individually, but we also need them to converge and collaborate,” Rouditchenko states.
Ultimately, their enhancements amplified the model’s capability to retrieve videos based on an audio query and predict the classification of an audiovisual scene, such as a dog barking or an instrument being played.
The results were more precise than their earlier work, outperforming even more intricate, state-of-the-art methods that necessitate larger volumes of training data.
“Sometimes, very straightforward ideas or small patterns observed in the data carry significant value when applied to a model you are developing,” Araujo notes.
In the future, the researchers aspire to integrate new models that generate enhanced data representations into CAV-MAE Sync, which could further elevate performance. They also aim to enable their system to manage text data, representing a crucial step towards creating an audiovisual large language model.
This research is funded, in part, by the German Federal Ministry of Education and Research and the MIT-IBM Watson AI Lab.
“`