Meta’s research division has introduced SAM 2 (Segment Anything Model 2), an AI system that marks a substantial advancement in video analysis.
This new model expands on its predecessor SAM’s image segmentation skills, venturing into the more complex domain of video.
Video segmentation – the ability to identify and track specific objects in a moving scene – has long been a challenge for AI.
While humans can effortlessly follow a car as it moves through traffic or a person walking through a crowd, AI systems tend to struggle.
This is a massive problem for driverless cars and other autonomous vehicles (AVs), which need to track moving 3D objects in their environments.
SAM 2 aims to bridge this gap, bringing AI’s understanding of video closer to human-level perception.
The system can identify and track virtually any object throughout a video with minimal user input – sometimes as little as a single click.
This opens up a world of possibilities in fields ranging from film editing to scientific research.
Here’s how Meta created SAM 2:
- The team created a technique called Promptable Visual Segmentation (PVS), allowing users to guide the AI with simple cues on any video frame. This means the system can adapt to a wide range of scenarios, from tracking a specific person in a crowd to following the movement of a bird’s wing in flight.
- They built a model architecture that included components for processing individual frames, storing information about objects over time, and generating precise segmentations. A key element is the memory module, which allows SAM 2 to maintain consistent tracking even when objects temporarily disappear from view.
- A massive new dataset was created, containing over 50,000 videos and 35 million labeled frames – dwarfing previous video segmentation datasets. This dataset, named SA-V, covers a wide spectrum of object types, sizes, and scenarios, enhancing the model’s ability to generalize to new situations.
- The model underwent extensive training and testing across 17 diverse video datasets, from dashcam footage to medical imaging. SAM 2 outperformed existing state-of-the-art methods in semi-supervised video object segmentation tasks, achieving an average improvement of 7.5% in J&F scores (a standard metric for segmentation quality).
Above: Image segmentation for complex video clips segregates different shapes in seconds.
- In film production, SAM 2 could streamline visual effects work, saving time in post-production
- Scientists could track cells in microscopy footage or monitor environmental changes in satellite imagery
- For AVs, including driverless cars, SAM 2 could enhance object detection in complex traffic scenarios
- Wildlife conservationists could employ SAM 2 to monitor animal populations in vast areas
- In AR/VR, it may enable more accurate interactions with virtual objects in live video
True to Meta’s commitment to open research, SAM 2 is being released as open-source software.
This includes not just the model, but also the dataset used to train it.
Researchers are already exploring ways to handle longer videos, improve performance on fine details, and reduce the computational power required to run the model.
As image segmentation technology matures, it’s sure to transform how we interact with and analyze video content.
From making complex editing tasks more accessible to enabling new forms of visual analysis, SAM 2 pushes the boundaries of visual manipulation.