SAM Architecture

The design of SAM is highly flexible and enables seamless integration with other systems. It has been trained on millions of images and masks using a model-in-the-loop “data engine” approach. Researchers iteratively annotated images and updated the model, resulting in SAM’s advanced capabilities and understanding of objects.

SAM was trained on over 1.1 billion segmentation masks collected from approximately 11 million licensed and privacy-preserving images. The dataset was built using SAM’s ambiguity-aware design and a grid of points, allowing it to automatically annotate new images.

SAM utilizes a sophisticated architecture that enables efficient and accurate object segmentation. The model is designed with a three-step process involving an image encoder, a prompt encoder, and a mask decoder.

Image Encoder

The image encoder is responsible for capturing essential features from the input image. It processes the image through deep neural networks, extracting high-level representations that encode relevant information about objects and their context. This encoding step allows SAM to understand the visual content of the image.

Prompt Encoder

The prompt encoder plays a crucial role in SAM’s promptable design. It processes the user-provided prompts, which can be bounding boxes, points, or text, and transforms them into meaningful representations. These representations serve as guidance for SAM to understand the desired object to be segmented. By encoding prompts, SAM gains the ability to perform a wide range of segmentation tasks without requiring additional training.

Mask Decoder

The mask decoder takes the encoded information from both the image encoder and the prompt encoder to generate precise segmentation masks. It leverages a lightweight network that efficiently processes the encoded features and produces detailed object boundaries. The mask decoder’s computational efficiency enables it to run seamlessly in a web browser, allowing for near real-time segmentation results.