Segment Anything Model — SAM: zero-shot image segmentation via points, boxes, masks
Segment Anything Model
Section titled “Segment Anything Model”SAM: zero-shot image segmentation via points, boxes, masks.
Skill metadata
Section titled “Skill metadata”| Source | Bundled (installed by default) |
| Path | skills/mlops/models/segment-anything |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | segment-anything, transformers>=4.30.0, torch>=1.7.0 |
| Tags | Multimodal, Image Segmentation, Computer Vision, SAM, Zero-Shot |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
Segment Anything Model (SAM)
Section titled “Segment Anything Model (SAM)”Comprehensive guide to using Meta AI’s Segment Anything Model for zero-shot image segmentation.
When to use SAM
Section titled “When to use SAM”Use SAM when:
- Need to segment any object in images without task-specific training
- Building interactive annotation tools with point/box prompts
- Generating training data for other vision models
- Need zero-shot transfer to new image domains
- Building object detection/segmentation pipelines
- Processing medical, satellite, or domain-specific images
Key features:
- Zero-shot segmentation: Works on any image domain without fine-tuning
- Flexible prompts: Points, bounding boxes, or previous masks
- Automatic segmentation: Generate all object masks automatically
- High quality: Trained on 1.1 billion masks from 11 million images
- Multiple model sizes: ViT-B (fastest), ViT-L, ViT-H (most accurate)
- ONNX export: Deploy in browsers and edge devices
Use alternatives instead:
- YOLO/Detectron2: For real-time object detection with classes
- Mask2Former: For semantic/panoptic segmentation with categories
- GroundingDINO + SAM: For text-prompted segmentation
- SAM 2: For video segmentation tasks
Quick start
Section titled “Quick start”Installation
Section titled “Installation”# From GitHubpip install git+https://github.com/facebookresearch/segment-anything.git
# Optional dependenciespip install opencv-python pycocotools matplotlib
# Or use HuggingFace transformerspip install transformersDownload checkpoints
Section titled “Download checkpoints”# ViT-H (largest, most accurate) - 2.4GBwget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
# ViT-L (medium) - 1.2GBwget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth
# ViT-B (smallest, fastest) - 375MBwget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pthBasic usage with SamPredictor
Section titled “Basic usage with SamPredictor”import numpy as npfrom segment_anything import sam_model_registry, SamPredictor
# Load modelsam = sam_model_registry["vit_h"](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/models/segment-anything/checkpoint="sam_vit_h_4b8939.pth")sam.to(device="cuda")
# Create predictorpredictor = SamPredictor(sam)
# Set image (computes embeddings once)image = cv2.imread("image.jpg")image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)predictor.set_image(image)
# Predict with point promptsinput_point = np.array([[500, 375]]) # (x, y) coordinatesinput_label = np.array([1]) # 1 = foreground, 0 = background
masks, scores, logits = predictor.predict( point_coords=input_point, point_labels=input_label, multimask_output=True # Returns 3 mask options)
# Select best maskbest_mask = masks[np.argmax(scores)]HuggingFace Transformers
Section titled “HuggingFace Transformers”import torchfrom PIL import Imagefrom transformers import SamModel, SamProcessor
# Load model and processormodel = SamModel.from_pretrained("facebook/sam-vit-huge")processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")model.to("cuda")
# Process image with point promptimage = Image.open("image.jpg")input_points = [[[450, 600]]] # Batch of points
inputs = processor(image, input_points=input_points, return_tensors="pt")inputs = {k: v.to("cuda") for k, v in inputs.items()}
# Generate maskswith torch.no_grad(): outputs = model(**inputs)
# Post-process masks to original sizemasks = processor.image_processor.post_process_masks( outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())Core concepts
Section titled “Core concepts”Model architecture
Section titled “Model architecture”SAM Architecture:┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│ Image Encoder │────▶│ Prompt Encoder │────▶│ Mask Decoder ││ (ViT) │ │ (Points/Boxes) │ │ (Transformer) │└─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ Image Embeddings Prompt Embeddings Masks + IoU (computed once) (per prompt) predictionsModel variants
Section titled “Model variants”| Model | Checkpoint | Size | Speed | Accuracy |
|---|---|---|---|---|
| ViT-H | vit_h | 2.4 GB | Slowest | Best |
| ViT-L | vit_l | 1.2 GB | Medium | Good |
| ViT-B | vit_b | 375 MB | Fastest | Good |
Prompt types
Section titled “Prompt types”| Prompt | Description | Use Case |
|---|---|---|
| Point (foreground) | Click on object | Single object selection |
| Point (background) | Click outside object | Exclude regions |
| Bounding box | Rectangle around object | Larger objects |
| Previous mask | Low-res mask input | Iterative refinement |
Interactive segmentation
Section titled “Interactive segmentation”Point prompts
Section titled “Point prompts”# Single foreground pointinput_point = np.array([[500, 375]])input_label = np.array([1])
masks, scores, logits = predictor.predict( point_coords=input_point, point_labels=input_label, multimask_output=True)
# Multiple points (foreground + background)input_points = np.array([[500, 375], [600, 400], [450, 300]])input_labels = np.array([1, 1, 0]) # 2 foreground, 1 background
masks, scores, logits = predictor.predict( point_coords=input_points, point_labels=input_labels, multimask_output=False # Single mask when prompts are clear)Box prompts
Section titled “Box prompts”# Bounding box [x1, y1, x2, y2]input_box = np.array([425, 600, 700, 875])
masks, scores, logits = predictor.predict( box=input_box, multimask_output=False)Combined prompts
Section titled “Combined prompts”# Box + points for precise controlmasks, scores, logits = predictor.predict( point_coords=np.array([[500, 375]]), point_labels=np.array([1]), box=np.array([400, 300, 700, 600]), multimask_output=False)Iterative refinement
Section titled “Iterative refinement”# Initial predictionmasks, scores, logits = predictor.predict( point_coords=np.array([[500, 375]]), point_labels=np.array([1]), multimask_output=True)
# Refine with additional point using previous maskmasks, scores, logits = predictor.predict( point_coords=np.array([[500, 375], [550, 400]]), point_labels=np.array([1, 0]), # Add background point mask_input=logits[np.argmax(scores)][None, :, :], # Use best mask multimask_output=False)Automatic mask generation
Section titled “Automatic mask generation”Basic automatic segmentation
Section titled “Basic automatic segmentation”from segment_anything import SamAutomaticMaskGenerator
# Create generatormask_generator = SamAutomaticMaskGenerator(sam)
# Generate all masksmasks = mask_generator.generate(image)
# Each mask contains:# - segmentation: binary mask# - bbox: [x, y, w, h]# - area: pixel count# - predicted_iou: quality score# - stability_score: robustness score# - point_coords: generating pointCustomized generation
Section titled “Customized generation”mask_generator = SamAutomaticMaskGenerator( model=sam, points_per_side=32, # Grid density (more = more masks) pred_iou_thresh=0.88, # Quality threshold stability_score_thresh=0.95, # Stability threshold crop_n_layers=1, # Multi-scale crops crop_n_points_downscale_factor=2, min_mask_region_area=100, # Remove tiny masks)
masks = mask_generator.generate(image)Filtering masks
Section titled “Filtering masks”# Sort by area (largest first)masks = sorted(masks, key=lambda x: x['area'], reverse=True)
# Filter by predicted IoUhigh_quality = [m for m in masks if m['predicted_iou'] > 0.9]
# Filter by stability scorestable_masks = [m for m in masks if m['stability_score'] > 0.95]Batched inference
Section titled “Batched inference”Multiple images
Section titled “Multiple images”# Process multiple images efficientlyimages = [cv2.imread(f"image_{i}.jpg") for i in range(10)]
all_masks = []for image in images: predictor.set_image(image) masks, _, _ = predictor.predict( point_coords=np.array([[500, 375]]), point_labels=np.array([1]), multimask_output=True ) all_masks.append(masks)Multiple prompts per image
Section titled “Multiple prompts per image”# Process multiple prompts efficiently (one image encoding)predictor.set_image(image)
# Batch of point promptspoints = [ np.array([[100, 100]]), np.array([[200, 200]]), np.array([[300, 300]])]
all_masks = []for point in points: masks, scores, _ = predictor.predict( point_coords=point, point_labels=np.array([1]), multimask_output=True ) all_masks.append(masks[np.argmax(scores)])ONNX deployment
Section titled “ONNX deployment”Export model
Section titled “Export model”python scripts/export_onnx_model.py \ --checkpoint sam_vit_h_4b8939.pth \ --model-type vit_h \ --output sam_onnx.onnx \ --return-single-maskUse ONNX model
Section titled “Use ONNX model”import onnxruntime
# Load ONNX modelort_session = onnxruntime.InferenceSession("sam_onnx.onnx")
# Run inference (image embeddings computed separately)masks = ort_session.run( None, { "image_embeddings": image_embeddings, "point_coords": point_coords, "point_labels": point_labels, "mask_input": np.zeros((1, 1, 256, 256), dtype=np.float32), "has_mask_input": np.array([0], dtype=np.float32), "orig_im_size": np.array([h, w], dtype=np.float32) })Common workflows
Section titled “Common workflows”Workflow 1: Annotation tool
Section titled “Workflow 1: Annotation tool”import cv2
# Load modelpredictor = SamPredictor(sam)predictor.set_image(image)
def on_click(event, x, y, flags, param): if event == cv2.EVENT_LBUTTONDOWN: # Foreground point masks, scores, _ = predictor.predict( point_coords=np.array([[x, y]]), point_labels=np.array([1]), multimask_output=True ) # Display best mask display_mask(masks[np.argmax(scores)])Workflow 2: Object extraction
Section titled “Workflow 2: Object extraction”def extract_object(image, point): """Extract object at point with transparent background.""" predictor.set_image(image)
masks, scores, _ = predictor.predict( point_coords=np.array([point]), point_labels=np.array([1]), multimask_output=True )
best_mask = masks[np.argmax(scores)]
# Create RGBA output rgba = np.zeros((image.shape[0], image.shape[1], 4), dtype=np.uint8) rgba[:, :, :3] = image rgba[:, :, 3] = best_mask * 255
return rgbaWorkflow 3: Medical image segmentation
Section titled “Workflow 3: Medical image segmentation”# Process medical images (grayscale to RGB)medical_image = cv2.imread("scan.png", cv2.IMREAD_GRAYSCALE)rgb_image = cv2.cvtColor(medical_image, cv2.COLOR_GRAY2RGB)
predictor.set_image(rgb_image)
# Segment region of interestmasks, scores, _ = predictor.predict( box=np.array([x1, y1, x2, y2]), # ROI bounding box multimask_output=True)Output format
Section titled “Output format”Mask data structure
Section titled “Mask data structure”# SamAutomaticMaskGenerator output{ "segmentation": np.ndarray, # H×W binary mask "bbox": [x, y, w, h], # Bounding box "area": int, # Pixel count "predicted_iou": float, # 0-1 quality score "stability_score": float, # 0-1 robustness score "crop_box": [x, y, w, h], # Generation crop region "point_coords": [[x, y]], # Input point}COCO RLE format
Section titled “COCO RLE format”from pycocotools import mask as mask_utils
# Encode mask to RLErle = mask_utils.encode(np.asfortranarray(mask.astype(np.uint8)))rle["counts"] = rle["counts"].decode("utf-8")
# Decode RLE to maskdecoded_mask = mask_utils.decode(rle)Performance optimization
Section titled “Performance optimization”GPU memory
Section titled “GPU memory”# Use smaller model for limited VRAMsam = sam_model_registry["vit_b"](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/models/segment-anything/checkpoint="sam_vit_b_01ec64.pth")
# Process images in batches# Clear CUDA cache between large batchestorch.cuda.empty_cache()Speed optimization
Section titled “Speed optimization”# Use half precisionsam = sam.half()
# Reduce points for automatic generationmask_generator = SamAutomaticMaskGenerator( model=sam, points_per_side=16, # Default is 32)
# Use ONNX for deployment# Export with --return-single-mask for faster inferenceCommon issues
Section titled “Common issues”| Issue | Solution |
|---|---|
| Out of memory | Use ViT-B model, reduce image size |
| Slow inference | Use ViT-B, reduce points_per_side |
| Poor mask quality | Try different prompts, use box + points |
| Edge artifacts | Use stability_score filtering |
| Small objects missed | Increase points_per_side |
References
Section titled “References”- Advanced Usage - Batching, fine-tuning, integration
- Troubleshooting - Common issues and solutions
Resources
Section titled “Resources”- GitHub: https://github.com/facebookresearch/segment-anything
- Paper: https://arxiv.org/abs/2304.02643
- Demo: https://segment-anything.com
- SAM 2 (Video): https://github.com/facebookresearch/segment-anything-2
- HuggingFace: https://huggingface.co/facebook/sam-vit-huge