Segment Anything Model (SAM)
νμ΅ λͺ©ν
- SAMμ "Promptable Segmentation" ν¨λ¬λ€μ μ΄ν΄
- Image Encoder, Prompt Encoder, Mask Decoder ꡬ쑰 νμ
- SAMμ νμ΅ λ°μ΄ν°μ λ°©λ²λ‘ μ΄ν΄
- μ€λ¬΄μμ SAM νμ©λ² μ΅λ
1. SAM κ°μ
1.1 Foundation Model for Segmentation
SAM (Segment Anything Model)μ Meta AIκ° 2023λ
λ°νν Vision Foundation Modelλ‘, μ΄λ€ μ΄λ―Έμ§μμλ μ΄λ€ κ°μ²΄λ μΈκ·Έλ©ν
μ΄μ
ν μ μμ΅λλ€.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SAMμ νμ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β κΈ°μ‘΄ μΈκ·Έλ©ν
μ΄μ
: β
β β’ νΉμ ν΄λμ€λ§ (μ¬λ, μλμ°¨ λ±) β
β β’ νμ΅ λ°μ΄ν°μ μλ κ°μ²΄λ§ β
β β’ ν΄λμ€λ³ λͺ¨λΈ λλ κ³ μ λ ν΄λμ€ μ β
β β
β SAM: β
β β’ μ΄λ€ κ°μ²΄λ μΈκ·Έλ©ν
μ΄μ
κ°λ₯ β
β β’ ν둬ννΈλ‘ μνλ κ°μ²΄ μ§μ β
β β’ Zero-shot: μλ‘μ΄ κ°μ²΄λ λ°λ‘ μ²λ¦¬ β
β β
β ν둬ννΈ μ’
λ₯: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Point β ν΄λ¦ μμΉ (foreground/background) β β
β β Box β λ°μ΄λ© λ°μ€ β β
β β Mask β λλ΅μ μΈ λ§μ€ν¬ (refinement) β β
β β Text β ν
μ€νΈ μ€λͺ
(SAM 2, Grounding SAM) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.2 SA-1B λ°μ΄ν°μ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SA-1B Dataset β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β κ·λͺ¨: β
β β’ 11M μ΄λ―Έμ§ β
β β’ 1.1B (11μ΅) λ§μ€ν¬ β
β β’ μ΄λ―Έμ§λΉ νκ· ~100 λ§μ€ν¬ β
β β
β μμ§ λ°©λ² (Data Engine): β
β β
β Phase 1: Assisted Manual (4.3M masks) β
β βββββββββββββββββββββββββββββββββββ β
β β’ μ λ¬Έ annotatorκ° SAM λμλ°μ λ μ΄λΈλ§ β
β β’ SAMμ΄ μ μ β μ¬λμ΄ μμ β
β β
β Phase 2: Semi-Automatic (5.9M masks) β
β βββββββββββββββββββββββββββββββββββ β
β β’ SAMμ΄ confidentν λ§μ€ν¬ μλ μμ± β
β β’ μ¬λμ λλ¨Έμ§λ§ λ μ΄λΈλ§ β
β β
β Phase 3: Fully Automatic (1.1B masks) β
β βββββββββββββββββββββββββββββββββββ β
β β’ 32Γ32 grid pointsλ‘ μλ μμ± β
β β’ νν°λ§ ν μ΅μ’
λ§μ€ν¬ μ λ³ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2. SAM μν€ν
μ²
2.1 μ 체 ꡬ쑰
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SAM Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Input Image β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Image Encoder β β
β β (MAE pre-trained ViT-H/16) β β
β β β β
β β β’ 1024Γ1024 μ
λ ₯ β 64Γ64 feature map β β
β β β’ 632M parameters β β
β β β’ ν λ²λ§ μ€ν (λΉμ© νΌ) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Image Embeddings β
β (64Γ64Γ256) β
β β β
β βββββββββββββββββ΄ββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββ βββββββββββββββββββββ β
β β Prompt Encoder β β Prompt Encoder β β
β β (Points/Boxes) β β (Dense: Mask) β β
β β β β β β
β β Sparse Embed β β Conv downscale β β
β β (NΓ256) β β (256Γ64Γ64) β β
β βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ β
β β β β
β βββββββββββββββββ¬ββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Mask Decoder β β
β β (Lightweight Transformer) β β
β β β β
β β β’ 2-layer Transformer decoder β β
β β β’ Cross-attention: prompt β image β β
β β β’ Self-attention: prompt tokens β β
β β β’ 4M parameters (λ§€μ° κ°λ²Όμ) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββ΄ββββββββββββββ β
β βΌ βΌ β
β 3 Mask Outputs IoU Scores β
β (256Γ256, upscaled) (confidence) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.2 Image Encoder
"""
SAM Image Encoder: MAE pre-trained ViT-H
νΉμ§:
- ViT-H/16: 632M parameters
- μ
λ ₯: 1024Γ1024 (κ³ ν΄μλ)
- μΆλ ₯: 64Γ64Γ256 feature map
- Positional Embedding: Windowed + Global attention
μ MAE pre-training?
- λ§μ€νΉ κΈ°λ° νμ΅μΌλ‘ dense predictionμ μ ν©
- μκΈ° μ§λ νμ΅μΌλ‘ λκ·λͺ¨ λ°μ΄ν° νμ©
- Patch-level νν νμ΅μ ν¨κ³Όμ
"""
import torch
import torch.nn as nn
class SAMImageEncoder(nn.Module):
"""
SAMμ Image Encoder (κ°μν λ²μ )
μ€μ λ‘λ ViT-Hλ₯Ό μ¬μ©νμ§λ§,
μ¬κΈ°μλ ꡬ쑰 μ΄ν΄λ₯Ό μν κ°μν
"""
def __init__(
self,
img_size: int = 1024,
patch_size: int = 16,
embed_dim: int = 1280, # ViT-H
depth: int = 32,
num_heads: int = 16,
out_chans: int = 256,
):
super().__init__()
self.patch_embed = nn.Conv2d(3, embed_dim, patch_size, patch_size)
self.pos_embed = nn.Parameter(
torch.zeros(1, (img_size // patch_size) ** 2, embed_dim)
)
# Transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(embed_dim, num_heads)
for _ in range(depth)
])
self.neck = nn.Sequential(
nn.Conv2d(embed_dim, out_chans, kernel_size=1),
nn.LayerNorm(out_chans),
nn.Conv2d(out_chans, out_chans, kernel_size=3, padding=1),
nn.LayerNorm(out_chans),
)
def forward(self, x):
# x: (B, 3, 1024, 1024)
x = self.patch_embed(x) # (B, embed_dim, 64, 64)
x = x.flatten(2).transpose(1, 2) # (B, 4096, embed_dim)
x = x + self.pos_embed
for block in self.blocks:
x = block(x)
# Reshape back to 2D
B, N, C = x.shape
H = W = int(N ** 0.5)
x = x.transpose(1, 2).reshape(B, C, H, W)
x = self.neck(x) # (B, 256, 64, 64)
return x
2.3 Prompt Encoder
class SAMPromptEncoder(nn.Module):
"""
SAM Prompt Encoder
ν둬ννΈ μ’
λ₯:
1. Points: (x, y) + label (foreground/background)
2. Boxes: (x1, y1, x2, y2)
3. Masks: μ΄μ λ§μ€ν¬ (refinementμ©)
"""
def __init__(self, embed_dim: int = 256, image_size: int = 1024):
super().__init__()
self.embed_dim = embed_dim
self.image_size = image_size
# Point embeddings
self.point_embeddings = nn.ModuleList([
nn.Embedding(1, embed_dim), # foreground
nn.Embedding(1, embed_dim), # background
])
# Positional encoding for points
self.pe_layer = PositionalEncoding(embed_dim, image_size)
# Box corner embeddings
self.box_embeddings = nn.Embedding(2, embed_dim) # top-left, bottom-right
# Mask encoder (for dense prompts)
self.mask_downscaler = nn.Sequential(
nn.Conv2d(1, embed_dim // 4, kernel_size=2, stride=2),
nn.LayerNorm(embed_dim // 4),
nn.GELU(),
nn.Conv2d(embed_dim // 4, embed_dim, kernel_size=2, stride=2),
nn.LayerNorm(embed_dim),
nn.GELU(),
nn.Conv2d(embed_dim, embed_dim, kernel_size=1),
)
# No-mask embedding
self.no_mask_embed = nn.Embedding(1, embed_dim)
def forward(self, points=None, boxes=None, masks=None):
"""
Args:
points: (B, N, 2) μ’ν + (B, N) λ μ΄λΈ
boxes: (B, 4) λ°μ΄λ© λ°μ€
masks: (B, 1, H, W) μ΄μ λ§μ€ν¬
Returns:
sparse_embeddings: (B, N_prompts, embed_dim)
dense_embeddings: (B, embed_dim, H, W)
"""
sparse_embeddings = []
# Point prompts
if points is not None:
coords, labels = points
point_embed = self.pe_layer(coords) # positional encoding
for i in range(coords.shape[1]):
label = labels[:, i]
type_embed = self.point_embeddings[label](label)
sparse_embeddings.append(point_embed[:, i] + type_embed)
# Box prompts
if boxes is not None:
# Box = 2 corner points
corners = boxes.reshape(-1, 2, 2) # (B, 2, 2)
corner_embed = self.pe_layer(corners)
corner_embed += self.box_embeddings.weight
sparse_embeddings.extend([corner_embed[:, 0], corner_embed[:, 1]])
sparse_embeddings = torch.stack(sparse_embeddings, dim=1) if sparse_embeddings else None
# Dense prompt (mask)
if masks is not None:
dense_embeddings = self.mask_downscaler(masks)
else:
# No mask: learnable embedding
dense_embeddings = self.no_mask_embed.weight.reshape(1, -1, 1, 1)
dense_embeddings = dense_embeddings.expand(-1, -1, 64, 64)
return sparse_embeddings, dense_embeddings
2.4 Mask Decoder
class SAMMaskDecoder(nn.Module):
"""
SAM Mask Decoder
ꡬ쑰:
- 2-layer Transformer decoder
- Cross-attention: tokens β image
- Self-attention: tokens
- 3κ°μ λ§μ€ν¬ μΆλ ₯ (multi-scale)
- IoU prediction head
"""
def __init__(
self,
embed_dim: int = 256,
num_heads: int = 8,
num_mask_tokens: int = 4, # 3 masks + 1 IoU
):
super().__init__()
# Mask tokens (learnable)
self.mask_tokens = nn.Embedding(num_mask_tokens, embed_dim)
# Transformer layers
self.transformer = TwoWayTransformer(
depth=2,
embed_dim=embed_dim,
num_heads=num_heads,
)
# Output heads
self.iou_prediction_head = nn.Sequential(
nn.Linear(embed_dim, embed_dim),
nn.GELU(),
nn.Linear(embed_dim, num_mask_tokens - 1), # 3 IoU scores
)
self.mask_prediction_head = nn.Sequential(
nn.ConvTranspose2d(embed_dim, embed_dim // 4, kernel_size=2, stride=2),
nn.GELU(),
nn.ConvTranspose2d(embed_dim // 4, embed_dim // 8, kernel_size=2, stride=2),
nn.GELU(),
nn.Conv2d(embed_dim // 8, num_mask_tokens - 1, kernel_size=1),
)
def forward(self, image_embeddings, sparse_embeddings, dense_embeddings):
"""
Args:
image_embeddings: (B, 256, 64, 64)
sparse_embeddings: (B, N_prompts, 256)
dense_embeddings: (B, 256, 64, 64)
Returns:
masks: (B, 3, 256, 256)
iou_predictions: (B, 3)
"""
# Combine sparse and mask tokens
mask_tokens = self.mask_tokens.weight.unsqueeze(0).expand(
sparse_embeddings.shape[0], -1, -1
)
tokens = torch.cat([mask_tokens, sparse_embeddings], dim=1)
# Add dense embeddings to image
image_pe = dense_embeddings
src = image_embeddings + dense_embeddings
# Transformer decoder
# Cross-attention between tokens and image
tokens, src = self.transformer(tokens, src, image_pe)
# Extract mask tokens
mask_tokens_out = tokens[:, :self.mask_tokens.num_embeddings - 1]
# IoU prediction
iou_predictions = self.iou_prediction_head(mask_tokens_out[:, 0])
# Mask prediction
# Upscale and predict
src = src.reshape(-1, 256, 64, 64)
masks = self.mask_prediction_head(src) # (B, 3, 256, 256)
return masks, iou_predictions
class TwoWayTransformer(nn.Module):
"""
Two-way Transformer for SAM
νΉμ§:
- Token β Image cross-attention
- Image β Token cross-attention
- Token self-attention
"""
def __init__(self, depth, embed_dim, num_heads):
super().__init__()
self.layers = nn.ModuleList([
TwoWayAttentionBlock(embed_dim, num_heads)
for _ in range(depth)
])
def forward(self, tokens, image, image_pe):
for layer in self.layers:
tokens, image = layer(tokens, image, image_pe)
return tokens, image
3. SAM μ¬μ©νκΈ°
3.1 κΈ°λ³Έ μ¬μ©λ²
from segment_anything import SamPredictor, sam_model_registry
import cv2
import numpy as np
# λͺ¨λΈ λ‘λ
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
predictor = SamPredictor(sam)
# μ΄λ―Έμ§ μ€μ
image = cv2.imread("image.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image)
# Point promptλ‘ μΈκ·Έλ©ν
μ΄μ
input_point = np.array([[500, 375]]) # ν΄λ¦ μμΉ
input_label = np.array([1]) # 1: foreground, 0: background
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True, # 3κ° λ§μ€ν¬ μΆλ ₯
)
# κ°μ₯ λμ scoreμ λ§μ€ν¬ μ ν
best_mask = masks[np.argmax(scores)]
3.2 λ€μν ν둬ννΈ
# 1. Multiple points
input_points = np.array([[500, 375], [600, 400], [450, 350]])
input_labels = np.array([1, 1, 0]) # 2 foreground, 1 background
masks, scores, _ = predictor.predict(
point_coords=input_points,
point_labels=input_labels,
multimask_output=False, # λ¨μΌ λ§μ€ν¬
)
# 2. Box prompt
input_box = np.array([100, 100, 500, 400]) # x1, y1, x2, y2
masks, scores, _ = predictor.predict(
box=input_box,
multimask_output=False,
)
# 3. Point + Box combined
masks, scores, _ = predictor.predict(
point_coords=input_point,
point_labels=input_label,
box=input_box,
multimask_output=False,
)
# 4. Iterative refinement (μ΄μ λ§μ€ν¬ μ¬μ©)
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
mask_input=logits[np.argmax(scores)][None, :, :], # μ΄μ logits
multimask_output=False,
)
3.3 Automatic Mask Generation
from segment_anything import SamAutomaticMaskGenerator
# μλ λ§μ€ν¬ μμ±κΈ°
mask_generator = SamAutomaticMaskGenerator(
sam,
points_per_side=32, # 32Γ32 grid
pred_iou_thresh=0.88, # IoU μκ³κ°
stability_score_thresh=0.95, # μμ μ± μκ³κ°
min_mask_region_area=100, # μ΅μ λ§μ€ν¬ ν¬κΈ°
)
# μ΄λ―Έμ§μ λͺ¨λ λ§μ€ν¬ μμ±
masks = mask_generator.generate(image)
# κ²°κ³Ό: list of dicts
# {
# 'segmentation': binary mask,
# 'area': mask area,
# 'bbox': bounding box,
# 'predicted_iou': IoU score,
# 'stability_score': stability score,
# 'crop_box': crop used for generation,
# }
print(f"Found {len(masks)} masks")
# μκ°ν
import matplotlib.pyplot as plt
def show_masks(image, masks):
plt.figure(figsize=(15, 10))
plt.imshow(image)
for mask in masks:
m = mask['segmentation']
color = np.random.random(3)
colored_mask = np.zeros((*m.shape, 4))
colored_mask[m] = [*color, 0.5]
plt.imshow(colored_mask)
plt.axis('off')
plt.show()
show_masks(image, masks)
from transformers import SamModel, SamProcessor
import torch
from PIL import Image
# λͺ¨λΈ λ‘λ
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
# μ΄λ―Έμ§ λ‘λ
image = Image.open("image.jpg")
# Point prompt
input_points = [[[500, 375]]] # batch of points
inputs = processor(image, input_points=input_points, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Post-process
masks = processor.image_processor.post_process_masks(
outputs.pred_masks.cpu(),
inputs["original_sizes"].cpu(),
inputs["reshaped_input_sizes"].cpu()
)
scores = outputs.iou_scores
4. SAM 2 (2024)
4.1 SAM 2μ λ°μ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SAM vs SAM 2 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β SAM (2023): β
β β’ μ΄λ―Έμ§ μ μ© β
β β’ νλ μλ³ λ
립 μ²λ¦¬ β
β β’ λΉλμ€: νλ μλ§λ€ ν둬ννΈ νμ β
β β
β SAM 2 (2024): β
β β’ μ΄λ―Έμ§ + λΉλμ€ ν΅ν© β
β β’ Memory attentionμΌλ‘ μκ° μΌκ΄μ± β
β β’ ν λ² ν둬ννΈ β μ 체 λΉλμ€ μΆμ β
β β
β μλ‘μ΄ κ΅¬μ±μμ: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Memory Encoder β κ³Όκ±° νλ μ μ 보 μΈμ½λ© β β
β β Memory Bank β κ³Όκ±° λ§μ€ν¬μ νΉμ§ μ μ₯ β β
β β Memory Attention β νμ¬ νλ μ β κ³Όκ±° μ 보 attentionβ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.2 SAM 2 λΉλμ€ μ¬μ©
from sam2.build_sam import build_sam2_video_predictor
predictor = build_sam2_video_predictor(
"sam2_hiera_large.pt",
device="cuda"
)
# λΉλμ€ νλ μλ€ λ‘λ
video_path = "video.mp4"
with predictor.init_state(video_path) as state:
# 첫 νλ μμμ ν둬ννΈ
_, _, masks = predictor.add_new_points_or_box(
state,
frame_idx=0,
obj_id=1,
points=[[500, 375]],
labels=[1],
)
# λλ¨Έμ§ νλ μ μλ μ ν
for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
# masks: κ° νλ μμ μΈκ·Έλ©ν
μ΄μ
κ²°κ³Ό
print(f"Frame {frame_idx}: {len(object_ids)} objects")
5. SAM μμ©
5.1 Grounding SAM (Text β Segment)
"""
Grounding SAM = Grounding DINO + SAM
1. Grounding DINO: ν
μ€νΈ β λ°μ΄λ© λ°μ€
2. SAM: λ°μ΄λ© λ°μ€ β μΈκ·Έλ©ν
μ΄μ
κ²°κ³Ό: ν
μ€νΈ ν둬ννΈλ‘ μΈκ·Έλ©ν
μ΄μ
"""
from groundingdino.util.inference import load_model, predict
from segment_anything import SamPredictor, sam_model_registry
# Grounding DINOλ‘ λ°μ€ κ²μΆ
grounding_dino = load_model("groundingdino_swinb.pth")
boxes, logits, phrases = predict(
grounding_dino,
image,
text_prompt="a cat",
box_threshold=0.3,
text_threshold=0.25,
)
# SAMμΌλ‘ μΈκ·Έλ©ν
μ΄μ
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)
predictor.set_image(image)
masks = []
for box in boxes:
mask, _, _ = predictor.predict(box=box.numpy(), multimask_output=False)
masks.append(mask)
"""
SAM κΈ°λ° μΈν°λν°λΈ λ μ΄λΈλ§ λꡬ
1. μ΄λ―Έμ§ λ‘λ
2. μ¬μ©μκ° ν¬μΈνΈ/λ°μ€ ν΄λ¦
3. SAMμ΄ μ€μκ° λ§μ€ν¬ μμ±
4. μ¬μ©μκ° μμ (positive/negative points)
5. μ΅μ’
λ§μ€ν¬ μ μ₯
"""
import cv2
import numpy as np
from segment_anything import SamPredictor
class SAMAnnotator:
def __init__(self, sam_checkpoint):
self.sam = sam_model_registry["vit_h"](checkpoint=sam_checkpoint)
self.predictor = SamPredictor(self.sam)
self.points = []
self.labels = []
def set_image(self, image):
self.image = image.copy()
self.predictor.set_image(image)
self.points = []
self.labels = []
def add_point(self, x, y, is_foreground=True):
self.points.append([x, y])
self.labels.append(1 if is_foreground else 0)
return self.predict()
def predict(self):
if not self.points:
return None
masks, scores, _ = self.predictor.predict(
point_coords=np.array(self.points),
point_labels=np.array(self.labels),
multimask_output=False,
)
return masks[0]
def reset(self):
self.points = []
self.labels = []
# μ¬μ© μμ (OpenCV λ§μ°μ€ μ½λ°±κ³Ό ν¨κ»)
# annotator = SAMAnnotator("sam_vit_h.pth")
# annotator.set_image(image)
# mask = annotator.add_point(500, 375, is_foreground=True)
5.3 Medical Imaging
"""
μλ£ μμ μΈκ·Έλ©ν
μ΄μ
SAMμ κ°μ :
- Zero-shotμΌλ‘ μλ‘μ΄ μ₯κΈ°/λ³λ³ μΈκ·Έλ©ν
μ΄μ
- μ λ¬Έκ°μ ν¬μΈνΈ ν΄λ¦λ§μΌλ‘ μ λ° λ§μ€ν¬
MedSAM: μλ£ μμμ fine-tuned SAM
"""
# MedSAM μ¬μ© μμ
from medsam import MedSAMPredictor
predictor = MedSAMPredictor("medsam_checkpoint.pth")
# CT/MRI μ΄λ―Έμ§ λ‘λ
medical_image = load_medical_image("ct_scan.nii")
# μ¬λΌμ΄μ€λ³ μΈκ·Έλ©ν
μ΄μ
for slice_idx in range(medical_image.shape[0]):
slice_img = medical_image[slice_idx]
predictor.set_image(slice_img)
# μ λ¬Έκ°κ° λ³λ³ μμΉ ν΄λ¦
mask, _, _ = predictor.predict(
point_coords=np.array([[tumor_x, tumor_y]]),
point_labels=np.array([1]),
)
μ 리
SAM ν΅μ¬ ꡬμ±
| ꡬμ±μμ |
μν |
νΉμ§ |
| Image Encoder |
μ΄λ―Έμ§ νΉμ§ μΆμΆ |
MAE ViT-H, 632M params |
| Prompt Encoder |
ν둬ννΈ μΈμ½λ© |
Point/Box/Mask μ§μ |
| Mask Decoder |
λ§μ€ν¬ μμ± |
2-layer Transformer, 4M params |
ν둬ννΈ μ’
λ₯
- Point: ν΄λ¦ μμΉ (foreground/background)
- Box: λ°μ΄λ© λ°μ€
- Mask: μ΄μ λ§μ€ν¬ (refinement)
- Text: Grounding SAM ν΅ν΄ μ§μ
νμ©
| μ©λ |
λ°©λ² |
| Interactive Annotation |
ν΄λ¦μΌλ‘ λΉ λ₯Έ λ μ΄λΈλ§ |
| Automatic Segmentation |
Grid pointsλ‘ μ 체 κ°μ²΄ |
| Video Tracking |
SAM 2λ‘ κ°μ²΄ μΆμ |
| Medical Imaging |
MedSAMμΌλ‘ νΉν |
λ€μ λ¨κ³
μ°Έκ³ μλ£
λ
Όλ¬Έ
- Kirillov et al. (2023). "Segment Anything"
- Ravi et al. (2024). "SAM 2: Segment Anything in Images and Videos"
- Liu et al. (2023). "Grounding DINO"
- Ma et al. (2023). "Segment Anything in Medical Images" (MedSAM)
μ½λ