Clip — OpenAI's model connecting vision and language
OpenAI’s model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
Skill metadata
Section titled “Skill metadata”| Source | Optional — install with hermes skills install official/mlops/clip |
| Path | optional-skills/mlops/clip |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | transformers, torch, pillow |
| Tags | Multimodal, CLIP, Vision-Language, Zero-Shot, Image Classification, OpenAI, Image Search, Cross-Modal Retrieval, Content Moderation |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
CLIP - Contrastive Language-Image Pre-Training
Section titled “CLIP - Contrastive Language-Image Pre-Training”OpenAI’s model that understands images from natural language.
When to use CLIP
Section titled “When to use CLIP”Use when:
- Zero-shot image classification (no training data needed)
- Image-text similarity/matching
- Semantic image search
- Content moderation (detect NSFW, violence)
- Visual question answering
- Cross-modal retrieval (image→text, text→image)
Metrics:
- 25,300+ GitHub stars
- Trained on 400M image-text pairs
- Matches ResNet-50 on ImageNet (zero-shot)
- MIT License
Use alternatives instead:
- BLIP-2: Better captioning
- LLaVA: Vision-language chat
- Segment Anything: Image segmentation
Quick start
Section titled “Quick start”Installation
Section titled “Installation”pip install git+https://github.com/openai/CLIP.gitpip install torch torchvision ftfy regex tqdmZero-shot classification
Section titled “Zero-shot classification”import torchimport clipfrom PIL import Image
# Load modeldevice = "cuda" if torch.cuda.is_available() else "cpu"model, preprocess = clip.load("ViT-B/32", device=device)
# Load imageimage = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
# Define possible labelstext = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
# Compute similaritywith torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text)
# Cosine similarity logits_per_image, logits_per_text = model(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy()
# Print resultslabels = ["a dog", "a cat", "a bird", "a car"]for label, prob in zip(labels, probs[0]): print(f"{label}: {prob:.2%}")Available models
Section titled “Available models”# Models (sorted by size)models = [ "RN50", # ResNet-50 "RN101", # ResNet-101 "ViT-B/32", # Vision Transformer (recommended) "ViT-B/16", # Better quality, slower "ViT-L/14", # Best quality, slowest]
model, preprocess = clip.load("ViT-B/32")| Model | Parameters | Speed | Quality |
|---|---|---|---|
| RN50 | 102M | Fast | Good |
| ViT-B/32 | 151M | Medium | Better |
| ViT-L/14 | 428M | Slow | Best |
Image-text similarity
Section titled “Image-text similarity”# Compute embeddingsimage_features = model.encode_image(image)text_features = model.encode_text(text)
# Normalizeimage_features /= image_features.norm(dim=-1, keepdim=True)text_features /= text_features.norm(dim=-1, keepdim=True)
# Cosine similaritysimilarity = (image_features @ text_features.T).item()print(f"Similarity: {similarity:.4f}")Semantic image search
Section titled “Semantic image search”# Index imagesimage_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]image_embeddings = []
for img_path in image_paths: image = preprocess(Image.open(img_path)).unsqueeze(0).to(device) with torch.no_grad(): embedding = model.encode_image(image) embedding /= embedding.norm(dim=-1, keepdim=True) image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)
# Search with text queryquery = "a sunset over the ocean"text_input = clip.tokenize([query]).to(device)with torch.no_grad(): text_embedding = model.encode_text(text_input) text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
# Find most similar imagessimilarities = (text_embedding @ image_embeddings.T).squeeze(0)top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values): print(f"{image_paths[idx]}: {score:.3f}")Content moderation
Section titled “Content moderation”# Define categoriescategories = [ "safe for work", "not safe for work", "violent content", "graphic content"]
text = clip.tokenize(categories).to(device)
# Check imagewith torch.no_grad(): logits_per_image, _ = model(image, text) probs = logits_per_image.softmax(dim=-1)
# Get classificationmax_idx = probs.argmax().item()max_prob = probs[0, max_idx].item()
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")Batch processing
Section titled “Batch processing”# Process multiple imagesimages = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]images = torch.stack(images).to(device)
with torch.no_grad(): image_features = model.encode_image(images) image_features /= image_features.norm(dim=-1, keepdim=True)
# Batch texttexts = ["a dog", "a cat", "a bird"]text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad(): text_features = model.encode_text(text_tokens) text_features /= text_features.norm(dim=-1, keepdim=True)
# Similarity matrix (10 images × 3 texts)similarities = image_features @ text_features.Tprint(similarities.shape) # (10, 3)Integration with vector databases
Section titled “Integration with vector databases”# Store CLIP embeddings in Chroma/FAISSimport chromadb
client = chromadb.Client()collection = client.create_collection("image_embeddings")
# Add image embeddingsfor img_path, embedding in zip(image_paths, image_embeddings): collection.add( embeddings=[embedding.cpu().numpy().tolist()], metadatas=[{"path": img_path}], ids=[img_path] )
# Query with textquery = "a sunset"text_embedding = model.encode_text(clip.tokenize([query]))results = collection.query( query_embeddings=[text_embedding.cpu().numpy().tolist()], n_results=5)Best practices
Section titled “Best practices”- Use ViT-B/32 for most cases - Good balance
- Normalize embeddings - Required for cosine similarity
- Batch processing - More efficient
- Cache embeddings - Expensive to recompute
- Use descriptive labels - Better zero-shot performance
- GPU recommended - 10-50× faster
- Preprocess images - Use provided preprocess function
Performance
Section titled “Performance”| Operation | CPU | GPU (V100) |
|---|---|---|
| Image encoding | ~200ms | ~20ms |
| Text encoding | ~50ms | ~5ms |
| Similarity compute | <1ms | <1ms |
Limitations
Section titled “Limitations”- Not for fine-grained tasks - Best for broad categories
- Requires descriptive text - Vague labels perform poorly
- Biased on web data - May have dataset biases
- No bounding boxes - Whole image only
- Limited spatial understanding - Position/counting weak
Resources
Section titled “Resources”- GitHub: https://github.com/openai/CLIP ⭐ 25,300+
- Paper: https://arxiv.org/abs/2103.00020
- Colab: https://colab.research.google.com/github/openai/clip/
- License: MIT