CLIP, SigLIP 2, Voyage multimodal-3 patterns for image+text retrieval, cross-modal search, and multimodal document chunking. Use when building RAG with images, implementing visual search, or hybrid retrieval.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: multimodal-rag description: CLIP, SigLIP 2, Voyage multimodal-3 patterns for image+text retrieval, cross-modal search, and multimodal document chunking. Use when building RAG with images, implementing visual search, or hybrid retrieval. context: fork agent: multimodal-specialist version: 1.0.0 author: OrchestKit user-invocable: false tags: [rag, multimodal, image-retrieval, clip, embeddings, vector-search, 2026]
Multimodal RAG (2026)
Build retrieval-augmented generation systems that handle images, text, and mixed content.
Overview
- Image + text retrieval (product search, documentation)
- Cross-modal search (text query -> image results)
- Multimodal document processing (PDFs with charts)
- Visual question answering with context
- Image similarity and deduplication
- Hybrid search pipelines
Architecture Approaches
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| Joint Embedding (CLIP) | Direct comparison | Limited context | Pure image search |
| Caption-based | Works with text LLMs | Lossy conversion | Existing text RAG |
| Hybrid | Best accuracy | More complex | Production systems |
Embedding Models (2026)
| Model | Context | Modalities | Best For |
|---|---|---|---|
| Voyage multimodal-3 | 32K tokens | Text + Image | Long documents |
| SigLIP 2 | Standard | Text + Image | Large-scale retrieval |
| CLIP ViT-L/14 | 77 tokens | Text + Image | General purpose |
| ImageBind | Standard | 6 modalities | Audio/video included |
| ColPali | Document | Text + Image | PDF/document RAG |
CLIP-Based Image Embeddings
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
# Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(image_path: str) -> list[float]:
"""Generate CLIP embedding for an image."""
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embeddings = model.get_image_features(**inputs)
# Normalize for cosine similarity
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()
def embed_text(text: str) -> list[float]:
"""Generate CLIP embedding for text query."""
inputs = processor(text=[text], return_tensors="pt", padding=True)
with torch.no_grad():
embeddings = model.get_text_features(**inputs)
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()
# Cross-modal search: text -> images
def search_images(query: str, image_embeddings: list, top_k: int = 5):
"""Search images using text query."""
query_embedding = embed_text(query)
# Compute similarities (cosine)
similarities = [
np.dot(query_embedding, img_emb)
for img_emb in image_embeddings
]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return top_indices, [similarities[i] for i in top_indices]
Voyage Multimodal-3 (Long Context)
import voyageai
client = voyageai.Client()
def embed_multimodal_voyage(
texts: list[str] = None,
images: list[str] = None # File paths or URLs
) -> list[list[float]]:
"""Embed text and/or images with 32K token context."""
inputs = []
if texts:
inputs.extend([{"type": "text", "content": t} for t in texts])
if images:
for img_path in images:
with open(img_path, "rb") as f:
import base64
b64 = base64.b64encode(f.read()).decode()
inputs.append({
"type": "image",
"content": f"data:image/png;base64,{b64}"
})
response = client.multimodal_embed(
inputs=inputs,
model="voyage-multimodal-3"
)
return response.embeddings
Hybrid RAG Pipeline
from typing import Optional
import numpy as np
class MultimodalRAG:
"""Production multimodal RAG with hybrid retrieval."""
def __init__(self, vector_db, vision_model, text_model):
self.vector_db = vector_db
self.vision_model = vision_model
self.text_model = text_model
async def index_document(
self,
doc_id: str,
text: Optional[str] = None,
image_path: Optional[str] = None,
metadata: dict = None
):
"""Index a document with text and/or image."""
embeddings = []
if text:
text_emb = embed_text(text)
embeddings.append(("text", text_emb))
if image_path:
# Option 1: Direct image embedding
img_emb = embed_image(image_path)
embeddings.append(("image", img_emb))
# Option 2: Generate caption for text search
caption = await self.generate_caption(image_path)
caption_emb = embed_text(caption)
embeddings.append(("caption", caption_emb))
# Store with shared document ID
for emb_type, emb in embeddings:
await self.vector_db.upsert(
id=f"{doc_id}_{emb_type}",
embedding=emb,
metadata={
"doc_id": doc_id,
"type": emb_type,
"image_url": image_path,
"text": text,
**(metadata or {})
}
)
async def generate_caption(self, image_path: str) -> str:
"""Generate text caption for image indexing."""
# Use GPT-4o or Claude for high-quality captions
response = await self.vision_model.analyze(
image_path,
prompt="Describe this image in detail for search indexing. "
"Include objects, text, colors, and context."
)
return response
async def retrieve(
self,
query: str,
query_image: Optional[str] = None,
top_k: int = 10
) -> list[dict]:
"""Hybrid retrieval with optional image query."""
results = []
# Text query embedding
text_emb = embed_text(query)
text_results = await self.vector_db.search(
embedding=text_emb,
top_k=top_k
)
results.extend(text_results)
# Image query embedding (if provided)
if query_image:
img_emb = embed_image(query_image)
img_results = await self.vector_db.search(
embedding=img_emb,
top_k=top_k
)
results.extend(img_results)
# Dedupe by doc_id, keep highest score
seen = {}
for r in results:
doc_id = r["metadata"]["doc_id"]
if doc_id not in seen or r["score"] > seen[doc_id]["score"]:
seen[doc_id] = r
return sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:top_k]
Multimodal Document Chunking
from dataclasses import dataclass
from typing import Literal
@dataclass
class Chunk:
content: str
chunk_type: Literal["text", "image", "table", "chart"]
page: int
image_path: Optional[str] = None
embedding: Optional[list[float]] = None
def chunk_multimodal_document(pdf_path: str) -> list[Chunk]:
"""Chunk PDF preserving images and tables."""
from pdf2image import convert_from_path
import fitz # PyMuPDF
doc = fitz.open(pdf_path)
chunks = []
for page_num, page in enumerate(doc):
# Extract text blocks
text_blocks = page.get_text("blocks")
current_text = ""
for block in text_blocks:
if block[6] == 0: # Text block
current_text += block[4] + "\n"
else: # Image block
# Save current text chunk
if current_text.strip():
chunks.append(Chunk(
content=current_text.strip(),
chunk_type="text",
page=page_num
))
current_text = ""
# Extract and save image
xref = block[7]
img = doc.extract_image(xref)
img_path = f"/tmp/page{page_num}_img{xref}.{img['ext']}"
with open(img_path, "wb") as f:
f.write(img["image"])
# Generate caption for the image
caption = generate_image_caption(img_path)
chunks.append(Chunk(
content=caption,
chunk_type="image",
page=page_num,
image_path=img_path
))
# Final text chunk
if current_text.strip():
chunks.append(Chunk(
content=current_text.strip(),
chunk_type="text",
page=page_num
))
return chunks
Vector Database Setup (Milvus)
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType
def setup_multimodal_collection():
"""Create Milvus collection for multimodal embeddings."""
connections.connect("default", host="localhost", port="19530")
fields = [
FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=256),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="image_url", dtype=DataType.VARCHAR, max_length=1024),
FieldSchema(name="page", dtype=DataType.INT64)
]
schema = CollectionSchema(fields, "Multimodal document collection")
collection = Collection("multimodal_docs", schema)
# Create index for vector search
index_params = {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)
return collection
Multimodal Generation
async def generate_with_context(
query: str,
retrieved_chunks: list[Chunk],
model: str = "claude-opus-4-5-20251124"
) -> str:
"""Generate response using multimodal context."""
content = []
# Add retrieved images first (attention positioning)
for chunk in retrieved_chunks:
if chunk.chunk_type == "image" and chunk.image_path:
base64_data, media_type = encode_image_base64(chunk.image_path)
content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64_data
}
})
# Add text context
text_context = "\n\n".join([
f"[Page {c.page}]: {c.content}"
for c in retrieved_chunks if c.chunk_type == "text"
])
content.append({
"type": "text",
"text": f"""Use the following context to answer the question.
Context:
{text_context}
Question: {query}
Provide a detailed answer based on the context and images provided."""
})
response = client.messages.create(
model=model,
max_tokens=4096,
messages=[{"role": "user", "content": content}]
)
return response.content[0].text
Key Decisions
| Decision | Recommendation |
|---|---|
| Long documents | Voyage multimodal-3 (32K context) |
| Scale retrieval | SigLIP 2 (optimized for large-scale) |
| PDF processing | ColPali (document-native) |
| Multi-modal search | Hybrid: CLIP + text embeddings |
| Production DB | Milvus or Pinecone with hybrid |
Common Mistakes
- Embedding images without captions (limits text search)
- Not deduplicating by document ID
- Missing image URL storage (can't display results)
- Using only image OR text embeddings (use both)
- Ignoring chunk boundaries (split mid-paragraph)
- Not validating image retrieval quality
Related Skills
vision-language-models- Image analysisembeddings- Text embedding patternsrag-retrieval- Text RAG patternscontextual-retrieval- Hybrid BM25+vector
Capability Details
image-embeddings
Keywords: CLIP, image embedding, visual features, SigLIP Solves:
- Convert images to vector representations
- Enable image similarity search
- Cross-modal retrieval
cross-modal-search
Keywords: text to image, image to text, cross-modal Solves:
- Find images from text queries
- Find text from image queries
- Bridge modalities
multimodal-chunking
Keywords: chunk PDF, split document, extract images Solves:
- Process documents with mixed content
- Preserve image-text relationships
- Handle tables and charts
hybrid-retrieval
Keywords: hybrid search, fusion, multi-embedding Solves:
- Combine text and image search
- Improve retrieval accuracy
- Handle diverse queries
More by NeverSight
View allUpdate Tailwind CSS configuration, custom themes, and design tokens across packages. Use when adding new colors, spacing scales, or customizing the design system.
Advanced distributed event patterns for ABP microservices including idempotent handlers, cross-tenant events, event sourcing lite, and saga patterns. Use when: (1) implementing event handlers across services, (2) ensuring idempotent event processing, (3) cross-tenant event handling, (4) designing event-driven architectures.
Optimize decisions using proof-weighted scoring.
knowledge-base-builder for learning content management and knowledge systems.
