Extract structured metadata from documents using AI. Classify content types, extract topics and tools. Supports async batch processing.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
skills listSkill Instructions
name: document-indexing description: Extract structured metadata from documents using AI. Classify content types, extract topics and tools. Supports async batch processing.
Document Indexing
Overview
Extract structured metadata from fetched documents using LLM:
- Content type: blog, tutorial, guide, reference, etc.
- Topics & Tools: Main subjects and technologies
- Structure: Code examples, procedures, narrative
Creates DocumentMetadata records for search and clustering.
Quick Start
# Index single document
kurt index 5494cc13
# Batch index (async, 5-10x faster)
kurt index --url-prefix https://example.com/
# Re-index with custom concurrency
kurt index --url-prefix https://example.com/ --force --max-concurrent 10
Prerequisites: Documents must be FETCHED (kurct content fetch)
Commands
# Single
kurt index <doc-id>
kurt index <doc-id> --force
# Batch (async parallel)
kurt index --url-prefix <url>
kurt index --url-contains <string>
kurt index --max-concurrent 10 # Default: 5
# Filters
kurt index --status FETCHED --url-prefix <url>
Content Types
BLOG | TUTORIAL | GUIDE | REFERENCE | WHITEPAPER | CASE_STUDY | FAQ | CHANGELOG | MARKETING | OTHER
Extracted Metadata
{
"content_type": "TUTORIAL",
"extracted_title": "Machine Learning Guide",
"primary_topics": ["Machine Learning", "Python"],
"tools_technologies": ["TensorFlow", "Pandas"],
"has_code_examples": true,
"has_step_by_step_procedures": true,
"has_narrative_structure": false
}
Performance
- Sequential: ~3-5s per document
- Parallel (5 concurrent): ~1s per document avg
- Example: 92 docs in 30s (parallel) vs 5 mins (sequential)
Python API
from kurt.indexing import extract_document_metadata, batch_extract_document_metadata
import asyncio
# Single
result = extract_document_metadata("abc-123")
# Batch
results = asyncio.run(batch_extract_document_metadata(
["abc-123", "def-456"],
max_concurrent=5
))
Troubleshooting
| Issue | Solution |
|---|---|
| "Document not FETCHED" | Run kurct content fetch <id> first |
| "Content file not found" | Re-fetch document |
| Slow batch | Increase --max-concurrent |
| Rate limits | Reduce --max-concurrent |
Next Steps
- ingest-content-skill - Fetch documents first
- document-management-skill - Query and manage documents
More by boringdata
View allBuild BSL semantic models with dimensions, measures, joins, and YAML config. Use for creating/modifying data models.
Query BSL semantic models with group_by, aggregate, filter, and visualizations. Use for data analysis from existing semantic tables.
Configure CMS connections and perform ad-hoc content searches (Sanity, Contentful, WordPress)
One-time team setup that creates Kurt profile and foundation rules