Surgically reloads a single PDF to Qdrant by deleting old chunks and re-uploading with fixes. Use when user wants to reload, refresh, fix, or update a specific PDF without reloading the entire collection.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: qdrant-pdf-reloader description: Surgically reloads a single PDF to Qdrant by deleting old chunks and re-uploading with fixes. Use when user wants to reload, refresh, fix, or update a specific PDF without reloading the entire collection.
Qdrant PDF Reloader
This skill helps users surgically reload a single PDF document to Qdrant without reloading the entire collection, using the LOAD_DB/reload_single_pdf.py script.
When to Use This Skill
Activate this skill automatically when the user:
- Wants to reload/refresh a specific PDF in Qdrant
- Needs to fix text cleaning issues in a particular document
- Wants to update a document with improved processing
- Needs to regenerate contextual embeddings for one PDF
- Asks to "reload [filename]", "refresh [filename]", or "fix [filename]"
- Uses keywords like "reload PDF", "refresh document", "update single file"
What is Surgical Reload?
A surgical reload is a targeted operation that:
- Deletes all existing chunks for the specified PDF from Qdrant
- Re-processes the PDF with current text cleaner and chunking logic
- Generates fresh contextual embeddings (document context + chunk context)
- Uploads the new chunks to Qdrant
Benefits:
- Much faster than reloading the entire collection (seconds vs. hours)
- Only affects the target PDF - other documents remain unchanged
- Applies latest fixes (text cleaner, TOC filtering, contextual embeddings)
- No risk of losing other documents' data
How to Use
Step 1: Identify the PDF to Reload
Ask the user which PDF file needs to be reloaded. The filename must match exactly (e.g., document-name.pdf).
Step 2: Verify PDF Exists
Check that the PDF file exists in the scraped_content/raw/pdfs/ directory:
ls scraped_content/raw/pdfs/document-name.pdf
If not found, inform the user and ask them to verify the filename.
Step 3: Run the Reload Script
Execute the surgical reload:
cd LOAD_DB
python reload_single_pdf.py document-name.pdf
The script takes the PDF filename as the only argument.
Step 4: Monitor Progress
The script will output progress through these stages:
- Deletion: Removing old chunks from Qdrant
- Processing: Loading PDF, cleaning text, chunking
- Context Generation: Creating document and chunk contexts
- Upload: Uploading new chunks with embeddings
Step 5: Report Results
After completion, summarize:
- Number of old chunks deleted
- Number of new chunks uploaded
- Any errors or warnings encountered
- Confirm successful reload
Script Features
Automatic Text Cleaning
The script applies the fixed text cleaner which:
- Normalizes whitespace
- Removes page headers/footers
- Cleans special characters
- Filters out Table of Contents (TOC) chunks
Contextual Embeddings
The script generates contextual metadata:
- Master Context: High-level overview of the entire Texas childcare system
- Document Context: Summary of the specific document (purpose, key topics)
- Chunk Context: Summary of the previous chunk for continuity
These contexts are included in embeddings to improve retrieval accuracy.
Special Handling
Table-heavy PDFs (configured in config.TABLE_PDFS):
- Extracted using Docling (table-aware extraction)
- Tables extracted as markdown and kept as separate chunks
- Narrative text grouped into ~1000 char chunks
- Item-level chunking (no overlap, semantic boundaries)
- Examples: income eligibility tables, evaluation reports
Standard PDFs (PyMuPDF extraction):
- Single-page: Loaded as one chunk to preserve structure
- Multi-page: Split using
RecursiveCharacterTextSplitter- Chunk size: 1000 characters
- Overlap: 200 characters
Collection Target
By default, reloads to the contextual collection (tro-child-1 or similar).
Example Workflows
Example 1: Fix Text Cleaning Issue
User: "The income eligibility PDF has garbled text. Can you reload it with the fixed text cleaner?"
Claude: I'll surgically reload that PDF with the updated text cleaner.
[Runs command]
cd LOAD_DB
python reload_single_pdf.py bcy-26-income-eligibility-and-maximum-psoc-twc.pdf
[Script output shows progress]
✓ Deleted 12 old chunks
✓ Loaded 8 pages
✓ Created 10 chunks after filtering (removed 2 TOC chunks)
✓ Generated contextual metadata
✓ Uploaded 10 new chunks
Reload complete! The document has been refreshed with:
- Fixed text cleaning
- TOC filtering
- Fresh contextual embeddings
Old chunks: 12 deleted
New chunks: 10 uploaded
Example 2: Update After PDF Replacement
User: "I replaced the payment rates PDF with a corrected version. Please reload it."
Claude: I'll reload the payment rates PDF with the new version.
[Runs command]
cd LOAD_DB
python reload_single_pdf.py bcy25-child-care-provider-payment-rates-twc.pdf
✓ Surgical reload complete!
Deleted: 5 old chunks (single-page table)
Uploaded: 5 new chunks
The updated PDF is now in Qdrant with fresh contextual embeddings.
Example 3: Regenerate Contextual Embeddings
User: "Can you regenerate the contextual embeddings for the state plan document?"
Claude: I'll reload the state plan to regenerate contextual metadata.
cd LOAD_DB
python reload_single_pdf.py tx-ccdf-state-plan-ffy2025-2027-approved.pdf
This will:
1. Delete existing chunks
2. Re-process the PDF
3. Generate new document and chunk contexts
4. Upload with fresh embeddings
[After completion]
✓ Reloaded with new contextual embeddings:
- Document context: [summary of document]
- Chunk contexts: Generated for all 87 chunks
Example 4: Fix Single-Page Table
User: "The PSOC chart isn't being retrieved correctly. It's a single-page table PDF."
Claude: Single-page PDFs are loaded as one chunk to preserve table structure. Let me reload it.
cd LOAD_DB
python reload_single_pdf.py bcy2025-psoc-chart-twc.pdf
✓ Single-page PDF: loading as 1 chunk (3,245 characters)
✓ Uploaded 1 chunk with full table content
The table is now loaded as a single coherent chunk, which should improve retrieval.
When to Use vs. Full Reload
Use Surgical Reload When:
- ✅ Fixing a specific document
- ✅ One or a few PDFs need updating
- ✅ Quick fix needed (seconds/minutes)
- ✅ Other documents are correct and shouldn't be touched
Use Full Reload When:
- ❌ Major changes to chunking logic or embeddings model
- ❌ Many/most documents need updating
- ❌ Collection structure changed
- ❌ Starting fresh with new documents
Rule of thumb: If updating < 10 PDFs, use surgical reload. If updating > 10 PDFs or making system-wide changes, use full reload.
Technical Details
File Location
- Script:
LOAD_DB/reload_single_pdf.py - PDF Directory:
scraped_content/raw/pdfs/ - Collection:
QDRANT_COLLECTION_NAME_CONTEXTUALfrom config
Architecture
The reload script uses a modular architecture:
- Extractors (
LOAD_DB/extractors/): Factory pattern for PyMuPDF vs. Docling selection - Shared Utilities (
LOAD_DB/shared/): Common processing and upload logic - PyMuPDFExtractor: Fast text extraction for standard PDFs
- DoclingExtractor: Table-aware extraction with item-level chunking for table-heavy PDFs
Dependencies
The script requires:
- LangChain (
PyMuPDFLoader,RecursiveCharacterTextSplitter) - OpenAI embeddings (
text-embedding-3-small) - Qdrant client
- GROQ API (for contextual metadata generation)
- Docling (for table-aware PDFs in
config.TABLE_PDFS) - Local modules:
extractors- Factory pattern and extractor classesshared- Processing utilities and upload logiccontextual_processor- Context generationtext_cleaner- Text cleaning and TOC detectionprompts- Master context template
Process Flow
1. DELETE PHASE
└─ Scroll through collection
└─ Find chunks where filename == pdf_filename
└─ Delete all matching chunks
2. EXTRACTION PHASE (via Factory Pattern)
├─ Check if PDF in config.TABLE_PDFS
├─ If YES: Use DoclingExtractor
│ ├─ Convert PDF with Docling
│ ├─ Extract tables as markdown
│ ├─ Group items by page and sort by y-position
│ └─ Create item-level chunks (tables + narrative)
└─ If NO: Use PyMuPDFExtractor
└─ Load PDF with PyMuPDFLoader (standard extraction)
3. PROCESS PHASE (via shared utilities)
└─ Clean text on each page (clean_documents)
└─ Enrich metadata (enrich_metadata)
└─ Split into chunks if multi-page (text_splitter)
└─ Filter out TOC chunks (filter_toc_chunks)
└─ Add chunk metadata (add_chunk_metadata)
4. CONTEXT PHASE (Contextual Mode)
└─ Generate document context from first 2000 chars
└─ Generate chunk context for each chunk (uses previous chunk)
└─ Add master context, document context, chunk context to metadata
5. UPLOAD PHASE (via shared uploader)
└─ Generate OpenAI embeddings from enriched text
└─ Create Qdrant points with embeddings + metadata
└─ Store only original content in page_content
└─ Upload in batches (100 per batch)
Metadata Added to Chunks
Each uploaded chunk includes:
{
'text': chunk.page_content,
'filename': pdf_filename,
'content_type': 'pdf',
'page': page_number,
'total_pages': total_page_count,
'chunk_index': chunk_number,
'total_chunks': total_chunk_count,
'chunk_type': 'table' | 'narrative', # Docling only
'extractor': 'docling' | 'pymupdf',
'has_context': True,
'master_context': master_context_text,
'document_context': document_summary,
'chunk_context': previous_chunk_summary,
'source_url': url (if available from metadata.json)
}
Error Handling
PDF Not Found
If the PDF doesn't exist in scraped_content/raw/pdfs/:
- Verify the filename (exact match, case-sensitive)
- Check if the PDF was scraped/downloaded
- Suggest running the scraper if needed
Deletion Errors
If deletion fails:
- Check Qdrant connection
- Verify API credentials
- Check if chunks actually exist (use
retrieve_chunks_by_filename.py)
Processing Errors
If PDF processing fails:
- Verify PDF is not corrupted
- Check if PDF is readable (try opening manually)
- Review error logs for specifics
Upload Errors
If upload fails:
- Check OpenAI API key (for embeddings)
- Verify Qdrant connection
- Check if collection exists
Advanced Usage
Verify Before Reloading
Use the qdrant-chunk-retriever skill to check existing chunks before reloading.
Verify After Reloading
After reloading, use the qdrant-chunk-retriever skill to confirm:
- Chunk count matches expectations
- Text is cleaned properly
- Context fields are populated
- Extractor type is correct (docling vs pymupdf)
- Chunk types are set (for Docling PDFs)
Compare Before/After
- Use
qdrant-file-exporterskill to export chunks before reload - Reload the PDF:
cd LOAD_DB
python reload_single_pdf.py doc.pdf
- Use
qdrant-file-exporterskill again to export chunks after reload - Compare the exported files to see what changed
Notes
- Contextual mode is default: All reloads use contextual embeddings
- Chunk IDs are deterministic: Based on hash of (filename, chunk_index, page)
- Idempotent operation: Running multiple times produces same result
- No undo: Once chunks are deleted, they're gone (reload completes the operation)
- Collection must exist: Script doesn't create collections, only updates existing ones
Related Tools
- Skills:
qdrant-chunk-retriever: Verify chunks before/after reloadqdrant-doc-deleter: Delete without reloading (if you just want to remove)qdrant-file-exporter: Export all chunks from a PDF to text file
- Scripts:
LOAD_DB/load_pdf_qdrant.py: Full collection reload (all PDFs)LOAD_DB/verify_qdrant.py: Verify collection statistics
- Shared Modules:
LOAD_DB/extractors/: PDF extraction with factory patternLOAD_DB/shared/: Common processing and upload utilities
More by techybolek
View allExtracts all chunks with three-tier contextual embeddings from a specific PDF file in Qdrant vector database and saves to plain text. Use when user wants to extract, export, dump, or view all chunks from a PDF document, inspect file content, save chunks for analysis, or review contextual embeddings.
Performs mathematical root calculations including square root, cube root, and nth roots. Use when user asks to calculate square root, cube root, nth root, or uses keywords like 'sqrt', 'root of', 'calculate root'.
Deletes specific PDF documents from Qdrant vector database collection. Use when user wants to remove, delete, or clean up PDF documents from the vector database, Qdrant collection, or needs to manage document versions.
Retrieves and inspects chunks from specific PDF documents in Qdrant vector database. Use when user wants to view, inspect, debug, or examine chunks from a particular file, check chunk content, or investigate chunk indexing.
