markitdown

@fgarofalo56/markitdown

2 forks

Updated 4/29/2026

Expert guidance for converting files to Markdown using Microsoft's MarkItDown utility. Convert PDF, Word, PowerPoint, Excel, images, audio, HTML, CSV, JSON, XML, ZIP, and EPub files to LLM-friendly Markdown format. Use when processing documents for AI analysis, extracting content from files, or preparing data for language models.

Installation

$npx agent-skills-cli install @fgarofalo56/markitdown

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositoryfgarofalo56/Suppercharge_Microsoft_Fabric

Path.github/skills/markitdown/SKILL.md

Branchmain

Scoped Name@fgarofalo56/markitdown

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: markitdown description: Expert guidance for converting files to Markdown using Microsoft's MarkItDown utility. Convert PDF, Word, PowerPoint, Excel, images, audio, HTML, CSV, JSON, XML, ZIP, and EPub files to LLM-friendly Markdown format. Use when processing documents for AI analysis, extracting content from files, or preparing data for language models.

MarkItDown File Conversion Expert

Expert guidance for converting diverse file formats to Markdown using Microsoft's MarkItDown utility. Transform documents, media, web content, and structured data into token-efficient Markdown optimized for LLM consumption.

Core Capabilities

Document Conversion - PDF, Word, PowerPoint, Excel
Media Processing - Images (OCR), Audio (transcription)
Web Content - HTML, YouTube URLs
Structured Data - CSV, JSON, XML
Archive Handling - ZIP files, EPub e-books
LLM Optimization - Token-efficient output for AI models

Quick Reference

Supported Formats

Category	Formats
Documents	PDF, DOCX, PPTX, XLSX/XLS
Media	Images (PNG, JPG, GIF), Audio (WAV, MP3)
Web	HTML, YouTube URLs
Data	CSV, JSON, XML
Archives	ZIP, EPub

Key Features

Structure Preservation - Maintains headings, lists, tables, links
Token Efficiency - Optimized for LLM consumption
OCR Support - Extract text from images
Audio Transcription - Convert speech to text
EXIF Metadata - Extract image metadata
Azure Integration - Advanced PDF processing
LLM Descriptions - Image descriptions via OpenAI

Instructions

Installation

Python 3.10+ Required

Full Installation (All Features):

pip install 'markitdown[all]'

Selective Installation:

# PDF support only
pip install 'markitdown[pdf]'

# Office documents
pip install 'markitdown[docx,pptx,xlsx]'

# Media files
pip install 'markitdown[audio,image]'

# Multiple features
pip install 'markitdown[pdf,docx,pptx,xlsx]'

Command-Line Usage

**Basic Con```bash

Output to stdout

markitdown document.pdf

Save to file

markitdown document.pdf > output.md

Specify output file

markitdown document.pdf -o output.md


**Batch Con```bash
# Convert multiple files
for file in *.pdf; do
  markitdown "$file" > "${file%.pdf}.md"
done

Python API Usage

**Basic Con```python from markitdown import MarkItDown

Initialize converter

md = MarkItDown()

Convert file

result = md.convert("document.pdf") print(result.text_content)

Save to file

with open("output.md", "w") as f: f.write(result.text_content)


**Convert from Stream:**
```python
with open("document.pdf", "rb") as f:
    result = md.convert_stream(f, file_extension=".pdf")
    print(result.text_content)

Batch Processing:

import os
from markitdown import MarkItDown

md = MarkItDown()
input_dir = "./documents"
output_dir = "./markdown"

for filename in os.listdir(input_dir):
    input_path = os.path.join(input_dir, filename)
    if os.path.isfile(input_path):
        result = md.convert(input_path)

        # Create output filename
        base_name = os.path.splitext(filename)[0]
        output_path = os.path.join(output_dir, f"{base_name}.md")

        with open(output_path, "w") as f:
            f.write(result.text_content)

        print(f"Converted: {filename}")

Advanced Features

Azure Document Intelligence

Enhanced PDF processing with Azure AI.

Setup:

pip install 'markitdown[azure-doc-intelligence]'

Configuration:

from markitdown import MarkItDown

md = MarkItDown(
    azure_doc_intelligence_endpoint="https://your-resource.cognitiveservices.azure.com",
    azure_doc_intelligence_key="your-key-here"
)

result = md.convert("complex-document.pdf")

Benefits:

Better table extraction
Improved layout understanding
Enhanced text recognition
Complex document handling

LLM-Powered Image Descriptions

Generate AI descriptions for images in documents.

Setup:

pip install 'markitdown[llm]'

Configuration:

from markitdown import MarkItDown

md = MarkItDown(
    llm_model="gpt-4o",  # or gpt-4o-mini
    llm_client=None       # Uses default Azure OpenAI or OpenAI
)

# Convert PowerPoint with image descriptions
result = md.convert("presentation.pptx")

Supported:

PowerPoint slides with images
Standalone image files
Images in other document types

Docker Usage

Build Image:

docker build -t markitdown:latest .

**Run Con```bash docker run --rm -i markitdown:latest < document.pdf > output.md


**Volume Mounting:**
```bash
docker run --rm -v $(pwd):/data markitdown:latest /data/document.pdf > output.md

Common Workflows

Document Processing Pipeline

Workflow:

1. Receive document (PDF, Word, Excel, etc.)
2. Convert to Markdown with MarkItDown
3. Process Markdown with LLM
4. Extract insights or generate summaries
5. Present results

Python Implementation:

from markitdown import MarkItDown

def process_document(file_path):
    # Convert to Markdown
    md = MarkItDown()
    result = md.convert(file_path)
    markdown_content = result.text_content

    # Process with LLM (your code here)
    # insights = analyze_with_llm(markdown_content)

    return markdown_content

# Process multiple documents
documents = ["report.pdf", "data.xlsx", "slides.pptx"]
for doc in documents:
    content = process_document(doc)
    print(f"Processed: {doc}")

Batch Document Conversion

Script:

import os
from pathlib import Path
from markitdown import MarkItDown

def batch_convert(input_folder, output_folder):
    md = MarkItDown()

    # Create output folder
    Path(output_folder).mkdir(parents=True, exist_ok=True)

    # Get all supported files
    supported_extensions = [
        '.pdf', '.docx', '.pptx', '.xlsx', '.xls',
        '.html', '.csv', '.json', '.xml', '.zip', '.epub',
        '.png', '.jpg', '.jpeg', '.gif',
        '.wav', '.mp3'
    ]

    files = [
        f for f in os.listdir(input_folder)
        if any(f.lower().endswith(ext) for ext in supported_extensions)
    ]

    for filename in files:
        input_path = os.path.join(input_folder, filename)
        output_filename = Path(filename).stem + ".md"
        output_path = os.path.join(output_folder, output_filename)

        try:
            result = md.convert(input_path)
            with open(output_path, "w", encoding="utf-8") as f:
                f.write(result.text_content)
            print(f"✓ Converted: {filename}")
        except Exception as e:
            print(f"✗ Failed: {filename} - {e}")

# Usage
batch_convert("./input_docs", "./output_markdown")

Audio Transcription Pipeline

Workflow:

from markitdown import MarkItDown

def transcribe_audio(audio_file):
    md = MarkItDown()
    result = md.convert(audio_file)

    # result.text_content contains transcription
    return result.text_content

# Process audio file
transcript = transcribe_audio("meeting.wav")

# Save transcript
with open("meeting_transcript.md", "w") as f:
    f.write(f"# Meeting Transcript\n\n{transcript}")

Image OCR Extraction

Workflow:

from markitdown import MarkItDown

def extract_text_from_image(image_file):
    md = MarkItDown()
    result = md.convert(image_file)

    # Includes EXIF metadata and OCR text
    return result.text_content

# Extract text from scanned document
text = extract_text_from_image("scanned_page.png")
print(text)

Best Practices

File Processing

Check File Size - Large files take longer to process
Validate Format - Ensure file extension matches content
Handle Errors - Implement try-catch for conversion failures
Test Output - Verify Markdown quality for critical documents
Use Streams - For large files, use convert_stream() for memory efficiency

LLM Integration

Token Limits - Check Markdown length before sending to LLM
Chunking - Split long documents for processing
Preserve Structure - Markdown headers help LLMs understand context
Clean Output - Remove unnecessary formatting if needed
Metadata - Include file name and type in LLM context

Performance Optimization

Batch Processing - Convert multiple files in parallel
Cache Results - Store converted Markdown to avoid re-processing
Selective Features - Only install needed dependencies
Stream Processing - Use streams for memory efficiency
Azure Services - Use Azure Document Intelligence for complex PDFs

Supported Format Details

PDF

Text extraction
Table preservation
Link extraction
Structure maintenance
Enhanced: Azure Document Intelligence for complex layouts

Word (DOCX)

Headings hierarchy
Lists (ordered/unordered)
Tables
Images (with optional descriptions)
Links

PowerPoint (PPTX)

Slide titles
Content structure
Tables
Images (with optional LLM descriptions)
Notes

Excel (XLSX/XLS)

Sheet names
Tables with headers
Data preservation
Multiple sheets

Images

EXIF metadata extraction
OCR text extraction
Optional LLM descriptions (with OpenAI)
Supported: PNG, JPG, JPEG, GIF

Audio

Transcription to text
Speaker detection
Timestamp support
Formats: WAV, MP3

HTML

Content extraction
Link preservation
Table conversion
Header hierarchy

CSV/JSON/XML

Structure-aware conversion
Data table formatting
Key-value preservation

YouTube

Video metadata
Transcript extraction
Description and comments

ZIP/EPub

Archive extraction
Recursive processing
Content aggregation

Troubleshooting

Installation Issues

Problem: Dependency conflicts

Solution:

# Use virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

pip install 'markitdown[all]'

Conversion Failures

Problem: "Unsupported file format"

Solutions:

Verify file extension matches content
Check file isn't corrupted
Ensure required dependencies installed
Try opening file in native application first

Problem: "Memory error" on large files

Solutions:

Use stream processing: convert_stream()
Process file in chunks
Increase available memory
Use Azure Document Intelligence for PDFs

Output Quality

Problem: Poor OCR results

Solutions:

Ensure image is high resolution
Check image is not rotated
Use Azure Document Intelligence
Pre-process image (contrast, brightness)

Problem: Missing tables or structure

Solutions:

Use Azure Document Intelligence for PDFs
Verify source document is well-formatted
Check PDF isn't scanned image (needs OCR)

When to Use This Skill

Converting documents for LLM analysis
Extracting text from PDFs and images
Processing Office documents programmatically
Transcribing audio to text
Building document processing pipelines
Preparing data for AI/ML workflows
Creating searchable archives from documents
Extracting structured data from files

Keywords

markitdown, file conversion, markdown, pdf to markdown, document processing, ocr, audio transcription, llm preprocessing, document extraction, text extraction, file parsing, microsoft, autogen, python utility, batch conversion

More by fgarofalo56

View all

azure-event-grid

Event routing with Azure Event Grid. Configure topics, subscriptions, event handlers, and dead-lettering. Use for event-driven architectures, serverless triggers, and reactive systems on Azure.

grafana-dashboards

Build monitoring dashboards with Grafana. Covers panel types, queries, variables, alerting, provisioning, and data sources like Prometheus and InfluxDB. Use for infrastructure monitoring, observability, and metrics visualization.

accessibility-wcag

Build accessible web applications meeting WCAG 2.1 AA standards. Covers ARIA attributes, keyboard navigation, screen reader optimization, focus management, color contrast, semantic HTML, and accessible components. Use for a11y audits, accessible UI development, and inclusive design.

instructor

Structured outputs with Instructor. Extract typed data from LLMs using Pydantic models and validation. Use for data extraction, structured generation, and type-safe LLM responses.