audiobook

@Motium-AI/audiobook

0 forks

Updated 4/1/2026

Transform technical documents into long-form audiobooks. Uses 4-agent heavy analysis, TTS optimization, Michael Caine oration style, and stop-slop enforcement. Generates ElevenLabs-ready output with SSML pause tags and full text normalization. Use when asked to "create an audiobook", "turn this into audio", or "/audiobook".

Installation

$npx agent-skills-cli install @Motium-AI/audiobook

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

RepositoryMotium-AI/claude-code-toolkit

Pathconfig/skills/audiobook/SKILL.md

Branchmain

Scoped Name@Motium-AI/audiobook

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: audiobook description: Transform technical documents into long-form audiobooks. Uses 4-agent heavy analysis, TTS optimization, Michael Caine oration style, and stop-slop enforcement. Generates ElevenLabs-ready output with SSML pause tags and full text normalization. Use when asked to "create an audiobook", "turn this into audio", or "/audiobook".

Audiobook Creation Skill

Transform technical documents into compelling audiobooks using multi-agent analysis, narrative synthesis, and ElevenLabs-optimized output.

Input

$ARGUMENTS

Workflow Overview

┌─────────────────────────────────────────────────────────────────┐
│  PHASE 1: CONTENT DISCOVERY                                     │
│     └─► Identify source documents                               │
│     └─► Read and analyze each document                          │
│     └─► Identify overlapping themes and unique insights         │
├─────────────────────────────────────────────────────────────────┤
│  PHASE 2: HEAVY ANALYSIS (4 Parallel Opus Agents)               │
│     └─► First Principles: Core insights, deletion candidates    │
│     └─► AGI-Pilled: Narrative structure, central metaphor       │
│     └─► TTS Production Expert: ElevenLabs constraints           │
│     └─► Stop-Slop Expert: AI pattern watchlist, voice traps     │
├─────────────────────────────────────────────────────────────────┤
│  PHASE 3: SYNTHESIS                                             │
│     └─► Resolve tradeoffs between agent recommendations         │
│     └─► Define chapter structure with word counts               │
│     └─► Establish production rules (format, TTS, stop-slop)     │
├─────────────────────────────────────────────────────────────────┤
│  PHASE 4: WRITING                                               │
│     └─► Write preamble (90 words, topic + tone + preview)       │
│     └─► Write all chapters as continuous prose                  │
│     └─► Apply stop-slop watchlist pass                          │
├─────────────────────────────────────────────────────────────────┤
│  PHASE 5: TTS POST-PROCESSING                                   │
│     └─► Strip ALL markdown (ElevenLabs reads it aloud)          │
│     └─► Normalize numbers, dates, abbreviations to words        │
│     └─► Insert SSML <break> tags at chapter transitions         │
│     └─► Self-score against 5 dimensions (target 40+/50)         │
│     └─► Output single TTS-ready file                            │
├─────────────────────────────────────────────────────────────────┤
│  PHASE 6: AUDIO GENERATION (optional, if user requests)         │
│     └─► Generate via ElevenLabs Studio or API                   │
│     └─► Post-process: normalize, concatenate, package M4B       │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Content Discovery

Step 1.1: Identify Source Documents

Parse the user's request to identify:

Source documents (paths or descriptions)
Desired voice/style (default: Michael Caine oration style)
Target length (default: 7,000-10,000 words)
Output filename

Step 1.2: Read and Analyze Sources

For each source document:

Read the full content
Identify key themes, insights, and narrative threads
Note overlapping content between documents
Flag unique insights that must not be lost in synthesis

Phase 2: Heavy Analysis (4 Parallel Agents)

CRITICAL: Launch ALL 4 agents in a SINGLE message with multiple Task tool calls. This makes them run in parallel.

Architecture: Task() (not TeamCreate) — Agents produce independent analysis artifacts (core insights, narrative structure, TTS constraints, slop watchlist). Coordinator synthesizes tradeoffs in Phase 3.

Agent 1: First Principles Analysis

Task(
  subagent_type="general-purpose",
  description="First Principles Analysis",
  model="opus",
  prompt="""You are distilling source documents into core audiobook insights.

Apply the Elon Musk algorithm:
1. **Question every theme** - Why does this need to be in the audiobook?
2. **Delete** - Remove anything redundant, tangential, or dilutive
3. **Simplify** - Merge overlapping concepts into singular, powerful statements
4. **Sequence** - What's the most compelling order for the listener?

SOURCE DOCUMENTS:
[PASTE SUMMARIES OF SOURCE DOCUMENTS]

## Your Mission

From an audiobook perspective:
- Which themes are ESSENTIAL vs. nice-to-have?
- What content overlaps and can be merged?
- What order creates the best narrative tension?
- What's the single most compelling insight to open with?
- What's the satisfying conclusion that ties everything together?

## Output

1. **10 Core Insights** (ranked by importance)
2. **Deletion List** (content that weakens the narrative)
3. **Proposed Narrative Arc** (beginning → middle → end)
4. **Central Hook** (the 1-sentence premise that grabs attention)
"""
)

Agent 2: AGI-Pilled Analysis

Task(
  subagent_type="general-purpose",
  description="AGI-Pilled Analysis",
  model="opus",
  prompt="""You are designing the MOST COMPELLING audiobook possible from this material.

**Core beliefs:**
- Listeners are intelligent - don't condescend
- Emotion drives retention more than information
- One transforming metaphor beats five scattered ones
- The best audiobooks feel like conversations, not lectures
- Every chapter should leave the listener wanting the next

SOURCE DOCUMENTS:
[PASTE SUMMARIES OF SOURCE DOCUMENTS]

## Your Mission

Design the audiobook structure:
1. **Central Metaphor**: One image that transforms across chapters
2. **Emotional Arc**: What does the listener FEEL at each stage?
3. **Chapter Outline**: 7-12 chapters with evocative titles
4. **Meta-awareness**: Where can the narrative acknowledge itself?
5. **Recurring Anchors**: What motifs appear 3+ times with deepening meaning?

## Narrative Techniques to Consider

- **Frame narrative**: The narrator has a stake in the story
- **Temporal fracture**: Break chronology for dramatic effect
- **Bleeding meta-awareness**: Hint at revelations before confirming
- **Deletion litany**: Lists read with rhythmic relish
- **Quiet endings**: Not every chapter needs a punchline

## Output

- **Proposed Title** (evocative, not descriptive)
- **Central Metaphor** and how it transforms across chapters
- **Chapter Outline** with emotional beats for each
- **3 Recurring Anchors** with their appearance schedule
"""
)

Agent 3: TTS Production Expert (ElevenLabs-Optimized)

Task(
  subagent_type="general-purpose",
  description="ElevenLabs TTS Production Expert",
  model="opus",
  prompt="""You are a TTS audio production expert optimizing this audiobook for ElevenLabs text-to-speech.

SOURCE DOCUMENTS:
[PASTE SUMMARIES OF SOURCE DOCUMENTS]

## ElevenLabs-Specific Constraints

**Target model: eleven_multilingual_v2** (most stable for long-form, 10K char limit per chunk)

**CRITICAL: ElevenLabs reads markdown aloud.** Every **, *, _, #, `, >, []() will be spoken as "asterisk", "hashtag", "backtick" etc. The output must be 100% clean prose with zero markdown syntax.

**Pause Control (SSML break tags -- works on multilingual_v2):**
- <break time="0.5s" /> -- micro pause (replaces comma-level breath)
- <break time="1.0s" /> -- sentence-level pause
- <break time="1.5s" /> -- paragraph transition
- <break time="2.5s" /> -- chapter transition (max 3s)
- DO NOT overuse: more than 3-4 break tags per paragraph causes artifacts (speedup, noise, garbled audio)
- Em-dashes (--) work as reliable short pauses on all models
- Ellipsis (...) adds hesitation/nervousness -- use sparingly and intentionally

**Text Normalization (MANDATORY -- ElevenLabs misreads raw numbers/symbols):**
- ALL numbers spelled out: "fifty" not "50", "three thousand" not "3,000"
- Currencies: "one hundred dollars" not "$100"
- Dates: "March fifteenth" not "3/15"
- Percentages: "seventy percent" not "70%"
- Times: "ten thirty A M" not "10:30 AM"
- Abbreviations expanded: "Doctor" not "Dr.", "for example" not "e.g."
- MAX two numbers per sentence -- split if more
- URLs removed entirely or described
- Acronyms: either spell out "C E O" or leave as "CEO" (ElevenLabs handles common ones)

**Technical Content Translation:**
- No file paths -- give everything human names
- No code blocks or syntax
- No backtick-quoted terms
- Technical terms get first-mention introductions then shorthand names

**Length & Pacing:**
- 800-1,200 words per chapter (ceiling 1,500)
- 90-word preamble before chapter 1
- Total target: 7,000-10,000 words (~42,000-60,000 characters)

**Chunking for API (if generating programmatically):**
- Split at chapter boundaries (never mid-sentence)
- Each chunk under 10,000 characters for multilingual_v2
- Use request stitching: pass previous_text/next_text between chunks for prosody continuity

**Natural Flow:**
- Contractions always (it's, don't, won't)
- Parenthetical asides under 8 words, using dashes not parentheses
- Groups of items: max 3 named, then summarize
- Shorthand names after first introduction
- No semicolons (awkward pause length)
- Sentence length: average 15-20 words, vary between 5 and 35

## Output

1. **Production Spec** (word counts per chapter, preamble requirements)
2. **Formatting Rules** (pause placement, number handling, term translations)
3. **TTS Anti-patterns** (specific constructs from this source that ElevenLabs handles poorly)
4. **Translation Table** (technical term -> spoken equivalent for every term in the source)
5. **Sample Preamble** (90 words demonstrating the format)
"""
)

Agent 4: Stop-Slop Expert

Task(
  subagent_type="general-purpose",
  description="Stop-Slop Enforcement Expert review",
  model="opus",
  prompt="""You are a stop-slop enforcement expert. Your mission: eliminate predictable AI writing patterns from the audiobook.

SOURCE DOCUMENTS:
[PASTE SUMMARIES OF SOURCE DOCUMENTS]

## Stop-Slop Core Rules

1. **Cut filler phrases** - Remove throat-clearing openers and emphasis crutches
2. **Break formulaic structures** - Avoid binary contrasts, dramatic fragmentation, rhetorical setups
3. **Vary rhythm** - Mix sentence lengths. Two items beat three.
4. **Trust readers** - State facts directly. Skip softening, justification, hand-holding.
5. **Cut quotables** - If it sounds like a pull-quote, rewrite it.

## Your Mission

Create a comprehensive watchlist for THIS specific audiobook:

### Content Traps
- AI-typical transitions and openers
- Overused emphasis phrases
- Generic profundity markers

### Voice Traps
- Wisdom-dispenser constructions ("X isn't about Y. It's about Z.")
- Anthropomorphism treadmill (agents "wrestle," "grapple," "confront")
- Caine-voice parody risks ("Not a lot of people know that" echoes)

### Structural Traps
- Binary contrast reveals ("Not X. Y.") - max once per book
- Orphan dramatic fragments ("Gone." "Nothing." "Zero.")
- Three-item lists (two items beat three)
- Punchy closers on every chapter (vary: some end quietly)
- Recap tax (orient through new material, not summaries)

## Output

1. **35-Item Watchlist** (specific phrases and patterns to avoid)
2. **Voice Traps** (patterns that undermine the Caine voice)
3. **Structural Traps** (patterns that make structure predictable)
4. **Scoring Rubric** (5 dimensions, 10 points each, target 40+/50)
"""
)

Phase 3: Synthesis

After all 4 agents return, synthesize their outputs:

3.1 Resolve Tradeoffs

For each tension between agents, document:

TRADEOFF: [topic]
- Agent X says: [position] because [reasoning]
- Agent Y says: [position] because [reasoning]
- Resolution: [chosen approach with rationale]

Common tradeoffs:

Title style: Markdown headings vs. plain prose (MUST be plain prose -- ElevenLabs reads # as "hashtag")
Meta-narrative placement: Confined to one chapter vs. bleeding throughout
Chronology: Strict order vs. temporal fracture for dramatic effect
Metaphor count: Multiple metaphors vs. one that transforms

3.2 Define Chapter Structure

Create a table:

Chapter	Title	Word Count	Emotional Beat	Key Content
Preamble	-	90	Invitation	Topic, tone, preview
1	"..."	800-1,200	Hook	Opening insight
...	...	...	...	...

3.3 Establish Production Rules

Document final rules for:

Format: Title style, pause markers, paragraph spacing
TTS: Number normalization, technical term translations, contractions
Stop-slop: Top 15 patterns to avoid during writing

Phase 4: Writing

4.1 Write Preamble (90 words)

The preamble must:

State the topic without spoiling insights
Set the tone (reflective, conversational, surprising)
Preview what the listener will experience
NOT include: "In this audiobook," "We will explore," or similar meta-framing

4.2 Write Chapters

For each chapter:

Open with the chapter title as plain text on its own line (NO markdown heading syntax)
Write continuous prose (no subheadings, no bullets, no formatting)
Honor the emotional beat for this chapter
Apply TTS constraints throughout -- especially number normalization
End chapters with a final sentence, then leave space for break tag insertion

4.3 Apply Stop-Slop Pass

After writing, scan for and eliminate:

Every item on the 35-item watchlist
Voice traps specific to the chosen narrator style
Structural predictability (vary endings, avoid binary reveals)

Phase 5: TTS Post-Processing

5.1 Markdown Stripping (CRITICAL)

ElevenLabs does NOT strip markdown. It reads formatting characters aloud.

Verify the output contains ZERO instances of:

** or * (bold/italic) -- spoken as "asterisk"
# (headings) -- spoken as "hashtag"
` (backticks) -- produces glottal pop or silence
> (blockquotes) -- spoken as "greater than"
[]() (links) -- spoken as "bracket parenthesis"
- or * (list markers) -- dash is audible
--- (horizontal rules) -- may read as "dash dash dash"
_ (underscore emphasis) -- spoken as "underscore"

5.2 Text Normalization Checklist

Before outputting, verify ALL of these are normalized:

Raw	Normalized
Numbers < 10	Spelled out ("three" not "3")
Numbers >= 10	Spelled out in speech form ("forty-two" not "42")
Large numbers	Natural speech ("roughly ninety-six thousand" not "96,000")
Percentages	"seventy percent" not "70%"
Currency	"fifty dollars" not "$50"
Dates	"March fifteenth" not "3/15"
Times	"ten thirty A M" not "10:30 AM"
Abbreviations	"Doctor" not "Dr.", "for example" not "e.g.", "et cetera" not "etc."
File paths	Human names ("the Skill file" not "SKILL.md")
Acronyms	Decided per-term ("the language model" not "LLM", "NASA" stays as-is)
URLs	Removed or described ("the ElevenLabs documentation" not the URL)
@ and &	"at" and "and"

5.3 Insert SSML Break Tags

Insert break tags at structural transitions in the final output:

...final sentence of chapter one. <break time="2.5s" />

Chapter Two. Title Here

First sentence of chapter two...

Rules:

<break time="2.5s" /> between chapters (one tag, not multiple)
<break time="0.5s" /> at the start and end of the entire text (prevents artifacts)
Do NOT scatter break tags throughout paragraphs -- em-dashes and periods handle intra-paragraph pacing
Maximum 3-4 break tags per chapter (overuse causes ElevenLabs to speed up or garble)

5.4 Self-Score (Target: 40+/50)

Dimension	Question	Score
Directness (1-10)	Can you delete the first sentence of any paragraph without loss?
Rhythm (1-10)	Sentence-length standard deviation above 4 words?
Trust (1-10)	Count evaluative adjectives (profound, elegant, devastating). Each costs 0.5 points.
Authenticity (1-10)	Would a blind reader guess AI or human?
Density (1-10)	Fewer than 8 distinct points per 1,000 words = too dilute.

If score < 40: Revise weak dimensions before outputting.

5.5 Output

Write the final audiobook to a single file (.txt preferred over .md to signal TTS-readiness):

Plain prose throughout -- ZERO markdown syntax
SSML <break> tags at chapter transitions only
-- for Caine-style asides (ElevenLabs reads these as natural pauses)
Blank lines between paragraphs
All numbers, dates, currencies, abbreviations spelled out as words

Phase 6: Audio Generation (Optional)

If the user requests actual audio generation, offer two paths:

Option A: ElevenLabs Studio (Recommended for One-Off)

Tell user to upload the .txt file to ElevenLabs Studio (studio.elevenlabs.io)
Recommended settings:
- Model: eleven_multilingual_v2 (most stable for long-form)
- Voice: Browse Voice Library for British/male/deep/conversational -- or use Michael Caine's Iconic Marketplace voice if licensed
- Stability: 0.75 (higher = smoother, less variation)
- Similarity: 0.80 (high for premade voices)
- Style Exaggeration: 0.0 (keep at zero for audiobook narration)
- Speaker Boost: On
Studio auto-detects chapters from the text structure
Review sentence-by-sentence, regenerate any bad sections, lock good ones
Export per-chapter MP3s
Cost: ~$22/month on Creator plan covers a full audiobook

Option B: API Pipeline (Programmatic)

If generating via script:

from elevenlabs import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

# Split text at chapter boundaries, each chunk < 10,000 chars
# Generate with request stitching for seamless prosody:

audio_chunk = client.text_to_speech.convert(
    voice_id="VOICE_ID",
    model_id="eleven_multilingual_v2",
    text=current_chunk,
    previous_text=prev_chunk[-1000:] if prev_chunk else None,
    next_text=next_chunk[:1000] if next_chunk else None,
    output_format="mp3_44100_128",
    voice_settings={
        "stability": 0.75,
        "similarity_boost": 0.80,
        "style": 0.0,
        "use_speaker_boost": True,
    },
)

Request stitching is essential -- without it, each chunk starts with a fresh cadence and the joins are audible. Pass ~1,000 characters of context in previous_text and next_text.

Post-processing pipeline:

# 1. Concatenate chapter MP3s
for f in chapter_*.mp3; do echo "file '$f'" >> filelist.txt; done
ffmpeg -f concat -safe 0 -i filelist.txt -c copy full_audiobook.mp3

# 2. Normalize audio levels (broadcast standard -16 LUFS)
ffmpeg -i full_audiobook.mp3 -af "loudnorm=I=-16:dual_mono=true" -ar 44100 -b:a 128k audiobook_normalized.mp3

# 3. Package as M4B audiobook with chapters (optional)
ffmpeg -i audiobook_normalized.mp3 -i chapters.txt -map_metadata 1 -c:a aac -b:a 128k audiobook.m4b

Cost: ~$3.60 (Flash v2.5) to ~$7.20 (Multilingual v2) for a 10,000-word audiobook via API.

ElevenLabs Model Quick Reference

Model	Char Limit	SSML Breaks	Best For	Cost/1K chars
eleven_multilingual_v2	10,000	YES	Audiobooks (most stable long-form)	$0.12
eleven_turbo_v2_5	40,000	YES	Good quality, half cost	$0.06
eleven_flash_v2_5	40,000	YES	Budget, real-time	$0.06
eleven_v3	5,000	NO (use [pause] audio tags)	Expressive/emotional	$0.12

Recommended for audiobooks: eleven_multilingual_v2 -- explicitly described by ElevenLabs as most stable on long-form generations.

Michael Caine voice: Available through ElevenLabs' Iconic Marketplace (partnership announced January 2026). Requires commercial licensing. For unlicensed alternatives, search the Voice Library for: British, male, deep, mature, conversational.

Voice Guide: Michael Caine Style

The Caine voice is:

Conversational, not lecturing
Uses contractions naturally (it's, don't, won't)
Employs asides with dashes -- like this, you see --
Builds to points through stories, not announcements
Occasionally pauses mid-thought to reflect
Never condescends, treats listener as intelligent
Underplays revelations (lets facts speak)

Caine voice traps to avoid:

Over-using "you see" or "the thing is"
Echoing his famous lines ("Not a lot of people know that")
Forced Cockney rhythms
Explaining jokes or metaphors

Sample Caine rhythm: "I spent twenty-five minutes reading the same paragraph. Not because it was difficult -- it wasn't. Because somewhere between the third and fourth reading, I forgot I'd read it at all. My notes said confidence high. They also said nothing useful."

Stop-Slop Reference: Top 35 Patterns

Filler Phrases (Cut These)

"Here's the thing" / "The reality is" / "Let's be clear"
"Let that sink in" / "Think about that"
"At the end of the day"
"It's worth noting that"
"Interestingly enough"
"Needless to say"
"In other words"
"Simply put" / "Put simply"
"The fact of the matter is"
"It goes without saying"

Profundity Markers (Rewrite These)

"ever-evolving landscape"
"profound implications"
"game-changer" / "paradigm shift"
"raises important questions"
"And that changes everything"
"transformative impact"
"at its core"
"fundamentally"

Structural Tells (Vary These)

"Not X. Y." binary contrast reveals (max 1x per book)
Orphan dramatic fragments ("Gone." "Nothing." "Zero.")
Three-item lists (two items beat three)
Em-dash before a reveal
Punchy one-liner chapter endings
"Actually" as reveal signal
Recap paragraphs at section starts

Voice Traps (Avoid These)

Wisdom-dispenser: "Memory isn't about X. It's about Y."
Anthropomorphism treadmill: agents "wrestle," "grapple," "confront"
Aside inflation (max 2 dashes per 250 words)
Explaining metaphors after deploying them
Rhetorical questions that you immediately answer

Flow Killers (Eliminate These)

Consecutive sentences of matching length (break one)
Paragraph starting with "But" after "And" paragraph
"In fact" as intensifier
"Of course" as dismissive cushion
"Perhaps" / "Maybe" hedging without purpose

Example Output Structure

Note: This is the FINAL output format -- plain text, zero markdown, SSML breaks at chapter joins only.

<break time="0.5s" />

How does a plain text file make an AI agent do real work -- not answer a question, but forty-step, error-recovering, runs-while-you-sleep work? This audiobook examines OpenClaw and its three core systems. Along the way, it finds that two independent projects arrived at the same architecture from opposite starting positions. Seven chapters.

<break time="2.5s" />

Chapter One. The Runbook Nobody Reads

Every engineering team has one. A document in the wiki titled something like Runbook -- Production Incident Response. It was written after the last outage by someone who swore the procedures would be followed this time...

<break time="2.5s" />

Chapter Two. Three Tiers Down

A single skill file doesn't weigh much. A few hundred words of instructions, a configuration header listing what tools the agent can use...

<break time="2.5s" />

...

<break time="2.5s" />

Chapter Seven. Convergence at the Carapace

Under the carapace, the gills are working. The organism is alive. The rest is surface.

<break time="0.5s" />

Triggers

/audiobook
"create an audiobook"
"turn this into audio"
"make this TTS-ready"
"audiobook from documents"

More by Motium-AI

View all

prompt-engineering-patterns

Design prompts, skills, and CLAUDE.md files as context engineering problems. Use when writing skills, optimizing prompts, designing agent workflows, auditing CLAUDE.md, or reducing prompt bloat. Triggers on "prompt engineering", "optimize prompt", "write a skill", "reduce bloat", "context engineering".

ux-improver

Recursively improve web application UX via vision-based screenshot analysis. Use when asked to "improve UX", "fix usability", "audit user experience", or "/uximprove". Triggers on UX review, usability improvement, user flow analysis, interaction audit.

webapp-testing

Toolkit for testing and automating web applications. Prioritizes Claude Code Chrome integration for real browser testing. Falls back to Playwright scripts for CI/CD or headless scenarios.

compound

Capture solved problems as memory events for cross-session learning. Use after solving non-trivial problems. Triggers on "/compound", "document this solution", "capture this learning", "remember this fix".