AI assistants are only as good as the context you give them. When you're deep in a coding session, you shouldn't have to leave your editor to look up documentation. docpull solves the first half of that problem — it converts any documentation site into clean, structured Markdown. This post covers what docpull does, then walks through building an MCP server on top of it so Claude Code can search your docs mid-session.

What is docpull?

docpull is a Python CLI that crawls documentation sites and outputs clean Markdown files with structured YAML frontmatter. It's designed specifically for AI workflows — RAG pipelines, LLM training data, and knowledge bases.

pip install docpull

Point it at any docs site:

docpull https://orm.drizzle.team --profile rag -o docs/drizzle/

Every page becomes a Markdown file with metadata:

---
title: "Insert"
source: https://orm.drizzle.team/docs/insert
fetched: 2024-01-15T10:30:00Z
---
 
# Insert
 
To insert data with Drizzle, use the `insert()` method...

The source field enables citations back to the original page. The title provides context for chunking. The Markdown body is ready for embedding.

Profiles

docpull ships with three built-in profiles that optimize for different use cases:

Profile	Purpose	Details
rag	RAG pipelines & LLMs	Streaming dedup, metadata-rich output, full crawl
mirror	Site archiving	Full crawl with caching enabled
quick	Testing & sampling	Limited to 50 pages, depth 2

Key features

Feature	Why It Matters
Streaming dedup	SimHash with O(1) lookups. Reduces index size by ~3x.
JS rendering	Playwright support for SPAs. Most modern docs need this.
Incremental updates	ETag caching. Only re-fetch changed pages.
Security	HTTPS-only, respects robots.txt, SSRF protection.

Incremental updates are especially useful for keeping your index fresh:

# Weekly update — only re-fetches changed pages
docpull https://orm.drizzle.team --profile rag -o docs/drizzle/ --incremental

From Markdown to MCP

Clean Markdown files in a folder are a good start, but they don't help your AI assistant mid-session. You need semantic search, exact pattern matching, and a way to serve results over MCP. Here's how to build that retrieval layer on top of docpull.

Architecture

docpull                    MCP server                   Claude Code
─────────────────          ─────────────────            ─────────────────
Fetch docs           →     Chunk + embed          →     search_docs
Clean markdown       →     Store in pgvector      →     grep_docs
YAML frontmatter     →     Serve over MCP         →     list_libraries

docpull handles data acquisition. The MCP server handles retrieval.

Database schema

Store chunks and embeddings in Postgres with pgvector:

CREATE TABLE doc_chunks (
  id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  library    TEXT NOT NULL,
  file_path  TEXT NOT NULL,
  content    TEXT NOT NULL,
  embedding  VECTOR(1536),
  metadata   JSONB,
  created_at TIMESTAMPTZ DEFAULT now()
);
 
CREATE INDEX idx_library ON doc_chunks(library);
CREATE INDEX idx_embedding ON doc_chunks
  USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

IVFFlat keeps similarity search fast. The lists = 100 parameter works well for a few thousand chunks.

Ingestion

The ingestion script reads docpull output and populates the database:

async function ingest(docsDir: string, library: string) {
  const files = await glob(`${docsDir}/**/*.md`);
 
  for (const file of files) {
    const content = await Bun.file(file).text();
    const { data: frontmatter, content: body } = matter(content);
    const chunks = chunkMarkdown(body, { maxTokens: 500, overlap: 50 });
 
    for (const chunk of chunks) {
      const embedding = await embed(chunk.text);
 
      await db.insert(docChunks).values({
        library,
        filePath: file,
        content: chunk.text,
        embedding: toVector(embedding),
        metadata: {
          title: frontmatter.title,
          source: frontmatter.source,
          heading: chunk.heading,
        },
      });
    }
  }
}

Chunking with 50-token overlap prevents context loss at boundaries. The source and title from docpull's frontmatter get stored as metadata so you can cite back to the original docs.

MCP tools

The server exposes three tools.

search_docs — semantic search using text-embedding-3-small and pgvector cosine similarity:

server.tool("search_docs", schema, async ({ query, library, limit }) => {
  const embedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
 
  const results = await db
    .select()
    .from(docChunks)
    .where(library ? eq(docChunks.library, library) : undefined)
    .orderBy(sql`embedding <=> ${toVector(embedding)}`)
    .limit(limit ?? 5);
 
  return { content: [{ type: "text", text: formatResults(results) }] };
});

grep_docs — exact pattern matching when you need specific method names or symbols. No embedding lookup, just SQL LIKE:

server.tool("grep_docs", schema, async ({ pattern, library }) => {
  const results = await db
    .select()
    .from(docChunks)
    .where(
      and(
        like(docChunks.content, `%${pattern}%`),
        library ? eq(docChunks.library, library) : undefined
      )
    )
    .limit(10);
 
  return { content: [{ type: "text", text: formatResults(results) }] };
});

list_libraries — returns all indexed libraries:

server.tool("list_libraries", {}, async () => {
  const libraries = await db
    .selectDistinct({ library: docChunks.library })
    .from(docChunks);
 
  return {
    content: [{ type: "text", text: libraries.map((l) => l.library).join("\n") }],
  };
});

Performance

Benchmarks on a few thousand chunks:

Query Type	Latency
Semantic search	~500ms
Grep	~70ms

The 500ms includes embedding generation. Grep skips that entirely.

Deployment

Option 1: SSH/stdio — No HTTP server. Claude Code connects directly over SSH:

{
  "mcpServers": {
    "docs": {
      "command": "ssh",
      "args": ["user@server", "cd docpull && bun run src/server.ts"]
    }
  }
}

Option 2: Local — Run the server locally over stdio:

{
  "mcpServers": {
    "docs": {
      "command": "bun",
      "args": ["run", "/path/to/docpull/src/server.ts"]
    }
  }
}

Getting Started

git clone https://github.com/raintree-technology/docpull
cd docpull
bun install
 
# Fetch docs
pip install docpull
docpull https://docs.example.com --profile rag -o docs/example/
 
# Set up Postgres with pgvector, then:
bun run db:push
bun run ingest docs/example example
 
# Run server
bun run src/server.ts

Add the server to your Claude Code config and you have semantic search over your documentation.