Building a Documentation MCP Server with docpull
How docpull converts documentation sites to AI-ready Markdown, and how to build an MCP server on top of it for semantic search in Claude Code.
AI assistants are only as good as the context you give them. When you're deep in a coding session, you shouldn't have to leave your editor to look up documentation. docpull solves the first half of that problem — it converts any documentation site into clean, structured Markdown. This post covers what docpull does, then walks through building an MCP server on top of it so Claude Code can search your docs mid-session.
What is docpull?
docpull is a Python CLI that crawls documentation sites and outputs clean Markdown files with structured YAML frontmatter. It's designed specifically for AI workflows — RAG pipelines, LLM training data, and knowledge bases.
pip install docpullPoint it at any docs site:
docpull https://orm.drizzle.team --profile rag -o docs/drizzle/Every page becomes a Markdown file with metadata:
---
title: "Insert"
source: https://orm.drizzle.team/docs/insert
fetched: 2024-01-15T10:30:00Z
---
# Insert
To insert data with Drizzle, use the `insert()` method...The source field enables citations back to the original page. The title provides context for chunking. The Markdown body is ready for embedding.
Profiles
docpull ships with three built-in profiles that optimize for different use cases:
| Profile | Purpose | Details |
|---|---|---|
| rag | RAG pipelines & LLMs | Streaming dedup, metadata-rich output, full crawl |
| mirror | Site archiving | Full crawl with caching enabled |
| quick | Testing & sampling | Limited to 50 pages, depth 2 |
Key features
| Feature | Why It Matters |
|---|---|
| Streaming dedup | SimHash with O(1) lookups. Reduces index size by ~3x. |
| JS rendering | Playwright support for SPAs. Most modern docs need this. |
| Incremental updates | ETag caching. Only re-fetch changed pages. |
| Security | HTTPS-only, respects robots.txt, SSRF protection. |
Incremental updates are especially useful for keeping your index fresh:
# Weekly update — only re-fetches changed pages
docpull https://orm.drizzle.team --profile rag -o docs/drizzle/ --incrementalFrom Markdown to MCP
Clean Markdown files in a folder are a good start, but they don't help your AI assistant mid-session. You need semantic search, exact pattern matching, and a way to serve results over MCP. Here's how to build that retrieval layer on top of docpull.
Architecture
docpull MCP server Claude Code
───────────────── ───────────────── ─────────────────
Fetch docs → Chunk + embed → search_docs
Clean markdown → Store in pgvector → grep_docs
YAML frontmatter → Serve over MCP → list_libraries
docpull handles data acquisition. The MCP server handles retrieval.
Database schema
Store chunks and embeddings in Postgres with pgvector:
CREATE TABLE doc_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
library TEXT NOT NULL,
file_path TEXT NOT NULL,
content TEXT NOT NULL,
embedding VECTOR(1536),
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_library ON doc_chunks(library);
CREATE INDEX idx_embedding ON doc_chunks
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);IVFFlat keeps similarity search fast. The lists = 100 parameter works well for a few thousand chunks.
Ingestion
The ingestion script reads docpull output and populates the database:
async function ingest(docsDir: string, library: string) {
const files = await glob(`${docsDir}/**/*.md`);
for (const file of files) {
const content = await Bun.file(file).text();
const { data: frontmatter, content: body } = matter(content);
const chunks = chunkMarkdown(body, { maxTokens: 500, overlap: 50 });
for (const chunk of chunks) {
const embedding = await embed(chunk.text);
await db.insert(docChunks).values({
library,
filePath: file,
content: chunk.text,
embedding: toVector(embedding),
metadata: {
title: frontmatter.title,
source: frontmatter.source,
heading: chunk.heading,
},
});
}
}
}Chunking with 50-token overlap prevents context loss at boundaries. The source and title from docpull's frontmatter get stored as metadata so you can cite back to the original docs.
MCP tools
The server exposes three tools.
search_docs — semantic search using text-embedding-3-small and pgvector cosine similarity:
server.tool("search_docs", schema, async ({ query, library, limit }) => {
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: query,
});
const results = await db
.select()
.from(docChunks)
.where(library ? eq(docChunks.library, library) : undefined)
.orderBy(sql`embedding <=> ${toVector(embedding)}`)
.limit(limit ?? 5);
return { content: [{ type: "text", text: formatResults(results) }] };
});grep_docs — exact pattern matching when you need specific method names or symbols. No embedding lookup, just SQL LIKE:
server.tool("grep_docs", schema, async ({ pattern, library }) => {
const results = await db
.select()
.from(docChunks)
.where(
and(
like(docChunks.content, `%${pattern}%`),
library ? eq(docChunks.library, library) : undefined
)
)
.limit(10);
return { content: [{ type: "text", text: formatResults(results) }] };
});list_libraries — returns all indexed libraries:
server.tool("list_libraries", {}, async () => {
const libraries = await db
.selectDistinct({ library: docChunks.library })
.from(docChunks);
return {
content: [{ type: "text", text: libraries.map((l) => l.library).join("\n") }],
};
});Performance
Benchmarks on a few thousand chunks:
| Query Type | Latency |
|---|---|
| Semantic search | ~500ms |
| Grep | ~70ms |
The 500ms includes embedding generation. Grep skips that entirely.
Deployment
Option 1: SSH/stdio — No HTTP server. Claude Code connects directly over SSH:
{
"mcpServers": {
"docs": {
"command": "ssh",
"args": ["user@server", "cd docpull && bun run src/server.ts"]
}
}
}Option 2: Local — Run the server locally over stdio:
{
"mcpServers": {
"docs": {
"command": "bun",
"args": ["run", "/path/to/docpull/src/server.ts"]
}
}
}Getting Started
git clone https://github.com/raintree-technology/docpull
cd docpull
bun install
# Fetch docs
pip install docpull
docpull https://docs.example.com --profile rag -o docs/example/
# Set up Postgres with pgvector, then:
bun run db:push
bun run ingest docs/example example
# Run server
bun run src/server.tsAdd the server to your Claude Code config and you have semantic search over your documentation.