KatDocExtract (Mistral-compatible Document OCR Wrapper using DeepSeek-OCR)

A Mistral-API compatible document parsing server that converts PDFs, images, and Office documents into clean, formatted Markdown using DeepSeek-OCR.

This project is designed as a drop-in OCR replacement for LLM pipelines (RAG, chatbots, Open WebUI, etc.) that need high-quality extraction from complex documents.

Important: This repository does not ship the model runtime. The Docker Compose setup includes a vLLM container, but you can also point the wrapper at an external vLLM instance if you already run one.
vLLM recipe (DeepSeek-OCR): https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-OCR.html

Features

Mistral Compatibility
The API structure mirrors Mistral OCR-style endpoints, making integration with existing tools easy.
Smart Hybrid PDF Parsing
- Digital PDFs: Extracts text from the PDF text layer (fast & accurate), while using Vision AI only for figures/charts.
- Scanned PDFs / broken encoding: Automatically detects scanned pages or bad encoding (“mojibake”) and switches to full Vision OCR.
Office Support (.docx, .pptx, …)
Converts Office docs to PDF via LibreOffice, then processes the resulting PDF.
High Performance (Async Map-Reduce)
Pages and figure crops are processed concurrently, bounded by a configurable vLLM concurrency limit.
Figure Understanding Detects charts/diagrams, crops them, and asks the Vision model to describe them, inserting the description into the Markdown flow.

How it works (Architecture)

vLLM serves deepseek-ai/DeepSeek-OCR as an OpenAI-compatible /v1 endpoint.
This server accepts documents and:
- For PDFs: decides per page whether to use text-layer extraction or Vision OCR fallback.
- For figures: crops and re-queries the model for figure descriptions (optional).
Returns a structured response with Markdown pages (and optional image/table objects).

Prerequisites

Recommended (Docker Compose)

Docker + Docker Compose
NVIDIA GPU + drivers (strongly recommended for vLLM)

Manual (no Docker)

Python 3.13+ recommended
uv (recommended) or pip/venv
Optional but recommended:
- LibreOffice (soffice) for Office docs (.doc/.docx/.ppt/.pptx/.odt/.odp)
System libs:
- Uses PyMuPDF (fitz) for PDF rendering

Note: The provided Dockerfile already includes LibreOffice and common fonts.

Installation

There are two ways to run this:

Recommended: Docker Compose (includes vLLM + the wrapper)
Manual: run vLLM + the API yourself (useful for dev or existing infra)

Option A: Docker Compose (recommended)

This starts both services:

vllm (OpenAI-compatible /v1 endpoint serving deepseek-ai/DeepSeek-OCR)
ocr-api (this wrapper, exposed on http://localhost:3001)

1) Configure

Copy and edit the environment file:

cp .env.example .env
# Edit .env (optionally set VLLM_MODEL, CONCURRENT_REQUEST_LIMIT, heuristic knobs, etc.)

Notes:

The wrapper talks to vLLM inside the Compose network. compose.yml sets:
- VLLM_BASE_URL=http://vllm:8000/v1
Hugging Face cache is persisted:
- ~/.cache/huggingface:/root/.cache/huggingface
File storage is persisted:
- ./files_store:/data/files_store

2) Start the stack

docker compose up -d --build

3) Open the Web UI

http://localhost:3001

Useful commands

View logs:

docker compose logs -f

Stop:

docker compose down

Option B: Manual setup (no Docker)

Use this if you already run vLLM elsewhere, want to develop locally, or prefer a venv-based workflow.

1) Run your own vLLM backend (required)

This wrapper expects an OpenAI-compatible server at VLLM_BASE_URL (defaults to http://localhost:8000/v1).

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Serve DeepSeek-OCR (OpenAI-compatible)

vllm serve deepseek-ai/DeepSeek-OCR \
  --logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor \
  --no-enable-prefix-caching \
  --mm-processor-cache-gb 0

Why these flags?

The custom logits processor is important for best OCR/Markdown behavior.
Prefix caching and multimodal processor caching usually don’t help OCR-style single-turn calls and may add overhead.

2) Install and run KatDocExtract

We recommend using uv.

Install

uv sync

Configure

Copy and edit the environment file:

cp .env.example .env
# Edit .env and set VLLM_BASE_URL (and optionally VLLM_MODEL)

.env.example (important fields)

VLLM_BASE_URL – your vLLM OpenAI-compatible endpoint (e.g. http://localhost:8000/v1)
VLLM_MODEL – model name (default deepseek-ai/DeepSeek-OCR)
CONCURRENT_REQUEST_LIMIT – max concurrent requests to vLLM (tune for GPU memory)

Run

uv run uvicorn app:app --host 0.0.0.0 --port 8001 --reload

Open the Web UI:

http://localhost:8001

API Usage

Endpoint: `POST /v1/ocr`

This is the main endpoint. It accepts a document object plus optional parsing controls.

Minimal example (remote PDF URL):

curl -X POST "http://localhost:3001/v1/ocr" \
  -H "Content-Type: application/json" \
  -d '{
    "document": {
      "type": "document_url",
      "document_url": "https://example.com/report.pdf"
    },
    "pdf_mode": "auto",
    "inline_figure_text": true
  }'

Minimal example (remote image URL):

curl -X POST "http://localhost:3001/v1/ocr" \
  -H "Content-Type: application/json" \
  -d '{
    "document": {
      "type": "image_url",
      "image_url": {"url": "https://example.com/scan.png"}
    }
  }'

Tip: This server also supports data: URLs (base64-encoded) for document_url and image_url. The built-in web UI uses data: URLs under the hood.

PDF modes and heuristics

`pdf_mode`

"auto" (default): per-page decision using heuristics
"text": always use the PDF text layer (fast; fails on scans/bad encoding)
"ocr": always render pages and run Vision OCR (slower; best for scanned PDFs)

Heuristic knobs (in `.env`)

MIN_TEXT_DENSITY If a page’s text covers less than this fraction of the page, it is treated as scanned.
MAX_MOJIBAKE_RATIO If too many replacement/control characters appear in the text layer, the page falls back to OCR.
PDF_TEXT_RENDER_DPI DPI used when rendering PDF pages for OCR fallback and figure crops.

Figure extraction

If figure extraction is enabled (default behavior in the UI flow), the pipeline will:

Detect image regions (either from PDF blocks in text-layer mode, or from DeepSeek OCR RefDet tags in OCR mode)
Crop those regions
Optionally ask DeepSeek-OCR to describe them and insert the description inline in Markdown

Useful request fields:

inline_figure_text (bool, default true): inline descriptions as blockquotes instead of image links
figure_prompt (string): override the default prompt for figure description
image_min_size (int): ignore very small crops
image_limit (int): only keep the largest N images per page

Office documents

Office formats are supported by converting them to PDF using LibreOffice.

If you use Docker, LibreOffice is already included in the image.

Manual install example for Debian/Ubuntu:

sudo apt-get update && sudo apt-get install -y libreoffice

If LibreOffice is missing, Office conversion requests will return an error.

Performance tuning

Concurrency

CONCURRENT_REQUEST_LIMIT controls how many requests this server sends to vLLM concurrently.

If you see GPU OOM, reduce it (e.g. 10–30).
If your GPU has headroom, increase it for throughput.

DPI tradeoff

Higher PDF_TEXT_RENDER_DPI improves OCR quality but increases:

rendering time
GPU compute
memory usage (larger images)

150 DPI is a good starting point; 200–300 may help for small text in scans.

Troubleshooting

vLLM connection errors

Verify VLLM_BASE_URL points to your running vLLM server and includes /v1
Confirm the model name matches what you serve (VLLM_MODEL)

Timeouts on large PDFs

Reduce CONCURRENT_REQUEST_LIMIT
Consider lowering DPI
Ensure your vLLM server timeout is high enough (this project uses a long client timeout)

Office conversion fails

Ensure soffice / LibreOffice is installed and available in PATH
Check server logs for LibreOffice stderr output

Development notes

FastAPI app entrypoint: app.py
PDF strategy: PdfEngine + heuristics.PageAnalyzer
OCR + figure tasks: pipeline.py
vLLM OpenAI-compatible client: deepseek_client.py

License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
static		static
templates		templates
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
compose.yml		compose.yml
config.py		config.py
deepseek_client.py		deepseek_client.py
document_io.py		document_io.py
files_api.py		files_api.py
files_signing.py		files_signing.py
files_store.py		files_store.py
heuristics.py		heuristics.py
models.py		models.py
parse_utils.py		parse_utils.py
pdf_engine.py		pdf_engine.py
pipeline.py		pipeline.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

KatDocExtract (Mistral-compatible Document OCR Wrapper using DeepSeek-OCR)

Features

How it works (Architecture)

Prerequisites

Recommended (Docker Compose)

Manual (no Docker)

Installation

Option A: Docker Compose (recommended)

1) Configure

2) Start the stack

3) Open the Web UI

Useful commands

Option B: Manual setup (no Docker)

1) Run your own vLLM backend (required)

Install vLLM

Serve DeepSeek-OCR (OpenAI-compatible)

2) Install and run KatDocExtract

Install

Configure

Run

API Usage

Endpoint: POST /v1/ocr

PDF modes and heuristics

pdf_mode

Heuristic knobs (in .env)

Figure extraction

Office documents

Performance tuning

Concurrency

DPI tradeoff

Troubleshooting

vLLM connection errors

Timeouts on large PDFs

Office conversion fails

Development notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Endpoint: `POST /v1/ocr`

`pdf_mode`

Heuristic knobs (in `.env`)

Packages