Development by fffoivos · Pull Request #91 · eellak/glossAPI

fffoivos · 2026-03-17T23:48:01Z

Fixes HTML interstitial handling so challenge/viewer pages are not recorded as successful downloads.
Adds browser-gated download support with standard, auto, and browser routes plus policy-driven selection: Closes Standard static scraping methods are failing to retrieve documents, due to non-static URL structures. #90
Adds a guided installer and browser dependency wiring.
Simplifies the DeepSeek OCR stack.
Fixes editable installs and chunk merging.
Expands docs with pipeline architecture, stage references, and Pages support.

docs: document pipeline artifact contract and runtime outputs

Flip DEFAULT_RUNTIME_BACKEND from "transformers" to "vllm". The transformers backend is currently broken: DeepSeek-OCR-2's bundled modeling_deepseekv2.py imports LlamaFlashAttention2 from transformers.models.llama.modeling_llama, which was removed upstream in transformers >= 4.46. vllm 0.18.0 transitively pulls a transformers >= 4.57, so anyone running the documented setup with the old default hits an ImportError before any OCR happens. Drop the explicit transformers and tokenizers pins from the deepseek extra (both come transitively via vllm at the version vllm requires; explicit pins were redundant). Add docs/operations/deepseek_runtime_contract.md documenting the supported backend, the page-level skip guard, and how to add a new backend. Verified on 2× A100 SXM4 40GB: 10 OpenArchives PDFs, 683 pages, exact_fill scheduling, vLLM wall time 276 s (0.65-0.76 s/page per GPU). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the RuntimeError that aborted the entire batch when a single PDF produced empty markdown with a per-document empty_markdown=True metric and a warning log. The other documents in the batch now finish successfully and the empty case is observable downstream via the metrics JSON. Forward-port of 6ce0d9c from codex/ocr-env-fix, rebased onto current dev's runner.py layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Forward-port test_runner_resolves_standard_vllm_defaults_when_omitted from codex/ocr-vllm-defaults-refactor. Asserts that calling run_for_files with runtime_backend="vllm" and explicit None for render_dpi / gpu_memory_utilization resolves to DEFAULT_RENDER_DPI and DEFAULT_GPU_MEMORY_UTILIZATION respectively. Pins the contract that "None means default" rather than "None means unset". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r-pipeline-20260425 Brings 8 new Rust modules into the cleaner crate as additive code. The modules compile and their internal tests can run, but they are NOT yet wired into Corpus.clean()'s production call path — that's Stage 3. New modules: - charset_module.rs (+707) — analyze_charset, non_empty_line_stats - cmark_gfm_oracle.rs (+408) — cmark-gfm subprocess oracle for verification - latex_module.rs (+1674) — LaTeX-syntax-aware detection + cropping - md_format.rs (+595) — Pilot A (format_parsed) + dual_verify - md_format_surgical.rs (+1053) — Pilot B (parser-backed surgical Phase A) - md_module.rs (+1641) — MD-syntax-aware Phase A detectors - md_verify.rs (+1158) — pulldown-cmark equivalence verifier - normalize.rs (+2022) — separated normalize passes (fold, bucket, etc.) New deps in Cargo.toml: comrak 0.26 (no default features), pulldown-cmark 0.11 (html-only). Both are pure-Rust, no system deps. Removes the duplicate [tool.maturin] table that was triggering "unused manifest key" warnings (lives in pyproject.toml). lib.rs adjustments: - New module declarations (mod foo;) so the modules compile. - #![allow(dead_code)] at crate level — many of the new modules' helpers are unused until Stage 3 wires them in; suppresses the noise. - New module-doc-comment header explaining the cleaner/noise crate boundary and the production Phase A choice (Pilot B + dual_verify). - Does NOT add new PyO3 surface registrations — those land in Stage 3 alongside the cleaning_module.rs rewrite that exposes them. - Pilot A's format_parsed_py and the dev-only dual_verify_py exports are intentionally NOT registered — they will not ship to production. Build status: - cargo check passes with 2 pre-existing dev warnings (DetailedTableIssueReportEntry privacy + non_local_definitions in table_analysis_module). These warnings were on dev before this commit; they will be cleaned up in Stage 1.2. Stage 1.2 (next commit on this branch): excise dead alternatives from the imported modules — Pilot A's format_parsed (md_format.rs) and the LineBased path's normalize_md_syntax (md_module.rs). dual_verify stays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ython tooling + docs This is the visible behavior change: Corpus.clean()'s Phase A now defaults to Pilot B (PhaseAMode::ParserSurgicalVerified → format_surgical_checked), parser-backed and dual-verifier-protected. Per CLEANER_PIPELINE_CLEANUP_PLAN_2026-04-25. ## Rust crate changes cleaning_module.rs: 744 → 3063 lines. Major rewrite per the cleanup plan: - Per-char ops collapsed to 2 groups (Group 1 STRIP / Group 2 FOLD). fold_codepoint absorbs Adobe Symbol PUA decode + µ→μ. soft-hyphen strip absorbed into is_unicode_noise_char (per-line). Pre-pass shrinks to HTML entities + base64 image strip + Phase A. - Group 1 STRIP narrowed to non-European / extraction-noise ranges. Latin-1 Supp + Latin-Ext-A + Cyrillic + Cyrillic-Supp now KEPT entirely. Latin-Ext-B kept except Romanian comma-below {Ș, ș, Ț, ț}. - Unified Rule B regex covers GLYPH<…>, glyph<c=…,font=/…>, /[A-Z]{6}+FontName, /uniXXXX, /g(id)?N. Rule A's 50 PostScript-name literals contribute to the same count+coverage gate (≥10 hits AND ≥9% coverage → line-drop). Bare- word matchers (GLYPH, hyphenminus, font, glyph as plain words) deleted. - R1∪R2 residue range narrowed to U+0180..U+024F minus Romanian to match Group 1's policy. - Per-rule counters in CleanStats: rule_a_match_count, rule_b_match_count, residue_line_drop_count. Production drivers source these directly, eliminating second matcher invocation per row. - Per-doc 4-way char accounting: content_chars_kept, chars_dropped_by_{line_drop, normalization, per_char_filter}, plus marker passthrough/added split. - PhaseAMode enum + core_clean_text_with_stats_with_mode + PyO3 phase_a_mode arg. Default flipped to ParserSurgicalVerified. - format_surgical_checked populates phase_a_fallback_reason and phase_a_dialect_ambiguous_input in CleanStats. - Corpus.clean / clean_text policy parity: both call build_script_char_sets. Fixes silent bug where directory pipeline stripped punct/digits when callers passed restricted scripts_to_keep. - Post-loop \n{3+} → \n\n collapse (CommonMark renders any blank-line run as one block separator; bytes go into chars_dropped_by_normalization). - Bug 1 fix: token-category exporter byte-vs-char offsets (Greek-prefixed input was silently dropping rows). Now emits CHAR offsets at the export boundary; internal byte offsets retained for Rust slicing. lib.rs: full PyO3 surface registration for the new modules: clean_text, clean_text_with_stats, analyze_charset, non_empty_line_stats, crop_latex_repetitions_py, verify_md_preview_equivalent_py, verify_md_structural_py, phase_a_alteration_stats, apply_phase_a, phase_a_stats_jsonl_line, cmark_gfm_verify_py, format_surgical_py, format_surgical_checked_py, phase_a_policy_py. DELIBERATELY EXCLUDED (per user direction "only keep Pilot B"): the dev-only format_parsed_py (Pilot A) and dual_verify_py (dev-only oracle exposure) PyO3 registrations. The Rust dual_verify function STAYS — it's used by format_surgical_checked. noise crate: +1,360 lines (token-category review/debug exports + 3-counter infrastructure used by the new cleaning_scripts/). Cleaner crate has zero Cargo.toml dep on noise — boundary enforced at compile time. ## Python orchestration tooling (cleaning_scripts/ — 8 new files, +1933 LOC) - analyze_cleaning_concentration.py — per-dataset / per-doc cleaning concentration - analyze_cleaning_distributions.py - analyze_quality_vs_deletions.py - clean_and_stats_rowsharded.py — production HPLT driver, per-row clean+stats - pull_deletion_band_samples.py — stratified band sampler - regenerate_samples.py - smoke_tests/test_rust_extensions_smoke.py — exercises new PyO3 surface - validate_gzipped_shards.py — verifies post-clean shards byte-identical to squash(clean_text_with_stats(raw, …)) ## Token-category review tooling (src/glossapi/scripts/ — 6 new files, +3003 LOC) - aggregate_token_category_reviews.py - build_token_category_review_bundle.py - export_token_category_debug.py - export_token_category_debug_parquet.py - review_token_category_with_gemini.py — Gemini review driver - token_category_debug_common.py google-genai>=1.30.0 added to core deps for the Gemini reviewer. ## Architecture / changelog docs (6 files, +2683 LOC) - rust/glossapi_rs_cleaner/CHANGES_2026_04_22.md (3-counter wave) - rust/glossapi_rs_cleaner/CHANGES_2026_04_25.md (Pilot B + cleanup wave) - rust/glossapi_rs_cleaner/docs/MD_MODULE_ARCHITECTURE.md - rust/glossapi_rs_cleaner/docs/MD_MODULE_ARCHITECTURE_IMPLEMENTATION_REVIEW_2026-04-24.md - rust/glossapi_rs_cleaner/docs/PHASE_A_PARSER_BACKED_IMPLEMENTATION_REVIEW_2026-04-24.md - rust/glossapi_rs_cleaner/docs/PHASE_A_PARSER_BACKED_INDEX.md ## phase_clean.py: +240 lines Python-side wiring for PhaseAMode arg, clean_text_with_stats call, build_script_char_sets policy parity. Does NOT touch clean_ocr() — that's intentional per noise-profile rule (DeepSeek OCR doesn't produce Docling- class glyph/mojibake noise; routing the new Docling-tuned machinery through clean_ocr would be wrong by default). ## tests/test_corpus_clean_enhancements.py Cleanup branch's version (1912 lines vs dev's 1724) — 188 line additions covering the new policy parity + per-rule counters + Pilot B fallback shape + the new build_token_category_review_bundle + the new score schema. ## Phase 1 changes preserved - DEFAULT_RUNTIME_BACKEND = "vllm" stays (defaults.py untouched by cleanup branch) - pyproject.toml's deepseek extra stays cleaned-up (transformers/tokenizers pins remain dropped — vllm pulls them transitively) ## What's NOT in this commit (deferred) - Stage 4 (clean_ocr surgical carve-outs — surface empty_markdown field): separate commit. - Stage B (cleaner-side ocr_render.py extraction from faa1362): separate PR. - Final dead-code excision (delete format_parsed body in md_format.rs + normalize_md_syntax in md_module.rs): final cleanup pass after integration. ## Build status cargo check: both crates build with 2 pre-existing dev warnings (table_analysis private-interface + non_local_definitions in pymethods macro). Production behavior verified compiles; runtime validation requires a real corpus run (per user's Stage 3 acceptance: 100-doc end-to-end on openarchives.gr.part-00000.parquet showing 0 doc drops, ≥17% chars removed, no Greek-text quality regressions on a 5-page spot-check). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spotted by cargo test failure on test_empty_content_with_remove_op: 1. table_remover_module.rs (4-line fix): add early-return on empty input so empty file with a remove op yields empty output instead of `` marker. Per cleanup wave's Bug 2 set. 2. directory_processor.rs: route Corpus.clean()'s analysis report path through cleaning_module::build_script_char_sets — same policy builder as clean_text / clean_text_with_stats. Fixes Point 8 silent bug where directory pipeline stripped ASCII punct + digits when callers passed restricted scripts_to_keep. Plus pub(crate) on DetailedTableIssueReportEntry to fix the privacy-leak warning. 3. table_analysis_module.rs: add #![allow(non_local_definitions)] for the pyo3 0.19 #[pymethods] macro. Per cleanup wave's lint-posture fix — silences the warning until pyo3 is upgraded. cargo test --release on the cleaner crate now reports: 385 passed; 0 failed; 3 ignored (matches cleanup branch's measured test outcome.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per user direction "We dont need Gemini reviewer scripts. This is still bloat for me. Just keep the basics." — remove the 6 src/glossapi/scripts/ files that are exclusively for the Gemini-driven review-bundle workflow, plus the google-genai dep and the test that imported them. Removed: - src/glossapi/scripts/aggregate_token_category_reviews.py (456 LOC) - src/glossapi/scripts/build_token_category_review_bundle.py (574 LOC) - src/glossapi/scripts/export_token_category_debug.py (71 LOC) - src/glossapi/scripts/export_token_category_debug_parquet.py (184 LOC) - src/glossapi/scripts/review_token_category_with_gemini.py (1429 LOC) - src/glossapi/scripts/token_category_debug_common.py (289 LOC) Dependency removed: - google-genai>=1.30.0 from pyproject.toml core deps (was only pulled in for the Gemini reviewer) Test removed: - test_build_token_category_review_bundle_materializes_cases (referenced the dropped script directly) What stays: - glossapi_rs_noise crate's match_token_category_debug_text PyO3 surface (kept for any future debug/discovery caller; no current Python script uses it after this commit) - Corpus.clean_token_category_debug Python method (uses the PyO3 surface for per-page category breakdown) - test_clean_token_category_debug_exports_synthetic_pages test (covers the PyO3 surface end-to-end) - cleaning_scripts/clean_and_stats_rowsharded.py's mention of token_category in a code comment (no actual call — describes prior matcher behavior that was eliminated by Point 7 of the cleanup plan) Net delta: 3003 LOC removed across 6 script files + 88 LOC test removed + 1 dep removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… Pilot B remains) Per user direction "only keep Pilot B for md reformatting not a etc", remove all dead Phase-A alternatives now that Pilot B (format_surgical_checked) is the production default. ## md_format.rs (594 → 299 lines) Removed: - format_parsed (Pilot A function — abandoned per CHANGES_2026_04_25.md after the 2026-04-24 90-doc audit found 50/66 failures from comrak's whole-doc round-trip over-normalizing list markers, link forms, escapes; Pilot B's surgical approach supersedes it). - format_parsed_py (PyO3 export of Pilot A). - dual_verify_py (PyO3 export — dev-only oracle exposure; the underlying dual_verify Rust function STAYS as crate-internal because format_surgical_checked depends on it). - All ~30 Pilot-A test fixtures. Kept (the only reason this module still exists): - dual_verify + DualVerifyReport + pulldown_render / comrak_render / collapse_ws helpers + common_prefix_len. Used by md_format_surgical::format_surgical_checked. ## md_module.rs (1641 → 1071 lines) Removed: - normalize_md_syntax (the LineBased Phase A orchestrator). - normalize_md_syntax_with_stats (instrumented LineBased variant). - apply_phase_a (PyO3 wrapper of normalize_md_syntax). - phase_a_stats_jsonl_line (PyO3 export — JSONL writer using LineBased stats; the bench script that called it was already dropped from the cleanup branch's "production-essential triage"). - phase_a_alteration_stats (PyO3 export — dict writer for LineBased). - push_json_str / fmt_finite_f64 (helpers used only by the LineBased PyO3 exports). - PhaseAStats struct (return type of normalize_md_syntax_with_stats). - 25 LineBased tests. - pyo3 imports (no PyO3 surface remains in this module). Kept: - non_destructive_canonicalize: used by md_verify. Updated its final Phase A step to call format_surgical (Pilot B unchecked) directly instead of normalize_md_syntax — keeps the same "maximal canonical form" purpose. - is_code_fence_marker: called from cleaning_module's per-line loop. - All MD-syntax-aware helpers (leading_columns, normalize_separator_line, scan_gfm_table_separators, parse_gfm_separator_row, count_gfm_row_cells, collapse_blank_line_runs, reflow_paragraphs, reflow_paragraphs_with_count, can_join_lines, line_is_hard_break) — used by Pilot B's machinery internally. ## cleaning_module.rs - core_clean_text_with_stats_with_mode's match arm collapsed: only Pilot B (format_surgical_checked) is called now. - phase_a_mode parameter renamed to _phase_a_mode (kept in signature for back-compat with PyO3 callers + tests; accepts any value, ignores it). - 6 LineBased-pinned tests removed (accounting_normalization_tracks_separator_collapse, accounting_escaped_underscore_run_buckets_but_stays_as_underscores, accounting_long_escaped_underscore_run_buckets_to_20, accounting_mixed_doc_invariant_holds, core_clean_text_composite_roundtrip, core_clean_text_normalizes_separator_line) plus the linebased_clean_text + linebased_clean_text_with_stats test helpers. PhaseAMode enum is intentionally KEPT (3 variants stay) for back-compat with: (a) PyO3 callers still passing a `phase_a_mode` kwarg, (b) the phase_a_mode kwarg signature on clean_text + clean_text_with_stats. All variants now route to the same Pilot B path. A follow-up PR can collapse the enum + drop the kwarg cleanly. ## lib.rs PyO3 registrations removed for the dropped functions: - format_parsed_py, dual_verify_py (dropped earlier in this PR). - format_surgical_py (Pilot B WITHOUT oracle — dev-only, not appropriate for production exposure). - apply_phase_a, phase_a_alteration_stats, phase_a_stats_jsonl_line (LineBased instrumentation — function bodies are gone). Production PyO3 surface is now: clean_text, clean_text_with_stats, analyze_charset, non_empty_line_stats, crop_latex_repetitions_py, verify_md_preview_equivalent_py, verify_md_structural_py, cmark_gfm_verify_py, format_surgical_checked_py, phase_a_policy_py, plus the existing pipeline + table + directory_processor surfaces. ## Tests cargo test --release: 325 passed; 0 failed; 3 ignored (was 385 passed before excision; the 60 removed tests covered LineBased and Pilot A code paths that no longer exist.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…om phase_clean.py Per the cleaner-integration plan's Stage B + the cleaner correctness axes memory rule "latest changes are canonical". ## What's extracted 11 functions moved from `src/glossapi/corpus/phase_clean.py` into the new `src/glossapi/corpus/ocr_render.py`: - _gap_has_at_most_n_nonwhitespace_chars - _clean_fill_for_removed_span - _merge_labeled_raw_spans - _summarize_merged_labeled_spans - _render_page_from_merged_labeled_spans - _render_page_with_labeled_spans_result - _render_page_with_labeled_spans - _annotate_page_with_labeled_spans - _utf8_prefix_byte_offsets - _span_repeat_count - _build_match_index_rows These collectively own the analyzer/renderer separation: phase_clean.py decides WHAT spans exist; ocr_render.py renders HOW those spans become page text and debug sidecars. ## Body resolution Per the function-by-function comparison done in the cleaner-integration audit, 5 functions were EXACT match vs faa1362 and 6 were DIFF. Per the "latest changes canonical" rule, this PR uses **dev's bodies for all 11** (dev's Apr 14 OCR speedup wave is later than faa1362's Apr 12 work). The faa1362-only helper `_build_debug_match_open_tag` is NOT brought over because dev's `_render_page_from_merged_labeled_spans` (73 lines vs faa1362's 44) inlines the equivalent logic and doesn't need the helper. ## Also added `src/glossapi/corpus/text_surface_metrics.py` (48 lines, from faa1362 verbatim — new module): `sanitized_char_count` + `_strip_latex_envs_for_char_count`. Shared "published-surface" metric helpers used by export-facing metadata refresh. ## phase_clean.py changes - 11 function definitions removed (~330 lines). - New imports: `from .ocr_render import (...)` (11 names) and `from .text_surface_metrics import sanitized_char_count`. - Net: phase_clean.py 4929 → 4597 lines. ## docs/architecture/ocr_cleaning_runtime.md Padded from 118 → 186 lines with the previously-missing sections: - Code Layout (now accurate — describes the modules that exist) - Stage Boundary: clean_ocr() vs clean() - Field Ownership (OCR-owned vs clean/export-owned parquet fields) ## Verification - python3 ast.parse: ocr_render.py OK, text_surface_metrics.py OK, phase_clean.py OK. - Direct module import + sanitized_char_count smoke: works correctly. - pytest tests/test_corpus_clean_enhancements.py: 67 passed, 2 failed. Both failures (test_clean_flags_uppercase_glyph_noise and test_clean_token_category_debug_exports_synthetic_pages) are PRE-EXISTING dev failures — verified by running the same tests against unmodified origin/development. NOT introduced by this extraction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Promote the previously-uncommitted survey from the cleanup branch worktree into docs/architecture/, matching the project's lowercase-with-underscores naming convention (was MD_LIBRARY_SURVEY_LEARNINGS_2026-04-24.md, now markdown_library_survey.md). Content rounded for fit: - Title is now noun-phrase Title Case ("Markdown Library Survey") with no date stamp, matching peers (ocr_cleaning_runtime.md, etc.). - Reframed from "addendum + recommendations" to "design rationale + outcomes" — every recommendation now annotated as ✅ landed or ⏳ open so a future reader sees what shipped vs what remains. - Section "Strategic value of a wholesale parser-backed direction" trimmed (the question it asked has been answered — Pilot B shipped). - "Open implementation directions" trimmed to only items still open (pseudo-table semantic transform, raw-readability metrics, lint-style diagnostics). - Added "See also" with cross-references to the production files (md_format_surgical.rs, md_format.rs, cmark_gfm_oracle.rs, ocr_cleaning_runtime.md). Updated docs/architecture/index.md to link to the new doc under "pressure points are documented separately in". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… word-repeat window default Forward-port of two Apr 17 uncommitted edits from the deleted ocr-env-fix worktree. ## 1. Skip chunk markdown when canonical doc exists When the OCR runner emits both a canonical merged `doc.md` and per-page-range chunk outputs `doc__pNNNNN-NNNNN.md`, the cleaner used to clean BOTH — double-counting the same content in metadata. Both `Corpus.clean()` and `Corpus.clean_ocr()` now skip `__p`-suffixed chunks when the canonical doc is present in the same input directory. Test: test_clean_ocr_ignores_chunk_markdown_when_canonical_doc_exists. ## 2. Widen OCR word-repeat window default 96 → 520 The legacy `word_window=96` default missed accent-shifted Greek repetitions where the period is wider than 32 chars. Two new constants: DEFAULT_OCR_WORD_REPEAT_MAX_PERIOD = 130 DEFAULT_OCR_WORD_REPEAT_WINDOW = DEFAULT_OCR_WORD_REPEAT_MAX_PERIOD * 4 # 520 are now the default for `word_window` in `Corpus.clean_ocr()`, `Corpus.clean_ocr_debug()`, and `Corpus.clean_ocr_numeric_debug()`. Regression test: test_long_accent_shift_repeat_needs_wider_default_window proves the legacy 96 misses a real accent-shift case while 520 catches it. ## Verification - pytest test_clean_ocr_ignores_chunk_markdown_when_canonical_doc_exists: passed. - pytest test_long_accent_shift_repeat_needs_wider_default_window: passed. - python3 -m ast: src/glossapi/corpus/phase_clean.py + tests parse OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fffoivos and others added 30 commits November 30, 2025 10:14

chore: remove GitHub workflows

00e8a2c

docs: remove PyPI badge

269cabf

fix chunk merging

b82d04e

Fix editable install by switching root build backend to setuptools

6bcde8b

Simplify OCR stack around DeepSeek

ab87731

Merge remote-tracking branch 'origin/development' into development

1adac28

Add GitHub Pages docs workflow

83f7bf2

Fix docs links for Pages build

1bf4261

docs: map pipeline concepts to implementation

79cc99c

Handle HTML download interstitials

379b8f0

Add browser-gated download mode

aca4dbb

docs: document pipeline artifact contract and runtime outputs

96241f9

Merge pull request #93 from adidev001/docs/pipeline-artifact-contract

c900b42

docs: document pipeline artifact contract and runtime outputs

Upgrade Docling and simplify OCR runtime

00aed53

add multi-worker deepseek gpu sharding

efd1698

add deepseek throughput tuning controls

8ed469b

fallback to eager when deepseek sdpa is unsupported

b749225

fix deepseek plain ocr crop defaults

864b0ea

add deepseek max token cap control

b319ae5

add deepseek generation guards and page metrics

2635c0c

Add DeepSeek OCR speed controls and sharding

4536e0e

Update DeepSeek runtime to Torch 2.9.1 cu130

0ebabe7

Add vLLM DeepSeek OCR runtime

502f8bc

Add DeepSeek markdown repair pipeline

cbeb638

Add DeepSeek pipeline benchmark harness

5ad8620

Document DeepSeek pipeline benchmark results

41b983e

Merge branch 'codex/docling-281' into development

9179062

Harden DeepSeek repair classification

0a86323

Merge branch 'codex/docling-281' into development

f5af409

Update DeepSeek benchmark note

3038fa8

fffoivos and others added 30 commits April 14, 2026 16:59

reuse filtered page views in OCR analyzer

586a543

speed up hybrid OCR matching in Rust

bbc643b

move shared OCR text normalization into Rust

880fbe2

parallelize combined OCR document rendering

8b052f8

speed up OCR document rendering with process pools

ad55b88

reduce OCR process-pool overhead and tune defaults

f405195

speed up OCR table handling and page fast paths

d00af17

skip irrelevant OCR passes on marker-free pages

d5fafeb

avoid redundant Rust prebuilds on OCR startup

7cebe12

refactor OCR table policy and document cleaner runtime

c15f4b3

widen same-type OCR span merge gap

bc51980

add latex short atom block matching

51db020

extend latex short atom tails

ba80f1e

expand latex structural repeat coverage

2b78f45

add OCR clean+debug match indexing

f208ae1

sanitize OCR spans against page bounds

fae3963

ocr: refresh cleaner metrics after reruns

5d7a0ba

test: cover OCR clean and export handoff

f6e711b

chore: bump version to 0.1.4

0874623

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development#91

Development#91
fffoivos wants to merge 128 commits intomasterfrom
development

fffoivos commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fffoivos commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants