feat: figure-level page-render extraction with caption-on-bottom table support by KuangjuX · Pull Request #1 · 917Dhj/DeepPaperNote

KuangjuX · 2026-05-06T08:51:26Z

Summary

This PR teaches extract_pdf_assets.py to render each Figure / Table from
the page pixmap with a caption-anchored bounding box, replacing (and
augmenting) the legacy xref-level extraction that often produced unusable
fragments such as 58×224 arrow icons.

The work is split into two commits so the table fix is reviewable on its own:

feat: add figure-level page-render extraction for complete figure cropping (df07e11)
- Locates Figure / Table captions on each page via regex.
- Computes a bbox that unions the embedded image rects and vector drawings
  between the previous caption boundary and the current caption.
- Renders that bbox at 200 DPI and emits a new figure_assets list.
- plan_figures.py matches plan items against figure_assets by
  normalized label and prefers them over xref candidates.
fix: correctly crop tables with caption-on-bottom and LaTeX cell fragmentation (740739a)
- Tags each caption with kind: figure | table and stops merging
  multi-line captions when a numeric data row or oversized vertical gap is
  encountered, so the caption bbox no longer absorbs the first table row.
- Adds row clustering (_cluster_lines_into_rows + _row_is_table_like)
  to fold the per-cell PyMuPDF lines that LaTeX tables emit (often 100+
  single-cell lines per table) into one logical row.
- Adds a stricter _find_paragraph_blocks (≥200 chars, ≥3 lines, <40 %
  numeric-heavy lines) so table-shaped blocks are no longer mistaken for
  prose.
- Splits bbox estimation into _estimate_figure_bbox_above_caption (legacy
  figure path, unchanged behaviour) and _estimate_table_bbox (probes
  both above and below the caption, picks the side with more table-like
  rows). Each estimator falls back to the other if its primary direction
  returns no usable bbox.
- extract_figure_regions selects the estimator based on anchor kind and
  records kind on each emitted asset (additive change; no existing field
  of figure_assets is removed or renamed, so plan_figures.py /
  materialize_figure_asset.py continue to work unchanged).

Why a single PR?

The table fix relies on the figure-level scaffolding from the first commit
(_find_caption_blocks, _find_body_text_blocks, extract_figure_regions).
Splitting it would make the second commit unbuildable on top of upstream
main. The two commits are kept separate inside the PR so reviewers can
evaluate them independently.

Test plan

Re-extracted the SkVM paper (arXiv 2604.03088): figure-level path
produces 19 complete crops covering Figure 1–16 + Table 1, vs. 6
fragmented xref outputs from the legacy path.
Re-extracted the LoongTrain SC'24 paper (arXiv 2406.18485):
Tables 2/3/4/5 now include the full column header, all data rows, and
the complete "Table N." caption text. Previous output only captured
the bottom data rows + the lower half of the caption.
Verified Figures 2/4/6/8/9/10/11/12/13/14 on the same paper still
extract correctly (no regression on figure-above-caption layout).
plan_figures.py / materialize_figure_asset.py unchanged; existing
fields of figure_assets preserved.

Notes for review

figure_assets[*] gains a new optional "kind" field
("figure" | "table"). All existing fields are preserved verbatim.
The legacy xref-level extraction is intentionally retained as a fallback
so users who already depend on it are not affected.
No new third-party dependencies; PyMuPDF (fitz) was already required
and remains the only PDF backend.

…pping The existing xref-level extraction often produces unusable fragments (e.g. 58×224 arrow icons) because PDF stores figures as many small embedded image objects or as pure vector art. This commit adds a second extraction strategy that runs alongside the legacy xref path: 1. Locate Figure/Table captions on each page via regex over text blocks. 2. Collect bounding boxes of all xref images and vector drawings between the previous caption boundary and the current caption. 3. Render the computed region from the page pixmap at 200 DPI, producing a complete, human-readable figure PNG including the full multi-line caption text. Downstream changes in plan_figures.py: - Accept the new `figure_assets` list from extract_pdf_assets.py. - Match figure plan items to figure-level assets by normalized label. - Prefer figure-level matches over legacy xref candidates, setting `insert_mode` to `"figure_asset"` when a direct match is found. Tested on the SkVM paper (arXiv 2604.03088): old pipeline extracted 6 xref fragments (including 58×224 artifacts); new pipeline produces 19 complete figure-level crops covering Figure 1–16 and Table 1. Co-authored-by: Cursor <cursoragent@cursor.com>

…mentation The figure-level extractor introduced in df07e11 worked well for figures but produced badly truncated screenshots for tables in many ACM/IEEE papers (e.g. LoongTrain SC'24 Tables 2–5 only showed the bottom data rows plus the lower half of the caption, losing the column header and upper data rows entirely). Three layered bugs caused this: 1. Direction was wrong. `_estimate_figure_bbox` always assumed "caption sits below the visual content", but academic tables are commonly typeset with caption-on-bottom (`\begin{tabular}` precedes `\caption`). Treating Table captions like Figure captions cropped the area above the caption, which is the body paragraph above the table rather than the table itself. 2. Body-block detection swallowed the table. PyMuPDF groups an entire tabular column ("DS-Ulysses 629.9 418.3 ...") into a single text block, and the legacy `_find_body_text_blocks` filter accepted any block longer than 40 characters as a "body paragraph". The probe walking up from the caption then immediately broke out as soon as the table region was reached, because every line inside the table was considered "inside a body block". 3. Row granularity was wrong. LaTeX-rendered tables emit one PyMuPDF "line" per cell (e.g. 136 single-cell lines for one tabular block). Per-line `_looks_like_data_row` checks therefore never fired, and the probe could not tell that it was walking through actual data. Changes: - Tag every caption anchor with `kind: "figure" | "table"` and stop multi-line caption merging when a numeric data row or an over-large vertical gap is encountered, so the caption bbox no longer absorbs the first table row. - Add `_cluster_lines_into_rows` to fold sibling text lines that share the same vertical band into one logical row, plus `_row_is_table_like` which accepts either rows with ≥3 separated cells or rows whose tokens are dominated by numbers. - Add `_find_paragraph_blocks` with a stricter heuristic (≥200 chars, ≥3 lines, <40 % numeric-heavy lines) so table-shaped blocks are no longer mistaken for prose. The table probe uses this stricter set; the figure probe keeps the original behaviour for backward compatibility. - Split bbox estimation: `_estimate_figure_bbox_above_caption` keeps the existing "image above caption" logic for figures, while a new `_estimate_table_bbox` probes both above and below the caption, picks the side with more table-like rows, and unions in nearby drawing/image rects (\hline, frames). Both estimators fall back to each other when the primary direction returns no usable bbox. - `extract_figure_regions` now picks the estimator based on the anchor kind and records the `kind` in each emitted asset (additive change; downstream plan_figures.py / materialize_figure_asset.py consume the same fields as before). Verified on LoongTrain SC'24 (arXiv 2406.18485): - Tables 2/3/4/5 now include the full column header, all data rows, and the complete "Table N." caption text. - Figures 2/4/6/8/9/10/11/12/13/14 still extract correctly; no regression on the figure-above-caption layout. Co-authored-by: Cursor <cursoragent@cursor.com>

917Dhj

Thanks a lot for working on this. I tested the PR on a set of papers, mostly CS/AI conference-style papers with many figures and tables.

My overall feeling is: I’m happy with the new extraction method, and I’d like to keep that part. It does improve the asset pool by adding captioned figure_assets, especially for papers where the old xref-only extraction misses most useful figures/tables.

However, I’d like the insertion and review logic to stay aligned with the original DeepPaperNote workflow.

The original design is intentionally placeholder-first:

scripts extract evidence and candidate assets;
figure/table placeholders are preserved first;
extracted images still need to be reviewed;
only images that pass the quality/semantic check should be inserted into the final note.

So I don’t think the PR should change the insertion decision from the script side. In particular, a label match between a planned figure/table and an extracted asset should not automatically set:

insert_mode: "figure_asset"

Some extracted assets are useful, but some are still duplicated, too tight, too loose, or not suitable for the final note. That is expected for an extraction step, but it means the output should remain a candidate pool rather than becoming an insertion decision.

Could you please keep the new figure/table extraction method, but adjust plan_figures.py so that it preserves the original placeholder-first behavior?

Concretely, I’d prefer:

keep insert_mode: "placeholder" by default;
attach matched figure_assets as candidates, e.g. figure_asset_candidate or candidate_assets;
leave the final decision of whether to materialize an image to the existing review/model-side workflow.

That way we get the benefit of your improved extraction method, while keeping DeepPaperNote’s original insertion and quality-review logic intact.

A few edge cases I noticed during testing:

duplicate labels, such as repeated Table 2 / Figure 3;
inconsistent labels such as Fig. 1 and Figure 1;
some table crops are still a bit tight or loose.

These are fine as candidate-generation issues, but they are exactly why I’d prefer not to automatically promote extracted assets into inserted note images.

So my requested change is mainly architectural rather than rejecting the extraction work: please keep the extraction improvement, but keep insertion/review decisions following the original placeholder-first workflow.

Thanks again — I think this is a useful improvement once that boundary is restored.

KuangjuX and others added 2 commits May 6, 2026 16:50

KuangjuX changed the title ~~feat: add figure-level page-render extraction for complete figure cropping~~ feat: figure-level page-render extraction with caption-on-bottom table support May 8, 2026

917Dhj requested changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: figure-level page-render extraction with caption-on-bottom table support#1

feat: figure-level page-render extraction with caption-on-bottom table support#1
KuangjuX wants to merge 2 commits into
917Dhj:mainfrom
KuangjuX:feat/figure-level-extraction

KuangjuX commented May 6, 2026 •

edited by 917Dhj

Loading

Uh oh!

917Dhj left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KuangjuX commented May 6, 2026 • edited by 917Dhj Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why a single PR?

Test plan

Notes for review

Uh oh!

917Dhj left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KuangjuX commented May 6, 2026 •

edited by 917Dhj

Loading