feat: figure-level page-render extraction with caption-on-bottom table support#1
feat: figure-level page-render extraction with caption-on-bottom table support#1KuangjuX wants to merge 2 commits into
Conversation
…pping The existing xref-level extraction often produces unusable fragments (e.g. 58×224 arrow icons) because PDF stores figures as many small embedded image objects or as pure vector art. This commit adds a second extraction strategy that runs alongside the legacy xref path: 1. Locate Figure/Table captions on each page via regex over text blocks. 2. Collect bounding boxes of all xref images and vector drawings between the previous caption boundary and the current caption. 3. Render the computed region from the page pixmap at 200 DPI, producing a complete, human-readable figure PNG including the full multi-line caption text. Downstream changes in plan_figures.py: - Accept the new `figure_assets` list from extract_pdf_assets.py. - Match figure plan items to figure-level assets by normalized label. - Prefer figure-level matches over legacy xref candidates, setting `insert_mode` to `"figure_asset"` when a direct match is found. Tested on the SkVM paper (arXiv 2604.03088): old pipeline extracted 6 xref fragments (including 58×224 artifacts); new pipeline produces 19 complete figure-level crops covering Figure 1–16 and Table 1. Co-authored-by: Cursor <cursoragent@cursor.com>
…mentation The figure-level extractor introduced in df07e11 worked well for figures but produced badly truncated screenshots for tables in many ACM/IEEE papers (e.g. LoongTrain SC'24 Tables 2–5 only showed the bottom data rows plus the lower half of the caption, losing the column header and upper data rows entirely). Three layered bugs caused this: 1. Direction was wrong. `_estimate_figure_bbox` always assumed "caption sits below the visual content", but academic tables are commonly typeset with caption-on-bottom (`\begin{tabular}` precedes `\caption`). Treating Table captions like Figure captions cropped the area above the caption, which is the body paragraph above the table rather than the table itself. 2. Body-block detection swallowed the table. PyMuPDF groups an entire tabular column ("DS-Ulysses 629.9 418.3 ...") into a single text block, and the legacy `_find_body_text_blocks` filter accepted any block longer than 40 characters as a "body paragraph". The probe walking up from the caption then immediately broke out as soon as the table region was reached, because every line inside the table was considered "inside a body block". 3. Row granularity was wrong. LaTeX-rendered tables emit one PyMuPDF "line" per cell (e.g. 136 single-cell lines for one tabular block). Per-line `_looks_like_data_row` checks therefore never fired, and the probe could not tell that it was walking through actual data. Changes: - Tag every caption anchor with `kind: "figure" | "table"` and stop multi-line caption merging when a numeric data row or an over-large vertical gap is encountered, so the caption bbox no longer absorbs the first table row. - Add `_cluster_lines_into_rows` to fold sibling text lines that share the same vertical band into one logical row, plus `_row_is_table_like` which accepts either rows with ≥3 separated cells or rows whose tokens are dominated by numbers. - Add `_find_paragraph_blocks` with a stricter heuristic (≥200 chars, ≥3 lines, <40 % numeric-heavy lines) so table-shaped blocks are no longer mistaken for prose. The table probe uses this stricter set; the figure probe keeps the original behaviour for backward compatibility. - Split bbox estimation: `_estimate_figure_bbox_above_caption` keeps the existing "image above caption" logic for figures, while a new `_estimate_table_bbox` probes both above and below the caption, picks the side with more table-like rows, and unions in nearby drawing/image rects (\hline, frames). Both estimators fall back to each other when the primary direction returns no usable bbox. - `extract_figure_regions` now picks the estimator based on the anchor kind and records the `kind` in each emitted asset (additive change; downstream plan_figures.py / materialize_figure_asset.py consume the same fields as before). Verified on LoongTrain SC'24 (arXiv 2406.18485): - Tables 2/3/4/5 now include the full column header, all data rows, and the complete "Table N." caption text. - Figures 2/4/6/8/9/10/11/12/13/14 still extract correctly; no regression on the figure-above-caption layout. Co-authored-by: Cursor <cursoragent@cursor.com>
917Dhj
left a comment
There was a problem hiding this comment.
Thanks a lot for working on this. I tested the PR on a set of papers, mostly CS/AI conference-style papers with many figures and tables.
My overall feeling is: I’m happy with the new extraction method, and I’d like to keep that part. It does improve the asset pool by adding captioned figure_assets, especially for papers where the old xref-only extraction misses most useful figures/tables.
However, I’d like the insertion and review logic to stay aligned with the original DeepPaperNote workflow.
The original design is intentionally placeholder-first:
- scripts extract evidence and candidate assets;
- figure/table placeholders are preserved first;
- extracted images still need to be reviewed;
- only images that pass the quality/semantic check should be inserted into the final note.
So I don’t think the PR should change the insertion decision from the script side. In particular, a label match between a planned figure/table and an extracted asset should not automatically set:
insert_mode: "figure_asset"
Some extracted assets are useful, but some are still duplicated, too tight, too loose, or not suitable for the final note. That is expected for an extraction step, but it means the output should remain a candidate pool rather than becoming an insertion decision.
Could you please keep the new figure/table extraction method, but adjust plan_figures.py so that it preserves the original placeholder-first behavior?
Concretely, I’d prefer:
- keep
insert_mode: "placeholder"by default; - attach matched
figure_assetsas candidates, e.g.figure_asset_candidateorcandidate_assets; - leave the final decision of whether to materialize an image to the existing review/model-side workflow.
That way we get the benefit of your improved extraction method, while keeping DeepPaperNote’s original insertion and quality-review logic intact.
A few edge cases I noticed during testing:
- duplicate labels, such as repeated
Table 2/Figure 3; - inconsistent labels such as
Fig. 1andFigure 1; - some table crops are still a bit tight or loose.
These are fine as candidate-generation issues, but they are exactly why I’d prefer not to automatically promote extracted assets into inserted note images.
So my requested change is mainly architectural rather than rejecting the extraction work: please keep the extraction improvement, but keep insertion/review decisions following the original placeholder-first workflow.
Thanks again — I think this is a useful improvement once that boundary is restored.
Summary
This PR teaches
extract_pdf_assets.pyto render each Figure / Table fromthe page pixmap with a caption-anchored bounding box, replacing (and
augmenting) the legacy xref-level extraction that often produced unusable
fragments such as 58×224 arrow icons.
The work is split into two commits so the table fix is reviewable on its own:
feat: add figure-level page-render extraction for complete figure cropping(df07e11)Figure/Tablecaptions on each page via regex.between the previous caption boundary and the current caption.
figure_assetslist.plan_figures.pymatches plan items againstfigure_assetsbynormalized label and prefers them over xref candidates.
fix: correctly crop tables with caption-on-bottom and LaTeX cell fragmentation(740739a)kind: figure | tableand stops mergingmulti-line captions when a numeric data row or oversized vertical gap is
encountered, so the caption bbox no longer absorbs the first table row.
_cluster_lines_into_rows+_row_is_table_like)to fold the per-cell PyMuPDF lines that LaTeX tables emit (often 100+
single-cell lines per table) into one logical row.
_find_paragraph_blocks(≥200 chars, ≥3 lines, <40 %numeric-heavy lines) so table-shaped blocks are no longer mistaken for
prose.
_estimate_figure_bbox_above_caption(legacyfigure path, unchanged behaviour) and
_estimate_table_bbox(probesboth above and below the caption, picks the side with more table-like
rows). Each estimator falls back to the other if its primary direction
returns no usable bbox.
extract_figure_regionsselects the estimator based on anchor kind andrecords
kindon each emitted asset (additive change; no existing fieldof
figure_assetsis removed or renamed, soplan_figures.py/materialize_figure_asset.pycontinue to work unchanged).Why a single PR?
The table fix relies on the figure-level scaffolding from the first commit
(
_find_caption_blocks,_find_body_text_blocks,extract_figure_regions).Splitting it would make the second commit unbuildable on top of upstream
main. The two commits are kept separate inside the PR so reviewers canevaluate them independently.
Test plan
produces 19 complete crops covering Figure 1–16 + Table 1, vs. 6
fragmented xref outputs from the legacy path.
Tables 2/3/4/5 now include the full column header, all data rows, and
the complete "Table N." caption text. Previous output only captured
the bottom data rows + the lower half of the caption.
extract correctly (no regression on figure-above-caption layout).
plan_figures.py/materialize_figure_asset.pyunchanged; existingfields of
figure_assetspreserved.Notes for review
figure_assets[*]gains a new optional"kind"field(
"figure" | "table"). All existing fields are preserved verbatim.so users who already depend on it are not affected.
fitz) was already requiredand remains the only PDF backend.