Skip to content

feat: figure-level page-render extraction with caption-on-bottom table support#1

Open
KuangjuX wants to merge 2 commits into
917Dhj:mainfrom
KuangjuX:feat/figure-level-extraction
Open

feat: figure-level page-render extraction with caption-on-bottom table support#1
KuangjuX wants to merge 2 commits into
917Dhj:mainfrom
KuangjuX:feat/figure-level-extraction

Conversation

@KuangjuX
Copy link
Copy Markdown

@KuangjuX KuangjuX commented May 6, 2026

Summary

This PR teaches extract_pdf_assets.py to render each Figure / Table from
the page pixmap with a caption-anchored bounding box, replacing (and
augmenting) the legacy xref-level extraction that often produced unusable
fragments such as 58×224 arrow icons.

The work is split into two commits so the table fix is reviewable on its own:

  1. feat: add figure-level page-render extraction for complete figure cropping (df07e11)
    • Locates Figure / Table captions on each page via regex.
    • Computes a bbox that unions the embedded image rects and vector drawings
      between the previous caption boundary and the current caption.
    • Renders that bbox at 200 DPI and emits a new figure_assets list.
    • plan_figures.py matches plan items against figure_assets by
      normalized label and prefers them over xref candidates.
  2. fix: correctly crop tables with caption-on-bottom and LaTeX cell fragmentation (740739a)
    • Tags each caption with kind: figure | table and stops merging
      multi-line captions when a numeric data row or oversized vertical gap is
      encountered, so the caption bbox no longer absorbs the first table row.
    • Adds row clustering (_cluster_lines_into_rows + _row_is_table_like)
      to fold the per-cell PyMuPDF lines that LaTeX tables emit (often 100+
      single-cell lines per table) into one logical row.
    • Adds a stricter _find_paragraph_blocks (≥200 chars, ≥3 lines, <40 %
      numeric-heavy lines) so table-shaped blocks are no longer mistaken for
      prose.
    • Splits bbox estimation into _estimate_figure_bbox_above_caption (legacy
      figure path, unchanged behaviour) and _estimate_table_bbox (probes
      both above and below the caption, picks the side with more table-like
      rows). Each estimator falls back to the other if its primary direction
      returns no usable bbox.
    • extract_figure_regions selects the estimator based on anchor kind and
      records kind on each emitted asset (additive change; no existing field
      of figure_assets is removed or renamed, so plan_figures.py /
      materialize_figure_asset.py continue to work unchanged).

Why a single PR?

The table fix relies on the figure-level scaffolding from the first commit
(_find_caption_blocks, _find_body_text_blocks, extract_figure_regions).
Splitting it would make the second commit unbuildable on top of upstream
main. The two commits are kept separate inside the PR so reviewers can
evaluate them independently.

Test plan

  • Re-extracted the SkVM paper (arXiv 2604.03088): figure-level path
    produces 19 complete crops covering Figure 1–16 + Table 1, vs. 6
    fragmented xref outputs from the legacy path.
  • Re-extracted the LoongTrain SC'24 paper (arXiv 2406.18485):
    Tables 2/3/4/5 now include the full column header, all data rows, and
    the complete "Table N." caption text. Previous output only captured
    the bottom data rows + the lower half of the caption.
  • Verified Figures 2/4/6/8/9/10/11/12/13/14 on the same paper still
    extract correctly (no regression on figure-above-caption layout).
  • plan_figures.py / materialize_figure_asset.py unchanged; existing
    fields of figure_assets preserved.

Notes for review

  • figure_assets[*] gains a new optional "kind" field
    ("figure" | "table"). All existing fields are preserved verbatim.
  • The legacy xref-level extraction is intentionally retained as a fallback
    so users who already depend on it are not affected.
  • No new third-party dependencies; PyMuPDF (fitz) was already required
    and remains the only PDF backend.

KuangjuX and others added 2 commits May 6, 2026 16:50
…pping

The existing xref-level extraction often produces unusable fragments (e.g.
58×224 arrow icons) because PDF stores figures as many small embedded
image objects or as pure vector art. This commit adds a second extraction
strategy that runs alongside the legacy xref path:

1. Locate Figure/Table captions on each page via regex over text blocks.
2. Collect bounding boxes of all xref images and vector drawings between
   the previous caption boundary and the current caption.
3. Render the computed region from the page pixmap at 200 DPI, producing
   a complete, human-readable figure PNG including the full multi-line
   caption text.

Downstream changes in plan_figures.py:
- Accept the new `figure_assets` list from extract_pdf_assets.py.
- Match figure plan items to figure-level assets by normalized label.
- Prefer figure-level matches over legacy xref candidates, setting
  `insert_mode` to `"figure_asset"` when a direct match is found.

Tested on the SkVM paper (arXiv 2604.03088): old pipeline extracted 6
xref fragments (including 58×224 artifacts); new pipeline produces 19
complete figure-level crops covering Figure 1–16 and Table 1.

Co-authored-by: Cursor <cursoragent@cursor.com>
…mentation

The figure-level extractor introduced in df07e11 worked well for figures
but produced badly truncated screenshots for tables in many ACM/IEEE
papers (e.g. LoongTrain SC'24 Tables 2–5 only showed the bottom data
rows plus the lower half of the caption, losing the column header and
upper data rows entirely).

Three layered bugs caused this:

1. Direction was wrong. `_estimate_figure_bbox` always assumed
   "caption sits below the visual content", but academic tables are
   commonly typeset with caption-on-bottom (`\begin{tabular}` precedes
   `\caption`). Treating Table captions like Figure captions cropped
   the area above the caption, which is the body paragraph above the
   table rather than the table itself.
2. Body-block detection swallowed the table. PyMuPDF groups an entire
   tabular column ("DS-Ulysses 629.9 418.3 ...") into a single text
   block, and the legacy `_find_body_text_blocks` filter accepted any
   block longer than 40 characters as a "body paragraph". The probe
   walking up from the caption then immediately broke out as soon as
   the table region was reached, because every line inside the table
   was considered "inside a body block".
3. Row granularity was wrong. LaTeX-rendered tables emit one PyMuPDF
   "line" per cell (e.g. 136 single-cell lines for one tabular block).
   Per-line `_looks_like_data_row` checks therefore never fired, and
   the probe could not tell that it was walking through actual data.

Changes:

- Tag every caption anchor with `kind: "figure" | "table"` and stop
  multi-line caption merging when a numeric data row or an over-large
  vertical gap is encountered, so the caption bbox no longer absorbs
  the first table row.
- Add `_cluster_lines_into_rows` to fold sibling text lines that share
  the same vertical band into one logical row, plus `_row_is_table_like`
  which accepts either rows with ≥3 separated cells or rows whose
  tokens are dominated by numbers.
- Add `_find_paragraph_blocks` with a stricter heuristic (≥200 chars,
  ≥3 lines, <40 % numeric-heavy lines) so table-shaped blocks are no
  longer mistaken for prose. The table probe uses this stricter set;
  the figure probe keeps the original behaviour for backward
  compatibility.
- Split bbox estimation: `_estimate_figure_bbox_above_caption` keeps
  the existing "image above caption" logic for figures, while a new
  `_estimate_table_bbox` probes both above and below the caption,
  picks the side with more table-like rows, and unions in nearby
  drawing/image rects (\hline, frames). Both estimators fall back to
  each other when the primary direction returns no usable bbox.
- `extract_figure_regions` now picks the estimator based on the anchor
  kind and records the `kind` in each emitted asset (additive change;
  downstream plan_figures.py / materialize_figure_asset.py consume the
  same fields as before).

Verified on LoongTrain SC'24 (arXiv 2406.18485):

- Tables 2/3/4/5 now include the full column header, all data rows,
  and the complete "Table N." caption text.
- Figures 2/4/6/8/9/10/11/12/13/14 still extract correctly; no
  regression on the figure-above-caption layout.

Co-authored-by: Cursor <cursoragent@cursor.com>
@KuangjuX KuangjuX changed the title feat: add figure-level page-render extraction for complete figure cropping feat: figure-level page-render extraction with caption-on-bottom table support May 8, 2026
Copy link
Copy Markdown
Owner

@917Dhj 917Dhj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this. I tested the PR on a set of papers, mostly CS/AI conference-style papers with many figures and tables.

My overall feeling is: I’m happy with the new extraction method, and I’d like to keep that part. It does improve the asset pool by adding captioned figure_assets, especially for papers where the old xref-only extraction misses most useful figures/tables.

However, I’d like the insertion and review logic to stay aligned with the original DeepPaperNote workflow.

The original design is intentionally placeholder-first:

  • scripts extract evidence and candidate assets;
  • figure/table placeholders are preserved first;
  • extracted images still need to be reviewed;
  • only images that pass the quality/semantic check should be inserted into the final note.

So I don’t think the PR should change the insertion decision from the script side. In particular, a label match between a planned figure/table and an extracted asset should not automatically set:

insert_mode: "figure_asset"

Some extracted assets are useful, but some are still duplicated, too tight, too loose, or not suitable for the final note. That is expected for an extraction step, but it means the output should remain a candidate pool rather than becoming an insertion decision.

Could you please keep the new figure/table extraction method, but adjust plan_figures.py so that it preserves the original placeholder-first behavior?

Concretely, I’d prefer:

  • keep insert_mode: "placeholder" by default;
  • attach matched figure_assets as candidates, e.g. figure_asset_candidate or candidate_assets;
  • leave the final decision of whether to materialize an image to the existing review/model-side workflow.

That way we get the benefit of your improved extraction method, while keeping DeepPaperNote’s original insertion and quality-review logic intact.

A few edge cases I noticed during testing:

  • duplicate labels, such as repeated Table 2 / Figure 3;
  • inconsistent labels such as Fig. 1 and Figure 1;
  • some table crops are still a bit tight or loose.

These are fine as candidate-generation issues, but they are exactly why I’d prefer not to automatically promote extracted assets into inserted note images.

So my requested change is mainly architectural rather than rejecting the extraction work: please keep the extraction improvement, but keep insertion/review decisions following the original placeholder-first workflow.

Thanks again — I think this is a useful improvement once that boundary is restored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants