Skip to content

Close Office and Google Docs paste sanitizer fidelity gaps #5896

@rtibbles

Description

@rtibbles

This issue is not open for contribution. Visit Contributing guidelines to learn about the contributing process and how to find suitable issues.

Overview

utils/pasteTransform.js covers the common Microsoft Office cases but leaves fidelity gaps when pasting from Word, Excel, SharePoint, and Google Docs. Most visibly, Word multi-level lists collapse to flat paragraphs. This task closes those gaps in the existing sanitizer.

Complexity: Medium
Target branch: hotfixes

Context

We evaluated existing TipTap/ProseMirror plugins for source-specific paste cleanup and chose to keep the homegrown sanitizer: @tiptap-pro/extension-paste-handler (the only feature-complete option) is paid; @intevation/tiptap-extension-office-paste is narrower than ours and has open list-in-table bugs; wordsoap is 7+ years dormant; CKEditor 5's filter is GPL and tightly coupled to its model layer.

The Change

Extend the homegrown sanitizer to close known fidelity gaps. Each is a distinct branch in the sanitizer and can be implemented and reviewed independently:

  1. Word multi-level lists collapse to flat paragraphs. Word emits <p class="MsoListParagraph"> with the indent level encoded in mso-list:l0 level2 lfo1 style hints. We strip both, so the structure disappears and the list becomes a sequence of indistinguishable paragraphs.
  2. Excel/SharePoint class-driven styles disappear. Excel emits class="xl63" referring to definitions in a pasted <style> block we ignore. Bold/colored cells paste as plain text.
  3. Google Docs paragraph fragmentation. GDocs wraps every visual line in <p style="margin:0">. Single logical paragraphs explode into many.
  4. Google Docs <hr> pastes as a literal dashed line of text. GDocs encodes the rule as a styled paragraph of dashes.
  5. ARIA-encoded headings paste as plain paragraphs. SharePoint / Word Online emit <p role="heading" aria-level="2"> instead of <h2>, losing document structure.
  6. Word bookmark anchors survive as empty/dead anchors. <a name="_Toc...">, _Ref, _GoBack, footnote/endnote markers leave behind dead hyperlinks.
  7. <ol list-style-type> choice is lost. Word "i, ii, iii" lists paste as default "1, 2, 3".
  8. Table-cell mark normalization. Bold/italic inside <td> survives in raw HTML but gets stripped on first re-serialize because ProseMirror doesn't find the mark at the schema-valid position.

Acceptance Criteria pairs each gap with a checkbox for tracking.

How to Get There

Each gap is reproducible against the TipTap editor in any Studio rich-text field (e.g. question/answer/hint editors in the exercise authoring flow):

  1. Open a source document in Word / Excel / SharePoint / Google Docs containing the relevant construct (a multi-level list, a styled cell, a multi-line paragraph, etc.).
  2. Select the content and copy.
  3. Paste into a TipTap editor field in Studio.
  4. Observe the symptom described in The Change.

Out of Scope

  • Adopting a third-party paste-sanitization plugin (see Context).
  • Image handling on paste — covered by the separate strip-images work and its follow-ups.
  • Any user-visible string for sanitization actions (string freeze).
  • Outlook-specific paste patterns (quoted-reply chains, signature wrappers, cid: images) — separate issue if a real case surfaces.

Acceptance Criteria

General

  • Word multi-level lists (source). Convert sequences of <p class="MsoListParagraph"> into nested <ol>/<ul>.
    • Parse depth from the level\d+ token in mso-list:l\d+ level\d+ lfo\d+.
    • Bullet vs ordered: presence of an mso-list:Ignore-glyph child indicates bullet; otherwise ordered.
    • Must run before the mso-* style strip (which would otherwise drop the indent token).
  • Class-driven inline styles (source). Inline <style>-block rules onto matching elements, then drop the blocks.
    • Resolve single-class selectors only (e.g. .xl63 {...}).
    • Existing inline styles win on conflict.
  • Google Docs paragraph fragmentation (source). Merge consecutive sibling <p> with margin:0 under the same parent into one <p>, joined by <br>.
    • Match margin:0 or paired margin-top:0 + margin-bottom:0.
    • Anchor recognition to the surrounding GDocs <b id="docs-internal-guid-..."> wrapper when present.
  • Google Docs <hr> detection (empirical — no authoritative spec; verify by capturing a real paste). Replace <p> whose stripped text matches /^[-—–]{3,}$/ with <hr>.
  • ARIA headings (source). Replace <p role="heading" aria-level="N"> (and <div role="heading">) with <hN>.
    • Clamp N to 1-6.
    • Preserve inner content and marks.
  • Bookmark anchors (source). Strip <a> whose name/id matches /^_(Toc|Ref|Hlt|GoBack|ftn|edn)\w*/i.
    • Re-parent any inner content in place; remove element entirely if empty.
  • Ordered-list style (source). Preserve list-style-type on <ol> through the mso-* strip.
    • Cover lower-roman, upper-roman, lower-alpha, upper-alpha, decimal.
  • Table-cell schema conformance (source). Re-frame Office-emitted cell markup to ProseMirror's schema-valid form so marks survive serialize/parse.
    • Inside <td>/<th>: <span style="font-weight:bold"><strong>, font-style:italic<em>, text-decoration:underline<u>.
    • Unwrap a single direct-child <p> so marks attach at a valid position.

Testing

  • Each AC is covered by a unit test in __tests__/pasteTransform.spec.js, using an HTML fixture captured from a real paste of that source.
  • Idempotency holds across all fixtures.
  • Pre-existing sanitizer behavior remains green.

References

Per-mechanism source links are inline in the corresponding Acceptance Criteria above.

AI usage

I used Claude (Opus 4.7) for research and drafting:

  • Surveying existing TipTap and ProseMirror paste sanitization plugins to confirm we should keep the homegrown sanitizer rather than adopt a dependency.
  • Gap analysis comparing TipTap Pro's documented transformations against the current sanitizer to identify what's missing.
  • Sourcing the per-mechanism authoritative references linked in each AC (CKEditor 5 source, Microsoft Q&A, ProseMirror forum threads).
  • Drafting the issue section by section.

I reviewed each section and each linked reference before submitting. Every AC except "Google Docs <hr> detection" is anchored to a production OSS implementation or official documentation; that one is explicitly flagged as empirical so the implementer captures a real paste to verify.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions