Skip to content

feat: decode inline Arrow IPC + warehouse-compat fallback#329

Open
jamesbroadhead wants to merge 13 commits into
mainfrom
stack/arrow-3-inline-arrow-fix
Open

feat: decode inline Arrow IPC + warehouse-compat fallback#329
jamesbroadhead wants to merge 13 commits into
mainfrom
stack/arrow-3-inline-arrow-fix

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

Summary

Serverless warehouses return ARROW_STREAM + INLINE results as base64 Arrow IPC in result.attachment rather than result.data_array. The previous code path discarded inline data for any ARROW_STREAM response (designed for EXTERNAL_LINKS), so these warehouses silently returned empty results.

This PR makes the analytics plugin work across classic and serverless warehouses by handling both dispositions for ARROW_STREAM, decoding inline Arrow IPC attachments server-side, and falling back to JSON_ARRAY when a warehouse rejects ARROW_STREAM + INLINE.

Stack

Based on #328. Merge order: #327#328 → this PR.

Changes

  • Inline Arrow IPC decoding (connectors/sql-warehouse/arrow-schema.ts + client.ts): detects result.attachment and decodes via apache-arrow's tableFromIPC, producing the same row-object shape as JSON_ARRAY regardless of warehouse backend. apache-arrow@21.1.0 added as a server dependency.
  • Format fallback: ARROW_STREAM + INLINE requests automatically fall back to JSON_ARRAY if a classic warehouse rejects them. Explicit format requests are respected without fallback.
  • Zod-validated SSE wire protocol (shared/sse/analytics.ts): single source of truth between server (AnalyticsPlugin._handleQueryRoute) and client (useAnalyticsQuery). Malformed payloads surface a clear error instead of silent undefined allocation.
  • Default remains JSON_ARRAY for compatibility.

Tests

Per-area additions:

  • connectors/sql-warehouse/tests/arrow-schema.test.ts (new, 514 lines)
  • connectors/sql-warehouse/tests/client.test.ts (new, 382 lines) — uses real base64 Arrow IPC captured from a live serverless warehouse
  • appkit-ui/src/react/hooks/__tests__/use-analytics-query.test.ts (new, 143 lines)
  • shared/src/sse/analytics.test.ts (new, 87 lines)
  • plugins/analytics/tests/analytics.test.ts (+449)

Test plan

  • Deploy an app against a serverless warehouse that returns ARROW_STREAM + INLINE attachments — verify useAnalyticsQuery returns rows
  • Deploy against a classic warehouse — verify JSON_ARRAY default works and ARROW_STREAM + EXTERNAL_LINKS still works
  • Verify automatic fallback when a warehouse rejects ARROW_STREAM + INLINE

Fixes #242. Replaces #256.

This pull request was AI-assisted by Isaac.

Renames the client-side analytics format model from "JSON"/"ARROW" to
"JSON_ARRAY"/"ARROW_STREAM" to match the Statement Execution API enum
verbatim — no more local-name to API-name translation.

Pure mechanical rename. No behavior change. Internal type values only;
the lowercase user-facing values passed to useChartData ("json", "arrow",
"auto") are unchanged.

Carved out of #256 (#327 is layer 1, this is layer 2). The actual
inline-Arrow-IPC + warehouse-fallback fix sits on top of this in layer 3.

Note: this is a breaking change for any direct consumer of
useAnalyticsQuery passing explicit format: "JSON" or "ARROW" — they will
need to update to "JSON_ARRAY" / "ARROW_STREAM". Consumers using
useChartData (lowercase "json"/"arrow"/"auto") are unaffected.

Co-authored-by: Isaac
Widen AnalyticsFormat to also include the pre-rename "JSON" and "ARROW"
spellings, both marked @deprecated with a JSDoc note describing the
removal condition (no consumer on appkit/appkit-ui < 0.33.0). Add a
normalizeAnalyticsFormat helper and call it at the analytics route
handler entry point so all downstream code (cache key, format
branching, formatParameters) continues to operate on the canonical
"JSON_ARRAY" | "ARROW_STREAM" values.

InferResultByFormat is widened to also match "ARROW" so callers
passing the legacy spelling still get TypedArrowTable<...> inferred.

This lifts the breaking-change carve-out from the rename, so callers
of useAnalyticsQuery({ format: "JSON" | "ARROW" }) keep working with
only an IDE deprecation hint.

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Serverless warehouses return ARROW_STREAM + INLINE results as base64 Arrow
IPC in result.attachment rather than result.data_array. The previous code
path discarded inline data for any ARROW_STREAM response (designed for
EXTERNAL_LINKS), so these warehouses silently returned empty results.

This commit makes the analytics plugin work across classic and serverless
warehouses by handling both dispositions for ARROW_STREAM, decoding inline
Arrow IPC attachments server-side, and falling back to JSON_ARRAY when a
warehouse rejects ARROW_STREAM + INLINE.

Changes
- Inline Arrow IPC decoding (new arrow-schema.ts) via apache-arrow's
  tableFromIPC, producing the same row-object shape as JSON_ARRAY
  regardless of warehouse backend. apache-arrow@21.1.0 added as a server
  dep.
- Format fallback: ARROW_STREAM + INLINE requests automatically fall back
  to JSON_ARRAY if a classic warehouse rejects them. Explicit format
  requests are respected without fallback.
- Zod-validated SSE wire protocol for /api/analytics/query (shared schema
  between server and client; malformed payloads surface a clear error
  instead of silent undefined).
- Default remains JSON_ARRAY for compatibility.

Stack: layer 3 of 3 carved from #256.
- #327 — coverage backfill (layer 1)
- #328 — AnalyticsFormat rename to API enum names (layer 2)
- (this PR) — the actual fix

Fixes #242

Co-authored-by: Isaac
@jamesbroadhead jamesbroadhead force-pushed the stack/arrow-3-inline-arrow-fix branch from ca69f8e to 08c5486 Compare May 11, 2026 16:38
Six issues surfaced by GPT 5.4 xhigh + Gemini 3.1 Pro parallel review
followed by an adversarial debate round (reviewer: GPT, critic: Gemini,
meta: Claude Opus).

1. Raise SSE event-size cap from 8 MiB to 12 MiB on both server
   (streamDefaults.maxEventSize) and client (connectSSE.maxBufferSize).
   The inline Arrow attachment cap (MAX_INLINE_ATTACHMENT_BYTES) stays
   at 8 MiB *decoded*; base64 encoding + JSON + SSE framing inflate that
   to ~10.6 MiB on the wire, so 12 MiB leaves enough headroom for legal
   8-MiB-decoded payloads to traverse the buffer.

2. Empty `data_array: []` is truthy, so zero-row ARROW_STREAM responses
   skipped empty-table synthesis and fell through to the JSON row
   transform — callers requesting Arrow got [] JSON rows. Length-check
   explicitly.

3. The arrow-fix commit dropped lowercase legacy "json" / "arrow" from
   DataFormat / resolveFormat(), silently breaking existing useChartData
   callers passing those spellings. Restore them as @deprecated aliases
   on the DataFormat union; resolveFormat() normalizes them to the
   canonical "JSON_ARRAY" / "ARROW_STREAM" return values.

4. The JSON_ARRAY -> ARROW_STREAM retry in DESCRIBE QUERY only fired on
   thrown exceptions. Some warehouses signal the rejection as
   `status.state === "FAILED"` instead. Extract the rejection-matcher
   helper and retry on both paths before degrading the typegen result
   to `unknown`.

5. analytics.test.ts:946 asserted `format: "JSON"` returns 400, but the
   route now accepts "JSON" as a legacy alias (normalized to
   JSON_ARRAY). Use a truly unsupported value ("CSV") so the test still
   exercises the malformed-format path.

6. Restore `zod: 4.3.6` to @databricks/appkit dependencies. main has it;
   the rebase conflict-resolution accepted the branch's older deps list
   which lacked it. appkit imports `zod` directly from several files
   (analytics.ts, agent tools, tests).

Co-authored-by: Isaac
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The original commit added zod@3.23.8 to shared for the new SSE wire
protocol schema. With zod restored on appkit at 4.3.6 (matching main),
the workspace now had two different zod majors resolving in different
packages — a latent peer-dep / type-incompatibility foot-gun even
though the schema itself was already cross-major-compatible.

Bump shared's zod to 4.3.6 so the whole workspace lands on one major.
The schema's two-arg `z.record(z.string(), z.unknown())` form is the
zod 4 spelling, so no functional change is needed; drop the now-stale
"keeps it valid under either major" comment.

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>

# Conflicts:
#	packages/appkit-ui/src/react/hooks/types.ts
#	packages/appkit-ui/src/react/hooks/use-chart-data.ts
#	packages/appkit/src/plugins/analytics/analytics.ts
#	packages/appkit/src/plugins/analytics/types.ts
Restoring zod@4.3.6 to appkit and bumping shared's zod from 3.23.8 to
4.3.6 left the lockfile out of sync with package.json, breaking CI's
pnpm install --frozen-lockfile step on every job. Regenerate the
lockfile so both specifier entries match the manifests.

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The merge with main left two separate import statements for "./types" —
one for the type-only specifiers and a duplicate value import of
normalizeAnalyticsFormat. Biome rejected this as both an
organize-imports failure and a noRedeclare error. Merge them into a
single mixed type/value import.

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The merge resolution in client.ts dropped the logger.error call from
the executeStatement catch block — main has it, our pre-merge branch
had it, the resolved version lost it. Without that line the
"error log redaction" tests fail because the connector no longer
surfaces the failure message to the log spy.

Restore the call. Test plan: the two sql-warehouse.test.ts redaction
tests pass locally; behavior matches the comment "executeStatement's
catch ... is the single point that logs (gated on isAborted)".

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
…e SSE message

Address Mario's design feedback: SSE is for short control messages, not
bulk binary. Inline Arrow IPC payloads from serverless warehouses no
longer ride the SSE channel as base64; they are stashed server-side and
fetched out-of-band through the existing /arrow-result/:jobId endpoint
with the canonical application/vnd.apache.arrow.stream content-type.

Wire protocol
- Discriminated union shrinks from three variants to two: the
  arrow_inline message type is gone. Both INLINE and EXTERNAL_LINKS
  ARROW_STREAM responses now flow as a single `arrow` message whose
  statement_id discriminates dispatch: warehouse-issued ids hit the
  warehouse path, synthetic "inline-<uuid>" ids hit the stash. The
  client sees one path.

Server
- New InlineArrowStash: TTL'd (10 min), bounded-memory (256 MiB),
  drain-on-read, per-user-keyed map of decoded Arrow IPC bytes. Stash
  key is the request's user id (or "global" for SP contexts) and is
  symmetric between put and take.
- AnalyticsPlugin holds one stash instance and uses it in two places:
  - _executeWithFormatFallback decodes result.attachment once, puts the
    bytes in the stash, and emits an arrow message with the synthetic
    id. Bulk bytes never traverse SSE.
  - _handleArrowRoute prefix-dispatches on the jobId: "inline-" drains
    the stash and serves with application/vnd.apache.arrow.stream + a
    no-store cache header; other ids fall through to the existing
    warehouse-fetch path unchanged.
- Connector's MAX_INLINE_ATTACHMENT_BYTES raised from 8 MiB to 25 MiB
  (the Databricks API hard cap on INLINE) since the SSE event-size
  budget no longer constrains it.

Client
- useAnalyticsQuery loses the arrow_inline branch and the local base64
  decoder. Both inline and external-links responses fetch through
  /api/analytics/arrow-result/:id; the prefix branch lives server-side.
- The dead client-side MAX_INLINE_ATTACHMENT_BYTES guard goes away.

SSE buffers
- streamDefaults.maxEventSize: 12 MiB -> 1 MiB
- connectSSE.maxBufferSize:    12 MiB -> 1 MiB
SSE now carries only short JSON control messages (result rows, arrow
envelope with statement id, error frames). Multi-MiB caps are no longer
needed and would mask buffer regressions.

Tests
- New InlineArrowStash unit tests (TTL eviction, max-bytes LRU, drain-
  on-read, per-user scoping).
- Reworked the route's "emits arrow_inline" test into a stash + arrow-
  message assertion: the SSE payload must not contain the base64 bytes
  or the arrow_inline type literal, and the decoded bytes must be in
  the stash keyed by the same synthetic id.
- New /arrow-result tests cover the inline path: success drain, 410 on
  unknown id, 410 on user mismatch.
- Client tests rewritten to assert both warehouse and inline-prefixed
  ids fetch through the same /arrow-result URL with no local decoding.
- Shared schema tests assert the retired arrow_inline type no longer
  parses.
- The /arrow-result content-type for warehouse hits stays application/
  octet-stream (no behavior change there).

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Four findings surfaced by the GPT pass on the reworked PR:

1. ARROW_STREAM cache replay returned drained inline-* ids (HIGH).
   The previous code capped the cache TTL at 10 min for ARROW_STREAM,
   which made sense for EXTERNAL_LINKS pre-signed URLs that expire in
   ~15 min but is broken for inline ids: the stash drains on the first
   /arrow-result fetch, so any cache hit replays an id whose bytes are
   gone and reliably 410s. Bypass cache entirely for ARROW_STREAM (TTL
   = 0); JSON_ARRAY responses still cache normally.

2. Stash evict-on-fit invalidated already-issued ids (MEDIUM).
   The earlier `evictUntilFits` dropped the oldest entries when a new
   payload would push total bytes past `maxBytes`, but those oldest
   entries had ids that were already in flight to clients. Replace
   eviction with rejection: `put()` now returns `string | null` and the
   caller falls back to EXTERNAL_LINKS when the stash is full. Every id
   we hand out stays valid until naturally drained or expired.

3. Aborted stream still decoded + stashed (MEDIUM).
   If the client cancels the SSE between query completion and stash
   write, we still decoded the base64 attachment and held the bytes
   until TTL eviction. Re-check `signal.aborted` before decode/put so
   canceled streams exit cleanly.

4. Empty result message wrote `undefined` to the hook's state (LOW).
   The wire schema makes `data` optional; an empty result set may omit
   it. Normalize the missing case to `[]` so consumers can rely on
   `data` being either `null` (no message yet) or a value of the
   inferred result type.

Also documents the process-local-memory constraint on the stash in its
docstring: a `GET /arrow-result/inline-*` that lands on a different
replica than the original SSE request will 410. Multi-replica
deployments need sticky sessions or a shared external store, neither in
scope for this PR.

Tests:
- `inline-arrow-stash`: replaced the eviction test with a rejection
  test that asserts `put()` returns null when the stash is full and
  that previously-issued ids remain takeable.
- `useAnalyticsQuery`: new test asserts an empty result message
  normalizes to [].

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
- agents.ts had two unused imports that biome's noUnusedImports rule
  flags as errors in CI. Drop them; behavior unchanged.
- inline-arrow-stash.test.ts: introduce a mustPut() helper that asserts
  the non-null contract for successful puts, so the new
  `put(): string | null` return type does not poison every downstream
  take() call with a string-vs-string|null TS error.
- Minor formatter touch-ups picked up by biome --write.

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Analytics plugin incompatible with ARROW_STREAM-only warehouses

1 participant