Summary
Under high CloudFetch concurrency, query-runner pods retain ~2 GiB of Arrow buffers in the Go heap indefinitely after the burst ends. Memory only releases on process restart — no progressive release during sustained low load.
Environment
- Driver:
databricks-sql-go v1.7.1
- Go: 1.20+
- Arrow:
github.com/apache/arrow/go/v14
- Server: production Databricks SQL warehouse
- Region: us-east-1
Reproduction signature
A single customer org ran one Databricks query repeatedly over a 6-hour window:
| Property |
Value |
| Result size |
3.19 GB |
| Chunks per query |
932 |
| Avg chunk size |
~3.4 MB |
| Max chunk size observed |
4.9 MB |
| Total runs in window |
26 |
| Peak concurrent runs |
6 |
| Total bytes pulled |
~135 GB |
| Window duration |
~6 hours |
Observed behavior
Heap timeline (PST):
Time Bytes/hr Heap Note
May 3 23:00 22.4 GB 0.3 GiB burst starts (7 big queries)
May 4 02:00 19.8 GB 0.3 GiB burst (6 big queries)
May 4 04:00 49.4 GB 2.0 GiB burst peak (12 big queries)
May 4 05:00 28.2 GB 2.0 GiB burst ends (9 big queries)
May 4 06:00 →
May 5 09:00 1-3 GB 2.0 GiB ← 28 hours, ZERO >1GB queries, heap flat
May 5 09:58 (deploy) →0.3 GiB ← pod restart releases all buffers
May 5 10:00 → 16:00 1-2 GB 0.3 GiB normal traffic, heap stays low
Stack trace for the ~2 GiB:
(*cloudFetchDownloadTask).Run.func1
getArrowRecords
(*ipc.Reader).Next
(*ipc.Reader).next
newRecord
(*arrayLoaderContext).loadArray
(*arrayLoaderContext).loadBinary
(*arrayLoaderContext).buffer
(*ipcSource).buffer
NewResizableBuffer
(*Buffer).Resize / Reserve
(*GoAllocator).Allocate
Why we believe this is retention rather than active workload
- No big queries during the plateau: Across the 28-hour heap plateau, we recorded zero queries returning >1 GB and zero repeat runs of the original 3.19 GB query. Workload was 1-3 GB/hr (10-50× lower than burst hours).
- Sharp release on restart of pods: Heap dropped from 2 GiB to baseline within minutes of a pod restart.
Questions
- Under high CloudFetch concurrency (≥6 simultaneous downloads), does the driver retain Arrow records or backing buffers in connection-level / batch-iterator state past the consumer's
Release() call?
- Is there an internal cache or buffer-reuse mechanism inside
cloudFetchDownloadTask / getArrowRecords / the IPC reader path that grows monotonically with peak concurrency and never evicts?
- Are there config options to bound buffer retention or force release after an idle period?
- Are there memory-related fixes between v1.7.1 and v1.11.0 that address this pattern?
Happy to provide additional profiles, controlled reproduction with memory.NewCheckedAllocator, or whatever else helps debug.
Summary
Under high CloudFetch concurrency, query-runner pods retain ~2 GiB of Arrow buffers in the Go heap indefinitely after the burst ends. Memory only releases on process restart — no progressive release during sustained low load.
Environment
databricks-sql-go v1.7.1github.com/apache/arrow/go/v14Reproduction signature
A single customer org ran one Databricks query repeatedly over a 6-hour window:
Observed behavior
Heap timeline (PST):
Stack trace for the ~2 GiB:
Why we believe this is retention rather than active workload
Questions
Release()call?cloudFetchDownloadTask/getArrowRecords/ the IPC reader path that grows monotonically with peak concurrency and never evicts?Happy to provide additional profiles, controlled reproduction with
memory.NewCheckedAllocator, or whatever else helps debug.