Skip to content

Memory issue: Heap retention after CloudFetch burst on v1.7.1 #356

@swvooda

Description

@swvooda

Summary

Under high CloudFetch concurrency, query-runner pods retain ~2 GiB of Arrow buffers in the Go heap indefinitely after the burst ends. Memory only releases on process restart — no progressive release during sustained low load.

Environment

  • Driver: databricks-sql-go v1.7.1
  • Go: 1.20+
  • Arrow: github.com/apache/arrow/go/v14
  • Server: production Databricks SQL warehouse
  • Region: us-east-1

Reproduction signature

A single customer org ran one Databricks query repeatedly over a 6-hour window:

Property Value
Result size 3.19 GB
Chunks per query 932
Avg chunk size ~3.4 MB
Max chunk size observed 4.9 MB
Total runs in window 26
Peak concurrent runs 6
Total bytes pulled ~135 GB
Window duration ~6 hours

Observed behavior

Heap timeline (PST):

Time                  Bytes/hr   Heap     Note
May 3 23:00            22.4 GB   0.3 GiB  burst starts (7 big queries)
May 4 02:00            19.8 GB   0.3 GiB  burst (6 big queries)
May 4 04:00            49.4 GB   2.0 GiB  burst peak (12 big queries)
May 4 05:00            28.2 GB   2.0 GiB  burst ends (9 big queries)
May 4 06:00 →
May 5 09:00             1-3 GB   2.0 GiB  ← 28 hours, ZERO >1GB queries, heap flat
May 5 09:58           (deploy)   →0.3 GiB ← pod restart releases all buffers
May 5 10:00 → 16:00     1-2 GB   0.3 GiB  normal traffic, heap stays low

Stack trace for the ~2 GiB:

(*cloudFetchDownloadTask).Run.func1
  getArrowRecords
    (*ipc.Reader).Next
      (*ipc.Reader).next
        newRecord
          (*arrayLoaderContext).loadArray
            (*arrayLoaderContext).loadBinary
              (*arrayLoaderContext).buffer
                (*ipcSource).buffer
                  NewResizableBuffer
                    (*Buffer).Resize / Reserve
                      (*GoAllocator).Allocate

Why we believe this is retention rather than active workload

  1. No big queries during the plateau: Across the 28-hour heap plateau, we recorded zero queries returning >1 GB and zero repeat runs of the original 3.19 GB query. Workload was 1-3 GB/hr (10-50× lower than burst hours).
  2. Sharp release on restart of pods: Heap dropped from 2 GiB to baseline within minutes of a pod restart.

Questions

  1. Under high CloudFetch concurrency (≥6 simultaneous downloads), does the driver retain Arrow records or backing buffers in connection-level / batch-iterator state past the consumer's Release() call?
  2. Is there an internal cache or buffer-reuse mechanism inside cloudFetchDownloadTask / getArrowRecords / the IPC reader path that grows monotonically with peak concurrency and never evicts?
  3. Are there config options to bound buffer retention or force release after an idle period?
  4. Are there memory-related fixes between v1.7.1 and v1.11.0 that address this pattern?

Happy to provide additional profiles, controlled reproduction with memory.NewCheckedAllocator, or whatever else helps debug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions