Memory issue: Heap retention after CloudFetch burst on v1.7.1


<h3>Summary</h3>
<p>Under high CloudFetch concurrency, query-runner pods retain ~2 GiB of Arrow buffers in the Go heap indefinitely after the burst ends. Memory only releases on process restart — no progressive release during sustained low load.</p>
<h3>Environment</h3>
<ul>
<li>Driver: <code>databricks-sql-go v1.7.1</code></li>
<li>Go: 1.20+</li>
<li>Arrow: <code>github.com/apache/arrow/go/v14</code></li>
<li>Server: production Databricks SQL warehouse</li>
<li>Region: us-east-1</li>
</ul>
<h3>Reproduction signature</h3>
<p>A single customer org ran one Databricks query repeatedly over a 6-hour window:</p>

Property | Value
-- | --
Result size | 3.19 GB
Chunks per query | 932
Avg chunk size | ~3.4 MB
Max chunk size observed | 4.9 MB
Total runs in window | 26
Peak concurrent runs | 6
Total bytes pulled | ~135 GB
Window duration | ~6 hours


<h3>Observed behavior</h3>
<p>Heap timeline (PST):</p>
<pre><code>Time                  Bytes/hr   Heap     Note
May 3 23:00            22.4 GB   0.3 GiB  burst starts (7 big queries)
May 4 02:00            19.8 GB   0.3 GiB  burst (6 big queries)
May 4 04:00            49.4 GB   2.0 GiB  burst peak (12 big queries)
May 4 05:00            28.2 GB   2.0 GiB  burst ends (9 big queries)
May 4 06:00 →
May 5 09:00             1-3 GB   2.0 GiB  ← 28 hours, ZERO &gt;1GB queries, heap flat
May 5 09:58           (deploy)   →0.3 GiB ← pod restart releases all buffers
May 5 10:00 → 16:00     1-2 GB   0.3 GiB  normal traffic, heap stays low
</code></pre>

<p>Stack trace for the ~2 GiB:</p>
<pre><code>(*cloudFetchDownloadTask).Run.func1
  getArrowRecords
    (*ipc.Reader).Next
      (*ipc.Reader).next
        newRecord
          (*arrayLoaderContext).loadArray
            (*arrayLoaderContext).loadBinary
              (*arrayLoaderContext).buffer
                (*ipcSource).buffer
                  NewResizableBuffer
                    (*Buffer).Resize / Reserve
                      (*GoAllocator).Allocate
</code></pre>
<h3>Why we believe this is retention rather than active workload</h3>
<ol>
<li><strong>No big queries during the plateau</strong>: Across the 28-hour heap plateau, we recorded <strong>zero</strong> queries returning &gt;1 GB and <strong>zero</strong> repeat runs of the original 3.19 GB query. Workload was 1-3 GB/hr (10-50× lower than burst hours).</li>
<li><strong>Sharp release on restart of pods</strong>: Heap dropped from 2 GiB to baseline within minutes of a pod restart. 
</ol>

<h3>Questions</h3>
<ol>
<li>Under high CloudFetch concurrency (≥6 simultaneous downloads), does the driver retain Arrow records or backing buffers in connection-level / batch-iterator state past the consumer's <code>Release()</code> call?</li>
<li>Is there an internal cache or buffer-reuse mechanism inside <code>cloudFetchDownloadTask</code> / <code>getArrowRecords</code> / the IPC reader path that grows monotonically with peak concurrency and never evicts?</li>
<li>Are there config options to bound buffer retention or force release after an idle period?</li>
<li>Are there memory-related fixes between v1.7.1 and v1.11.0 that address this pattern?</li>
</ol>
<p>Happy to provide additional profiles, controlled reproduction with <code>memory.NewCheckedAllocator</code>, or whatever else helps debug.</p>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issue: Heap retention after CloudFetch burst on v1.7.1 #356

Summary

Environment

Reproduction signature

Observed behavior

Why we believe this is retention rather than active workload

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Property	Value
Result size	3.19 GB
Chunks per query	932
Avg chunk size	~3.4 MB
Max chunk size observed	4.9 MB
Total runs in window	26
Peak concurrent runs	6
Total bytes pulled	~135 GB
Window duration	~6 hours

Memory issue: Heap retention after CloudFetch burst on v1.7.1 #356

Description

Summary

Environment

Reproduction signature

Observed behavior

Why we believe this is retention rather than active workload

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions