Summary
Filtering the Runs list by tag in the webapp causes ClickHouse to time out (Timeout error, 30s default request_timeout from @clickhouse/client). For projects with non-trivial run volume, the tag filter is effectively unusable — the UI appears to crash.
Environment
- Self-hosted trigger.dev v4 (helm chart)
- ClickHouse 25.7.2
task_runs_v2: ~12M rows, 22 active parts, 2.2 GiB on disk, partitioned by toYYYYMM(created_at) across 4 months
- Reproduces consistently on any tag filter with a moderately old cursor
Failing query (actual log)
SELECT run_id FROM trigger_dev.task_runs_v2 FINAL
WHERE organization_id = {organizationId: String}
AND project_id = {projectId: String}
AND environment_id = {environmentId: String}
AND task_identifier IN {tasks: Array(String)}
AND status IN {statuses: Array(String)}
AND hasAny(tags, {tags: Array(String)})
AND created_at >= fromUnixTimestamp64Milli({period: Int64})
AND run_id < {runId: String}
ORDER BY created_at DESC, run_id DESC
LIMIT 2
Source: apps/webapp/app/services/runsRepository/clickhouseRunsRepository.server.ts (paginator "has-next" probe, LIMIT 2).
Root cause
Three compounding factors make this query pathologically slow:
-
FINAL on ReplacingMergeTree — forces merge-at-query-time across every part matching the (organization_id, project_id, environment_id) prefix of the ORDER BY. The existing idx_tags tokenbf_v1 skip index helps at part level but FINAL still reads whole parts to resolve _version/_is_deleted.
-
hasAny(tags, ...) applied post-FINAL — low selectivity under FINAL means many granules are read and merged before the tag predicate eliminates them. Bloom filter benefit is largely lost.
-
Cursor pagination LIMIT 2 with no upper bound on created_at — when the cursor (run_id < {runId}) sits far back in time and the tag hit density is low, CH scans backward through many granules looking for 2 matching rows. Each scan repeats the FINAL + tag work.
Observed in production: query exceeds the hard-coded 30s @clickhouse/client default request_timeout (no override in internal-packages/clickhouse/src/client/client.ts), so the webapp logs {"name":"ClickHouse","error":{"message":"Timeout error."}} and returns an error to the UI.
Reproduction
- Self-hosted instance with ≥10M rows in
task_runs_v2 across 3+ monthly partitions.
- Open Runs list in the dashboard.
- Apply a tag filter where matches are sparse (e.g. 1 tag that appears in <1% of runs).
- Paginate past the first page (cursor advances into older data).
- Observe 30s hang then UI error. Webapp logs show
ClickHouse Timeout error with the query above.
Proposed fixes (any of these would help)
-
Push tag filter into PREWHERE — let CH apply the bloom-filter-friendly predicate before the FINAL merge work:
SELECT run_id FROM trigger_dev.task_runs_v2 FINAL
PREWHERE hasAny(tags, {tagsFilter: Array(String)})
WHERE organization_id = ... AND ...
-
Drop FINAL, dedupe via argMax(_version) aggregation pattern — avoids merge-at-query-time. Requires group-by on run_id and argMax(status, _version) etc., plus a _is_deleted = 0 filter. Heavier per-row but avoids the worst case.
-
Expose request_timeout on ClickhouseConfig — internal-packages/clickhouse/src/client/client.ts:59 currently doesn't pass request_timeout to createClient. Allow self-hosters to bump it via env while a proper query fix lands.
-
Bound created_at upper side on cursor queries — carry the cursor row's created_at alongside run_id and add created_at <= {cursorCreatedAt} to prune partitions.
(1) + (3) are the smallest surface-area changes and would unblock most users.
Workaround we're applying
Bumping ClickHouse pod resources (requests 4 CPU / 16 Gi, limits 8 CPU / 32 Gi) to reduce the chance of hitting the 30s client timeout. Doesn't fix the underlying quadratic-ish scan pattern — tag filter will still degrade as the table grows.
Happy to open a PR for (1) + (3) if there's interest.
Summary
Filtering the Runs list by tag in the webapp causes ClickHouse to time out (
Timeout error, 30s defaultrequest_timeoutfrom@clickhouse/client). For projects with non-trivial run volume, the tag filter is effectively unusable — the UI appears to crash.Environment
task_runs_v2: ~12M rows, 22 active parts, 2.2 GiB on disk, partitioned bytoYYYYMM(created_at)across 4 monthsFailing query (actual log)
Source:
apps/webapp/app/services/runsRepository/clickhouseRunsRepository.server.ts(paginator "has-next" probe,LIMIT 2).Root cause
Three compounding factors make this query pathologically slow:
FINALonReplacingMergeTree— forces merge-at-query-time across every part matching the(organization_id, project_id, environment_id)prefix of the ORDER BY. The existingidx_tagstokenbf_v1 skip index helps at part level butFINALstill reads whole parts to resolve_version/_is_deleted.hasAny(tags, ...)applied post-FINAL— low selectivity under FINAL means many granules are read and merged before the tag predicate eliminates them. Bloom filter benefit is largely lost.Cursor pagination
LIMIT 2with no upper bound oncreated_at— when the cursor (run_id < {runId}) sits far back in time and the tag hit density is low, CH scans backward through many granules looking for 2 matching rows. Each scan repeats the FINAL + tag work.Observed in production: query exceeds the hard-coded 30s
@clickhouse/clientdefaultrequest_timeout(no override ininternal-packages/clickhouse/src/client/client.ts), so the webapp logs{"name":"ClickHouse","error":{"message":"Timeout error."}}and returns an error to the UI.Reproduction
task_runs_v2across 3+ monthly partitions.ClickHouse Timeout errorwith the query above.Proposed fixes (any of these would help)
Push tag filter into
PREWHERE— let CH apply the bloom-filter-friendly predicate before the FINAL merge work:Drop
FINAL, dedupe viaargMax(_version)aggregation pattern — avoids merge-at-query-time. Requires group-by onrun_idandargMax(status, _version)etc., plus a_is_deleted = 0filter. Heavier per-row but avoids the worst case.Expose
request_timeoutonClickhouseConfig—internal-packages/clickhouse/src/client/client.ts:59currently doesn't passrequest_timeouttocreateClient. Allow self-hosters to bump it via env while a proper query fix lands.Bound
created_atupper side on cursor queries — carry the cursor row'screated_atalongsiderun_idand addcreated_at <= {cursorCreatedAt}to prune partitions.(1) + (3) are the smallest surface-area changes and would unblock most users.
Workaround we're applying
Bumping ClickHouse pod resources (requests 4 CPU / 16 Gi, limits 8 CPU / 32 Gi) to reduce the chance of hitting the 30s client timeout. Doesn't fix the underlying quadratic-ish scan pattern — tag filter will still degrade as the table grows.
Happy to open a PR for (1) + (3) if there's interest.