Runs list: ClickHouse timeout on tag filter due to FINAL + hasAny(tags) scan

## Summary

Filtering the Runs list by tag in the webapp causes ClickHouse to time out (`Timeout error`, 30s default `request_timeout` from `@clickhouse/client`). For projects with non-trivial run volume, the tag filter is effectively unusable — the UI appears to crash.

## Environment

- Self-hosted trigger.dev v4 (helm chart)
- ClickHouse 25.7.2
- `task_runs_v2`: ~12M rows, 22 active parts, 2.2 GiB on disk, partitioned by `toYYYYMM(created_at)` across 4 months
- Reproduces consistently on any tag filter with a moderately old cursor

## Failing query (actual log)

```sql
SELECT run_id FROM trigger_dev.task_runs_v2 FINAL
WHERE organization_id = {organizationId: String}
  AND project_id = {projectId: String}
  AND environment_id = {environmentId: String}
  AND task_identifier IN {tasks: Array(String)}
  AND status IN {statuses: Array(String)}
  AND hasAny(tags, {tags: Array(String)})
  AND created_at >= fromUnixTimestamp64Milli({period: Int64})
  AND run_id < {runId: String}
ORDER BY created_at DESC, run_id DESC
LIMIT 2
```

Source: `apps/webapp/app/services/runsRepository/clickhouseRunsRepository.server.ts` (paginator "has-next" probe, `LIMIT 2`).

## Root cause

Three compounding factors make this query pathologically slow:

1. **`FINAL` on `ReplacingMergeTree`** — forces merge-at-query-time across every part matching the `(organization_id, project_id, environment_id)` prefix of the ORDER BY. The existing `idx_tags` tokenbf_v1 skip index helps at part level but `FINAL` still reads whole parts to resolve `_version`/`_is_deleted`.

2. **`hasAny(tags, ...)` applied post-`FINAL`** — low selectivity under FINAL means many granules are read and merged before the tag predicate eliminates them. Bloom filter benefit is largely lost.

3. **Cursor pagination `LIMIT 2` with no upper bound on `created_at`** — when the cursor (`run_id < {runId}`) sits far back in time and the tag hit density is low, CH scans backward through many granules looking for 2 matching rows. Each scan repeats the FINAL + tag work.

Observed in production: query exceeds the hard-coded 30s `@clickhouse/client` default `request_timeout` (no override in `internal-packages/clickhouse/src/client/client.ts`), so the webapp logs `{"name":"ClickHouse","error":{"message":"Timeout error."}}` and returns an error to the UI.

## Reproduction

1. Self-hosted instance with ≥10M rows in `task_runs_v2` across 3+ monthly partitions.
2. Open Runs list in the dashboard.
3. Apply a tag filter where matches are sparse (e.g. 1 tag that appears in <1% of runs).
4. Paginate past the first page (cursor advances into older data).
5. Observe 30s hang then UI error. Webapp logs show `ClickHouse Timeout error` with the query above.

## Proposed fixes (any of these would help)

1. **Push tag filter into `PREWHERE`** — let CH apply the bloom-filter-friendly predicate before the FINAL merge work:
   ```sql
   SELECT run_id FROM trigger_dev.task_runs_v2 FINAL
   PREWHERE hasAny(tags, {tagsFilter: Array(String)})
   WHERE organization_id = ... AND ...
   ```

2. **Drop `FINAL`, dedupe via `argMax(_version)` aggregation pattern** — avoids merge-at-query-time. Requires group-by on `run_id` and `argMax(status, _version)` etc., plus a `_is_deleted = 0` filter. Heavier per-row but avoids the worst case.

3. **Expose `request_timeout` on `ClickhouseConfig`** — `internal-packages/clickhouse/src/client/client.ts:59` currently doesn't pass `request_timeout` to `createClient`. Allow self-hosters to bump it via env while a proper query fix lands.

4. **Bound `created_at` upper side on cursor queries** — carry the cursor row's `created_at` alongside `run_id` and add `created_at <= {cursorCreatedAt}` to prune partitions.

(1) + (3) are the smallest surface-area changes and would unblock most users.

## Workaround we're applying

Bumping ClickHouse pod resources (requests 4 CPU / 16 Gi, limits 8 CPU / 32 Gi) to reduce the chance of hitting the 30s client timeout. Doesn't fix the underlying quadratic-ish scan pattern — tag filter will still degrade as the table grows.

Happy to open a PR for (1) + (3) if there's interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Runs list: ClickHouse timeout on tag filter due to FINAL + hasAny(tags) scan #3426

Summary

Environment

Failing query (actual log)

Root cause

Reproduction

Proposed fixes (any of these would help)

Workaround we're applying

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Runs list: ClickHouse timeout on tag filter due to FINAL + hasAny(tags) scan #3426

Description

Summary

Environment

Failing query (actual log)

Root cause

Reproduction

Proposed fixes (any of these would help)

Workaround we're applying

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions