Skip to content

feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396

Open
amir-deris wants to merge 4 commits intomainfrom
amir/plt-323-migrate-app-metrics-to-otel-meter-api
Open

feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396
amir-deris wants to merge 4 commits intomainfrom
amir/plt-323-migrate-app-metrics-to-otel-meter-api

Conversation

@amir-deris
Copy link
Copy Markdown
Contributor

@amir-deris amir-deris commented May 5, 2026

Summary

Migrates telemetry in the app and app/ante packages from the legacy telemetry/utilmetrics helpers to the standardized OpenTelemetry Meter API, following the same pattern established in #3265 for evmrpc.

  • Adds app/metrics.go with a struct-based OTel instrument set registered via otel.Meter("app"), initialized once in NewApp after SetupOtelMetricsProvider()
  • Adds app/ante/metrics.go with a lazily-initialized anteMetrics struct via sync.OnceValue (fires on first CheckTx, after the global MeterProvider is set)
  • Replaces telemetry.MeasureSince / utilmetrics calls in abci.go and invariance.go with dual-emitted OTel instruments; legacy calls remain with TODO(PLT-327) markers until dashboards are migrated
  • Threads ctx/ctx.Context() through all .Record() and .Add() call sites to support OTel's context-based propagation
  • Replaces the per-BeginBlock GaugeSeidVersionAndCommit call with an app_build_info observable gauge that fires on Prometheus scrape — no per-block overhead

New metrics (OTel naming convention, exported via the process-wide MeterProvider)

ABCI phase durations — histograms bucketed at SLO thresholds (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s):

  • app_abci_begin_block_duration_seconds
  • app_abci_end_block_duration_seconds
  • app_abci_module_end_block_duration_seconds
  • app_abci_check_tx_duration_seconds
  • app_abci_deliver_tx_duration_seconds
  • app_abci_deliver_batch_tx_duration_seconds
  • app_block_process_duration_seconds — block tx processing duration by execution type

Transaction counters and gas:

  • app_tx_count_total
  • app_tx_process_type_total — by execution type label
  • app_tx_gas_total — cumulative gas by type label
  • app_tx_gas_used / app_tx_gas_wanted — per-tx gauges

App-level flow counters:

  • app_optimistic_processing_total — cache hit (enabled=true) vs miss
  • app_failed_total_gas_wanted_check_total
  • app_giga_fallback_to_v2_total

Light invariance:

  • app_lightinvariance_supply_duration_seconds
  • app_lightinvariance_supply_invalid_key_total
  • app_lightinvariance_supply_unmarshal_failure_total

Build info:

  • app_build_info — observable gauge, always 1, labels: seid_version, commit

Ante:

  • app_pending_nonce_total — pending nonce events by type (added, expired, rejected, accepted)

Migration note

Legacy metrics (telemetry.MeasureSince, utilmetrics.MeasureDeliverTxDuration, etc.) are dual-emitted during the migration window so existing dashboards continue to receive data. Each legacy call site is annotated TODO(PLT-327) and will be removed once dashboards are verified against the new OTel series.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 6, 2026, 9:03 PM

@amir-deris amir-deris changed the title Added otel metrics for app and app/ante feat(app): migrate app and ante telemetry to OpenTelemetry Meter API May 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 74.44934% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.12%. Comparing base (8310d93) to head (b7b4868).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
app/invariance.go 30.76% 18 Missing ⚠️
app/metrics.go 90.26% 9 Missing and 2 partials ⚠️
app/abci.go 76.74% 10 Missing ⚠️
app/app.go 66.66% 9 Missing ⚠️
app/ante/evm_checktx.go 0.00% 8 Missing ⚠️
app/ante/metrics.go 80.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3396      +/-   ##
==========================================
+ Coverage   59.10%   59.12%   +0.02%     
==========================================
  Files        2101     2103       +2     
  Lines      173195   173882     +687     
==========================================
+ Hits       102360   102801     +441     
- Misses      61955    62184     +229     
- Partials     8880     8897      +17     
Flag Coverage Δ
sei-chain-pr 50.52% <74.44%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
app/ante/metrics.go 80.00% <80.00%> (ø)
app/ante/evm_checktx.go 34.72% <0.00%> (-0.66%) ⬇️
app/app.go 65.52% <66.66%> (-4.34%) ⬇️
app/abci.go 63.85% <76.74%> (+1.35%) ⬆️
app/metrics.go 90.26% <90.26%> (ø)
app/invariance.go 55.81% <30.76%> (+14.77%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@bdchatham bdchatham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration shape is right and dual-emit is sound. Three blockers worth fixing before merge — all silent (no runtime error, just wrong data):

  1. Counter and histogram names will double-suffix under the Prometheus exporter (_total_total, _seconds_seconds).
  2. app_tx_count_total double-counts every tx via the unlabeled + labeled Add pair.
  3. txGasUsed / txGasWanted as Int64Gauge is last-write-wins under OCC concurrency.

Plus should-fixes around bucket density at the p99=2.5s SLO threshold, the lazy sync.OnceValue init in app/ante/, and the raw proposer label encoding. Inline below.

Dashboard rewrites land alongside the PLT-327 removal step; existing legacy refs in clusters/prod/monitoring/grafana-dashboards-protocol.yaml are summary-typed ({quantile="0.5"}), so query rewrites are histogram_quantile-based, not 1:1. No alert/recording rules touch these names — the dual-emit window is sufficient buffer.

Comment thread app/metrics.go
))

appMetrics.txCount = must(meter.Int64Counter(
"app_tx_count_total",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocker] The Prometheus exporter appends _total to OTel counters automatically; this instrument will export as app_tx_count_total_total. Same hits every _total-suffixed counter here (app_tx_process_type_total, app_tx_gas_total, app_optimistic_processing_total, app_failed_total_gas_wanted_check_total, app_giga_fallback_to_v2_total, app_lightinvariance_supply_*_total) and the histograms — _seconds + WithUnit("s")*_seconds_seconds.

Drop the _total suffix from counter names and _seconds from histogram names; the exporter adds them. Verify against /metrics from a node running this build before any dashboard work.

Comment thread app/metrics.go
// histogramBuckets aligns with block-time SLO thresholds
// (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s) expressed in seconds.
var histogramBuckets = metric.WithExplicitBucketBoundaries(
0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0,
Copy link
Copy Markdown
Contributor

@bdchatham bdchatham May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[should-fix] Buckets jump 2.5 → 5.0; with the SLO at p99=2.5s, any p99 between 2.5s and 5s reports inside [2.5, 5.0] with up to ~2.5s of histogram_quantile interpolation error. Add 3.0 (ideally 4.0) to tighten resolution at the threshold.

Separately: this set is reused for commitDuration and invarianceDuration. Unless they all really share the same proper bucket sizing, consider separating these.

Comment thread app/metrics.go
metric.WithDescription("Cumulative transaction gas by type (gas_used, gas_wanted)"),
))

appMetrics.txGasUsed = must(meter.Int64Gauge(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocker] Synchronous Int64Gauge is last-write-wins per scrape. Under OCC concurrent execution (ProcessTXsWithOCCV2), thousands of DeliverTx per 15-30s scrape window means the published value is "the gas of one arbitrary tx" — not aggregable, not meaningful. txGasWanted (below) has the same issue.

app_tx_gas_total (lines 126-129) already covers cumulative gas, and the _count series of app_abci_deliver_tx_duration_seconds covers tx rate. If you want a per-tx gas distribution, convert these to Float64Histogram. Otherwise drop both — they're carrying forward the telemetry.SetGauge anti-pattern.

Comment thread app/abci.go
Comment on lines +163 to +164
appMetrics.txCount.Add(ctx.Context(), 1)
appMetrics.txCount.Add(ctx.Context(), 1, otelmetrics.WithAttributes(attribute.String("result", resultStr)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocker] Double-counts every tx: line 163 is an unlabeled Add(1) and line 164 is a labeled Add(1, result=...). sum(app_tx_count_total) returns 2× actual tx count; sum by (result)(app_tx_count_total) is off by N (the unlabeled adds land in a series with no result label).

Legacy emitted two distinct metric names (tx_count and tx_<result>); 1:1 port onto a labeled counter doesn't compose. Drop line 163, or split into two metrics (app_tx_count_total + app_tx_result_total).

Comment thread app/ante/metrics.go

// getAnteMetrics returns the package-level OTel instruments, initializing them
// lazily on first call (after the global MeterProvider is set in NewApp).
var getAnteMetrics = sync.OnceValue(func() *anteMetrics {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[should-fix] Adds an atomic load to every CheckTx (hot path), and — more importantly — if anything calls getAnteMetrics() before SetupOtelMetricsProvider() runs, the noop provider binds permanently for the process lifetime.

Better: eager init at the same point initAppMetrics() runs in NewApp (app/app.go:491). Mirror the app pattern — package-level var anteMetrics anteMetrics + func InitAnteMetrics() called explicitly after SetupOtelMetricsProvider.

Comment thread app/app.go
metrics.IncrFailedTotalGasWantedCheck(string(req.Header.ProposerAddress))
utilmetrics.IncrFailedTotalGasWantedCheck(string(req.Header.ProposerAddress)) // TODO(PLT-327): remove once app_failed_total_gas_wanted_check_total verified
appMetrics.failedGasWantedCheck.Add(ctx.Context(), 1,
otelmetric.WithAttributes(attribute.String("proposer", string(req.Header.ProposerAddress))))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[should-fix] string(req.Header.ProposerAddress) is a raw 20-byte consensus address; non-UTF8 bytes make the Prom label value untrustworthy and unreadable in dashboards. Cardinality is bounded (~50 validators on pacific-1), so not a blocker — but the value isn't useful as-is.

Either drop the label (proposer attribution belongs in a span event or log, not a metric label), or hex-encode: attribute.String("proposer", hex.EncodeToString(req.Header.ProposerAddress)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants