feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396
feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396amir-deris wants to merge 4 commits intomainfrom
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3396 +/- ##
==========================================
+ Coverage 59.10% 59.12% +0.02%
==========================================
Files 2101 2103 +2
Lines 173195 173882 +687
==========================================
+ Hits 102360 102801 +441
- Misses 61955 62184 +229
- Partials 8880 8897 +17
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
bdchatham
left a comment
There was a problem hiding this comment.
Migration shape is right and dual-emit is sound. Three blockers worth fixing before merge — all silent (no runtime error, just wrong data):
- Counter and histogram names will double-suffix under the Prometheus exporter (
_total_total,_seconds_seconds). app_tx_count_totaldouble-counts every tx via the unlabeled + labeledAddpair.txGasUsed/txGasWantedasInt64Gaugeis last-write-wins under OCC concurrency.
Plus should-fixes around bucket density at the p99=2.5s SLO threshold, the lazy sync.OnceValue init in app/ante/, and the raw proposer label encoding. Inline below.
Dashboard rewrites land alongside the PLT-327 removal step; existing legacy refs in clusters/prod/monitoring/grafana-dashboards-protocol.yaml are summary-typed ({quantile="0.5"}), so query rewrites are histogram_quantile-based, not 1:1. No alert/recording rules touch these names — the dual-emit window is sufficient buffer.
| )) | ||
|
|
||
| appMetrics.txCount = must(meter.Int64Counter( | ||
| "app_tx_count_total", |
There was a problem hiding this comment.
[blocker] The Prometheus exporter appends _total to OTel counters automatically; this instrument will export as app_tx_count_total_total. Same hits every _total-suffixed counter here (app_tx_process_type_total, app_tx_gas_total, app_optimistic_processing_total, app_failed_total_gas_wanted_check_total, app_giga_fallback_to_v2_total, app_lightinvariance_supply_*_total) and the histograms — _seconds + WithUnit("s") → *_seconds_seconds.
Drop the _total suffix from counter names and _seconds from histogram names; the exporter adds them. Verify against /metrics from a node running this build before any dashboard work.
| // histogramBuckets aligns with block-time SLO thresholds | ||
| // (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s) expressed in seconds. | ||
| var histogramBuckets = metric.WithExplicitBucketBoundaries( | ||
| 0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, |
There was a problem hiding this comment.
[should-fix] Buckets jump 2.5 → 5.0; with the SLO at p99=2.5s, any p99 between 2.5s and 5s reports inside [2.5, 5.0] with up to ~2.5s of histogram_quantile interpolation error. Add 3.0 (ideally 4.0) to tighten resolution at the threshold.
Separately: this set is reused for commitDuration and invarianceDuration. Unless they all really share the same proper bucket sizing, consider separating these.
| metric.WithDescription("Cumulative transaction gas by type (gas_used, gas_wanted)"), | ||
| )) | ||
|
|
||
| appMetrics.txGasUsed = must(meter.Int64Gauge( |
There was a problem hiding this comment.
[blocker] Synchronous Int64Gauge is last-write-wins per scrape. Under OCC concurrent execution (ProcessTXsWithOCCV2), thousands of DeliverTx per 15-30s scrape window means the published value is "the gas of one arbitrary tx" — not aggregable, not meaningful. txGasWanted (below) has the same issue.
app_tx_gas_total (lines 126-129) already covers cumulative gas, and the _count series of app_abci_deliver_tx_duration_seconds covers tx rate. If you want a per-tx gas distribution, convert these to Float64Histogram. Otherwise drop both — they're carrying forward the telemetry.SetGauge anti-pattern.
| appMetrics.txCount.Add(ctx.Context(), 1) | ||
| appMetrics.txCount.Add(ctx.Context(), 1, otelmetrics.WithAttributes(attribute.String("result", resultStr))) |
There was a problem hiding this comment.
[blocker] Double-counts every tx: line 163 is an unlabeled Add(1) and line 164 is a labeled Add(1, result=...). sum(app_tx_count_total) returns 2× actual tx count; sum by (result)(app_tx_count_total) is off by N (the unlabeled adds land in a series with no result label).
Legacy emitted two distinct metric names (tx_count and tx_<result>); 1:1 port onto a labeled counter doesn't compose. Drop line 163, or split into two metrics (app_tx_count_total + app_tx_result_total).
|
|
||
| // getAnteMetrics returns the package-level OTel instruments, initializing them | ||
| // lazily on first call (after the global MeterProvider is set in NewApp). | ||
| var getAnteMetrics = sync.OnceValue(func() *anteMetrics { |
There was a problem hiding this comment.
[should-fix] Adds an atomic load to every CheckTx (hot path), and — more importantly — if anything calls getAnteMetrics() before SetupOtelMetricsProvider() runs, the noop provider binds permanently for the process lifetime.
Better: eager init at the same point initAppMetrics() runs in NewApp (app/app.go:491). Mirror the app pattern — package-level var anteMetrics anteMetrics + func InitAnteMetrics() called explicitly after SetupOtelMetricsProvider.
| metrics.IncrFailedTotalGasWantedCheck(string(req.Header.ProposerAddress)) | ||
| utilmetrics.IncrFailedTotalGasWantedCheck(string(req.Header.ProposerAddress)) // TODO(PLT-327): remove once app_failed_total_gas_wanted_check_total verified | ||
| appMetrics.failedGasWantedCheck.Add(ctx.Context(), 1, | ||
| otelmetric.WithAttributes(attribute.String("proposer", string(req.Header.ProposerAddress)))) |
There was a problem hiding this comment.
[should-fix] string(req.Header.ProposerAddress) is a raw 20-byte consensus address; non-UTF8 bytes make the Prom label value untrustworthy and unreadable in dashboards. Cardinality is bounded (~50 validators on pacific-1), so not a blocker — but the value isn't useful as-is.
Either drop the label (proposer attribution belongs in a span event or log, not a metric label), or hex-encode: attribute.String("proposer", hex.EncodeToString(req.Header.ProposerAddress)).
Summary
Migrates telemetry in the
appandapp/antepackages from the legacytelemetry/utilmetricshelpers to the standardized OpenTelemetry Meter API, following the same pattern established in #3265 forevmrpc.app/metrics.gowith a struct-based OTel instrument set registered viaotel.Meter("app"), initialized once inNewAppafterSetupOtelMetricsProvider()app/ante/metrics.gowith a lazily-initializedanteMetricsstruct viasync.OnceValue(fires on first CheckTx, after the global MeterProvider is set)telemetry.MeasureSince/utilmetricscalls inabci.goandinvariance.gowith dual-emitted OTel instruments; legacy calls remain withTODO(PLT-327)markers until dashboards are migratedctx/ctx.Context()through all.Record()and.Add()call sites to support OTel's context-based propagationGaugeSeidVersionAndCommitcall with anapp_build_infoobservable gauge that fires on Prometheus scrape — no per-block overheadNew metrics (OTel naming convention, exported via the process-wide MeterProvider)
ABCI phase durations — histograms bucketed at SLO thresholds (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s):
app_abci_begin_block_duration_secondsapp_abci_end_block_duration_secondsapp_abci_module_end_block_duration_secondsapp_abci_check_tx_duration_secondsapp_abci_deliver_tx_duration_secondsapp_abci_deliver_batch_tx_duration_secondsapp_block_process_duration_seconds— block tx processing duration by execution typeTransaction counters and gas:
app_tx_count_totalapp_tx_process_type_total— by execution type labelapp_tx_gas_total— cumulative gas by type labelapp_tx_gas_used/app_tx_gas_wanted— per-tx gaugesApp-level flow counters:
app_optimistic_processing_total— cache hit (enabled=true) vs missapp_failed_total_gas_wanted_check_totalapp_giga_fallback_to_v2_totalLight invariance:
app_lightinvariance_supply_duration_secondsapp_lightinvariance_supply_invalid_key_totalapp_lightinvariance_supply_unmarshal_failure_totalBuild info:
app_build_info— observable gauge, always 1, labels:seid_version,commitAnte:
app_pending_nonce_total— pending nonce events by type (added, expired, rejected, accepted)Migration note
Legacy metrics (
telemetry.MeasureSince,utilmetrics.MeasureDeliverTxDuration, etc.) are dual-emitted during the migration window so existing dashboards continue to receive data. Each legacy call site is annotatedTODO(PLT-327)and will be removed once dashboards are verified against the new OTel series.