Skip to content

Track superseded mempool errors separately#4385

Open
tilacog wants to merge 43 commits into
mainfrom
mempool-metric-superseded
Open

Track superseded mempool errors separately#4385
tilacog wants to merge 43 commits into
mainfrom
mempool-metric-superseded

Conversation

@tilacog
Copy link
Copy Markdown
Contributor

@tilacog tilacog commented May 5, 2026

Description

Mempools::execute() runs all configured mempools concurrently and returns the first one that succeeds. Previously, errors from mempools that lost the race were counted as real failures, even though the overall submission was successful. Dropped mempools were never recorded.

This skewed mempool counts by both:

  • counting errors alongside a successful submission, and
  • omitting counts for dropped mempools.

This PR keeps the racing behavior but changes how observation works.

Changes

Behavior

  • On first success, emit mempool_succeeded for the winner and mempool_superseded for every other configured mempool.
  • On all-failed, emit mempool_failed for each one of them.

Code specific

  • Replace select_ok with FuturesUnordered in Mempools::execute so the consumer can observe each completion (crates/driver/src/domain/mempools.rs).
  • Split observe::mempool_executed into mempool_succeeded(&SubmissionSuccess) and mempool_failed(&mempools::Error), dropping the Result<&S, &E> indirection now that each call site already knows which branch it is on. Behavior and emitted metrics are unchanged by the split.
  • Add mempool_superseded(&Mempool, winner: &Mempool, &Settlement) which increments driver_mempool_submission with result="Superseded".

How to test

Existing driver unit tests cover the race semantics; this PR does not change the externally observable submission outcome, only how observation is sequenced and labeled. To verify manually:

  1. Run the driver against a config with at least two mempools.
  2. Trigger a settlement that succeeds via the public mempool.
  3. Confirm Prometheus shows one result="Success" increment for the winner and one result="Superseded" increment for the loser; no Revert/Expired/Other from the loser.
  4. Trigger a settlement that fails on every mempool and confirm each mempool gets its own non-Superseded failure label.

Alert query update needed when deploying

Per-mempool success counts both wins and races-lost (so happy and failure paths both emit N events for N configured mempools, keeping the ratio symmetric). Superseded stays as a separate label so dashboards can still distinguish wins from race-losses per mempool.

sum by (network) (increase(driver_mempool_submission{cow_fi_environment="prod",result=~"Success|Superseded"}[2h]))
/
sum by (network) (increase(driver_mempool_submission{cow_fi_environment="prod",result!="Disabled"}[2h])) < 0.6

@tilacog tilacog force-pushed the mempool-metric-superseded branch 2 times, most recently from a798fc3 to 9905e6e Compare May 6, 2026 15:39
@tilacog

This comment was marked as outdated.

tilacog added 7 commits May 6, 2026 12:54
When `Mempools::execute()` runs mempools in parallel, errors from mempools
whose results were discarded after another mempool succeeded were still
recorded against `driver_mempool_submission`, biasing the per-mempool
success ratio with timing-dependent shadowed failures.

Replace `select_ok` with `FuturesUnordered` + manual loop so observation
runs in the consuming context. Errors that occur before another mempool
succeeds are now recorded under a new `Superseded` label via
`observe::mempool_superseded`, which also records the winning mempool in
the trace fields. Errors in the all-failed case keep their existing
labels (Revert / Expired / Other / Disabled).

Alert query update needed when deploying:

    sum by (network) (increase(driver_mempool_submission{cow_fi_environment="prod",result="Success"}[2h]))
    /
    sum by (network) (increase(driver_mempool_submission{cow_fi_environment="prod",result!~"Disabled|Superseded"}[2h])) < 0.6
`mempool_executed` took a `Result<&SubmissionSuccess, &mempools::Error>`
and re-matched the same discriminant several times to pick the log level,
metric label, and block-passed labels. Replace it with two functions,
`mempool_succeeded(&SubmissionSuccess)` and `mempool_failed(&mempools::Error)`,
so each branch is straight-line and call sites pick the correct observer
directly. Behavior and emitted metrics are unchanged.
@tilacog tilacog force-pushed the mempool-metric-superseded branch from 9905e6e to d9fb0cb Compare May 6, 2026 15:55
@cowprotocol cowprotocol deleted a comment from github-actions Bot May 6, 2026
Copy link
Copy Markdown
Contributor

@fleupold fleupold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you are not using the PR template for the description?

I agree with the change, however I'd like to suggest that we interpret "superseeded" events as success wrt. how you envision to change the metric. A superseeded submission should be considered a successful one.

This way we receive N (# of mempool) events in the happy case, and N events in the failure case allowing us to keep our alert metric as a ratio of successful to failed ones (otherwise failed events would be weighted N times more than successful ones).

Every loser in a mempool race is now marked Superseded, whether it
failed before the winner finished or was still in flight when the
winner landed. The old code only labelled already-failed losers as
superseded and quietly dropped ones still in flight; the
shadowed_errors accumulator that carried their errors across is gone.

Minor cleanup:

- Error::blocks_passed on the domain type returns the block delta
from submission to the terminal event for variants that carry
block-level timing. This replaces the inline match in mempool_failed.
- error_label is shared between mempool_failed and the per-attempt
counter so the Prometheus labels stay in sync.

The all-failed path also swaps the expect for an explicit
Error::Other fallback instead of panicking on the (currently
unreachable) empty-errors case.
@tilacog
Copy link
Copy Markdown
Contributor Author

tilacog commented May 8, 2026

Is there a reason you are not using the PR template for the description?

Apologies, I was in a rush and didn't account for that. I've updated the description to match the template.

I agree with the change, however I'd like to suggest that we interpret "superseeded" events as success wrt. how you envision to change the metric. A superseeded submission should be considered a successful one.

This way we receive N (# of mempool) events in the happy case, and N events in the failure case allowing us to keep our alert metric as a ratio of successful to failed ones (otherwise failed events would be weighted N times more than successful ones).

Agree. I've adjusted the suggested metric.

@tilacog tilacog marked this pull request as ready for review May 8, 2026 19:33
@tilacog tilacog requested a review from a team as a code owner May 8, 2026 19:33
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the mempool execution logic to use FuturesUnordered, enabling more detailed tracking of success, failure, and superseded states. It also adds a blocks_passed method to the Error enum for improved block-level timing metrics. A high-severity logic error was identified where disabled mempools are incorrectly reported as 'Superseded' if another mempool wins the race, which would artificially inflate success rate metrics. A correction was suggested to preserve the 'Disabled' status during the racing process.

Comment thread crates/driver/src/domain/mempools.rs Outdated
@tilacog tilacog enabled auto-merge May 11, 2026 11:45
Comment thread crates/driver/src/domain/mempools.rs Outdated
Comment thread crates/driver/src/domain/mempools.rs Outdated
Comment thread crates/driver/src/domain/mempools.rs Outdated
Comment thread crates/driver/src/domain/mempools.rs Outdated
Comment thread crates/driver/src/domain/mempools.rs
Comment thread crates/driver/src/domain/mempools.rs Outdated
Comment thread crates/driver/src/domain/mempools.rs Outdated
Comment thread crates/driver/src/domain/mempools.rs Outdated
tilacog and others added 7 commits May 11, 2026 09:59
Disabled is a configuration skip, not a submission failure. Split it into
its own observer so failure-rate metrics aren't polluted.
Co-authored-by: José Duarte <15343819+jmg-duarte@users.noreply.github.com>
@tilacog tilacog requested a review from MartinquaXD May 11, 2026 17:06
Comment thread crates/driver/src/infra/observe/mod.rs Outdated
Comment thread crates/driver/src/infra/observe/mod.rs Outdated
@tilacog tilacog added this pull request to the merge queue May 12, 2026
@jmg-duarte jmg-duarte removed this pull request from the merge queue due to a manual request May 12, 2026
Copy link
Copy Markdown
Contributor

@MartinquaXD MartinquaXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea to fix the metrics makes sense to me but the logic seems more complicated than necessary.

Wouldn't it also work to construct a hashmap mapping all mempools to a submission outcome?

  1. initialize all mempools as superseeded
  2. run new FuturesUnordered logic
  3. match on each submit result and update the hashmap accordingly

At the end all submission futures that finished will have updated their own entry in the mapping and all futures that didn't finish were cancelled because some other pool was successful first so the original superseeded label is correct.

@tilacog
Copy link
Copy Markdown
Contributor Author

tilacog commented May 12, 2026

I agree this got more complex than it should.

At the end all submission futures that finished will have updated their own entry in the mapping and all futures that didn't finish were cancelled because some other pool was successful first so the original superseeded label is correct.

If I'm reading this right, I think this arrangement still leaves us with a possible final state of [winner(1), errored(N), superseeded(M)] (out of [1+N+M] mempools).

On the hashmap point, I don't think it changes the design much. We still need to iterate over all mempools to set the disabled ones, then iterate by value to overwrite the errored ones with superseded on a success. That's basically the current design, just with a different container.

I'll try to simplify the current code a bit.

tilacog added 3 commits May 12, 2026 12:42
Helper added little abstraction value — only one caller, short body.
Inlining keeps the disabled-filter rationale next to the race loop it
guards.
Comment thread crates/driver/src/infra/mempool/mod.rs Outdated
Comment on lines +85 to +90
impl PartialEq for Mempool {
fn eq(&self, other: &Self) -> bool {
self.config == other.config
}
}

Copy link
Copy Markdown
Contributor Author

@tilacog tilacog May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this is correct/appropriate, because deriving PartialEq on Mempool simplifies the filtering in the race post-processing.

Config carries several fields, most importantly the Mempool's Url and Name.

@tilacog tilacog added this pull request to the merge queue May 12, 2026
@tilacog tilacog removed this pull request from the merge queue due to a manual request May 12, 2026
@MartinquaXD
Copy link
Copy Markdown
Contributor

We still need to iterate over all mempools to set the disabled ones, then iterate by value to overwrite the errored ones with superseded on a success.

I think the difference is that the current approach spreads around the responsibilities of managing the metrics too much. With the approach suggested below there are 3 steps:

  1. initialize counters
  2. poll futures and update counters based on the results
  3. interpret stats and update metrics

There is no need to pre-emptively partition out the disabled pools (also no need for a debug assert), no need to collect the errors into a vector to return the last one, and the logic of interpreting the results (interpreting errors as superseed when there is a success result) happens inside a singular function where the reasoning can be documented nicely.

        // initialize all pools with superseeded as that is the correct state when we
        // don't get to update the pools when one succeeds.
        let mut stats = self.mempools.iter().map(|mem| (&mem, Superseeded)).collect();

        let (submission, _remaining_futures) = select_ok(self.mempools.iter().map(|mempool| {
            async move {
                let result = self
                    .submit(mempool, settlement, submission_deadline, mode)
                    .instrument(tracing::info_span!("mempool", kind = mempool.to_string()))
                    .await;
                match &result {
                    Ok(_) => stats[&mempool] = Success,
                    Err(Disabled) => stats[&mempool] = Disabled,
                    Err(_) => stats[&mempool] = Error,
                }
                result
            }
            .boxed()
        }))
        .await?;

        // if stats.values().any(|val| val==success) we mark every error as superseeded
        // otherwise we recorde the given label
        self.update_metrics(stats);

        Ok(submission.tx_hash)

tilacog added 6 commits May 12, 2026 19:38
We've established this during the review, fixing this small regression
Replace the pre-race disabled filter with an Outcome enum that records
per-mempool state (Pending/Success/Failed/Disabled) as futures settle.
update_metrics observes per-mempool labels after the race, and
reconstruct_result collapses the stats into the caller's Result —
first Success in config order wins, else first non-Disabled error,
else Error::Disabled. This makes the surfaced error deterministic and
keeps Disabled out of the Superseded bucket.
@tilacog
Copy link
Copy Markdown
Contributor Author

tilacog commented May 15, 2026

Ok, here's what I landed on after addressing @MartinquaXD's comments:

I didn't use a Dashmap or Hashmap because hashing Mempool isn't straightforward (due to an float field in Config), so I kept a vector of pairs instead (Mempool, Outcome) and handled them by position.

I had to use FuturesUnordered instead of select_ok because sending submit's results out of the async closure wasn't possible due to lifetime constraints, plus mempool::Error isn't Clone since anyhow::Error isn't either... and I didn't want to add more complexity there.

So I followed the suggestions as closely as I could:

I used an Outcome enum to capture (and own) errors and submissions results by consuming the results from submit calls, and processed them in their metrics pipeline, finally recreating the returning result to avoid breaking the contract with execute callers (currently just Competition::process_settle_request). I added a few unit tests for that part since they were easy to write.

I think this is in good shape to merge. If we don't see any correctness issues, I'd recommend merging right away and improving the code later, because I think this will help with our current alerts on high mempool submission errors.

@tilacog tilacog requested review from MartinquaXD and extrawurst May 15, 2026 18:45
tilacog added 3 commits May 15, 2026 15:50
Short-circuits on the first Success instead of carrying it through the
remaining outcomes. Drops the manual_try_fold allow since fold_while
expresses the state machine directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants