Skip to content

feat(gateway): add DB-backed resource_version CAS for stored objects #1255

@johntmyers

Description

@johntmyers

Problem Statement

OpenShell's generic object store currently persists top-level objects by rewriting the full protobuf payload on put_message, with the database only enforcing uniqueness on id and (object_type, name). That works for a single writer, but it is not safe once multiple gateway instances can mutate the same stored object.

PR #1242 adds sandbox provider attach/detach and currently relies on whole-object sandbox mutation. A single-gateway mutex can reduce same-process races, but it does not protect HA writers, compute watch/reconcile updates, or other gateway code paths that concurrently rewrite the same sandbox row.

The result is classic lost-update behavior: the last full-payload write wins, even if it was based on stale state. We need a DB-backed compare-and-swap/resource-version mechanism at the object-store boundary so correctness does not depend on in-process locks.

Technical Context

What the spike found

  • The gateway persistence layer stores opaque protobuf payloads with a narrow indexed schema.
  • Generic object writes go through unconditional overwrite helpers:
    • Store::put(...) in crates/openshell-server/src/persistence/mod.rs
    • Store::put_message(...) in crates/openshell-server/src/persistence/mod.rs
  • Both SQLite and Postgres currently upsert by (object_type, name) and replace payload wholesale.
  • The current behavior preserves the original row id while accepting a new caller-supplied id on name conflict, which is another signal that updates are name-based overwrites rather than identity/version-checked mutations.
  • ObjectMeta comments already mention "resource version", but the proto does not actually contain a field for it today.

Sandbox mutation paths exposed to lost updates

Location Current behavior
crates/openshell-server/src/grpc/sandbox.rs Attach/detach provider does get sandbox by name -> mutate providers -> put_message.
crates/openshell-server/src/compute/mod.rs Compute watch/session/delete paths load a sandbox record, mutate status/phase/lifecycle data, and rewrite the object.
crates/openshell-server/src/grpc/policy.rs Policy backfill and policy status reporting mutate sandbox fields via full-object writes.
crates/openshell-server/src/grpc/provider.rs Provider update is a read/merge/write, while provider delete scans sandboxes then deletes the provider.
Settings persistence Currently relies on a process-local mutex, which is not an HA-safe boundary.

Existing patterns to reuse

  • Policy revisions already use a DB-enforced optimistic-concurrency pattern based on a unique (object_type, scope, version) key and bounded retries.
  • The codebase already contains tests showing that read/modify/write without a lock can lose writes.
  • Process-local locks can stay temporarily to reduce retries, but correctness needs to move to the database.

Proposed Design

Add an authoritative resource_version to every generic object row and require mutating stored objects through compare-and-swap writes keyed by stable object identity (id) plus expected resource_version.

This should be a store-level correctness primitive, not a sandbox-specific lock.

Core design

  • Add resource_version to the objects table in both SQLite and Postgres.
  • Treat the DB row's resource_version as authoritative.
  • On insert, set resource_version = 1.
  • On update, require WHERE id = ? AND resource_version = ?, and atomically bump to the next version.
  • On delete, support the same WHERE id = ? AND resource_version = ? pattern.
  • Read paths must hydrate decoded objects with the DB row's resource_version, not trust whatever is currently embedded in the stored payload.
  • Update paths must stop using name-based upsert for mutable top-level objects. Resolve by name if needed, then persist by stable id.

Scope boundary

This issue should implement single-object CAS and migrate the high-risk sandbox-related mutation paths first. It should not try to redesign the entire persistence model or normalize every resource into dedicated tables.

Policy revision storage should stay append-only as-is. It already has the right optimistic-concurrency pattern for versioned policy rows.

API Shape

Proto / object metadata

Add an actual resource-version field to shared object metadata.

message ObjectMeta {
  string id = 1;
  string name = 2;
  int64 created_at_ms = 3;
  map<string, string> labels = 4;
  uint64 resource_version = 5;
}

Notes:

  • This is wire-compatible for existing clients.
  • Existing comments already promise resource-version semantics, so this makes the schema match the intent.
  • If we want Kubernetes-style opacity later, we can still treat the number as an opaque token at the API boundary.

Persistence API

Add explicit CAS primitives instead of a generic mutation DSL.

pub enum WriteCondition {
    MustCreate,
    MatchResourceVersion(u64),
    Unconditional,
}

pub struct WriteResult {
    pub resource_version: u64,
    pub created_at_ms: i64,
    pub updated_at_ms: i64,
}

pub enum PersistenceError {
    // existing variants...
    Conflict {
        current_resource_version: Option<u64>,
    },
}

Suggested store methods:

pub async fn put_if(
    &self,
    object_type: &str,
    id: &str,
    name: &str,
    payload: &[u8],
    labels: Option<&str>,
    condition: WriteCondition,
) -> PersistenceResult<WriteResult>;

pub async fn delete_if(
    &self,
    object_type: &str,
    id: &str,
    expected_resource_version: u64,
) -> PersistenceResult<bool>;

Typed wrappers should provide the same semantics for protobuf objects, and typed reads should hydrate metadata.resource_version from the DB row.

Pragmatic constraint:

  • Keep put / put_message temporarily for bootstrap and legacy paths.
  • New mutable paths should use CAS helpers.
  • Once migrated, deprecate unconditional overwrite helpers for top-level mutable objects.

SQLite / Postgres Implementation Details

Schema

Add a new non-null column in both backends:

  • SQLite: resource_version INTEGER NOT NULL DEFAULT 1
  • Postgres: resource_version BIGINT NOT NULL DEFAULT 1

No new secondary index should be necessary. Updates will use the primary key id, plus a version check in the predicate.

Insert path

For MustCreate:

  • Insert the row with resource_version = 1.
  • Preserve current unique constraints:
    • id primary key
    • (object_type, name) unique when name IS NOT NULL

Update path

For MatchResourceVersion(v):

  • Update by object_type + id + expected resource_version.
  • Set payload, labels, updated_at_ms.
  • Set resource_version = resource_version + 1.
  • Use RETURNING resource_version, created_at_ms, updated_at_ms where supported to avoid a second round trip on success.

If the update affects zero rows:

  • Follow up with a cheap existence check by object_type + id.
  • Return NotFound if the row is gone.
  • Return Conflict { current_resource_version } if the row still exists but the version changed.

Delete path

For CAS deletes:

  • DELETE ... WHERE object_type = ? AND id = ? AND resource_version = ?.
  • On zero rows, distinguish not-found from conflict the same way as updates.

Read path

Extend ObjectRecord with resource_version.

When decoding a protobuf object:

  • Decode the payload.
  • Override or populate metadata.resource_version from the DB row before returning it.

This matters because older payloads will not contain the field even after the schema migration.

Migration / Backcompat Needs

Database migration

Add a migration for both SQLite and Postgres:

  • Add the column.
  • Backfill existing rows to 1.

No payload rewrite is required.

Proto / client compatibility

  • Adding metadata.resource_version is backward-compatible in protobuf.
  • Older clients will ignore the field.
  • Newer clients will start seeing it in Get* / List* responses without needing request changes.

Internal compatibility

  • Existing persisted payloads remain readable.
  • Existing write call sites can keep compiling while we migrate them incrementally.
  • get_message* should populate resource_version from the row so callers do not need to know whether a payload predates the migration.

Mixed-version rollout caveat

Old binaries will still perform unconditional full-row overwrites without version checks. Because of that:

  • Do not enable HA writers during a mixed-version rollout.
  • Keep the deployment in single-writer mode until every gateway instance is on the CAS-capable build.
  • Call this out explicitly in rollout notes.

Retry Semantics

The DB layer should not silently retry stale writes. A stale write must surface as a typed conflict so the caller can re-read and re-apply its semantic merge.

Recommended behavior by call site:

  • Sandbox attach/detach:
    • Bounded retry, re-read latest sandbox by name, re-run add/remove/dedupe, retry CAS.
    • Safe because desired state is set-like and idempotent.
  • Compute watch updates / supervisor session state / startup resume:
    • Bounded retry by sandbox id, re-apply derived status/phase changes to the latest row.
    • If retries exhaust, log and rely on the next watch/reconcile event to converge.
  • ReportPolicyStatus sandbox updates:
    • Bounded retry by sandbox id when setting current_policy_version.
  • Settings mutations:
    • Replace correctness reliance on settings_mutex with CAS on the settings object row.
    • The mutex may stay temporarily as a retry-reduction optimization, but not as the only guard.
  • Provider update:
    • Bounded retry, re-read latest provider, re-run merge-map logic, retry CAS.
  • Delete/reconcile sweeps:
    • Use CAS delete where feasible.
    • On conflict, skip and let the next sweep handle the object instead of forcing a delete.

Recommended retry budget:

  • 3 to 5 attempts.
  • tokio::task::yield_now() or small jitter between attempts.
  • Return ABORTED after exhaustion for gRPC handlers that cannot safely continue.

Rollout Plan

  1. Add resource_version to ObjectMeta, ObjectRecord, and both DB schemas.
  2. Add store-level CAS update/delete primitives and typed conflict errors.
  3. Hydrate metadata.resource_version on all typed reads.
  4. Migrate the highest-risk sandbox writers first:
    • provider attach/detach
    • compute watch/reconcile/session status writes
    • policy backfill of spec.policy
    • policy status update of current_policy_version
  5. Migrate provider update and settings save paths onto the same CAS primitives.
  6. Keep any short-term single-gateway lock from PR feat(providers): support sandbox provider attach lifecycle #1242 only as a best-effort local serialization aid, not as the correctness boundary.
  7. Add conflict telemetry and concurrent tests on both SQLite and Postgres.
  8. After all sandbox-related writers are CAS-backed, allow HA gateway writers.
  9. Deprecate unconditional overwrite helpers for mutable top-level objects and migrate remaining call sites in follow-up work.

Definition of Done

  • objects.resource_version exists in both SQLite and Postgres, with migration/backfill coverage.
  • openshell.datamodel.v1.ObjectMeta includes resource_version.
  • ObjectRecord includes resource_version.
  • Store exposes CAS update/delete primitives and returns typed conflict errors.
  • Typed read helpers hydrate metadata.resource_version from the DB row.
  • Sandbox attach/detach no longer relies on unconditional put_message.
  • Compute sandbox status/phase writers no longer rely on unconditional full-row overwrites.
  • Policy-related sandbox mutations (spec.policy backfill and current_policy_version) use CAS.
  • Provider update and settings save paths use CAS.
  • Concurrency tests cover attach/detach, attach/detach vs compute status update, policy status update vs attach/detach, provider update conflicts, settings conflicts, and SQLite/Postgres backends.
  • E2E coverage demonstrates safe sandbox mutation behavior with concurrent writers.
  • Metrics/logging exist for conflict count, retry count, and retry exhaustion.

Notable Risks / Open Questions

  • Single-row CAS does not fully solve cross-object invariants. delete_provider_record() can still race with sandbox attach unless we add a higher-level transactional composition or redesign the attachment relationship.
  • Mixed-version rollouts are unsafe for HA unless explicitly gated to a single writer.
  • put currently updates by (object_type, name) rather than by stable id. Migrated call sites must stop depending on name-based overwrite semantics.
  • Settings persistence currently generates a fresh UUID on every save attempt before the name-based upsert preserves the old row. That helper should be cleaned up as part of the CAS migration so settings rows retain stable identity cleanly.
  • Conflict-heavy paths could become noisy without metrics and bounded retries. We need observability before enabling HA writers.
  • We should decide whether future external mutation RPCs need explicit expected_resource_version request fields. This issue can defer that if gateway-internal retries are sufficient for current APIs.
  • Policy revision rows remain append-only and should not be collapsed into the generic object CAS model.

Scope Assessment

  • Complexity: High
  • Confidence: Medium-High
  • Estimated files to change: 12-18
  • Issue type: feat

Created from a focused principal-engineer spike. No code changes were made as part of the spike.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:gatewayGateway server and control-plane workarea:sandboxSandbox runtime and isolation work

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions