Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
70305e6
change name to be agent agnostic
viktorbeck98 May 4, 2026
9bdb4e2
feat: add PersistencyLoadError and dump/load protocol to EventDataStr…
viktorbeck98 May 4, 2026
2327e4c
test: strengthen abstract contract verification for EventDataStructure
viktorbeck98 May 4, 2026
7bb165d
feat: add to_state/from_state serialization to SingleStabilityTracker
viktorbeck98 May 4, 2026
18990b4
feat: implement dump/load on EventTracker via MessagePack
viktorbeck98 May 4, 2026
a72b03a
docs: document converter_function and event_id limitations in EventTr…
viktorbeck98 May 4, 2026
6aa82f5
feat: implement dump/load on EventDataFrame via Parquet
viktorbeck98 May 4, 2026
b0d9c7e
docs: document event_id/template limitations in EventDataFrame load
viktorbeck98 May 4, 2026
357e7a8
feat: implement dump/load on ChunkedEventDataFrame via Parquet + msgp…
viktorbeck98 May 4, 2026
bec28ee
feat: add _dirty_count counter and reset_dirty_count() to EventPersis…
viktorbeck98 May 4, 2026
b842599
feat: add PersistencySaverConfig and _SaveTimer
viktorbeck98 May 4, 2026
5b227b1
feat: implement PersistencySaver save() and load() with fsspec
viktorbeck98 May 4, 2026
36962a2
fix: collapse _tick dead branch, coerce event ID types on load, impro…
viktorbeck98 May 4, 2026
997acdc
test: add trigger tests for PersistencySaver timer, dirty threshold, …
viktorbeck98 May 4, 2026
c869b68
fix: join timer thread in stop(), rename misleading dirty-threshold test
viktorbeck98 May 4, 2026
dc7a2f6
feat: add context manager protocol to Component for saver cleanup
viktorbeck98 May 4, 2026
0e7f2ac
fix: remove redundant hasattr, add _Stoppable Protocol type for saver
viktorbeck98 May 4, 2026
e07853f
test: add integration tests for full save/load cycle across backends
viktorbeck98 May 4, 2026
3d36643
test: add cell-value assertion to DataFrame integration test
viktorbeck98 May 4, 2026
9d3d5bb
chore: add fsspec, msgpack, pyarrow deps with optional s3/gcs/azure e…
viktorbeck98 May 4, 2026
2ef2b47
adapt gitignore
viktorbeck98 May 6, 2026
03e8162
minor changes
viktorbeck98 May 6, 2026
5c704fd
chore: never commit design docs
viktorbeck98 May 6, 2026
204bd1f
feat: change dirty_threshold default to None in PersistencySaverConfig
viktorbeck98 May 6, 2026
ac2f1fc
feat: add PersistConfig model and persist field to CoreDetectorConfig
viktorbeck98 May 6, 2026
5cd8d39
feat: add _register_persistency() helper to CoreDetector
viktorbeck98 May 6, 2026
7b1a58a
style: fix import ordering in detector.py
viktorbeck98 May 6, 2026
515ee3b
feat: handle persist block in config serialization and suppress spuri…
viktorbeck98 May 6, 2026
7f9cd7e
style: update MissingParamsWarning message to include global and persist
viktorbeck98 May 6, 2026
8bbc6df
feat: wire _register_persistency() into NewValueDetector, NewValueCom…
viktorbeck98 May 6, 2026
9b47ffd
fix: preserve persist config across set_configuration() rebuild
viktorbeck98 May 6, 2026
b893e98
fix: filter non-serializable event_data_kwargs and make stop() idempo…
viktorbeck98 May 6, 2026
1049ab0
refactor: rename dirty counter to events_since_save / events_until_save
viktorbeck98 May 6, 2026
bceb931
docs: document persist block, PersistencySaver API, and storage optio…
viktorbeck98 May 6, 2026
a58d016
docs: add persist block and detector wiring guidance to AGENTS.md
viktorbeck98 May 6, 2026
f80041e
feat: implement events_until_save trigger in PersistencySaver
viktorbeck98 May 6, 2026
57830a2
chore: merge development into feat/save_persistency
viktorbeck98 May 6, 2026
353b282
refactor: drop explicit pyarrow API in EventDataFrame
viktorbeck98 May 12, 2026
8534b8c
docs: explain msgpack config header layout in ChunkedEventDataFrame.dump
viktorbeck98 May 12, 2026
aa19b10
refactor: treat persistency as a sub-library and slim CoreDetector
viktorbeck98 May 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -202,3 +202,6 @@ test.py

# claude code
CLAUDE.md
docs/superpowers/
docs/design/
.claude/
41 changes: 40 additions & 1 deletion CLAUDE.md → AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# CLAUDE.md
# AGENTS.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Expand Down Expand Up @@ -70,6 +70,12 @@ detectors:
MyDetector:
method_type: new_value_detector
auto_config: false # true = auto-discover variables from training data
persist: # optional — omit to disable state saving
path: ./state # base path; detector name is appended automatically
interval_seconds: 300 # save every N seconds
events_until_save: null # also save after N ingested events (null = disabled)
auto_load: false # restore saved state on construction
storage_options: {} # fsspec credentials (S3, Azure, GCS, etc.)
events:
login_failure: # named event ID (string) or integer EventID
instance_label: # arbitrary instance name
Expand Down Expand Up @@ -149,6 +155,33 @@ class MyDetector(CoreDetector):

Same pattern applies for `CoreParser` — implement `parse(input_: LogSchema, output_: ParserSchema) -> bool`.

### Wiring persist support into a new detector

Detectors that maintain an `EventPersistency` instance must do two things to support the `persist:` config block:

**1. Call `_register_persistency()` at the end of `__init__`:**

```python
def __init__(self, name="MyDetector", config=MyDetectorConfig()):
super().__init__(name=name, config=config)
self.persistency = EventPersistency(event_data_class=EventStabilityTracker)
self._register_persistency(self.persistency) # must be last
```

**2. Preserve `config.persist` across `set_configuration()` rebuilds:**

`set_configuration()` replaces `self.config` via `from_dict()`, which produces a config with no `persist` key — silently dropping the user's persist settings. Save and restore it:

```python
def set_configuration(self) -> None:
old_persist = self.config.persist
# ... build config_dict, call from_dict() ...
self.config = MyDetectorConfig.from_dict(config_dict, self.name)
self.config.persist = old_persist
```

Omitting either step means a `persist:` block in the YAML is silently ignored with no error.

## Code Quality

Pre-commit hooks enforce:
Expand All @@ -158,3 +191,9 @@ Pre-commit hooks enforce:
- **docformatter** docstring style

Python 3.12 is required (see `.python-version`).


# Git
NEVER include "Co-Authored-By ..." in your commit or PR messages.

Design documents (files under `docs/design/`) must NEVER be committed to the repository.
230 changes: 150 additions & 80 deletions docs/auxiliar/persistency.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets leave it like this for now, but I think this documentation can be a little hard to follow if you dont know how the persistency works

Original file line number Diff line number Diff line change
@@ -1,127 +1,197 @@
# Persistency

The persistency module provides event-based state management for detectors. It allows detectors to accumulate, store, and query data across their lifecycle — during training, detection, and auto-configuration.
The persistency module gives a detector a place to remember things about the
events it sees. State is keyed by `EventID` and survives across training, the
detection loop, and (optionally) restarts on disk.

## EventPersistency
This page is structured to read top-to-bottom: first the mental model, then a
quick start, then the API surface.

`EventPersistency` is the main entry point. It manages one storage backend instance per event ID, so each event type maintains its own isolated state.
## Mental model

### Creating an instance
Persistency has three moving parts. Understanding what each one does makes the
rest of the page much easier to follow.

```python
from detectmatelibrary.common.persistency import EventPersistency
### 1. Events

persistency = EventPersistency(
event_data_class=MyBackend, # storage backend class (see below)
variable_blacklist=["Content"], # variable names to exclude (optional)
event_data_kwargs={"max_rows": 1000} # extra kwargs forwarded to the backend (optional)
)
```
Logs are grouped by `EventID`. Two events with the same ID share a template
but have their own variable values. Persistency stores **one independent state
object per event ID**, so an `EventStabilityTracker` for `EventID=4733` does
not interfere with one for `EventID=4624`.

| Parameter | Description |
|---|---|
| `event_data_class` | An `EventDataStructure` subclass that defines how data is stored and queried. |
| `variable_blacklist` | Variable names to exclude from storage. Defaults to `["Content"]`. |
| `event_data_kwargs` | A dictionary of keyword arguments forwarded to the backend constructor. |
### 2. Backends (`EventDataStructure`)

A backend is the thing that actually stores the per-event state. Persistency
owns the dict `{event_id: backend}`; the backend itself decides *how* data is
kept.

Two families ship today:

- **DataFrame backends** (`EventDataFrame`, `ChunkedEventDataFrame`) keep the
raw rows. Use these when a detector needs to scan history.
- **Tracker backends** (`EventStabilityTracker`) keep only derived features
(e.g. "this variable has been constant for the last 10k events"). Use these
when you only need a summary, not the raw history — they cost a fraction of
the memory.

All backends implement the same four-method contract: `add_data`, `get_data`,
`dump`, `load`. That contract is what `EventPersistency` and
`PersistencySaver` rely on — anything you add later only has to follow it.

### 3. Saver lifecycle (`PersistencySaver`)

### Storing data
`EventPersistency` itself is in-memory. To survive a process restart, the
state has to be written somewhere. `PersistencySaver` wraps an
`EventPersistency` and:

- writes to disk (or any `fsspec` URI) on two triggers — a wall-clock interval
and an event-count threshold;
- optionally `auto_load`s previously saved state during construction;
- exposes `start()` / `stop()` so the background timer can be torn down
cleanly. `stop()` is idempotent and is called automatically when a
`Component` is used as a context manager.

In practice a detector never instantiates `PersistencySaver` directly: it sets
a `persist:` block in its config and `CoreDetector` wires the saver up via
[`init_persistency`](../../src/detectmatelibrary/common/persist.py).

---

## Quick start

```python
persistency.ingest_event(
event_id=event_id,
event_template=template,
variables=positional_vars, # optional positional variables
named_variables=named_vars # optional named variables
from detectmatelibrary.utils import persistency

ep = persistency.EventPersistency(
event_data_class=persistency.EventStabilityTracker,
)

ep.ingest_event(
event_id="4624",
event_template="An account was successfully logged on.",
named_variables={"AccountName": "alice", "LogonType": "3"},
)

tracker = ep.get_event_data("4624") # or ep["4624"]
```

Each call appends data to the backend associated with the given `event_id`. If no backend exists for that ID yet, one is created automatically.
That snippet covers the whole in-memory API: pick a backend class, ingest
events, query state.

### Retrieving data
---

```python
# Single event
data = persistency.get_event_data(event_id)
## API reference

# All events
all_data = persistency.get_events_data() # dict[event_id -> backend]
### `EventPersistency`

# Templates
template = persistency.get_event_template(event_id)
all_templates = persistency.get_event_templates()
| Parameter | Description |
|---|---|
| `event_data_class` | An `EventDataStructure` subclass; one instance is created per event ID. |
| `variable_blacklist` | Variable names to skip when ingesting. Defaults to `["Content"]`. |
| `event_data_kwargs` | Extra kwargs forwarded to each backend instance. |

# Bracket access
backend = persistency[event_id]
```
Common methods:

## Storage backends
```python
ep.ingest_event(event_id, event_template, variables=..., named_variables=...)

ep.get_event_data(event_id) # backend for a single event
ep.get_events_data() # dict[event_id -> backend]
ep.get_event_template(event_id)
ep.get_event_templates()
ep.get_events_seen() # all event IDs ever ingested
ep[event_id] # alias for get_event_data
```

The backend determines how ingested data is stored and what queries are available. Choose the backend that fits your detector's needs.
### Available backends

### DataFrame backends
| Class | Use when |
|---|---|
| `persistency.EventDataFrame` | You need history and a Pandas DataFrame is the natural shape. |
| `persistency.ChunkedEventDataFrame` | High-volume / streaming workloads — Polars-backed with row-retention and automatic compaction. |
| `persistency.EventStabilityTracker` | You only care about how variables behave over time (`STATIC` / `STABLE` / `UNSTABLE` / `RANDOM`). Cheapest memory footprint. |

Store raw event data in tabular form. Useful when a detector needs to query or iterate over historical values.
All three are re-exported from the top of the package — `persistency.X` is the
canonical import; the deeply nested submodules are an implementation detail.

- **`EventDataFrame`** — Pandas-backed storage. Simple and familiar.
- **`ChunkedEventDataFrame`** — Polars-backed storage with configurable row retention and automatic compaction. Suited for high-volume or streaming workloads.
### Persisting to disk

```python
from detectmatelibrary.common.persistency.event_data_structures.dataframes import (
EventDataFrame,
ChunkedEventDataFrame,
saver = persistency.PersistencySaver(
ep,
persistency.PersistencySaverConfig(
path="./state/my-detector",
save_interval_seconds=300,
events_until_save=10_000, # save after this many ingests, too
auto_load=False,
storage_options={}, # forwarded to fsspec
),
)
saver.start()
# ... detector runs ...
saver.stop() # final flush, stops the background timer
```

### Tracker backends

Track variable behavior over time rather than storing raw data. Useful when a detector needs to understand how variables evolve (e.g., whether they converge to constant values). Is optimized for space efficiency since only extracted features from the logs are stored.
`PersistencySaver.save()` is thread-safe, and `stop()` is idempotent. The two
save triggers (`save_interval_seconds` and `events_until_save`) are
independent — whichever fires first wins.

- **`EventStabilityTracker`** — Classifies each variable as `STATIC`, `STABLE`, `UNSTABLE`, `RANDOM`, or `INSUFFICIENT_DATA` based on how its values change over time.
#### Restoring state

```python
from detectmatelibrary.common.persistency.event_data_structures.trackers import (
EventStabilityTracker,
saver = persistency.PersistencySaver(
ep,
persistency.PersistencySaverConfig(path="./state/my-detector", auto_load=True),
)
# ep is now pre-populated from disk
```

## Usage in detectors
If `auto_load=True` and no saved state exists, the constructor raises
`persistency.PersistencyLoadError` immediately — fail-fast rather than
silently starting empty.

Persistency is **optional**. A detector can function without it. When a detector does need to maintain state across events — for example, to learn normal values during training and flag deviations during detection — it can integrate persistency by following this pattern:
### Storage backends (fsspec)

### 1. Initialize in `__init__`
`PersistencySaverConfig.path` accepts any URI fsspec understands: a local path
(`./state`), `s3://bucket/key`, `gs://...`, `az://...`, and so on. Provider
credentials and tuning knobs go in `storage_options`.

Create one or more `EventPersistency` instances with the appropriate backend.
---

```python
class MyDetector(CoreDetector):
def __init__(self, name="MyDetector", config=MyDetectorConfig()):
super().__init__(name=name, ...)
self.persistency = EventPersistency(
event_data_class=EventStabilityTracker,
)
```
## Using persistency inside a detector

### 2. Accumulate state in `train()`
The recommended path: declare `persist:` in the detector's config and let
`CoreDetector._register_persistency` build the saver for you. See
[Saving state (persist)](../detectors.md#saving-state-persist) for the config
schema.

During training, ingest each event so the backend builds up its internal state.
In detector code, the pattern is:

```python
def train(self, input_):
variables = self.get_configured_variables(input_, self.config.events)
self.persistency.ingest_event(
event_id=input_["EventID"],
event_template=input_["template"],
named_variables=variables,
)
```
from detectmatelibrary.common.detector import CoreDetector
from detectmatelibrary.utils import persistency

### 3. Query state in `detect()`
class MyDetector(CoreDetector):
def __init__(self, name="MyDetector", config=MyDetectorConfig()):
super().__init__(name=name, config=config)
self.persistency = persistency.EventPersistency(
event_data_class=persistency.EventStabilityTracker,
)
self._register_persistency(self.persistency)

During detection, query the accumulated state to decide whether the incoming event is anomalous.
def train(self, input_):
self.persistency.ingest_event(
event_id=input_["EventID"],
event_template=input_["template"],
named_variables={...},
)

```python
def detect(self, input_, output_):
for event_id, backend in self.persistency.get_events_data().items():
stored_data = backend.get_data()
# compare input_ against stored_data to produce alerts
def detect(self, input_, output_):
tracker = self.persistency.get_events_data().get(input_["EventID"])
# compare against tracker to produce alerts
```

`_register_persistency` is a one-line wrapper around
[`init_persistency`](../../src/detectmatelibrary/common/persist.py); the helper
honours `config.persist` and returns `None` (so `self.saver` stays `None`)
when persistence is disabled.
Loading
Loading