Skip to content

feat: add SNMP provider#6480

Closed
danichese wants to merge 3 commits into
keephq:mainfrom
danichese:codex/snmp-provider-2112
Closed

feat: add SNMP provider#6480
danichese wants to merge 3 commits into
keephq:mainfrom
danichese:codex/snmp-provider-2112

Conversation

@danichese
Copy link
Copy Markdown

@danichese danichese commented May 14, 2026

Closes #2112

/claim #2112

Summary

  • Adds a webhook-first snmp provider discovered by Keep's existing provider factory conventions.
  • Converts decoded SNMP trap/event JSON into AlertDto objects with normalized host, source, status, severity, timestamps, stable IDs/fingerprints, labels, and structured SNMP details.
  • Supports common payload aliases for host/source, trap OID, generic trap, timestamp, and varbind fields.
  • Preserves named varbinds and OID-based varbinds in labels, plus normalized snmp_varbinds and raw snmp_event fields for debugging and filtering.
  • Adds mock payloads, focused unit tests, provider docs, repository-script generated snippet content, and an externally hosted demo video.

Why webhook-first

SNMP traps usually arrive over UDP, often on privileged port 162. Keeping this provider webhook-first avoids native listener lifecycle, threading, tenancy, container networking, and privileged-port complexity inside Keep. Operators can run standard trap receivers such as snmptrapd, snmptt, Telegraf, or another relay to decode traps and forward JSON to /alerts/event/snmp.

This intentionally does not add pysnmp, SNMP polling, MIB parsing, SNMPv3 credential storage, or a native UDP listener.

Validation

Passed:

wsl.exe bash -lc "cd '/mnt/c/Users/Dan/Documents/New project/keep' && /tmp/keep-snmp-venv/bin/python -m pytest tests/test_snmp_provider.py -q"
# 12 passed, 1 warning

wsl.exe bash -lc "cd '/mnt/c/Users/Dan/Documents/New project/keep' && /tmp/keep-snmp-venv/bin/python -m pytest tests/test_snmp_provider.py::test_provider_discovery_includes_snmp tests/test_snmp_provider.py::test_provider_factory_can_load_snmp -q"
# 2 passed, 1 warning

wsl.exe bash -lc "cd '/mnt/c/Users/Dan/Documents/New project/keep' && /tmp/keep-snmp-venv/bin/python -m ruff check keep/providers/snmp_provider tests/test_snmp_provider.py"
# All checks passed

wsl.exe bash -lc "cd '/mnt/c/Users/Dan/Documents/New project/keep' && /tmp/keep-snmp-venv/bin/python -m black --check keep/providers/snmp_provider tests/test_snmp_provider.py"
# 4 files would be left unchanged

git diff --cached --check
# passed before commit

Docs snippet generation:

wsl.exe bash -lc "cd '/mnt/c/Users/Dan/Documents/New project/keep' && /tmp/keep-snmp-venv/bin/python scripts/docs_render_provider_snippets.py"

The committed snmp-snippet-autogenerated.mdx webhook section was produced by the repository script. The full --validate mode currently stops before completion on the existing main-branch indentation issue in keep/providers/anthropic_provider/anthropic_provider.py around line 93.

Attempted broader provider factory test:

wsl.exe bash -lc "cd '/mnt/c/Users/Dan/Documents/New project/keep' && /tmp/keep-snmp-venv/bin/python -m pytest tests/test_provider_factory.py -q"

That run also fails before reaching SNMP because of the same unrelated existing anthropic_provider indentation issue.

GitHub Actions note: the larger PR workflows (Unit Tests, Integration Tests, Test docs, Test workflow examples, Release JSON Schema, Tests (E2E)) currently show action_required with no jobs, meaning they are awaiting maintainer approval to run for this forked PR. The available checks (Validate PR title, Validate PR to Issue link, GitGuardian, CLA, CodeRabbit skip) pass.

Demo Evidence

Demo video: Watch the MP4 demo

Executable formatter demo used the sample linkDown payload and produced:

{
  "name": "SNMP linkDown on 10.0.0.15",
  "host": "10.0.0.15",
  "source": ["snmp"],
  "status": "firing",
  "severity": "critical",
  "trap_oid": "1.3.6.1.6.3.1.1.5.3",
  "ifDescr": "eth0"
}

Reproduce against a running Keep API:

curl -X POST "$KEEP_API_URL/alerts/event/snmp" \
  -H "Content-Type: application/json" \
  -H "X-API-KEY: $KEEP_WEBHOOK_API_KEY" \
  -d '{
    "version": "2c",
    "source_ip": "10.0.0.15",
    "agent_address": "10.0.0.15",
    "community": "public",
    "trap_oid": "1.3.6.1.6.3.1.1.5.3",
    "generic_trap": "linkDown",
    "timestamp": "2026-05-14T12:00:00Z",
    "varbinds": [
      {
        "oid": "1.3.6.1.2.1.2.2.1.2.1",
        "name": "ifDescr",
        "value": "eth0"
      }
    ]
  }'

Expected result: a source snmp alert for host 10.0.0.15, status firing, severity critical, with trap_oid and ifDescr visible in alert details/labels.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 14, 2026

CLA assistant check
All committers have signed the CLA.

@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. Feature A new feature Provider Providers related issues labels May 14, 2026
@danichese
Copy link
Copy Markdown
Author

Follow-up update based on pre-review cleanup:

  • Regenerated docs/snippets/providers/snmp-snippet-autogenerated.mdx with scripts/docs_render_provider_snippets.py; the SNMP webhook section now comes from the repo generator.
  • Removed the committed demo MP4 from the PR diff and moved it to a public release asset: https://github.com/danichese/keep/releases/download/snmp-provider-demo-6480/snmp-provider-demo.mp4
  • Re-ran focused SNMP tests, isolated provider discovery checks, Ruff, Black, and cached diff whitespace check successfully.
  • Checked PR workflows: the large Actions runs are currently �ction_required with no jobs, awaiting maintainer approval for this forked PR. No SNMP-related CI failure logs are available yet.

Copy link
Copy Markdown

@digzrow-coder digzrow-coder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lifecycle bug in the default fingerprint/id. _stable_id() includes both trap_oid and generic_trap in the hashed payload, while linkDown maps to firing and linkUp maps to resolved. That means the recovery trap gets a different fingerprint from the outage trap:

  • linkDown: trap_oid=1.3.6.1.6.3.1.1.5.3, generic_trap=linkDown
  • linkUp: trap_oid=1.3.6.1.6.3.1.1.5.4, generic_trap=linkUp

Since the provider then sets fingerprint=alert_id, Keep will ingest the linkUp as a separate resolved alert instead of resolving the existing linkDown alert for the same host/interface. The current tests only assert stable ids for identical payloads, so they miss the firing -> resolved pair.

For link state traps the fingerprint needs to be based on the resource identity (host + interface identity from varbinds, e.g. ifIndex/ifDescr/ifName) rather than the transition state. The trap type should drive status/severity, but it should not split the dedup key for the same resource.

@danichese
Copy link
Copy Markdown
Author

There is a lifecycle bug in the default fingerprint/id. _stable_id() includes both trap_oid and generic_trap in the hashed payload, while linkDown maps to firing and linkUp maps to resolved. That means the recovery trap gets a different fingerprint from the outage trap:

  • linkDown: trap_oid=1.3.6.1.6.3.1.1.5.3, generic_trap=linkDown
  • linkUp: trap_oid=1.3.6.1.6.3.1.1.5.4, generic_trap=linkUp

Since the provider then sets fingerprint=alert_id, Keep will ingest the linkUp as a separate resolved alert instead of resolving the existing linkDown alert for the same host/interface. The current tests only assert stable ids for identical payloads, so they miss the firing -> resolved pair.

For link state traps the fingerprint needs to be based on the resource identity (host + interface identity from varbinds, e.g. ifIndex/ifDescr/ifName) rather than the transition state. The trap type should drive status/severity, but it should not split the dedup key for the same resource.

Good spot on the fingerprint issue, you're right. Including generic_trap in the dedup key breaks the firing/resolved lifecycle for link state traps.
Updating _build_fingerprint to key on host and interface identity (ifDescr/ifName/ifIndex from varbinds) and excluding trap_oid and generic_trap from the hash. They'll still drive status and severity as before, just not the dedup key.
Also adding a test for the firing/resolved pair to cover this properly. Will push the fix shortly.

@shahargl
Copy link
Copy Markdown
Member

N/A

@shahargl shahargl closed this May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🙋 Bounty claim Feature A new feature Provider Providers related issues size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[🔌 Provider]: SNMP provider

4 participants