Skip to content

[EXPERIMENTAL] Mimir integration with VictoriaMetrics#1607

Closed
gsanchietti wants to merge 39 commits into
mainfrom
alertmanager_victoria
Closed

[EXPERIMENTAL] Mimir integration with VictoriaMetrics#1607
gsanchietti wants to merge 39 commits into
mainfrom
alertmanager_victoria

Conversation

@gsanchietti
Copy link
Copy Markdown
Member

@gsanchietti gsanchietti commented Apr 22, 2026

Raise Mimir alerts using VictoriaMetrics and VictoriaLogs.

Keep netdata with its own alerting for backward compatibility.

TODO:

  • refine list of alerts
  • cleanup the PR by removing experimental tools and commits
  • implement ping latency monitor using telegraf, update also related API and UI (see https://docs.influxdata.com/telegraf/v1/input-plugins/ping/)
  • implement dashboard charts using VictoriaMetrics: report transmitted ping is incorrect
  • make VictoriaLogs work or drop it
  • bind victoria metrics and logs to 127.0.0.1, configurable from uci

@gsanchietti gsanchietti self-assigned this Apr 22, 2026
@gsanchietti gsanchietti changed the title Alertmanager victoria [EXPERIMENTAL] Mimir integration with Victoria metrics Apr 22, 2026
@gsanchietti gsanchietti changed the title [EXPERIMENTAL] Mimir integration with Victoria metrics [EXPERIMENTAL] Mimir integration with VictoriaMetrics Apr 22, 2026
@gsanchietti gsanchietti force-pushed the alertmanager_victoria branch 5 times, most recently from 010e499 to e94f63d Compare April 23, 2026 14:52
Tbaile and others added 15 commits April 30, 2026 11:48
The netdata alert script is inside the configuration directory and it is
preserved across upgrade
This was a leftover when mwan alert was moved
from netdata chart to mwan hooks
This is required to keep the alert fireing inside
mimir
- Add vmalert init script (vmalert.initd) to start/stop vmalert service
- Add vmalert UCI configuration file (vmalert.conf) with datasource settings
- Add comprehensive alert rules for host and hardware monitoring (vmalert-rules/host.yaml)
  - CPU usage alerts (warning at 70%, critical at 85%)
  - Memory usage alerts (warning at 80%, critical at 90%)
  - Disk space alerts (warning at 80%, critical at 90%)
  - Disk inodes alerts
  - System load alerts
  - Network error and drop alerts
  - Process zombie and blocked alerts
- Update Makefile to install vmalert configuration and rules
- Add detailed documentation of vmalert setup and metrics mapping
- Alerts are currently in blackhole mode (evaluated but not forwarded)
- Rules adapted for Telegraf metric names instead of Prometheus names
- Support for Mimir integration when configured via ns-plug

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add telegraf-services Python script that queries ubus to collect the
  running state of all procd-managed services. Outputs JSON for
  telegraf inputs.exec with data_format = json_v2.

- Track known persistent services in /var/run/telegraf-services-known.json
  so services that disappear from the ubus list (stopped or failed to
  start) are still emitted with running=0, keeping VictoriaMetrics alert
  rules effective even when procd removes the instance entirely.

- Add services.conf telegraf input using inputs.exec + json_v2 parser.
  Tags each metric with service, instance, and has_respawn so vmalert
  rules can target only persistent daemons (has_respawn=true).

- Add services.yaml vmalert rule: ServiceDown fires when
  procd_service_running{has_respawn="true"} == 0 for 2 minutes.
  Uses alertgroup=services label (not service=host) so the metric's
  own service label (e.g. nginx, rpcd) is preserved in the alert.

- Add comprehensive telegraf/README.md documenting architecture, all
  metrics, service monitoring design, PromQL query examples, and
  manual test procedures.

- Add scripts/test-service-monitor.sh for end-to-end simulation:
  injects a bad nginx config, verifies metric drops to 0, waits for
  ServiceDown to fire, then restores nginx and confirms alert resolves.

- Update victoria-metrics/README_VMALERT.md with service monitoring
  section and cross-references.

End-to-end verified on 192.168.100.238:
  nginx broken config → procd removes instance from ubus →
  telegraf-services emits running=0 via state file →
  VictoriaMetrics stores metric →
  ServiceDown{service=nginx} fires after 2m →
  nginx restored → alert resolves

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add telegraf-mwan Python script that reads /var/run/mwan3/iface_state/
  to collect WAN interface connectivity state. Each file in that directory
  is named after an mwan3 interface and contains 'online' or 'offline'.
  No-ops silently when mwan3 is not running (directory absent).

- Add mwan.conf telegraf input using inputs.exec + json_v2 parser,
  producing mwan_interface metrics tagged with interface and status.

- Add mwan.yaml vmalert rule: WANDown fires when
  mwan_interface_online == 0 for 2 minutes. The interface and status
  labels come directly from the metric, so each WAN interface fires its
  own distinct alert.

- Update telegraf/README.md with WAN Monitoring section: architecture,
  metric reference, query examples, alert details, and manual test
  procedure.

- Update Makefile to install telegraf-mwan and mwan.conf.

End-to-end verified on 192.168.100.238:
  wan2 iface_state=offline → mwan_interface_online{interface=wan2}=0 →
  WANDown{interface=wan2, alertgroup=mwan} fired after 2 minutes ✓

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Netdata health checks and alert forwarding are now disabled permanently.
Alerts are handled by vmalert (via Telegraf → VictoriaMetrics) instead.

Changes:
- 30_ns-plug_alerts: emptied to disable all health.d rules, set health=no,
  remove health_alarm_notify.conf from system on upgrade. Keep python plugins
  for non-alerting metrics (fping latency, dashboards).
- Deleted: health_alarm_notify.conf, netadata_enable_alerts, netadata_disable_alerts
- ns-plug-alert: removed netdata subcommand and ~140 lines of NETDATA_ALERT_MAP,
  cmd_netdata, _netdata_fire, _netdata_resolve helper functions
- Makefile: removed install of netadata hooks and health_alarm_notify.conf,
  removed /etc/netdata dir creation
- README.md: updated Alerts section to reflect vmalert-based alerting

Persistence: 30_ns-plug_alerts runs at every sysupgrade + fresh install,
ensuring netdata alerting stays disabled across updates.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@gsanchietti gsanchietti force-pushed the alertmanager_victoria branch 2 times, most recently from 2caa025 to 4158b17 Compare April 30, 2026 13:44
gsanchietti and others added 2 commits April 30, 2026 15:57
Migrate ping monitoring from netdata's fping plugin to telegraf's native
ping input plugin. This provides better performance, no external
dependencies, and improved system compatibility.

Changes:
- Add ns.telegraf API handler for ping monitor configuration
- Add ns.telegraf.json ACL definition for telegraf-manager role
- Add telegraf.conf.d/ping.conf with native ping plugin configuration
- Remove ns.netdata API handler and ACL (netdata integration)
- Update ns-api Makefile to install new API handler
- Update telegraf Makefile to install ping.conf and add inputs.ping tag

The new API provides the same interface:
- get-configuration: retrieve current ping hosts
- set-hosts: configure hosts to ping

The ping plugin uses native method (method="native") which sends ICMP
packets directly without external ping command, requiring CAP_NET_RAW
capability or root privileges. Metrics are tagged with
influxdb_db="ping-metrics" for proper InfluxDB database routing.

BREAKING CHANGE: ns.netdata API is removed. Clients must migrate to
ns.telegraf API for ping monitor configuration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Migrate the following APIs:
- ns.report: latency-and-quality-report
- ns.dashboard: interface-traffic

Replace data from Netdata with data from Victoria Metrics:
netdata is now deprecated and will be dismissed in the future.

Assisted-by:: Copilot:Sonnet4.6
@gsanchietti gsanchietti force-pushed the alertmanager_victoria branch from 4158b17 to 15cffa5 Compare April 30, 2026 13:57
@gsanchietti
Copy link
Copy Markdown
Member Author

Replaced by #1633

@gsanchietti gsanchietti deleted the alertmanager_victoria branch May 7, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants