[EXPERIMENTAL] Mimir integration with VictoriaMetrics by gsanchietti · Pull Request #1607 · NethServer/nethsecurity

gsanchietti · 2026-04-22T10:42:13Z

Raise Mimir alerts using VictoriaMetrics and VictoriaLogs.

Keep netdata with its own alerting for backward compatibility.

TODO:

refine list of alerts
cleanup the PR by removing experimental tools and commits
implement ping latency monitor using telegraf, update also related API and UI (see https://docs.influxdata.com/telegraf/v1/input-plugins/ping/)
implement dashboard charts using VictoriaMetrics: report transmitted ping is incorrect
make VictoriaLogs work or drop it
bind victoria metrics and logs to 127.0.0.1, configurable from uci

The netdata alert script is inside the configuration directory and it is preserved across upgrade

This was a leftover when mwan alert was moved from netdata chart to mwan hooks

This is required to keep the alert fireing inside mimir

- Add vmalert init script (vmalert.initd) to start/stop vmalert service - Add vmalert UCI configuration file (vmalert.conf) with datasource settings - Add comprehensive alert rules for host and hardware monitoring (vmalert-rules/host.yaml) - CPU usage alerts (warning at 70%, critical at 85%) - Memory usage alerts (warning at 80%, critical at 90%) - Disk space alerts (warning at 80%, critical at 90%) - Disk inodes alerts - System load alerts - Network error and drop alerts - Process zombie and blocked alerts - Update Makefile to install vmalert configuration and rules - Add detailed documentation of vmalert setup and metrics mapping - Alerts are currently in blackhole mode (evaluated but not forwarded) - Rules adapted for Telegraf metric names instead of Prometheus names - Support for Mimir integration when configured via ns-plug Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add telegraf-services Python script that queries ubus to collect the running state of all procd-managed services. Outputs JSON for telegraf inputs.exec with data_format = json_v2. - Track known persistent services in /var/run/telegraf-services-known.json so services that disappear from the ubus list (stopped or failed to start) are still emitted with running=0, keeping VictoriaMetrics alert rules effective even when procd removes the instance entirely. - Add services.conf telegraf input using inputs.exec + json_v2 parser. Tags each metric with service, instance, and has_respawn so vmalert rules can target only persistent daemons (has_respawn=true). - Add services.yaml vmalert rule: ServiceDown fires when procd_service_running{has_respawn="true"} == 0 for 2 minutes. Uses alertgroup=services label (not service=host) so the metric's own service label (e.g. nginx, rpcd) is preserved in the alert. - Add comprehensive telegraf/README.md documenting architecture, all metrics, service monitoring design, PromQL query examples, and manual test procedures. - Add scripts/test-service-monitor.sh for end-to-end simulation: injects a bad nginx config, verifies metric drops to 0, waits for ServiceDown to fire, then restores nginx and confirms alert resolves. - Update victoria-metrics/README_VMALERT.md with service monitoring section and cross-references. End-to-end verified on 192.168.100.238: nginx broken config → procd removes instance from ubus → telegraf-services emits running=0 via state file → VictoriaMetrics stores metric → ServiceDown{service=nginx} fires after 2m → nginx restored → alert resolves Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add telegraf-mwan Python script that reads /var/run/mwan3/iface_state/ to collect WAN interface connectivity state. Each file in that directory is named after an mwan3 interface and contains 'online' or 'offline'. No-ops silently when mwan3 is not running (directory absent). - Add mwan.conf telegraf input using inputs.exec + json_v2 parser, producing mwan_interface metrics tagged with interface and status. - Add mwan.yaml vmalert rule: WANDown fires when mwan_interface_online == 0 for 2 minutes. The interface and status labels come directly from the metric, so each WAN interface fires its own distinct alert. - Update telegraf/README.md with WAN Monitoring section: architecture, metric reference, query examples, alert details, and manual test procedure. - Update Makefile to install telegraf-mwan and mwan.conf. End-to-end verified on 192.168.100.238: wan2 iface_state=offline → mwan_interface_online{interface=wan2}=0 → WANDown{interface=wan2, alertgroup=mwan} fired after 2 minutes ✓ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Netdata health checks and alert forwarding are now disabled permanently. Alerts are handled by vmalert (via Telegraf → VictoriaMetrics) instead. Changes: - 30_ns-plug_alerts: emptied to disable all health.d rules, set health=no, remove health_alarm_notify.conf from system on upgrade. Keep python plugins for non-alerting metrics (fping latency, dashboards). - Deleted: health_alarm_notify.conf, netadata_enable_alerts, netadata_disable_alerts - ns-plug-alert: removed netdata subcommand and ~140 lines of NETDATA_ALERT_MAP, cmd_netdata, _netdata_fire, _netdata_resolve helper functions - Makefile: removed install of netadata hooks and health_alarm_notify.conf, removed /etc/netdata dir creation - README.md: updated Alerts section to reflect vmalert-based alerting Persistence: 30_ns-plug_alerts runs at every sysupgrade + fresh install, ensuring netdata alerting stays disabled across updates. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Migrate ping monitoring from netdata's fping plugin to telegraf's native ping input plugin. This provides better performance, no external dependencies, and improved system compatibility. Changes: - Add ns.telegraf API handler for ping monitor configuration - Add ns.telegraf.json ACL definition for telegraf-manager role - Add telegraf.conf.d/ping.conf with native ping plugin configuration - Remove ns.netdata API handler and ACL (netdata integration) - Update ns-api Makefile to install new API handler - Update telegraf Makefile to install ping.conf and add inputs.ping tag The new API provides the same interface: - get-configuration: retrieve current ping hosts - set-hosts: configure hosts to ping The ping plugin uses native method (method="native") which sends ICMP packets directly without external ping command, requiring CAP_NET_RAW capability or root privileges. Metrics are tagged with influxdb_db="ping-metrics" for proper InfluxDB database routing. BREAKING CHANGE: ns.netdata API is removed. Clients must migrate to ns.telegraf API for ping monitor configuration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Migrate the following APIs: - ns.report: latency-and-quality-report - ns.dashboard: interface-traffic Replace data from Netdata with data from Victoria Metrics: netdata is now deprecated and will be dismissed in the future. Assisted-by:: Copilot:Sonnet4.6

gsanchietti · 2026-04-30T14:42:55Z

Replaced by #1633

gsanchietti self-assigned this Apr 22, 2026

gsanchietti changed the title ~~Alertmanager victoria~~ [EXPERIMENTAL] Mimir integration with Victoria metrics Apr 22, 2026

gsanchietti changed the title ~~[EXPERIMENTAL] Mimir integration with Victoria metrics~~ [EXPERIMENTAL] Mimir integration with VictoriaMetrics Apr 22, 2026

gsanchietti force-pushed the alertmanager_victoria branch 5 times, most recently from 010e499 to e94f63d Compare April 23, 2026 14:52

Tbaile added 22 commits April 30, 2026 11:48

feat: added ns-stats

231c952

controller fixes

0e8276a

removed deprecations

96007a5

added complete management of metrics and data

2ade2d3

moved telegraf configuration where it's more appropriate

fb6ea96

added correct values for netifyd

56d6dff

added grafana for monitoring

8ad8810

removed unneeded fields

15ebde1

unlimited restart

7dde19e

separated telegraf plugins

c72714d

expanded inputs

fb514ff

refactor: telegraf should configure netifyd

4ba8b61

fixed build issue

7b2c75e

added logs

175593b

uploading all logs

4f063e4

fixed some config issues

3095eab

added ulogd

a119fa4

separated telegraf concerns

3476feb

added vlogscli and vmalert

50e78b1

fixed build issue with victoria logs

445f60c

chore: updated banip

d6be974

separated telegraf parsers

2973e28

Tbaile and others added 15 commits April 30, 2026 11:48

updated netifyd ingester with http

bb80503

added outputs.sql

ed341e8

synced banip

b64518c

updated log interval to 60s

3a2fc00

updated sink with a testing env

c0ccfe1

adjusting telegraf inputs

4722369

removed netifyd logging, too much data

101d11a

feat: add mimir alerting

a04979f

fix: make sure to update netdata alert conf

9c48d52

The netdata alert script is inside the configuration directory and it is preserved across upgrade

chore: remove unused mwan alert

aa8b46c

This was a leftover when mwan alert was moved from netdata chart to mwan hooks

feat: repeat mwan down alert

f22d0f6

This is required to keep the alert fireing inside mimir

gsanchietti force-pushed the alertmanager_victoria branch 2 times, most recently from 2caa025 to 4158b17 Compare April 30, 2026 13:44

gsanchietti and others added 2 commits April 30, 2026 15:57

gsanchietti force-pushed the alertmanager_victoria branch from 4158b17 to 15cffa5 Compare April 30, 2026 13:57

gsanchietti closed this Apr 30, 2026

gsanchietti deleted the alertmanager_victoria branch May 7, 2026 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EXPERIMENTAL] Mimir integration with VictoriaMetrics#1607

[EXPERIMENTAL] Mimir integration with VictoriaMetrics#1607
gsanchietti wants to merge 39 commits into
mainfrom
alertmanager_victoria

gsanchietti commented Apr 22, 2026 •

edited

Loading

Uh oh!

gsanchietti commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gsanchietti commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gsanchietti commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gsanchietti commented Apr 22, 2026 •

edited

Loading