feat: complete redesign – process-level net+file I/O monitoring#8
Open
feat: complete redesign – process-level net+file I/O monitoring#8
Conversation
Full rewrite fixing all identified issues: - eBPF: file I/O events now emitted; pwrite*/pwritev* direction fixed to DIR_WRITE; new hooks for writev/readv/pipe/dup/fork; EXIT only on main thread; resources properly released via link.Close() in Objects.Close() - Model: separate Net and File IOCounters per process; sync/atomic.Uint64 (Go 1.19+) replacing go.uber.org/atomic; io.ReadAll replacing ioutil - Store: RWMutex-protected map eliminates old ConnectionsMap race condition - Collector: context-driven lifecycle with graceful Stop(); file/net I/O routing by FDClass; /proc/net/* async resolver (100ms refresh) - CLI: signal.NotifyContext graceful shutdown; log/slog replacing PBLogger; kernel version check (>= 4.9); cobra + viper config - TUI: bubbletea Elm architecture; Net↓/Net↑/File R/File W/Conns columns - Web: stdlib net/http; WebSocket hub; Prometheus collector interface; Alpine.js+Chart.js SPA embedded via go:embed - Config: all previously-TODO fields implemented (FilterPIDs, FilterNames, IncludeFileIO, IncludeLocal) - Tests: events decode, IOCounter concurrent, Store concurrent, NetResolver fixture parsing – all pass with -race https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
The loader was patching the 'events' map to BPF_MAP_TYPE_RINGBUF (27) on kernel >= 5.8, but the C code calls bpf_perf_event_output() which only accepts PERF_EVENT_ARRAY. The BPF verifier rejects the mismatch with "cannot pass map_type 27 into func bpf_perf_event_output#25". Fix: remove the ring buffer upgrade entirely and always use PERF_EVENT_ARRAY, which is supported from kernel 4.9+ and works correctly with the existing bpf_perf_event_output() calls in pbmon.c. Ring buffer support would require rewriting the C helper calls to bpf_ringbuf_output() and is left for a future iteration. Also remove the now-unused ringbuf import, UseRingBuf() method, and ringBufReader from reader.go. https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Ctrl+C fix: bubbletea runs in raw terminal mode where Ctrl+C is delivered as byte 0x03 to stdin (a KeyMsg), NOT as SIGINT. signal.NotifyContext never sees the signal, so <-ctx.Done() in main blocks until a second Ctrl+C after the TUI exits restores normal terminal mode and generates SIGINT. Fix: tui.Start now accepts a context.CancelFunc. Pressing q or Ctrl+C calls cancel() before tea.Quit, which unblocks main and triggers the ordered shutdown (coll.Stop → wg.Wait). No-data fix: eBPF only populates pfid_class_map for FDs created while pbmon is running (via socket/open/accept tracepoints). Pre-existing connections emit IO events with FDClass=UNKNOWN. The old switch in handleIO had no case for UNKNOWN, so proc.Net and proc.File were never incremented for those connections. Fix: for UNKNOWN-class events, resolve the FD type via /proc/PID/fd/FD symlink (reusing the existing NetResolver.LookupByFD path). If the symlink points to a socket inode present in the /proc/net cache, treat the event as FDClassSocket and route bytes to proc.Net. The resolved ConnectionInfo is stored on the Connection so subsequent events for the same FD skip the /proc read (one syscall per connection lifetime). Also remove two dead import-keeping dummy variables. https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Run with: pbmon --debug --log /tmp/pbmon.log --no-tui
Loader (internal/bpf/loader.go):
- Load() now accepts *slog.Logger; each tracepoint attach attempt is
logged (success/fail/not-in-ELF) so you can see exactly which hooks
are active.
- Summary line at INFO level: "eBPF tracepoints attached attached=N ..."
- If zero tracepoints attach, Load() now returns an error immediately
instead of silently returning a useless Objects.
Collector (internal/collector/collector.go):
- readLoop: logs when it starts, when the first event ever arrives
(with raw byte length), every 1 000 events prints channel/drop stats.
- workerLoop: logs decode errors with raw length.
- handleIO: logs every FDClass=UNKNOWN event + whether /proc resolved
it to a socket; atomic counters track totals.
- snapshotLoop: every 10 s logs process count, cumulative event counts
by type (IO/FD/Proc), unknown/resolved class counts, and dropped.
https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Switch to dual-mode BPF output to resolve silent delivery failures in environments where bpf_perf_event_output is blocked (containers/VMs with restricted perf_event_open): - pbmon.c: compile-time #ifdef USE_RINGBUF selects map type and output helper; all three emit functions (io/fd/proc) updated accordingly - Makefile: builds both pbmon_bpf.o (perf) and pbmon_bpf_ringbuf.o - bpf_objects.go: embeds both ELF objects as PbmonELF / PbmonELFRingBuf - loader.go: detects kernel ring buffer support via features.HaveMapType; loads ringbuf ELF on kernel >= 5.8, perf ELF otherwise - reader.go: inspects objs.Events.Type() at runtime; uses ringbuf.Reader for RingBuf maps and perf.Reader for PerfEventArray maps - collector.go: passes logger to NewReaderWithLogger - .gitignore: add eBPF .o artifacts and pbmon binary Ring buffer (BPF_MAP_TYPE_RINGBUF) uses shared memory + futex delivery, completely bypassing the perf subsystem, so it works in environments where perf_event_open is blocked. https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
bpf_helpers.h included <linux/bpf.h> which is not available on macOS, breaking `make ebpf` when cross-compiling with -target bpf from macOS. - bpf_helpers.h: remove <linux/bpf.h> dependency; embed all needed constants inline (BPF_MAP_TYPE_*, BPF_FUNC_*, BPF_ANY, BPF_F_CURRENT_CPU) with values taken verbatim from linux/bpf.h in the Linux kernel source - Makefile: make SYSROOT_INC conditional — only pass -I$(SYSROOT_INC) when the directory exists (absent on macOS), so the build succeeds without any system Linux headers installed Compilation on Linux is unchanged; macOS with clang and -target bpf now works out of the box without a Linux sysroot. https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Apple's clang (Xcode) does not ship the BPF backend, so `-target bpf` fails with "No available targets are compatible with triple 'bpf'". On Darwin, auto-use `$(brew --prefix llvm)/bin/clang` when Homebrew LLVM is installed, with a clear error pointing to `brew install llvm` if it is not. Linux behavior is unchanged (CLANG defaults to system clang). https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Root cause: the key handler called m.cancel() (the signal.NotifyContext
stop function) from *inside* bubbletea's event loop. This cancelled the
context that was also passed to tea.WithContext(ctx), causing a race:
bubbletea's context-watcher goroutine tried to send a quit message back
into the event channel while the event loop was mid-Update, resulting in
a hang or the terminal being left in a broken state.
Fix:
- Key handler now only returns tea.Quit — no cancel call inside Update
- tui.Start() calls cancel() unconditionally *after* p.Run() returns,
so the rest of the app always receives the shutdown signal regardless
of whether the TUI exited via q/Ctrl+C or an external signal
- main.go: derive a separate context.WithCancel ctx from the
signal.NotifyContext; pass cancel (not stop) to tui.Start so the
TUI's exit correctly propagates to the main shutdown path
- Remove AppModel.cancel and AppModel.ctx fields (no longer needed)
Shutdown sequence is now deterministic:
q/Ctrl+C → tea.Quit → p.Run() returns → cancel() → <-ctx.Done()
→ coll.Stop() → wg.Wait() → clean exit
https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
For FDs with FDClassUnknown (opened before pbmon started), the code was calling LookupByFD (reads /proc/PID/fd/FD + inode lookup) on *every* IO event for that FD, and logging a debug line each time. For a busy FD this produces thousands of log lines per second and burns CPU on /proc reads. Root cause: Info()==nil couldn't distinguish "never tried" from "tried and confirmed not a socket". Fix: add Connection.netLookupDone (atomic.Uint32) that is set after the first failed /proc lookup. Subsequent events for the same FD skip the lookup and log nothing. The debug log "will not retry" fires exactly once per FD instead of once per event. The positive path (socket resolved on first try) is unchanged. https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Without --installed, brew --prefix always exits 0 and prints a path even when the formula is not installed, so the missing-llvm error was never triggered. --installed makes brew exit non-zero when the formula is absent, giving an empty _BREW_PREFIX and the correct error message. https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Adds `pbmon test` subcommand that runs a full end-to-end self-test:
Phases:
1. Pre-BPF – starts TCP echo connections and opens temp files before
BPF loads; records exact TX/RX bytes per FD
2. Load BPF – calls collector.New(), starts web server (--web-port)
3. Post-BPF – opens NEW TCP connections and temp files after BPF loads;
runs traffic for (duration - 5s) seconds
4. Settle – waits 3s for the store snapshot to stabilise
5. Verify – for each post-BPF FD: asserts conn.IO.TotalTx/Rx() ≥ 95%
of expected bytes; pre-BPF FDs are informational only
Traffic generation:
- Network: multiple goroutines connect to an in-process TCP echo server
and send/receive random-sized chunks; exact TX = RX for each FD
- File I/O: temp files written then read back; BPF sees write()/read()
syscalls with exact byte counts
- Two independent RNG streams (seed+offset) prevent correlated patterns
- All FDs kept open until after verification (prevents FD reuse)
Pre-BPF vs post-BPF distinction:
- Pre-BPF: FDs opened before Load(); BPF has FDClassUnknown; partial
capture expected via /proc/net inode lookup
- Post-BPF: FDs opened after Load(); BPF classifies at creation; full
capture required (±5% tolerance)
GitHub Actions (.github/workflows/integration-test.yml):
- ubuntu-latest (Linux 5.15+, sudo-capable) — no privileged container needed
- Builds eBPF objects with `make ebpf` (clang from apt)
- Builds binary with `make build`
- Runs `sudo ./pbmon test --duration 25s --web-port 0`
- Separate unit-test job that skips eBPF compilation
Usage:
sudo ./pbmon test # 20s, web on :18080
sudo ./pbmon test --duration 30s # longer simulation
sudo ./pbmon test --web-port 0 # no web server
sudo ./pbmon test --seed 42 # reproducible run
https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
internal/bpf and internal/collector both embed compiled BPF ELF objects (pbmon_bpf.o / pbmon_bpf_ringbuf.o) via //go:embed. Those files are only produced by `make ebpf` (requires clang), which the lightweight unit-test job does not run. Fix: test only the packages that have no BPF compile-time dependency: ./internal/model/... ./internal/store/... ./config/... ./ui/... The integration-test job compiles the BPF objects first and covers internal/bpf + internal/collector end-to-end via `pbmon test`. https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
The previous "give up after one miss" approach broke pre-BPF monitoring:
- Socket FDs: netLookupDone was set on the first cache miss (e.g. inode
not yet in /proc/net), so those FDs were permanently ignored even though
the connection was real and the cache would populate moments later.
- File/pipe FDs: symlink is not "socket:[...]" so LookupByFD returned nil,
netLookupDone was set, and effectiveClass stayed FDClassUnknown → bytes
were never routed to proc.File.
Fix — replace netLookupDone with resolvedClass (atomic.Uint32):
1. ProbeFDClass() reads the /proc/PID/fd/FD symlink and classifies it:
- "socket:[inode]" → (FDClassSocket, inode)
- "pipe:[inode]" → (FDClassPipe, 0)
- "/absolute/path" → (FDClassFile, 0)
- unreadable → (FDClassUnknown, 0)
2. handleIO for FDClassUnknown FDs:
- If resolvedClass is already set → use it directly (fast path).
- Otherwise probe the symlink:
- Socket + inode in cache → resolve, set resolvedClass, route to Net.
- Socket + inode not in cache yet → trigger refresh, retry next event
(no permanent give-up).
- File / pipe → set resolvedClass immediately, route to File.
https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
TUI:
- Right panel now shows block-char sparklines (▁▂▃▄▅▆▇█) for Net↓, Net↑,
and (when --file-io) File R / File W above the connection list.
- Connection class column now uses the resolved class for pre-BPF FDs
(falls back to FDClassUnknown only when still unresolved).
Web:
- Chart now has four datasets: Net↓, Net↑, File R, File W with distinct
colors matching the process table headers.
- New GET /api/processes/{pid} endpoint returns per-process bandwidth history
(net_rx_history, net_tx_history, file_rd_history, file_wr_history as
arrays of bytes/s values). Called on process selection to pre-populate
the chart with up to 60 s of history before live ticks begin appending.
- Tooltip enabled (index mode) so hovering shows all four series at once.
- Connection badge class display also uses resolved class for pre-BPF FDs.
https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Full rewrite fixing all identified issues:
DIR_WRITE; new hooks for writev/readv/pipe/dup/fork; EXIT only on main
thread; resources properly released via link.Close() in Objects.Close()
(Go 1.19+) replacing go.uber.org/atomic; io.ReadAll replacing ioutil
routing by FDClass; /proc/net/* async resolver (100ms refresh)
kernel version check (>= 4.9); cobra + viper config
Alpine.js+Chart.js SPA embedded via go:embed
IncludeFileIO, IncludeLocal)
fixture parsing – all pass with -race
https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3