Skip to content

feat: complete redesign – process-level net+file I/O monitoring#8

Open
Ivlyth wants to merge 14 commits intomainfrom
claude/review-project-issues-CsaTq
Open

feat: complete redesign – process-level net+file I/O monitoring#8
Ivlyth wants to merge 14 commits intomainfrom
claude/review-project-issues-CsaTq

Conversation

@Ivlyth
Copy link
Copy Markdown
Owner

@Ivlyth Ivlyth commented Apr 16, 2026

Full rewrite fixing all identified issues:

  • eBPF: file I/O events now emitted; pwrite*/pwritev* direction fixed to
    DIR_WRITE; new hooks for writev/readv/pipe/dup/fork; EXIT only on main
    thread; resources properly released via link.Close() in Objects.Close()
  • Model: separate Net and File IOCounters per process; sync/atomic.Uint64
    (Go 1.19+) replacing go.uber.org/atomic; io.ReadAll replacing ioutil
  • Store: RWMutex-protected map eliminates old ConnectionsMap race condition
  • Collector: context-driven lifecycle with graceful Stop(); file/net I/O
    routing by FDClass; /proc/net/* async resolver (100ms refresh)
  • CLI: signal.NotifyContext graceful shutdown; log/slog replacing PBLogger;
    kernel version check (>= 4.9); cobra + viper config
  • TUI: bubbletea Elm architecture; Net↓/Net↑/File R/File W/Conns columns
  • Web: stdlib net/http; WebSocket hub; Prometheus collector interface;
    Alpine.js+Chart.js SPA embedded via go:embed
  • Config: all previously-TODO fields implemented (FilterPIDs, FilterNames,
    IncludeFileIO, IncludeLocal)
  • Tests: events decode, IOCounter concurrent, Store concurrent, NetResolver
    fixture parsing – all pass with -race

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3

claude added 14 commits April 13, 2026 01:49
Full rewrite fixing all identified issues:
- eBPF: file I/O events now emitted; pwrite*/pwritev* direction fixed to
  DIR_WRITE; new hooks for writev/readv/pipe/dup/fork; EXIT only on main
  thread; resources properly released via link.Close() in Objects.Close()
- Model: separate Net and File IOCounters per process; sync/atomic.Uint64
  (Go 1.19+) replacing go.uber.org/atomic; io.ReadAll replacing ioutil
- Store: RWMutex-protected map eliminates old ConnectionsMap race condition
- Collector: context-driven lifecycle with graceful Stop(); file/net I/O
  routing by FDClass; /proc/net/* async resolver (100ms refresh)
- CLI: signal.NotifyContext graceful shutdown; log/slog replacing PBLogger;
  kernel version check (>= 4.9); cobra + viper config
- TUI: bubbletea Elm architecture; Net↓/Net↑/File R/File W/Conns columns
- Web: stdlib net/http; WebSocket hub; Prometheus collector interface;
  Alpine.js+Chart.js SPA embedded via go:embed
- Config: all previously-TODO fields implemented (FilterPIDs, FilterNames,
  IncludeFileIO, IncludeLocal)
- Tests: events decode, IOCounter concurrent, Store concurrent, NetResolver
  fixture parsing – all pass with -race

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
The loader was patching the 'events' map to BPF_MAP_TYPE_RINGBUF (27)
on kernel >= 5.8, but the C code calls bpf_perf_event_output() which
only accepts PERF_EVENT_ARRAY. The BPF verifier rejects the mismatch
with "cannot pass map_type 27 into func bpf_perf_event_output#25".

Fix: remove the ring buffer upgrade entirely and always use
PERF_EVENT_ARRAY, which is supported from kernel 4.9+ and works
correctly with the existing bpf_perf_event_output() calls in pbmon.c.
Ring buffer support would require rewriting the C helper calls to
bpf_ringbuf_output() and is left for a future iteration.

Also remove the now-unused ringbuf import, UseRingBuf() method, and
ringBufReader from reader.go.

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Ctrl+C fix:
  bubbletea runs in raw terminal mode where Ctrl+C is delivered as byte
  0x03 to stdin (a KeyMsg), NOT as SIGINT. signal.NotifyContext never
  sees the signal, so <-ctx.Done() in main blocks until a second Ctrl+C
  after the TUI exits restores normal terminal mode and generates SIGINT.
  Fix: tui.Start now accepts a context.CancelFunc. Pressing q or Ctrl+C
  calls cancel() before tea.Quit, which unblocks main and triggers the
  ordered shutdown (coll.Stop → wg.Wait).

No-data fix:
  eBPF only populates pfid_class_map for FDs created while pbmon is
  running (via socket/open/accept tracepoints). Pre-existing connections
  emit IO events with FDClass=UNKNOWN. The old switch in handleIO had no
  case for UNKNOWN, so proc.Net and proc.File were never incremented for
  those connections.
  Fix: for UNKNOWN-class events, resolve the FD type via /proc/PID/fd/FD
  symlink (reusing the existing NetResolver.LookupByFD path). If the
  symlink points to a socket inode present in the /proc/net cache, treat
  the event as FDClassSocket and route bytes to proc.Net. The resolved
  ConnectionInfo is stored on the Connection so subsequent events for the
  same FD skip the /proc read (one syscall per connection lifetime).

Also remove two dead import-keeping dummy variables.

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Run with: pbmon --debug --log /tmp/pbmon.log --no-tui

Loader (internal/bpf/loader.go):
  - Load() now accepts *slog.Logger; each tracepoint attach attempt is
    logged (success/fail/not-in-ELF) so you can see exactly which hooks
    are active.
  - Summary line at INFO level: "eBPF tracepoints attached attached=N ..."
  - If zero tracepoints attach, Load() now returns an error immediately
    instead of silently returning a useless Objects.

Collector (internal/collector/collector.go):
  - readLoop: logs when it starts, when the first event ever arrives
    (with raw byte length), every 1 000 events prints channel/drop stats.
  - workerLoop: logs decode errors with raw length.
  - handleIO: logs every FDClass=UNKNOWN event + whether /proc resolved
    it to a socket; atomic counters track totals.
  - snapshotLoop: every 10 s logs process count, cumulative event counts
    by type (IO/FD/Proc), unknown/resolved class counts, and dropped.

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Switch to dual-mode BPF output to resolve silent delivery failures in
environments where bpf_perf_event_output is blocked (containers/VMs
with restricted perf_event_open):

- pbmon.c: compile-time #ifdef USE_RINGBUF selects map type and output
  helper; all three emit functions (io/fd/proc) updated accordingly
- Makefile: builds both pbmon_bpf.o (perf) and pbmon_bpf_ringbuf.o
- bpf_objects.go: embeds both ELF objects as PbmonELF / PbmonELFRingBuf
- loader.go: detects kernel ring buffer support via features.HaveMapType;
  loads ringbuf ELF on kernel >= 5.8, perf ELF otherwise
- reader.go: inspects objs.Events.Type() at runtime; uses ringbuf.Reader
  for RingBuf maps and perf.Reader for PerfEventArray maps
- collector.go: passes logger to NewReaderWithLogger
- .gitignore: add eBPF .o artifacts and pbmon binary

Ring buffer (BPF_MAP_TYPE_RINGBUF) uses shared memory + futex delivery,
completely bypassing the perf subsystem, so it works in environments
where perf_event_open is blocked.

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
bpf_helpers.h included <linux/bpf.h> which is not available on macOS,
breaking `make ebpf` when cross-compiling with -target bpf from macOS.

- bpf_helpers.h: remove <linux/bpf.h> dependency; embed all needed
  constants inline (BPF_MAP_TYPE_*, BPF_FUNC_*, BPF_ANY, BPF_F_CURRENT_CPU)
  with values taken verbatim from linux/bpf.h in the Linux kernel source
- Makefile: make SYSROOT_INC conditional — only pass -I$(SYSROOT_INC)
  when the directory exists (absent on macOS), so the build succeeds
  without any system Linux headers installed

Compilation on Linux is unchanged; macOS with clang and -target bpf now
works out of the box without a Linux sysroot.

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Apple's clang (Xcode) does not ship the BPF backend, so
`-target bpf` fails with "No available targets are compatible
with triple 'bpf'".

On Darwin, auto-use `$(brew --prefix llvm)/bin/clang` when
Homebrew LLVM is installed, with a clear error pointing to
`brew install llvm` if it is not.

Linux behavior is unchanged (CLANG defaults to system clang).

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Root cause: the key handler called m.cancel() (the signal.NotifyContext
stop function) from *inside* bubbletea's event loop. This cancelled the
context that was also passed to tea.WithContext(ctx), causing a race:
bubbletea's context-watcher goroutine tried to send a quit message back
into the event channel while the event loop was mid-Update, resulting in
a hang or the terminal being left in a broken state.

Fix:
- Key handler now only returns tea.Quit — no cancel call inside Update
- tui.Start() calls cancel() unconditionally *after* p.Run() returns,
  so the rest of the app always receives the shutdown signal regardless
  of whether the TUI exited via q/Ctrl+C or an external signal
- main.go: derive a separate context.WithCancel ctx from the
  signal.NotifyContext; pass cancel (not stop) to tui.Start so the
  TUI's exit correctly propagates to the main shutdown path
- Remove AppModel.cancel and AppModel.ctx fields (no longer needed)

Shutdown sequence is now deterministic:
  q/Ctrl+C → tea.Quit → p.Run() returns → cancel() → <-ctx.Done()
             → coll.Stop() → wg.Wait() → clean exit

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
For FDs with FDClassUnknown (opened before pbmon started), the code was
calling LookupByFD (reads /proc/PID/fd/FD + inode lookup) on *every*
IO event for that FD, and logging a debug line each time. For a busy FD
this produces thousands of log lines per second and burns CPU on /proc reads.

Root cause: Info()==nil couldn't distinguish "never tried" from "tried
and confirmed not a socket".

Fix: add Connection.netLookupDone (atomic.Uint32) that is set after the
first failed /proc lookup. Subsequent events for the same FD skip the
lookup and log nothing. The debug log "will not retry" fires exactly once
per FD instead of once per event.

The positive path (socket resolved on first try) is unchanged.

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Without --installed, brew --prefix always exits 0 and prints a path
even when the formula is not installed, so the missing-llvm error was
never triggered. --installed makes brew exit non-zero when the formula
is absent, giving an empty _BREW_PREFIX and the correct error message.

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Adds `pbmon test` subcommand that runs a full end-to-end self-test:

Phases:
  1. Pre-BPF  – starts TCP echo connections and opens temp files before
                BPF loads; records exact TX/RX bytes per FD
  2. Load BPF – calls collector.New(), starts web server (--web-port)
  3. Post-BPF – opens NEW TCP connections and temp files after BPF loads;
                runs traffic for (duration - 5s) seconds
  4. Settle   – waits 3s for the store snapshot to stabilise
  5. Verify   – for each post-BPF FD: asserts conn.IO.TotalTx/Rx() ≥ 95%
                of expected bytes; pre-BPF FDs are informational only

Traffic generation:
  - Network: multiple goroutines connect to an in-process TCP echo server
    and send/receive random-sized chunks; exact TX = RX for each FD
  - File I/O: temp files written then read back; BPF sees write()/read()
    syscalls with exact byte counts
  - Two independent RNG streams (seed+offset) prevent correlated patterns
  - All FDs kept open until after verification (prevents FD reuse)

Pre-BPF vs post-BPF distinction:
  - Pre-BPF: FDs opened before Load(); BPF has FDClassUnknown; partial
    capture expected via /proc/net inode lookup
  - Post-BPF: FDs opened after Load(); BPF classifies at creation; full
    capture required (±5% tolerance)

GitHub Actions (.github/workflows/integration-test.yml):
  - ubuntu-latest (Linux 5.15+, sudo-capable) — no privileged container needed
  - Builds eBPF objects with `make ebpf` (clang from apt)
  - Builds binary with `make build`
  - Runs `sudo ./pbmon test --duration 25s --web-port 0`
  - Separate unit-test job that skips eBPF compilation

Usage:
  sudo ./pbmon test                          # 20s, web on :18080
  sudo ./pbmon test --duration 30s           # longer simulation
  sudo ./pbmon test --web-port 0             # no web server
  sudo ./pbmon test --seed 42                # reproducible run

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
internal/bpf and internal/collector both embed compiled BPF ELF objects
(pbmon_bpf.o / pbmon_bpf_ringbuf.o) via //go:embed. Those files are only
produced by `make ebpf` (requires clang), which the lightweight unit-test
job does not run.

Fix: test only the packages that have no BPF compile-time dependency:
  ./internal/model/... ./internal/store/... ./config/... ./ui/...

The integration-test job compiles the BPF objects first and covers
internal/bpf + internal/collector end-to-end via `pbmon test`.

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
The previous "give up after one miss" approach broke pre-BPF monitoring:
- Socket FDs: netLookupDone was set on the first cache miss (e.g. inode
  not yet in /proc/net), so those FDs were permanently ignored even though
  the connection was real and the cache would populate moments later.
- File/pipe FDs: symlink is not "socket:[...]" so LookupByFD returned nil,
  netLookupDone was set, and effectiveClass stayed FDClassUnknown → bytes
  were never routed to proc.File.

Fix — replace netLookupDone with resolvedClass (atomic.Uint32):

1. ProbeFDClass() reads the /proc/PID/fd/FD symlink and classifies it:
   - "socket:[inode]" → (FDClassSocket, inode)
   - "pipe:[inode]"   → (FDClassPipe, 0)
   - "/absolute/path" → (FDClassFile, 0)
   - unreadable       → (FDClassUnknown, 0)

2. handleIO for FDClassUnknown FDs:
   - If resolvedClass is already set → use it directly (fast path).
   - Otherwise probe the symlink:
     - Socket + inode in cache → resolve, set resolvedClass, route to Net.
     - Socket + inode not in cache yet → trigger refresh, retry next event
       (no permanent give-up).
     - File / pipe → set resolvedClass immediately, route to File.

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
TUI:
- Right panel now shows block-char sparklines (▁▂▃▄▅▆▇█) for Net↓, Net↑,
  and (when --file-io) File R / File W above the connection list.
- Connection class column now uses the resolved class for pre-BPF FDs
  (falls back to FDClassUnknown only when still unresolved).

Web:
- Chart now has four datasets: Net↓, Net↑, File R, File W with distinct
  colors matching the process table headers.
- New GET /api/processes/{pid} endpoint returns per-process bandwidth history
  (net_rx_history, net_tx_history, file_rd_history, file_wr_history as
  arrays of bytes/s values). Called on process selection to pre-populate
  the chart with up to 60 s of history before live ticks begin appending.
- Tooltip enabled (index mode) so hovering shows all four series at once.
- Connection badge class display also uses resolved class for pre-BPF FDs.

https://claude.ai/code/session_019uf2n6Y1AJffXGrc1bXhY3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants