Why
find + du + stat + xargs is the prehistoric pipeline for
filesystem queries. SQL would express the same things much better
and let you JOIN against other dbfy sources (incidents, audit,
even another filesystem mount).
Examples this enables:
-- Top 10 biggest files under /var
SELECT full_path, size FROM fs.entries
WHERE parent_dir LIKE '/var/%' AND kind = 'file'
ORDER BY size DESC LIMIT 10;
-- Per-directory disk usage (du in SQL, recursively summed)
SELECT parent_dir, sum(size) AS bytes_used, count(*) AS files
FROM fs.entries
WHERE kind = 'file'
GROUP BY parent_dir ORDER BY bytes_used DESC LIMIT 20;
-- Stale logs >90 days
SELECT full_path, mtime FROM fs.entries
WHERE kind = 'file' AND extension = 'log'
AND mtime < now() - INTERVAL '90 days';
-- Broken symlinks (self-join)
SELECT full_path, link_target FROM fs.entries
WHERE kind = 'symlink'
AND link_target NOT IN (SELECT full_path FROM fs.entries);
-- Cross-source: which files changed the same day as an incident?
SELECT f.full_path, f.mtime, i.id AS incident
FROM fs.entries f
JOIN incidents.events i ON DATE(f.mtime) = DATE(i.occurred_at)
WHERE f.parent_dir LIKE '/var/log/%';
Design
New crate dbfy-provider-fs. Streaming walkdir-based, yields
RecordBatches as the tree is traversed — never materialises the
whole tree. The entire surface is a single entries table with a
kind column; users GROUP BY kind or filter WHERE kind = 'file'
for the typical case.
type: fs
roots:
- path: "/var/log"
follow_symlinks: false
max_depth: 10 # optional cap
tables:
entries:
columns:
- { name: full_path, type: string }
- { name: parent_dir, type: string }
- { name: name, type: string }
- { name: kind, type: string } # file | dir | symlink | other
- { name: size, type: int64 }
- { name: mtime, type: timestamp }
- { name: atime, type: timestamp }
- { name: ctime, type: timestamp }
- { name: mode, type: int64 } # POSIX octal (e.g. 0o644 = 420)
- { name: uid, type: int64 }
- { name: gid, type: int64 }
- { name: depth, type: int64 } # 0 = root
- { name: extension, type: string } # "log" for "x.log", "" for dirs
- { name: link_target, type: string } # NULL unless kind = 'symlink'
Table-function form too: fs('path')
In addition to the YAML declaration, register a DataFusion UDTF so
ad-hoc queries don't need a config file:
SELECT full_path, size FROM fs('/var/log')
WHERE size > 1024 * 1024 * 100;
Same registration pattern the DuckDB extension already uses for
dbfy_rest() / dbfy_rows_file(). In the standalone dbfy CLI this
makes it usable as a find replacement without writing YAML.
Pushdown
| Predicate |
Pushdown |
parent_dir = '/x' |
restrict walk to that subtree |
parent_dir LIKE '/x/%' |
same — bound the walk |
kind = 'directory' |
evaluated against d_type from readdir (no stat() call needed — 1000× speedup on big trees) |
depth <= N |
bounded BFS |
extension = 'log' |
filter at readdir, before any stat |
LIMIT N |
early break |
size > X, mtime > Y |
post-stat in-process (cheap once we've stat'd) |
Cross-platform notes
- Linux/macOS: full schema works (libc stat).
- Windows: skip
uid/gid/mode-as-POSIX (return 0); keep
the rest (size, mtime, kind, depth, ...). Document the gap.
Effort estimate
| Step |
Effort |
dbfy-provider-fs scaffold + walkdir streaming |
1 d |
| Pushdown for parent_dir / path / kind / depth |
0.5 d |
fs() UDTF registration in DuckDB extension + standalone |
0.5 d |
| 6 integration tests + 1 showcase ("disk-hogs") |
0.5 d |
| Total v1 |
~2.5 d |
Acceptance for v1
- The schema above populated for any directory, recursively
kind = 'directory' pushdown skips stat() calls (verifiable via
strace count in test)
parent_dir = X pushdown bounds the walk (verifiable via
RecordBatch count = subtree-only)
- Broken-symlink self-join query produces a correct result
- 4 integration tests on a
tempfile::tempdir() fixture covering
file/dir/symlink/permissions
- README: a "disk usage in SQL" snippet
- Showcase: top-N hogs, stale logs, broken symlinks
Targeted milestone: v0.5 (smaller and broadly useful — ahead of the
code source in #15).
Why
find+du+stat+xargsis the prehistoric pipeline forfilesystem queries. SQL would express the same things much better
and let you JOIN against other dbfy sources (incidents, audit,
even another filesystem mount).
Examples this enables:
Design
New crate
dbfy-provider-fs. Streamingwalkdir-based, yieldsRecordBatches as the tree is traversed — never materialises the
whole tree. The entire surface is a single
entriestable with akindcolumn; usersGROUP BY kindor filterWHERE kind = 'file'for the typical case.
Table-function form too:
fs('path')In addition to the YAML declaration, register a DataFusion UDTF so
ad-hoc queries don't need a config file:
Same registration pattern the DuckDB extension already uses for
dbfy_rest()/dbfy_rows_file(). In the standalone dbfy CLI thismakes it usable as a
findreplacement without writing YAML.Pushdown
parent_dir = '/x'parent_dir LIKE '/x/%'kind = 'directory'd_typefromreaddir(nostat()call needed — 1000× speedup on big trees)depth <= Nextension = 'log'LIMIT Nsize > X,mtime > YCross-platform notes
uid/gid/mode-as-POSIX (return 0); keepthe rest (size, mtime, kind, depth, ...). Document the gap.
Effort estimate
dbfy-provider-fsscaffold + walkdir streamingfs()UDTF registration in DuckDB extension + standaloneAcceptance for v1
kind = 'directory'pushdown skipsstat()calls (verifiable viastrace count in test)
parent_dir = Xpushdown bounds the walk (verifiable viaRecordBatch count = subtree-only)
tempfile::tempdir()fixture coveringfile/dir/symlink/permissions
Targeted milestone: v0.5 (smaller and broadly useful — ahead of the
code source in #15).