Skip to content

Source: filesystem as a SQL surface (fs()) #16

@frhack

Description

@frhack

Why

find + du + stat + xargs is the prehistoric pipeline for
filesystem queries. SQL would express the same things much better
and let you JOIN against other dbfy sources (incidents, audit,
even another filesystem mount).

Examples this enables:

-- Top 10 biggest files under /var
SELECT full_path, size FROM fs.entries
 WHERE parent_dir LIKE '/var/%' AND kind = 'file'
 ORDER BY size DESC LIMIT 10;

-- Per-directory disk usage (du in SQL, recursively summed)
SELECT parent_dir, sum(size) AS bytes_used, count(*) AS files
  FROM fs.entries
 WHERE kind = 'file'
 GROUP BY parent_dir ORDER BY bytes_used DESC LIMIT 20;

-- Stale logs >90 days
SELECT full_path, mtime FROM fs.entries
 WHERE kind = 'file' AND extension = 'log'
   AND mtime < now() - INTERVAL '90 days';

-- Broken symlinks (self-join)
SELECT full_path, link_target FROM fs.entries
 WHERE kind = 'symlink'
   AND link_target NOT IN (SELECT full_path FROM fs.entries);

-- Cross-source: which files changed the same day as an incident?
SELECT f.full_path, f.mtime, i.id AS incident
  FROM fs.entries f
  JOIN incidents.events i ON DATE(f.mtime) = DATE(i.occurred_at)
 WHERE f.parent_dir LIKE '/var/log/%';

Design

New crate dbfy-provider-fs. Streaming walkdir-based, yields
RecordBatches as the tree is traversed — never materialises the
whole tree. The entire surface is a single entries table with a
kind column; users GROUP BY kind or filter WHERE kind = 'file'
for the typical case.

type: fs
roots:
  - path: "/var/log"
    follow_symlinks: false
    max_depth: 10            # optional cap

tables:
  entries:
    columns:
      - { name: full_path,    type: string }
      - { name: parent_dir,   type: string }
      - { name: name,         type: string }
      - { name: kind,         type: string }       # file | dir | symlink | other
      - { name: size,         type: int64 }
      - { name: mtime,        type: timestamp }
      - { name: atime,        type: timestamp }
      - { name: ctime,        type: timestamp }
      - { name: mode,         type: int64 }        # POSIX octal (e.g. 0o644 = 420)
      - { name: uid,          type: int64 }
      - { name: gid,          type: int64 }
      - { name: depth,        type: int64 }        # 0 = root
      - { name: extension,    type: string }       # "log" for "x.log", "" for dirs
      - { name: link_target,  type: string }       # NULL unless kind = 'symlink'

Table-function form too: fs('path')

In addition to the YAML declaration, register a DataFusion UDTF so
ad-hoc queries don't need a config file:

SELECT full_path, size FROM fs('/var/log')
 WHERE size > 1024 * 1024 * 100;

Same registration pattern the DuckDB extension already uses for
dbfy_rest() / dbfy_rows_file(). In the standalone dbfy CLI this
makes it usable as a find replacement without writing YAML.

Pushdown

Predicate Pushdown
parent_dir = '/x' restrict walk to that subtree
parent_dir LIKE '/x/%' same — bound the walk
kind = 'directory' evaluated against d_type from readdir (no stat() call needed — 1000× speedup on big trees)
depth <= N bounded BFS
extension = 'log' filter at readdir, before any stat
LIMIT N early break
size > X, mtime > Y post-stat in-process (cheap once we've stat'd)

Cross-platform notes

  • Linux/macOS: full schema works (libc stat).
  • Windows: skip uid/gid/mode-as-POSIX (return 0); keep
    the rest (size, mtime, kind, depth, ...). Document the gap.

Effort estimate

Step Effort
dbfy-provider-fs scaffold + walkdir streaming 1 d
Pushdown for parent_dir / path / kind / depth 0.5 d
fs() UDTF registration in DuckDB extension + standalone 0.5 d
6 integration tests + 1 showcase ("disk-hogs") 0.5 d
Total v1 ~2.5 d

Acceptance for v1

  • The schema above populated for any directory, recursively
  • kind = 'directory' pushdown skips stat() calls (verifiable via
    strace count in test)
  • parent_dir = X pushdown bounds the walk (verifiable via
    RecordBatch count = subtree-only)
  • Broken-symlink self-join query produces a correct result
  • 4 integration tests on a tempfile::tempdir() fixture covering
    file/dir/symlink/permissions
  • README: a "disk usage in SQL" snippet
  • Showcase: top-N hogs, stale logs, broken symlinks

Targeted milestone: v0.5 (smaller and broadly useful — ahead of the
code source in #15).

Metadata

Metadata

Assignees

No one assigned

    Labels

    source-requestRequest for a new source kind to be addedtier-1Wow factor — opens a new mark

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions