Source: filesystem as a SQL surface (fs())

## Why

`find` + `du` + `stat` + `xargs` is the prehistoric pipeline for
filesystem queries. SQL would express the same things much better
and let you JOIN against other dbfy sources (incidents, audit,
even another filesystem mount).

Examples this enables:

```sql
-- Top 10 biggest files under /var
SELECT full_path, size FROM fs.entries
 WHERE parent_dir LIKE '/var/%' AND kind = 'file'
 ORDER BY size DESC LIMIT 10;

-- Per-directory disk usage (du in SQL, recursively summed)
SELECT parent_dir, sum(size) AS bytes_used, count(*) AS files
  FROM fs.entries
 WHERE kind = 'file'
 GROUP BY parent_dir ORDER BY bytes_used DESC LIMIT 20;

-- Stale logs >90 days
SELECT full_path, mtime FROM fs.entries
 WHERE kind = 'file' AND extension = 'log'
   AND mtime < now() - INTERVAL '90 days';

-- Broken symlinks (self-join)
SELECT full_path, link_target FROM fs.entries
 WHERE kind = 'symlink'
   AND link_target NOT IN (SELECT full_path FROM fs.entries);

-- Cross-source: which files changed the same day as an incident?
SELECT f.full_path, f.mtime, i.id AS incident
  FROM fs.entries f
  JOIN incidents.events i ON DATE(f.mtime) = DATE(i.occurred_at)
 WHERE f.parent_dir LIKE '/var/log/%';
```

## Design

New crate `dbfy-provider-fs`. **Streaming `walkdir`-based**, yields
RecordBatches as the tree is traversed — never materialises the
whole tree. The entire surface is a single `entries` table with a
`kind` column; users `GROUP BY kind` or filter `WHERE kind = 'file'`
for the typical case.

```yaml
type: fs
roots:
  - path: "/var/log"
    follow_symlinks: false
    max_depth: 10            # optional cap

tables:
  entries:
    columns:
      - { name: full_path,    type: string }
      - { name: parent_dir,   type: string }
      - { name: name,         type: string }
      - { name: kind,         type: string }       # file | dir | symlink | other
      - { name: size,         type: int64 }
      - { name: mtime,        type: timestamp }
      - { name: atime,        type: timestamp }
      - { name: ctime,        type: timestamp }
      - { name: mode,         type: int64 }        # POSIX octal (e.g. 0o644 = 420)
      - { name: uid,          type: int64 }
      - { name: gid,          type: int64 }
      - { name: depth,        type: int64 }        # 0 = root
      - { name: extension,    type: string }       # "log" for "x.log", "" for dirs
      - { name: link_target,  type: string }       # NULL unless kind = 'symlink'
```

## Table-function form too: `fs('path')`

In addition to the YAML declaration, register a DataFusion UDTF so
ad-hoc queries don't need a config file:

```sql
SELECT full_path, size FROM fs('/var/log')
 WHERE size > 1024 * 1024 * 100;
```

Same registration pattern the DuckDB extension already uses for
`dbfy_rest()` / `dbfy_rows_file()`. In the standalone dbfy CLI this
makes it usable as a `find` replacement without writing YAML.

## Pushdown

| Predicate | Pushdown |
|---|---|
| `parent_dir = '/x'` | restrict walk to that subtree |
| `parent_dir LIKE '/x/%'` | same — bound the walk |
| `kind = 'directory'` | evaluated against `d_type` from `readdir` (no `stat()` call needed — 1000× speedup on big trees) |
| `depth <= N` | bounded BFS |
| `extension = 'log'` | filter at readdir, before any stat |
| `LIMIT N` | early break |
| `size > X`, `mtime > Y` | post-stat in-process (cheap once we've stat'd) |

## Cross-platform notes

- **Linux/macOS**: full schema works (libc stat).
- **Windows**: skip `uid`/`gid`/`mode`-as-POSIX (return 0); keep
  the rest (size, mtime, kind, depth, ...). Document the gap.

## Effort estimate

| Step | Effort |
|---|---|
| `dbfy-provider-fs` scaffold + walkdir streaming | 1 d |
| Pushdown for parent_dir / path / kind / depth | 0.5 d |
| `fs()` UDTF registration in DuckDB extension + standalone | 0.5 d |
| 6 integration tests + 1 showcase ("disk-hogs") | 0.5 d |
| **Total v1** | **~2.5 d** |

## Acceptance for v1

- The schema above populated for any directory, recursively
- `kind = 'directory'` pushdown skips `stat()` calls (verifiable via
  strace count in test)
- `parent_dir = X` pushdown bounds the walk (verifiable via
  RecordBatch count = subtree-only)
- Broken-symlink self-join query produces a correct result
- 4 integration tests on a `tempfile::tempdir()` fixture covering
  file/dir/symlink/permissions
- README: a "disk usage in SQL" snippet
- Showcase: top-N hogs, stale logs, broken symlinks

Targeted milestone: **v0.5** (smaller and broadly useful — ahead of the
code source in #15).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source: filesystem as a SQL surface (fs()) #16

Why

Design

Table-function form too: `fs('path')`

Pushdown

Cross-platform notes

Effort estimate

Acceptance for v1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Predicate	Pushdown
`parent_dir = '/x'`	restrict walk to that subtree
`parent_dir LIKE '/x/%'`	same — bound the walk
`kind = 'directory'`	evaluated against `d_type` from `readdir` (no `stat()` call needed — 1000× speedup on big trees)
`depth <= N`	bounded BFS
`extension = 'log'`	filter at readdir, before any stat
`LIMIT N`	early break
`size > X`, `mtime > Y`	post-stat in-process (cheap once we've stat'd)

Step	Effort
`dbfy-provider-fs` scaffold + walkdir streaming	1 d
Pushdown for parent_dir / path / kind / depth	0.5 d
`fs()` UDTF registration in DuckDB extension + standalone	0.5 d
6 integration tests + 1 showcase ("disk-hogs")	0.5 d
Total v1	~2.5 d

Source: filesystem as a SQL surface (fs()) #16

Description

Why

Design

Table-function form too: fs('path')

Pushdown

Cross-platform notes

Effort estimate

Acceptance for v1

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Table-function form too: `fs('path')`