feat: add StorageAdapter plugin system for third-party storage protocols by kushalbakshi · Pull Request #1432 · datajoint/datajoint-python

kushalbakshi · 2026-04-15T21:32:13Z

Summary

Adds a StorageAdapter entry-point plugin system that allows third-party packages to register new storage protocols without modifying DataJoint internals.

New StorageAdapter ABC in src/datajoint/storage_adapter.py with 4 extension points: create_filesystem, validate_spec, full_path, get_url
Lazy entry-point discovery via datajoint.storage group (mirrors the existing codec plugin pattern in codecs.py)
Plugin delegation wired into settings.py (get_store_spec) and storage.py (_create_filesystem, _full_path, get_url)
Zero changes to built-in protocol code paths (file, s3, gcs, azure)
Clear error message when an unknown protocol has no adapter installed

Motivation

DataJoint's external storage layer currently supports four hardcoded protocols. Extending it with custom fsspec-backed storage requires monkey-patching private methods — fragile, non-composable, and breaks on upstream refactors. This PR adds a clean plugin system so third-party packages can register storage protocols via standard Python entry points.

How third-party packages use it

from datajoint.storage_adapter import StorageAdapter

class MyStorageAdapter(StorageAdapter):
    protocol = "myprotocol"
    required_keys = ("protocol", "endpoint", "bucket")
    allowed_keys = required_keys + ("token",)

    def create_filesystem(self, spec):
        return MyFsspecFilesystem(**spec)

# In their pyproject.toml:
[project.entry-points."datajoint.storage"]
myprotocol = "my_package:MyStorageAdapter"

No import needed — DataJoint auto-discovers the adapter when it encounters the protocol in dj.config.stores.

Test plan

19 new unit tests covering registry, validation defaults, delegation, and error handling
All existing unit tests pass unchanged (zero regressions)

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When a store config uses an unrecognised protocol, settings.py now queries the adapter registry before raising an error. Registered plugin adapters are validated and default keys are applied; unknown protocols surface a clear message directing users to install a plugin package. Includes new test class TestGetStoreSpecPluginDelegation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The three else-branches in StorageBackend (_create_filesystem, _full_path, get_url) now query the adapter registry before falling back to the built-in behaviour. Registered plugin adapters are called for filesystem construction, path composition, and URL generation; unknown protocols still raise DataJointError. Includes new test class TestStorageBackendPluginDelegation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move imports to top of file and apply ruff-format to satisfy CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dimitri-yatsenko · 2026-04-16T18:45:34Z

Review — two simpler alternatives to consider

Good work, Kushal. The need is real: third-party packages should be able to register new storage backends without monkey-patching. Before we commit to the ABC approach, I want to explore whether we can achieve the same result with less machinery — or with machinery that also improves the existing code.

Background: what StorageBackend actually does on top of fsspec

Looking at `storage.py`, there are five protocol-specific concerns:

Config validation — different required keys per protocol
Constructor arg mapping — translating DJ config → fsspec constructor params (e.g., S3 endpoint → `client_kwargs`)
Path construction — cloud protocols prepend bucket; file:// uses pathlib
URL generation — file:// needs absolute path resolution; clouds are `protocol://path`
Atomic writes — file:// uses `safe_copy`/`safe_write`; clouds delegate to fsspec

~80% of I/O already delegates to fsspec directly. The protocol-specific code is concentrated in items 1-4, spread across ~26 `if self.protocol == "..."` branches.

fsspec already has its own protocol discovery — `fsspec.filesystem("s3")` auto-discovers S3FileSystem. So the question: what does DataJoint need to add on top of fsspec's own plugin system?

Alternative A — Generic fsspec passthrough

The lightest option: make `StorageBackend` accept any fsspec-supported protocol by default. No new ABC, no entry points, no registry.

For unknown protocols:

Pass non-DataJoint config keys directly to `fsspec.filesystem(protocol, **kwargs)`
Use generic defaults for path (`location/relpath`) and URL (`protocol://path`)

~30 lines of changes. No new module. Works for any fsspec filesystem immediately. Downside: no config validation or custom path/URL logic for unknown protocols.

Alternative B — Unified adapter model for ALL protocols (preferred)

Route all backends through adapters — including the existing four. The 26 hardcoded `if self.protocol == "..."` branches become adapter implementations:

```python
class S3Adapter(StorageAdapter):
protocol = "s3"

def create_filesystem(self, spec):
    endpoint = spec["endpoint"]
    if not endpoint.startswith(("http://", "https://")):
        endpoint_url = f"{'https' if spec.get('secure') else 'http'}://{endpoint}"
    else:
        endpoint_url = endpoint
    return fsspec.filesystem("s3", client_kwargs={"endpoint_url": endpoint_url}, ...)

def validate_spec(self, spec):
    missing = [k for k in ("endpoint", "bucket", "access_key", "secret_key") if k not in spec]
    if missing:
        raise DataJointError(f"S3 store missing: {', '.join(missing)}")

```

Built-in adapters (File, S3, GCS, Azure) registered at import time. Third-party adapters discovered via entry points. `StorageBackend` becomes a thin dispatcher — one code path for all protocols.

This eliminates the 26 hardcoded protocol branches, makes third-party backends first-class citizens alongside built-ins, and gives us the right abstraction. The user-facing API (`dj.config.stores`) doesn't change at all — this is purely internal refactoring. We test against the existing integration test suite and release within the 2.2.x series when confident.

Assessment

Your current PR has the right instincts — ABC, entry points, lazy discovery. But it creates a two-tier system: hardcoded paths for built-in protocols, adapter path for everything else. That means maintaining both code paths going forward.

I'd prefer Alternative B: if we're introducing the adapter abstraction, route everything through it. The existing integration tests validate that the built-in adapters behave correctly. Since the user API doesn't change, this is safe to land in 2.2.x with thorough testing.

@kushalbakshi — would you be up for reworking this PR to route the four existing protocols through adapters as well? The `StorageAdapter` ABC and entry-point discovery from your PR are the right foundation; the additional work is extracting the existing protocol-specific logic from `StorageBackend` into adapter subclasses.

dimitri-yatsenko

See my comment above — recommending a unified adapter model that routes all four built-in protocols through the same adapter system, not just new ones.

kushalbakshi · 2026-04-23T19:26:39Z

Thanks for the detailed review, Dimitri. I agree the two-tier pattern is a real architectural smell and that routing all backends through a unified interface is the right end state. Happy to own that migration — but I want to push back on bundling it into this PR, for three reasons.

1. Alt A (pure fsspec passthrough) isn't a substitute for the adapter layer. fsspec's entry-point discovery (fsspec.specs group) does work, but StorageBackend adds value on top: config-key conventions (access_key/endpoint → fsspec's key/endpoint_url), path/URL construction semantics, and the atomic-write guarantees from safe_write/safe_copy for the file backend. Pure passthrough regresses on all three. Flagging explicitly so it's on record.

2. The StorageAdapter interface as drafted eliminates only about a third of the protocol branches — the remaining ones are the architecturally hard ones. The ABC (both the one in this PR and the one in your sketch — protocol, create_filesystem, validate_spec, with full_path/get_url added here) collapses the if self.protocol == "..." dispatch in _create_filesystem (storage.py:334–371), _full_path (373–405), get_url (407–452), and _validate_spec (314–324) — 12 of the ~30 protocol-conditional branches in storage.py. The other ~19 live inside the I/O methods themselves and are not addressable via the current interface:

Method	Line	What branches
`put_file`	470	`safe_copy` (file) vs `fsspec.put_file` (clouds)
`put_buffer`	518	`safe_write` (file) vs `fsspec.pipe_file` (clouds)
`get_file`	497	pathlib (file) vs `fsspec.get_file` (clouds)
`get_buffer`	549	`open()` (file) vs `fsspec.cat_file` (clouds)
`remove`	604	`Path.unlink` (file) vs `fsspec.rm` (clouds)
`open`	651	pre-write `mkdir` (file only)
`put_folder`	697	`shutil.copytree` vs `fsspec.put` recursive
`remove_folder`	737	`shutil.rmtree` vs `fsspec.rm` recursive
`copy_from_url`	796	`mkdir` + copy vs `fsspec.pipe`
`copy_folder_from_url`	847	`mkdir` vs no-op

These branches aren't incidental. safe_write and safe_copy (utils.py:131, 153) write to a .saving/.copying temp file and then rename — an atomicity contract (documented in their docstrings) that fsspec.LocalFileSystem.put_file / pipe_file don't provide by default. The shutil-based recursive operations likewise have semantics that don't map 1:1 onto fsspec.put / fsspec.rm. These semantics are load-bearing for how DataJoint users rely on external storage — partial-write corruption or surprising recursive-op behavior would be serious regressions for existing pipelines. There are multiple reasonable ways to unify the interface, each with materially different risk profiles, and that design discussion belongs in a dedicated issue rather than in this PR's review cycle.

3. Unit test coverage on the built-in protocols is thin. The only dedicated unit test touching StorageBackend per-protocol logic is test_settings.py::test_get_store_spec_file_protocol; the rest sits in integration tests against live backends. Refactoring all four built-ins without a per-protocol unit-test harness in place is a real regression risk for existing users — especially for the atomic-write and recursive-op semantics called out in point 2, which integration tests don't typically stress.

Proposal. Land this PR as-is — it's strictly additive (else: extensions in _create_filesystem, _full_path, get_url, and get_store_spec; zero change to any of the four built-in code paths; no regression surface) — and open a tracked follow-up issue for the unified-adapter migration with a concrete target date. That follow-up would take the design questions around the atomic-write and recursive-op semantics seriously, add dedicated per-protocol unit tests before any refactor, and migrate one protocol at a time (file → s3 → gcs → azure) in small reviewable PRs.

I'm happy to own that workstream. Given the scope, I'd want to sequence it against other commitments rather than execute it immediately inside this change — doing it now at the cost of higher-priority items isn't the right trade-off, and bundling it here would force the design decisions under the wrong time pressure. If you're good with the staging, I'll open the follow-up issue today.

kushalbakshi and others added 5 commits April 15, 2026 16:35

test: add unit tests for StorageAdapter plugin system

8d7a7d6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add StorageAdapter ABC and entry-point registry

3caee31

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: fix ruff E402 and ruff-format violations

cd93c19

Move imports to top of file and apply ruff-format to satisfy CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kushalbakshi force-pushed the feature/storage-adapter-plugin branch from 59c47de to cd93c19 Compare April 15, 2026 21:39

kushalbakshi requested a review from dimitri-yatsenko April 15, 2026 22:56

dimitri-yatsenko requested changes Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add StorageAdapter plugin system for third-party storage protocols#1432

feat: add StorageAdapter plugin system for third-party storage protocols#1432
kushalbakshi wants to merge 5 commits intodatajoint:masterfrom
kushalbakshi:feature/storage-adapter-plugin

kushalbakshi commented Apr 15, 2026

Uh oh!

dimitri-yatsenko commented Apr 16, 2026

Uh oh!

dimitri-yatsenko left a comment

Uh oh!

kushalbakshi commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kushalbakshi commented Apr 15, 2026

Summary

Motivation

How third-party packages use it

Test plan

Uh oh!

dimitri-yatsenko commented Apr 16, 2026

Review — two simpler alternatives to consider

Background: what StorageBackend actually does on top of fsspec

Alternative A — Generic fsspec passthrough

Alternative B — Unified adapter model for ALL protocols (preferred)

Assessment

Uh oh!

dimitri-yatsenko left a comment

Choose a reason for hiding this comment

Uh oh!

kushalbakshi commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants