Skip to content

feat: add StorageAdapter plugin system for third-party storage protocols#1432

Open
kushalbakshi wants to merge 5 commits intodatajoint:masterfrom
kushalbakshi:feature/storage-adapter-plugin
Open

feat: add StorageAdapter plugin system for third-party storage protocols#1432
kushalbakshi wants to merge 5 commits intodatajoint:masterfrom
kushalbakshi:feature/storage-adapter-plugin

Conversation

@kushalbakshi
Copy link
Copy Markdown
Contributor

Summary

Adds a StorageAdapter entry-point plugin system that allows third-party packages to register new storage protocols without modifying DataJoint internals.

  • New StorageAdapter ABC in src/datajoint/storage_adapter.py with 4 extension points: create_filesystem, validate_spec, full_path, get_url
  • Lazy entry-point discovery via datajoint.storage group (mirrors the existing codec plugin pattern in codecs.py)
  • Plugin delegation wired into settings.py (get_store_spec) and storage.py (_create_filesystem, _full_path, get_url)
  • Zero changes to built-in protocol code paths (file, s3, gcs, azure)
  • Clear error message when an unknown protocol has no adapter installed

Motivation

DataJoint's external storage layer currently supports four hardcoded protocols. Extending it with custom fsspec-backed storage requires monkey-patching private methods — fragile, non-composable, and breaks on upstream refactors. This PR adds a clean plugin system so third-party packages can register storage protocols via standard Python entry points.

How third-party packages use it

from datajoint.storage_adapter import StorageAdapter

class MyStorageAdapter(StorageAdapter):
    protocol = "myprotocol"
    required_keys = ("protocol", "endpoint", "bucket")
    allowed_keys = required_keys + ("token",)

    def create_filesystem(self, spec):
        return MyFsspecFilesystem(**spec)
# In their pyproject.toml:
[project.entry-points."datajoint.storage"]
myprotocol = "my_package:MyStorageAdapter"

No import needed — DataJoint auto-discovers the adapter when it encounters the protocol in dj.config.stores.

Test plan

  • 19 new unit tests covering registry, validation defaults, delegation, and error handling
  • All existing unit tests pass unchanged (zero regressions)

🤖 Generated with Claude Code

kushalbakshi and others added 5 commits April 15, 2026 16:35
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a store config uses an unrecognised protocol, settings.py now
queries the adapter registry before raising an error. Registered
plugin adapters are validated and default keys are applied; unknown
protocols surface a clear message directing users to install a plugin
package. Includes new test class TestGetStoreSpecPluginDelegation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The three else-branches in StorageBackend (_create_filesystem,
_full_path, get_url) now query the adapter registry before falling
back to the built-in behaviour. Registered plugin adapters are called
for filesystem construction, path composition, and URL generation;
unknown protocols still raise DataJointError. Includes new test class
TestStorageBackendPluginDelegation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move imports to top of file and apply ruff-format to satisfy CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kushalbakshi kushalbakshi force-pushed the feature/storage-adapter-plugin branch from 59c47de to cd93c19 Compare April 15, 2026 21:39
@dimitri-yatsenko
Copy link
Copy Markdown
Member

Review — two simpler alternatives to consider

Good work, Kushal. The need is real: third-party packages should be able to register new storage backends without monkey-patching. Before we commit to the ABC approach, I want to explore whether we can achieve the same result with less machinery — or with machinery that also improves the existing code.

Background: what StorageBackend actually does on top of fsspec

Looking at `storage.py`, there are five protocol-specific concerns:

  1. Config validation — different required keys per protocol
  2. Constructor arg mapping — translating DJ config → fsspec constructor params (e.g., S3 endpoint → `client_kwargs`)
  3. Path construction — cloud protocols prepend bucket; file:// uses pathlib
  4. URL generation — file:// needs absolute path resolution; clouds are `protocol://path`
  5. Atomic writes — file:// uses `safe_copy`/`safe_write`; clouds delegate to fsspec

~80% of I/O already delegates to fsspec directly. The protocol-specific code is concentrated in items 1-4, spread across ~26 `if self.protocol == "..."` branches.

fsspec already has its own protocol discovery — `fsspec.filesystem("s3")` auto-discovers S3FileSystem. So the question: what does DataJoint need to add on top of fsspec's own plugin system?


Alternative A — Generic fsspec passthrough

The lightest option: make `StorageBackend` accept any fsspec-supported protocol by default. No new ABC, no entry points, no registry.

For unknown protocols:

  • Pass non-DataJoint config keys directly to `fsspec.filesystem(protocol, **kwargs)`
  • Use generic defaults for path (`location/relpath`) and URL (`protocol://path`)

~30 lines of changes. No new module. Works for any fsspec filesystem immediately. Downside: no config validation or custom path/URL logic for unknown protocols.


Alternative B — Unified adapter model for ALL protocols (preferred)

Route all backends through adapters — including the existing four. The 26 hardcoded `if self.protocol == "..."` branches become adapter implementations:

```python
class S3Adapter(StorageAdapter):
protocol = "s3"

def create_filesystem(self, spec):
    endpoint = spec["endpoint"]
    if not endpoint.startswith(("http://", "https://")):
        endpoint_url = f"{'https' if spec.get('secure') else 'http'}://{endpoint}"
    else:
        endpoint_url = endpoint
    return fsspec.filesystem("s3", client_kwargs={"endpoint_url": endpoint_url}, ...)

def validate_spec(self, spec):
    missing = [k for k in ("endpoint", "bucket", "access_key", "secret_key") if k not in spec]
    if missing:
        raise DataJointError(f"S3 store missing: {', '.join(missing)}")

```

Built-in adapters (File, S3, GCS, Azure) registered at import time. Third-party adapters discovered via entry points. `StorageBackend` becomes a thin dispatcher — one code path for all protocols.

This eliminates the 26 hardcoded protocol branches, makes third-party backends first-class citizens alongside built-ins, and gives us the right abstraction. The user-facing API (`dj.config.stores`) doesn't change at all — this is purely internal refactoring. We test against the existing integration test suite and release within the 2.2.x series when confident.


Assessment

Your current PR has the right instincts — ABC, entry points, lazy discovery. But it creates a two-tier system: hardcoded paths for built-in protocols, adapter path for everything else. That means maintaining both code paths going forward.

I'd prefer Alternative B: if we're introducing the adapter abstraction, route everything through it. The existing integration tests validate that the built-in adapters behave correctly. Since the user API doesn't change, this is safe to land in 2.2.x with thorough testing.

@kushalbakshi — would you be up for reworking this PR to route the four existing protocols through adapters as well? The `StorageAdapter` ABC and entry-point discovery from your PR are the right foundation; the additional work is extracting the existing protocol-specific logic from `StorageBackend` into adapter subclasses.

Copy link
Copy Markdown
Member

@dimitri-yatsenko dimitri-yatsenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above — recommending a unified adapter model that routes all four built-in protocols through the same adapter system, not just new ones.

@kushalbakshi
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review, Dimitri. I agree the two-tier pattern is a real architectural smell and that routing all backends through a unified interface is the right end state. Happy to own that migration — but I want to push back on bundling it into this PR, for three reasons.

1. Alt A (pure fsspec passthrough) isn't a substitute for the adapter layer. fsspec's entry-point discovery (fsspec.specs group) does work, but StorageBackend adds value on top: config-key conventions (access_key/endpoint → fsspec's key/endpoint_url), path/URL construction semantics, and the atomic-write guarantees from safe_write/safe_copy for the file backend. Pure passthrough regresses on all three. Flagging explicitly so it's on record.

2. The StorageAdapter interface as drafted eliminates only about a third of the protocol branches — the remaining ones are the architecturally hard ones. The ABC (both the one in this PR and the one in your sketch — protocol, create_filesystem, validate_spec, with full_path/get_url added here) collapses the if self.protocol == "..." dispatch in _create_filesystem (storage.py:334–371), _full_path (373–405), get_url (407–452), and _validate_spec (314–324) — 12 of the ~30 protocol-conditional branches in storage.py. The other ~19 live inside the I/O methods themselves and are not addressable via the current interface:

Method Line What branches
put_file 470 safe_copy (file) vs fsspec.put_file (clouds)
put_buffer 518 safe_write (file) vs fsspec.pipe_file (clouds)
get_file 497 pathlib (file) vs fsspec.get_file (clouds)
get_buffer 549 open() (file) vs fsspec.cat_file (clouds)
remove 604 Path.unlink (file) vs fsspec.rm (clouds)
open 651 pre-write mkdir (file only)
put_folder 697 shutil.copytree vs fsspec.put recursive
remove_folder 737 shutil.rmtree vs fsspec.rm recursive
copy_from_url 796 mkdir + copy vs fsspec.pipe
copy_folder_from_url 847 mkdir vs no-op

These branches aren't incidental. safe_write and safe_copy (utils.py:131, 153) write to a .saving/.copying temp file and then rename — an atomicity contract (documented in their docstrings) that fsspec.LocalFileSystem.put_file / pipe_file don't provide by default. The shutil-based recursive operations likewise have semantics that don't map 1:1 onto fsspec.put / fsspec.rm. These semantics are load-bearing for how DataJoint users rely on external storage — partial-write corruption or surprising recursive-op behavior would be serious regressions for existing pipelines. There are multiple reasonable ways to unify the interface, each with materially different risk profiles, and that design discussion belongs in a dedicated issue rather than in this PR's review cycle.

3. Unit test coverage on the built-in protocols is thin. The only dedicated unit test touching StorageBackend per-protocol logic is test_settings.py::test_get_store_spec_file_protocol; the rest sits in integration tests against live backends. Refactoring all four built-ins without a per-protocol unit-test harness in place is a real regression risk for existing users — especially for the atomic-write and recursive-op semantics called out in point 2, which integration tests don't typically stress.

Proposal. Land this PR as-is — it's strictly additive (else: extensions in _create_filesystem, _full_path, get_url, and get_store_spec; zero change to any of the four built-in code paths; no regression surface) — and open a tracked follow-up issue for the unified-adapter migration with a concrete target date. That follow-up would take the design questions around the atomic-write and recursive-op semantics seriously, add dedicated per-protocol unit tests before any refactor, and migrate one protocol at a time (file → s3 → gcs → azure) in small reviewable PRs.

I'm happy to own that workstream. Given the scope, I'd want to sequence it against other commitments rather than execute it immediately inside this change — doing it now at the cost of higher-priority items isn't the right trade-off, and bundling it here would force the design decisions under the wrong time pressure. If you're good with the staging, I'll open the follow-up issue today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants