Description
While following the README instructions to set up the project locally, I encountered an issue where the application crashes on startup if the embeddings cannot be downloaded.
The failure appears to originate from two related problems:
- The embeddings cannot currently be downloaded from the public S3 bucket.
- When embeddings are missing, the application crashes with an opaque error instead of providing a helpful message.
Steps to Reproduce
- Clone the repository.
- Install dependencies:
- Configure environment variables (
.env) as described in the README.
- Attempt to list available embeddings:
./bin/embeddings_manager ls-remote
This returns a boto3 error:
403 Forbidden / AccessDenied
from the S3 bucket download.reactome.org.
- Attempt to install embeddings:
./bin/embeddings_manager install openai/text-embedding-3-large/reactome/Release91
This fails with the same error.
- Start the application:
Both containers exit with the following traceback:
AttributeError: 'NoneType' object has no attribute 'glob'
Observed Behavior
embeddings_manager ls-remote fails with a boto3 403 AccessDenied.
embeddings_manager install fails during the S3 download.
- When the application starts without embeddings present, it crashes during initialization with:
AttributeError: 'NoneType' object has no attribute 'glob'
The error message does not indicate that the underlying issue is missing embeddings.
Expected Behavior
Ideally one of the following would occur:
- If the embeddings cannot be downloaded from S3,
embeddings_manager should produce a clear error explaining the access failure.
Example:
ERROR: Unable to access embedding archive from S3 (403 AccessDenied).
Please verify bucket permissions or install embeddings manually.
- If embeddings are missing when the application starts, the server should fail with a clear message such as:
ERROR: No embeddings configured for 'reactome'.
Run 'bin/embeddings_manager install <model>/<db>/<version>' to install embeddings.
Alternatively, the affected profile could be disabled while allowing the server to start with reduced functionality.
Likely Cause
From debugging the startup process, it appears the crash occurs because:
EmbeddingEnvironment.get_dir() returns None when embeddings are not configured.
- That value eventually propagates into the retriever initialization.
- A later call to
directory.glob() assumes the directory exists and triggers the AttributeError.
Impact
This currently blocks new contributors from running the application locally using the documented workflow:
ls-remote → install → docker compose up
Since the embeddings cannot be downloaded and the startup error does not explain the root cause, diagnosing the issue requires tracing through several internal modules.
Possible Improvements
Some potential improvements that might make this easier for users:
- Add clearer error handling in
embeddings_manager for S3 access failures.
- Validate that embeddings exist during application startup and fail with a descriptive message.
- Optionally allow the application to start while disabling profiles that require missing embeddings.
Happy to help implement a fix if this approach makes sense.
Description
While following the README instructions to set up the project locally, I encountered an issue where the application crashes on startup if the embeddings cannot be downloaded.
The failure appears to originate from two related problems:
Steps to Reproduce
.env) as described in the README.This returns a boto3 error:
from the S3 bucket
download.reactome.org.This fails with the same error.
Both containers exit with the following traceback:
Observed Behavior
embeddings_manager ls-remotefails with a boto3403 AccessDenied.embeddings_manager installfails during the S3 download.The error message does not indicate that the underlying issue is missing embeddings.
Expected Behavior
Ideally one of the following would occur:
embeddings_managershould produce a clear error explaining the access failure.Example:
Alternatively, the affected profile could be disabled while allowing the server to start with reduced functionality.
Likely Cause
From debugging the startup process, it appears the crash occurs because:
EmbeddingEnvironment.get_dir()returnsNonewhen embeddings are not configured.directory.glob()assumes the directory exists and triggers theAttributeError.Impact
This currently blocks new contributors from running the application locally using the documented workflow:
Since the embeddings cannot be downloaded and the startup error does not explain the root cause, diagnosing the issue requires tracing through several internal modules.
Possible Improvements
Some potential improvements that might make this easier for users:
embeddings_managerfor S3 access failures.Happy to help implement a fix if this approach makes sense.