Skip to content

feat(cwl): integration of CWL job submission and execution into DiracX#877

Draft
ryuwd wants to merge 39 commits intoDIRACGrid:mainfrom
ryuwd:feat/cwl-job-submission
Draft

feat(cwl): integration of CWL job submission and execution into DiracX#877
ryuwd wants to merge 39 commits intoDIRACGrid:mainfrom
ryuwd:feat/cwl-job-submission

Conversation

@ryuwd
Copy link
Copy Markdown
Contributor

@ryuwd ryuwd commented Apr 2, 2026

End-to-end CWL job submission and execution for DiracX, from CLI to worker
node and back. This PR introduces the full subsystem: there is no prior CWL
path in diracx, so the components below are all new code unless noted
otherwise.

Follows the plan in #858. Goes with DIRACGrid/DIRAC#8506.

CLI (diracx-cli)

Component Description Tested in cert Further development
dirac job submit cwl <workflow> [inputs...] New command. Submits CWL jobs with local file sandbox upload, LFN references, and parametric --range expansion. Workflow input defaults are merged into job dicts before sandbox grouping so default: { class: File, path: ... } entries are uploaded as expected. Yes None for this PR.
dirac job submit cmd -- <command> New command. Quick submission with auto-generated CWL (captures stdout/stderr to log files). Yes Needs further development in a follow-up PR.
dirac job search Renamed and reworked in this PR. Searches jobs with conditions and rich table output. Yes None.
dirac job sandbox list|peek|get <job_id> New commands. Explore and retrieve output sandbox files. peek pages through $PAGER with LESS=-R so ANSI colours render. Output sandboxes only. Yes (output sandboxes) Extend to input sandboxes in a follow-up PR (issue: TBD).
Submission pipeline New module. CWL/YAML parsing, input validation, sandbox scanning/grouping/upload, confirmation prompt, range expansion. Tolerates complex CWL input types in parse_cli_args. Yes None.
dirac-cwl-runner New entry point. Custom cwltool executor with DIRAC-aware FsAccess and PathMapper for LFN/SB resolution via replica maps. The PathMapper overrides cwltool's mapper() so that SB: keys carrying #fragment identifiers are looked up by their full key rather than fragment-stripped, which cwltool's default does for plain HTTP-style URIs. Name follows the CWL-community <framework>-cwl-runner convention (toil-cwl-runner, arvados-cwl-runner, cwl-tes). Yes None.

Worker node (diracx-api)

Component Description Tested in cert Further development
JobWrapper New module. Full CWL job lifecycle: pre-process (sandbox download, LFN resolution, replica map building), async subprocess execution with live stderr streaming, post-process (output sandbox upload, output data registration). The worker writes the fetched CWL YAML verbatim to disk rather than round-tripping through cwl_utils.save(). Yes (LHCb Simulation job) None.
Input Data Resolution (LFN resolution via DataManager) New. Builds the replica map by querying DataManager for input LFNs before staging. No Implementation incomplete. To be completed and exercised in a follow-up PR (issue: TBD).
post_process semantics New. Returns bool and runs on both success and failure paths. Cwltool exiting non-zero is treated as a normal control-flow outcome, so partial outputs from pickValue: all_non_null reach the user even on permanentFail. JobMinorStatus.APP_ERRORS is set on failure. Yes Revisit the status mapping (Major/Minor/Application) for the various cwltool exit conditions; current mapping is a first cut (issue: TBD).
JobMonitor New. Heartbeat loop covering prmon metrics, peek content (last N cwltool stderr lines), Kill command handling, and stall detection via a rolling CPU/wall ratio window. Sends one final heartbeat on exit. Yes (peek and heartbeats confirmed) Decide what to do with the prmon output (where it goes, retention, surfacing to the user). Stall window/threshold and PEEK_LINES are hard-coded; should move to CS config (issue: TBD).
prmon streaming New. PrmonFifoReader reads metrics from a named pipe (1 s sampling, --fast-memmon) instead of polling the on-disk TSV. OnlineCompressor ports the HSF/prmon interpolation-drop algorithm to pure Python so the time-series can be compressed in-process without pandas (DIRACOS2 has no pandas). To be checked Document FIFO failure-mode contract (what happens if prmon dies mid-job).
ApplicationStatus reporting New. Cwltool lifecycle transitions ([job echo-tool] completed success, [workflow ] starting step greet) streamed as ApplicationStatus with rate-limited commits. Yes None.
Sandbox URI scheme New. SB:<se>|<s3_path>#<filename>. The #fragment identifies the file inside the tar archive; the SB: reference is preserved end-to-end and resolved to a presigned URL at extraction time. Yes (input and output sandboxes) Verify that sandboxes are still registered correctly for legacy DIRAC consumers (issue: TBD).
JobReport New. Accumulates job status updates in a timestamped dict and flushes them to the server via set_job_statuses on commit. Heartbeats are sent through synchronously by send_heartbeat, returning any pending server commands (e.g. Kill) to the caller. Uses generated-client HeartbeatData / JobCommand. Yes None.
Subprocess environment CVMFS node and the job_path are prepended to PATH so cwltool's JS evaluation finds Node and sandbox-staged binaries resolve. Yes Remove these PATH hacks before merge.
StoreOutputData command Stub. Wired in as a PostProcessCommand on jobs with output_data hints, calls DataManager.putAndRegister over the configured SE list, and raises on total failure. None of the production-grade behaviour (Adler32 checksumming, GUID extraction, local-SE preference, retry/backoff, RMS failover Request, progress reporting) is implemented; all are marked TODO. No Out of scope. Full implementation and cert testing in a follow-up PR (issue: TBD).

Server (diracx-logic, diracx-routers, diracx-core)

Component Description Tested in cert Further development
CWL-to-JDL translation New. cwl_to_jdl() extracts dirac:Job hints from CWL and emits a JDL string for the legacy matcher. Derives JobName, JobGroup, JobType, Priority, LogLevel, Site, CPUTime, MaxWallTime, Min/MaxNumberOfProcessors, processor tags (MultiProcessor, NProcessors, GPU), I/O sandboxes, InputData, OutputData, OutputPath, OutputSE. A legacy_jdl field on JobHint is merged last as a user override. Yes None.
Auto stdout/stderr collection New. CWL stdout: / stderr: fields are added to OutputSandbox automatically. Yes None.
InputSandbox #fragment stripping New. JDL InputSandbox carries bare SB: references for server ownership checks; the full URI with #filename is preserved in CWL inputs for worker extraction. Yes None.
Range expansion New. Server-side parametric job expansion from --range. Yes None.
Models New. JobHint, IOSource, OutputDataEntry, and the pre/post-process command framework. ReplicaMap is preexisting and was extended in this PR to accept SB: keys (validation passes them through with the prefix; LFN: keys still have the prefix stripped). Yes None.
CWL requirement validation New. validate_requirements() checks every CWL Requirement and Hint against a whitelist. Pass-through (no matcher impact): InlineJavascriptRequirement, SchemaDefRequirement, InitialWorkDirRequirement, EnvVarRequirement, ShellCommandRequirement, LoadListingRequirement, InplaceUpdateRequirement, WorkReuse, NetworkAccess, SubworkflowFeatureRequirement, ScatterFeatureRequirement, MultipleInputFeatureRequirement, StepInputExpressionRequirement. Rejected: DockerRequirement, MPIRequirement, SoftwareRequirement. Unknown requirements raise. Yes None.
SoftwareDistModule default Changed from "LocalSoftwareDist" to "" to fix a Pilot error. Yes Confirm with @chaen before merge that this is the intended default.

Client (diracx-client)

Component Description Tested in cert Further development
Generated extensions New endpoints in the generated client for CWL submission, workflow retrieval, and sandbox operations. Yes None.

Key design decisions

Decision Rationale Further development
CWL-native client surface No JDL on the client side. CWL is the job description format; JDL is an internal implementation detail of the legacy DIRAC matcher. None.
cwltool passthrough for status ApplicationStatus shows verbatim cwltool lifecycle lines rather than a custom translation layer. None.
Replica map as JSON JSON file passed to the cwltool executor mapping LFN/SB paths to local files. Decouples CWL execution from DIRAC data management. None.
location vs path per CWL v1.2 URI schemes (LFN:, SB:) live in File.location; path is reserved for local filesystem paths set after staging. Validation rejects scheme URIs in path. Readers check location before path on cwl_utils File objects. None.
Singularity/apptainer for InlineJavascriptRequirement The cwl_utils JS sandbox runs Node inside Singularity/apptainer rather than directly, matching the typical grid worker capability. Switch to quickjs as the JS evaluator in a follow-up PR. Avoids the Node package size and other complications of carrying a full Node runtime on the worker (issue: TBD).

Compatibility caveats

Issue Resolution Tested in cert Further development
make_path_mapper override under mypyc-compiled cwltool Cwltool ships mypyc-compiled wheels on Linux x86_64. Inside the compiled CommandLineTool.job(), a @staticmethod override of make_path_mapper is bypassed because the call resolves as a direct C call rather than through Python's MRO. The PR ships two complementary fixes: (1) the override is declared as an instance method so dispatch goes through the descriptor protocol and survives the compiled call site; (2) _mypyc_compat.py installs a sys.meta_path finder, invoked at executor package import, that forces .py-source loading for cwltool.command_line_tool when both source and compiled modules are present. The finder skips itself for pure-Python cwltool installs and raises at import time if only the compiled extension is present, so a wheel that strips the .py source can never silently disable the override. The active mode is logged and printed to stderr at install. Yes Evaluate whether the _mypyc_compat.py hack is necessary anymore.
prmon stderr leakage into stdout dirac-cwl-runner strips leading non-JSON noise from cwltool's stdout when prmon emits warnings before the JSON document. Workaround only. Proper fix requires an upstream PR to prmon to keep warnings off stdout.

Test coverage

20 new test files across the affected packages.

Package Test files
diracx-api test_job_monitor.py, test_job_report.py, test_job_wrapper.py, test_job_wrapper_integration.py, test_job_wrapper_sandbox.py, test_prmon_compress.py, test_prmon_reader.py
diracx-cli test_cwl_submit.py, test_executor.py, test_executor_integration.py, test_fs_access.py, test_no_cwltool_import.py, test_submission_confirm.py, test_submission_inputs.py, test_submission_integration.py, test_submission_pipeline.py, test_submission_sandbox.py, test_submit_simple.py
diracx-core test_replica_map.py (extended for SB: keys)
diracx-db test_workflow_db.py
diracx-logic test_cwl_submission.py

Cert testing on diracx-cert.app.cern.ch covers LHCb Simulation jobs end-to-end. See per-row "Tested in cert" columns above for component-level cert coverage.

Status

Under certification testing on diracx-cert.app.cern.ch. The follow-ups
listed above are intentionally scoped out of this PR.

cc @aldbr

@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community Bot commented Apr 2, 2026

@ryuwd ryuwd force-pushed the feat/cwl-job-submission branch 2 times, most recently from 850b0a5 to 32a33a2 Compare April 8, 2026 13:56
@ryuwd ryuwd changed the title feat(cwl): add CWL workflow submission endpoint and DB storage model feat(cwl): integration of CWL job submission and execution into DiracX Apr 10, 2026
Comment thread diracx-api/src/diracx/api/job_report.py
Comment thread diracx-api/src/diracx/api/job_wrapper.py Outdated
Comment thread diracx-cli/src/diracx/cli/executor/fs_access.py
Services: ServicesConfig = ServicesConfig()
"""Configuration for various DIRAC services."""
SoftwareDistModule: str = "LocalSoftwareDist"
SoftwareDistModule: str = ""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was causing errors in the Pilot

to be checked with @chaen

Comment on lines +75 to +81
# TODO: Compute Adler32 checksum before upload
# TODO: Extract POOL/ROOT GUID if applicable
# TODO: Prefer local SEs (getSEsForSite) before remote ones
# TODO: Implement retry with exponential backoff on transient failures
# TODO: On complete failure, create a failover Request (RMS)
# for async recovery instead of raising immediately
# TODO: Report upload progress via job status updates
Copy link
Copy Markdown
Contributor Author

@ryuwd ryuwd Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command is still untested in cert. StoreOutputData still needs fuller implementation, discussion, and testing.

@ryuwd ryuwd force-pushed the feat/cwl-job-submission branch from 75e0b03 to f9133d2 Compare April 14, 2026 14:47
@ryuwd ryuwd force-pushed the feat/cwl-job-submission branch from dd8abd1 to 8a45d3c Compare April 16, 2026 12:21
@ryuwd ryuwd force-pushed the feat/cwl-job-submission branch 2 times, most recently from ab58da3 to 27d0557 Compare May 6, 2026 15:17
ryuwd added 20 commits May 7, 2026 14:50
Per the CWL v1.2 spec, location is an IRI that identifies a file
resource (supports custom URI schemes like LFN: and SB:), while path
is a local filesystem path set after staging. Previously both URI
schemes and local paths were placed in path, which breaks when
cwltool normalises inputs via file_uri().

Readers now check location before path, writers place URI schemes
in location, and validation rejects LFN:/SB: in the path field
on both the client and server side.
Verify that DiracPathMapper produces correct target values (what
cwltool assigns to the File path field at runtime) for different
PFN types: file:// to local path, https:// and root:// passed
through as URLs, and SB: resolved via replica map.
load_inputfile() converts input dicts into cwl_utils File objects
where location="SB:..." and path=None. The extract methods only
checked .path on objects, silently dropping SB: and LFN: references
stored in .location. This caused empty replica maps and sandbox
download failures on the worker.
…cwltool

cwltool ships mypyc-compiled wheels on Linux x86_64. Inside the compiled
CommandLineTool.job(), `self.make_path_mapper(...)` is resolved as a direct
C call when the override is a `@staticmethod` — Python's MRO is bypassed
and DiracCommandLineTool's PathMapper override silently never runs.

Two changes restore the override:

* tool.py: drop @staticmethod and add `self`. Instance-method dispatch
  goes through the descriptor protocol and survives the compiled call
  site. Toil's CWL runner uses the same pattern (see toil's
  src/toil/cwl/cwltoil.py:1110).

* _mypyc_compat.py: rewrite as three explicit branches — pure-Python
  cwltool (skip), compiled+source (install meta-path finder forcing .py
  load for cwltool.command_line_tool), compiled-only (raise; the override
  cannot be restored without a .py source). The finder also raises on
  missing .py instead of silently falling back to .so. The active mode
  is logged and announced on stderr at install time so production
  behaviour is observable.

Tests: test_executor.py updates the make_path_mapper unit tests for the
new instance-method signature; test_no_cwltool_import.py asserts the
runtime invariant we care about (cwltool.command_line_tool is loaded
from a .py source after the executor is imported) rather than the
implementation detail of which finder is registered.
The standalone dirac-cwl package also exports a `dirac-cwl-run` console
script (dirac_cwl.job.executor.__main__:cli). Once both packages coexist
in the same environment (e.g. lb-dirac during the migration window), the
two entry points collide — pip silently lets the second installer
overwrite the first, and which `dirac-cwl-run` actually runs becomes
non-deterministic across resolvers.

Rename the diracx version to `dirac-cwl-runner`, which:

* avoids the collision (different name from dirac-cwl's `-run`);
* matches the CWL-community convention of `<framework>-cwl-runner`
  (toil-cwl-runner, arvados-cwl-runner, cwl-tes…);
* aligns with the `dirac` CLI verb prefix used elsewhere in diracx.

Updates: pyproject entry-point declaration, JobWrapper.run_job
subprocess invocation, logger names across the executor modules
(so log filters track the binary), and the tests/docs that reference
the old name. lbaplocal can opt in via its existing `--executor` flag
today; the default flips once dirac-cwl is removed from lb-dirac.
dirac:Job.input_sandbox sources are resolved at submission time by
looking up each source name in the supplied inputs dict. cwltool
applies workflow-level input defaults later, on the worker — too late
for sandbox upload. Without an inputs.yml or CLI override, defaults
like 'default: { class: File, path: ... }' silently fail to upload.

Merge the workflow's declared defaults into each job dict between
CLI/inputs.yml processing and sandbox grouping so a sandbox source
referencing a defaulted File input gets uploaded as expected.
Worker writes the fetched CWL YAML verbatim to disk and points
dirac-cwl-runner at it. Drops JobModel/JobInputModel/BaseJobModel
indirection and the cwl_utils.save() that re-serialised inline
sub-workflow ids with a 'run/' prefix, breaking URI fragment
resolution downstream.
Removes the RuntimeError raise — using exceptions for an expected
control-flow path (cwltool exited non-zero) muddied the outer
exception handler and produced misleading 'Failed to execute
workflow' tracebacks for normal user-job failures.

post_process now always logs cwltool's output JSON, parses what it
can, attempts the output-sandbox upload (so partial outputs from
pickValue=all_non_null reach the user even on permanentFail), and
returns whether its own infrastructure ran cleanly. run_job decides
DONE vs FAILED from cwltool's exit code separately.
@ryuwd ryuwd force-pushed the feat/cwl-job-submission branch from 79b4f36 to fa5e403 Compare May 7, 2026 12:51
@ryuwd ryuwd requested a review from aldbr May 7, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant