feat(ribocode): pre-build pyfasta indexes + prefix-scoped outputs#11685
Conversation
Two related changes carried in nf-core/riboseq#174 and split out of the bundled PR #11684. ribocode/prepare: pre-build the pyfasta `.gdx`/`.flat` indexes for `annotation/transcripts_sequence.fa` immediately after `prepare_transcripts`, using the same `key_fn` RiboCode applies internally (split on first space, otherwise split on `|`). Stub touches the two new sidecars. Why: downstream RiboCode steps open the FASTA with pyfasta, which lazily writes `.gdx`/`.flat` next to the input on first read. Under Fusion staging those writes land back at the upstream task's S3 prefix and silently corrupt the staged copy on retries. Building the indexes inside the producing task fixes it. ribocode/ribocode: switch the `orf_txt` and `orf_txt_collapsed` output globs from `*.txt` / `*_collapsed.txt` to `${prefix}.txt` / `${prefix}_collapsed.txt` so multi-instance publication is unambiguous (`*.txt` previously matched both files into the same emit). The `prefix` binding is promoted out of `def` in both `script:` and `stub:` so it resolves at the output-glob stage; the Nextflow 26 strict parser rejects re-declaring the same local with `def` across both blocks. The existing stub assertion at `process.out.orf_txt[0][1][0]` is corrected to the new single-file shape (`process.out.orf_txt[0][1]`). Source: nf-core/riboseq#174 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lazy pyfasta sidecar write isn't Fusion-specific - it's a Nextflow symlink-staging concern that affects any backend (writes leak back to the producer task's work dir via the staged-input symlink). Rewording the inline comment to match. No code change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…build Replace the inline 8-line python heredoc (which replicated RiboCode's `get_chrom` key_fn verbatim) with a single `python -c` line that imports and instantiates `RiboCode.prepare_transcripts.GenomeSeq` directly. The class constructor itself runs `Fasta(filename, key_fn=get_chrom)` with the same key function, so we drop the replication while producing byte-identical .gdx/.flat sidecars (md5-verified on the realistic FASTA format prepare_transcripts emits). No snapshot change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| -o annotation \\ | ||
| $args | ||
| # Pre-build the pyfasta .gdx/.flat sidecars by instantiating RiboCode's own GenomeSeq - |
There was a problem hiding this comment.
Could this be an additional flag in prepare_transcripts in the future? Maybe worth to file an issue to their github repo
There was a problem hiding this comment.
Thanks! It's kind of something that comes out of the workflow use case specifically, but I can at least flag it with them.
|
@jonasscheid good shout - opened xryanglab/RiboCode#70 upstream proposing the eager-build patch. If they take it we can drop the inline pre-build entirely. I'll add a link in the module's inline comment so future maintainers can find it. |
|
Update: opened the upstream PR too - xryanglab/RiboCode#71. If they accept it, the next ribocode container build picks up the eager index and we can drop the inline pre-build from this module entirely. For now this PR remains the workaround. |
|
Great stuff 👍🏼 |
Two related changes carried in nf-core/riboseq#174.
ribocode/prepare
After
prepare_transcriptsruns, pre-build the pyfasta.gdx/.flatsidecars forannotation/transcripts_sequence.faby instantiatingRiboCode.prepare_transcripts.GenomeSeqdirectly.Why: RiboCode's downstream
detectORF.pyopenstranscripts_sequence.favia pyfasta, which lazily writes.gdx/.flatnext to the FASTA on first read. Under Nextflow's default symlink staging those writes go through the symlink and land in the upstreamRIBOCODE_PREPAREtask's work dir; parallel consumers then race on the same sidecar paths, with last-writer-wins behaviour on shared/network storage. Building the sidecars in the producing task makes them part of the published annotation directory and removes the lazy-write path entirely - same pattern assamtools/faidxshipping.faialongside its FASTA.ribocode/ribocode
Switch
orf_txtandorf_txt_collapsedfrom*.txt/*_collapsed.txtto${prefix}.txt/${prefix}_collapsed.txt. The previous globs matched both files into the same emit. Theprefixbinding is promoted out ofdefin bothscript:andstub:so it resolves at the output-glob stage; the Nextflow 26 strict parser rejects re-declaring the same local across the two blocks. Existing stub assertion adjusted fromprocess.out.orf_txt[0][1][0]toprocess.out.orf_txt[0][1].Snapshot deltas
ribocode/prepare: gains two new file md5s (.flat,.gdx); existing md5s unchanged.ribocode/ribocode: drops the duplicatetest_collapsed.txtentry the old*.txtglob double-counted.Source: nf-core/riboseq#174. Supersedes the ribocode half of #11684.