Skip to content

feat(ribocode,ribotish): pyfasta indexes, prefix-scoped outputs, optional ribotish -a#11684

Closed
pinin4fjords wants to merge 1 commit into
masterfrom
ribocode-ribotish-bundled-fixes
Closed

feat(ribocode,ribotish): pyfasta indexes, prefix-scoped outputs, optional ribotish -a#11684
pinin4fjords wants to merge 1 commit into
masterfrom
ribocode-ribotish-bundled-fixes

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

Bundles three in-place module changes carried in nf-core/riboseq#174. Each is self-contained and addresses a different pain point we hit running RiboCode / Ribo-TISH at scale.

ribocode/prepare

Pre-build the pyfasta .gdx/.flat indexes for annotation/transcripts_sequence.fa immediately after prepare_transcripts, using the same key_fn RiboCode applies internally (split on first space, otherwise split on |). The stub also touches the two new sidecars.

Why: downstream RiboCode steps open the FASTA with pyfasta, which lazily writes .gdx/.flat next to the input on first read. Under Fusion staging those writes land back at the upstream task's S3 prefix and silently corrupt the staged copy on retries. Building the indexes inside the producing task fixes it.

ribocode/ribocode

Switch the orf_txt and orf_txt_collapsed output globs from *.txt / *_collapsed.txt to ${prefix}.txt / ${prefix}_collapsed.txt so multi-instance publication is unambiguous (*.txt previously matched both the all-ORFs and collapsed files into the same emit). The prefix binding is promoted out of def in both script: and stub: so it resolves at the output-glob stage; the Nextflow 26 strict parser rejects re-declaring the same local with def across the two blocks.

The existing stub assertion that indexed process.out.orf_txt[0][1][0] is corrected to the new single-file shape (process.out.orf_txt[0][1]).

ribotish/predict

Breaking signature change. The third input tuple gains an optional fourth element, reference_gtf, plumbed through to ribotish predict as -a <gtf> when populated:

tuple val(meta3), path(fasta), path(gtf), path(reference_gtf, stageAs: 'secondary.gtf')

Callers must supply a fourth element on every emit. Pass [] for the no-op case (no secondary annotation). The existing test cases in this PR are migrated that way; positive-coverage tests for the populated path will land in a follow-up.

Why: Ribo-TISH's -a argument is the documented hook for layering a secondary annotation (e.g. MANE/RefSeq) on top of the primary GTF, and we want to expose it from the module without a second optional input tuple.

Test plan

All three modules pass under Docker on a c5.9xlarge VM with nf-core 4.0.2 / nextflow 26.04.1 / nf-test 0.9.5:

nf-core modules test --profile docker ribocode/prepare
nf-core modules test --profile docker ribocode/ribocode
nf-core modules test --profile docker ribotish/predict

Snapshot deltas:

  • ribocode/prepare: non-stub snapshot gains the two new file md5s (transcripts_sequence.fa.flat, transcripts_sequence.fa.gdx); existing files' md5s unchanged.
  • ribocode/ribocode: orf_outputs snapshot drops the duplicate test_collapsed.txt entry that the old *.txt glob had pulled into orf_txt; everything else unchanged.
  • ribotish/predict: no snapshot change (the [] migration is a no-op at runtime).

Source: nf-core/riboseq#174

…onal ribotish -a

Bundles three in-place module changes carried in nf-core/riboseq#174.

ribocode/prepare: pre-build the pyfasta .gdx/.flat indexes for
annotation/transcripts_sequence.fa using the same key_fn RiboCode applies
internally (split on first space, else split on '|'). Downstream RiboCode
tasks otherwise lazily build those sidecars inside the staged input
directory, which fails under Fusion staging because writes leak back to
the upstream task's S3 prefix.

ribocode/ribocode: scope the orf_txt and orf_txt_collapsed output globs to
${prefix}.txt and ${prefix}_collapsed.txt rather than *.txt/*_collapsed.txt
so multi-instance publication is unambiguous. The prefix binding is
promoted out of `def` in both the script and stub blocks so it resolves at
the output-glob stage (Nextflow 26 strict parser rejects redeclaration of
the same name across script/stub if either uses `def`). The existing
stub-test assertion that indexed orf_txt[0][1][0] is adjusted to the new
single-file shape.

ribotish/predict: extend the fasta/gtf input tuple with an optional fourth
path, reference_gtf, plumbed to ribotish predict as `-a <gtf>` when
populated. BREAKING signature change for callers: every emitter must
supply a fourth element in the third tuple (use `[]` for the no-op case).

Source: nf-core/riboseq#174

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords
Copy link
Copy Markdown
Member Author

Superseded by the per-module splits:

Closing this bundled draft. Branch ribocode-ribotish-bundled-fixes will be pruned once the splits land.

@pinin4fjords pinin4fjords deleted the ribocode-ribotish-bundled-fixes branch May 18, 2026 15:48
maxulysse pushed a commit to maxulysse/nf-core_modules that referenced this pull request May 19, 2026
…-core#11685)

* feat(ribocode): pre-build pyfasta indexes + prefix-scoped outputs

Two related changes carried in nf-core/riboseq#174 and split out of the
bundled PR nf-core#11684.

ribocode/prepare: pre-build the pyfasta `.gdx`/`.flat` indexes for
`annotation/transcripts_sequence.fa` immediately after `prepare_transcripts`,
using the same `key_fn` RiboCode applies internally (split on first space,
otherwise split on `|`). Stub touches the two new sidecars.

Why: downstream RiboCode steps open the FASTA with pyfasta, which lazily
writes `.gdx`/`.flat` next to the input on first read. Under Fusion staging
those writes land back at the upstream task's S3 prefix and silently
corrupt the staged copy on retries. Building the indexes inside the
producing task fixes it.

ribocode/ribocode: switch the `orf_txt` and `orf_txt_collapsed` output
globs from `*.txt` / `*_collapsed.txt` to `${prefix}.txt` /
`${prefix}_collapsed.txt` so multi-instance publication is unambiguous
(`*.txt` previously matched both files into the same emit). The `prefix`
binding is promoted out of `def` in both `script:` and `stub:` so it
resolves at the output-glob stage; the Nextflow 26 strict parser rejects
re-declaring the same local with `def` across both blocks. The existing
stub assertion at `process.out.orf_txt[0][1][0]` is corrected to the new
single-file shape (`process.out.orf_txt[0][1]`).

Source: nf-core/riboseq#174

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(ribocode/prepare): reframe pyfasta pre-build comment

The lazy pyfasta sidecar write isn't Fusion-specific - it's a Nextflow
symlink-staging concern that affects any backend (writes leak back to
the producer task's work dir via the staged-input symlink).

Rewording the inline comment to match. No code change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(ribocode/prepare): use RiboCode's GenomeSeq for pyfasta pre-build

Replace the inline 8-line python heredoc (which replicated RiboCode's
`get_chrom` key_fn verbatim) with a single `python -c` line that imports
and instantiates `RiboCode.prepare_transcripts.GenomeSeq` directly. The
class constructor itself runs `Fasta(filename, key_fn=get_chrom)` with
the same key function, so we drop the replication while producing
byte-identical .gdx/.flat sidecars (md5-verified on the realistic FASTA
format prepare_transcripts emits).

No snapshot change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maxulysse pushed a commit to maxulysse/nf-core_modules that referenced this pull request May 19, 2026
…s + 0.2.8 (nf-core#11686)

* feat(ribotish/predict): add optional secondary reference GTF for -a

Carried in nf-core/riboseq#174 and split out of the bundled PR
nf-core#11684.

**Breaking signature change.** The third input tuple gains an optional
fourth element, `reference_gtf`, plumbed through to `ribotish predict`
as `-a <gtf>` when populated:

    tuple val(meta3), path(fasta), path(gtf), path(reference_gtf, stageAs: 'secondary.gtf')

Callers must supply a fourth element on every emit. Pass `[]` for the
no-op case (no secondary annotation). The existing test cases in this
PR are migrated that way; positive-coverage tests for the populated
path will land in a follow-up.

Why: Ribo-TISH's `-a` argument is the documented hook for layering a
secondary annotation (e.g. MANE/RefSeq) on top of the primary GTF, and
we want to expose it from the module without a second optional input
tuple.

Source: nf-core/riboseq#174

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(ribotish/predict): optional reference_gtf in its own tuple + topics versions + bump 0.2.8

Three coupled cleanups in response to the lint feedback on PR nf-core#11686:

1. Move the new `reference_gtf` input out of the existing fasta/gtf tuple
   and into its own optional input tuple (meta7) - the convention this
   module already uses for `bam_ti`, `candidate_orfs`, `para_ribo`, and
   `para_ti`. The existing `(meta3, fasta, gtf)` signature is preserved,
   so callers no longer need to grow that tuple; they wire a separate
   `Channel.of([[], []])` (or a populated channel) into the new slot.

2. Migrate version reporting from the legacy `versions.yml` heredoc to the
   new topic-based emission (`tuple val("${task.process}"), val('ribotish'),
   eval('...'), topic: versions, emit: versions_ribotish`). The
   `versions.yml` heredoc is removed from both `script:` and `stub:`.
   `meta.yml` regenerated by `nf-core modules lint --fix` to add the
   `topics:` block and reshape the `versions_ribotish` output entry.

3. Bump ribotish from 0.2.7 to 0.2.8 (bioconda; build hash unchanged).

Test snapshot regenerated under `--update`: versions snapshot key renamed
from `versions_*` to `versions_ribotish_*`, version string updated to
`0.2.8`. Prediction-table assertions unchanged - 0.2.8 is a patch release.

Source: nf-core/riboseq#174

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(ribotish/predict): consolidate to one unnamed snapshot per test

Per SPPearce's review comment on nf-core#11686: each test should
have a single anonymous snapshot() call rather than multiple named ones.

Non-stub tests roll `transprofile` + the topic-versions findAll into one
snapshot; the existing `predictions` / `all` contains() row checks are
kept as separate assertions (they pin specific known-good output rows
and aren't redundant with the snapshot).

Stub tests roll `predictions` + `all` + `transprofile` + versions into
one snapshot.

Versions are referenced via the canonical
`process.out.findAll { key, val -> key.startsWith('versions') }`
pattern (653 modules in nf-core/modules use it vs 53 with explicit
`process.out.versions_<tool>`).

Snapshot keys are now the test names directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant