Skip to content

Add CRAM support#69

Open
DaGaMs wants to merge 4 commits into
Bioconductor:develfrom
DaGaMs:devel
Open

Add CRAM support#69
DaGaMs wants to merge 4 commits into
Bioconductor:develfrom
DaGaMs:devel

Conversation

@DaGaMs
Copy link
Copy Markdown

@DaGaMs DaGaMs commented May 6, 2026

This PR adds basic support for CRAM files. CRAM and BAM files are handled (almost) equally by newer versions of htslib, so all that was needed was to add an (optional) reference= parameter to BamFile which is then set on the underlying htsFile struct, and various adaptations to index handling, because CRAM indices are a bit different.

One limitation is mate-pair handling with asMates is not yet supported, because we can't read within the bgzf block in the same way as for BAM.

I added a test cram file and reference and various unit tests to check that it works as expected.

@DaGaMs DaGaMs mentioned this pull request May 6, 2026
@vjcitn
Copy link
Copy Markdown
Contributor

vjcitn commented May 6, 2026

Looking at this now.

@vjcitn
Copy link
Copy Markdown
Contributor

vjcitn commented May 6, 2026

So far it is looking good. I would ask you to bump the version number and add yourself as an aut in Authors@R in DESCRIPTION. Some information on provenance of the test CRAM resources would be good even if trivial. User-visible documentation should indicate the new capability. You were clear that chunked sequential reading is not supported. GPT-5.2 told me that


  • CRAM file is organized into containers, each with slices.
  • CRAI index entries essentially point to container start offsets (and slice info / alignment span).

So you can support chunked reading by:

  • Mapping the “next chunk to read” to one or more containers/slices,
  • Seeking to the container start byte offset using htslib’s CRAM seek support,
  • Decoding records until you have yielded yieldSize records, then stopping,
  • Storing a resumable “cursor” state (more on this below).

GitHub co-pilot offered to do some refactoring that would accommodate yieldSize feature. Is
this important? Would you want to embark on this aspect of upgrading Rsamtools?

@DaGaMs
Copy link
Copy Markdown
Author

DaGaMs commented May 6, 2026

Updated the PR with those annotation changes and some basic documentation. As for the provenance of the test data: it's completely synthetic. I generated 3 random sequences, then created simulated 2x150bp reads and aligned them to the synthetic reference with minimap2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants