Skip to content

Add opt-in planned parser for VCF FORMAT fields#2005

Open
jhl-oai wants to merge 39 commits into
samtools:developfrom
jhl-oai:feature/vcf-parsing-speedup
Open

Add opt-in planned parser for VCF FORMAT fields#2005
jhl-oai wants to merge 39 commits into
samtools:developfrom
jhl-oai:feature/vcf-parsing-speedup

Conversation

@jhl-oai
Copy link
Copy Markdown

@jhl-oai jhl-oai commented May 8, 2026

Summary

TL;DR: A fast-path VCF FORMAT parser for well-defined FORMAT fields results in large speedups for GT-heavy VCF parsing: on real 1000G chr22 GT-only input, test_view improves from 25.08s to 10.27s (2.44x), and downstream bcftools view -Ob and bcftools query improves by 2-2.56x user time across --threads 0/2/4. On more general mixed FORMAT cohort inputs, the benefit is smaller but consistent: about 1.07-1.14x on GATK-output-like conversion/filter and reordered likelihood workloads, with unsupported shapes falling back to parity.

The default behavior is unchanged: the existing generic parser is used unless the caller explicitly sets:

HTS_VCF_FORMAT_PLAN=1

The current implementation takes the following approach: header-owned plan cache, tag/type-based compilation, conservative row execution, and whole-column fallback to the generic parser whenever a supported invariant is not met.

This is not intended to merge as-is with the current review controls. Before any merge, I would expect to either remove or replace the temporary environment gate/statistics surface and remove the interim review note from docs/.

This is an updated version of #2001 . Compared to the earlier draft, this version removes the width-specific integer-vector micro-specializations and uses one generic integer-vector row path instead. The executor now has only these row kinds:

GT2, INT1, INTVEC, FLOAT1, FLOATN, STR

The planned parser is optional and conservative. Unsupported schemas, dirty headers, malformed rows, unsupported GT shapes, unsafe widths, and unprofitable string/float-heavy layouts fall back for the whole FORMAT column.

Review Guide

An implementation note is in docs/FORMAT_PLAN_OVERVIEW.md, which should be untracked prior to merge. It contains an overview of the implementation; to summarize, the implementation in vcf.c consists of:

  • opt-in environment gating;
  • temporary diagnostic counters under HTS_VCF_FORMAT_PLAN_STATS=1;
  • a private bcf_hdr_aux_t FORMAT plan cache invalidated after bcf_hdr_sync();
  • a compiler that uses header tag metadata rather than exact whole-FORMAT string kernels;
  • row-local width resolution for fixed, allele-dependent, and bounded measured widths;
  • selected-sample handling that scans original sample columns and emits retained samples densely;
  • whole-column rollback/fallback before invoking the generic parser.

No public header or exported API is added for this planner state.

Correctness

The hard requirement is byte-identical BCF output compared with the generic parser. The planned path must not keep partial FORMAT output after detecting a fallback condition.

Focused coverage includes disabled/unknown environment behavior, selected samples, malformed FORMAT tokens, malformed numeric fields, phased missing GT forms such as .|., cache invalidation after header metadata changes, and fallback after partial planned parsing.

Current Performance Summary

These numbers are for commit bd643182c8fa722abbc0cb89860263a90bb97020.

test/test_view -b -l 0, mean of five baseline/planned repetitions, all 50 BCF comparisons ok:

Times in this table are user time, reported as mean +/- standard error.

Input Baseline Planned Speedup Planner hits / fallback Notes
1000G chr22 full GT 26.174 +/- 0.127 s 10.750 +/- 0.136 s 2.43x 1,103,547 / 0 Largest win; sample-rich GT-only rows stay fully planned.
CCDG 10k 2.640 +/- 0.003 s 2.356 +/- 0.012 s 1.12x 9,861 / 139 Real cohort likelihood-style rows; long string rows fall back record-locally.
Large CCDG-like likelihood 4.208 +/- 0.024 s 3.824 +/- 0.014 s 1.10x 20,000 / 0 Synthetic cohort likelihood rows with CCDG-like field mix.
Large reordered likelihood 3.022 +/- 0.005 s 2.532 +/- 0.005 s 1.19x 20,000 / 0 Shows tag-level composition handles reordered FORMAT layouts.
Large multiallelic likelihood 3.238 +/- 0.007 s 2.860 +/- 0.004 s 1.13x 16,000 / 0 Exercises allele-count-dependent vector widths.
Large float/string negative 3.026 +/- 0.014 s 3.022 +/- 0.022 s 1.00x 0 / 16,000 Shows no regression when generic fallback is triggered.
Large phase-width variation 2.680 +/- 0.004 s 2.536 +/- 0.005 s 1.06x 12,000 / 0 Exercises row-local width handling.
Mixed row-local likelihood 2.238 +/- 0.005 s 1.952 +/- 0.004 s 1.15x 12,000 / 0 Exercises mixed row-local shapes while preserving planned execution.
GT-first reordered 1.778 +/- 0.002 s 1.488 +/- 0.002 s 1.19x 12,000 / 0 Reordered layout with GT first.
Two-string float negative 2.344 +/- 0.005 s 2.340 +/- 0.005 s 1.00x 0 / 12,000 Shows no regression when generic fallback is triggered.

Downstream bcftools was rebuilt against this branch:

bcftools 1.23.1-63-gd9375776
Using htslib 1.23.1-58-gbd643182

Repeated streamed bcftools runs use cksum, alternate baseline/planned order by repetition, and compare baseline-vs-plan checksums.

Selected bcftools user-time results:

Each cell is baseline -> planned (speedup). --threads applies to the bcftools view command shapes only; bcftools query rows are unthreaded and therefore use threads=0.

Dataset Command User t0 User t2 User t4 Real t0 Real t2 Real t4
CCDG 10k view_bcf 2.716 -> 2.412 (1.13x) 2.806 -> 2.480 (1.13x) 2.776 -> 2.480 (1.12x) 4.382 -> 4.032 (1.09x) 3.160 -> 2.810 (1.12x) 3.056 -> 2.852 (1.07x)
CCDG 10k view_keep2_bcf 2.588 -> 2.296 (1.13x) 2.670 -> 2.376 (1.12x) 2.662 -> 2.376 (1.12x) 2.682 -> 2.414 (1.11x) 2.362 -> 2.060 (1.15x) 2.340 -> 2.046 (1.14x)
CCDG 10k view_sites_bcf 2.600 -> 2.298 (1.13x) 2.644 -> 2.340 (1.13x) 2.634 -> 2.338 (1.13x) 2.718 -> 2.402 (1.13x) 2.312 -> 2.008 (1.15x) 2.310 -> 2.020 (1.14x)
CCDG 10k query_gt_keep2 0.668 -> 0.422 (1.58x) n/a n/a 0.716 -> 0.462 (1.55x) n/a n/a
CCDG 10k query_gt_all 2.942 -> 2.634 (1.12x) n/a n/a 3.056 -> 2.748 (1.11x) n/a n/a
CCDG 10k filter_gt_keep2 2.872 -> 2.572 (1.12x) 2.988 -> 2.654 (1.13x) 2.952 -> 2.662 (1.11x) 2.994 -> 2.668 (1.12x) 2.690 -> 2.330 (1.15x) 2.632 -> 2.330 (1.13x)
1000G chr22 full GT view_bcf 26.946 -> 10.316 (2.61x) 27.088 -> 10.662 (2.54x) 26.886 -> 10.846 (2.48x) 33.056 -> 15.922 (2.08x) 24.924 -> 11.700 (2.13x) 24.674 -> 11.708 (2.11x)
1000G chr22 full GT view_keep2_bcf 26.680 -> 10.260 (2.60x) 27.016 -> 10.768 (2.51x) 27.034 -> 10.662 (2.54x) 27.076 -> 10.522 (2.57x) 25.132 -> 8.850 (2.84x) 25.130 -> 8.734 (2.88x)
1000G chr22 full GT view_sites_bcf 25.894 -> 9.704 (2.67x) 26.346 -> 10.244 (2.57x) 26.410 -> 10.150 (2.60x) 26.262 -> 9.984 (2.63x) 24.432 -> 8.310 (2.94x) 24.418 -> 8.242 (2.96x)
1000G chr22 full GT query_gt_keep2 5.728 -> 2.888 (1.98x) n/a n/a 5.836 -> 2.998 (1.95x) n/a n/a
1000G chr22 full GT query_gt_all 57.230 -> 40.938 (1.40x) n/a n/a 58.820 -> 42.396 (1.39x) n/a n/a
1000G chr22 full GT filter_gt_keep2 48.782 -> 32.420 (1.50x) 49.516 -> 33.116 (1.50x) 51.412 -> 35.188 (1.46x) 49.376 -> 33.010 (1.50x) 48.132 -> 31.410 (1.53x) 49.984 -> 33.958 (1.47x)

codex added 8 commits April 30, 2026 17:25
Collapse the width-specific integer vector row kinds and parsers into a single INTVEC executor path, preserving whole-column fallback behavior for malformed or over-width fields.

Add planner docs covering the current control surface, cache and fallback contract, focused tests, and the full-corpus simplification benchmark delta.
@jhl-oai
Copy link
Copy Markdown
Author

jhl-oai commented May 8, 2026

@jkbonfield, thanks again for the feedback on the earlier PR; I agree with your concern about the previous implementation looking over-specialized. I’ve reworked the branch to simplify; in particular, the width-specific integer-vector micro-specializations have been removed; the executor now uses a much smaller set of generic row kinds:

GT2, INT1, INTVEC, FLOAT1, FLOATN, STR

So the planner is still tag/type driven, but it is no longer trying to carry separate code paths for lots of narrow integer-vector widths.

The current draft keeps the same conservative approach: opt-in only, byte-identical BCF output vs the generic parser, and whole-FORMAT-column fallback if a row does not meet the supported invariants.

I also agree with your broader point about a better long-term VCF API, and thanks for linking your previous work on it. A lazy/tokenized/query-oriented VCF representation seems like the more principled direction if we want to avoid forcing full parse and type conversion of every record.

That said, it feels like a larger, potentially breaking API-level project, whereas this PR is trying to stay within the existing API/ABI and recover a fairly large amount of performance for common cohort FORMAT workloads now. The current numbers are still around 2.4x for test_view on 1000G GT-only chr22 and about 2.5-2.6x user-time for some downstream bcftools view paths, with smaller 1.1x-ish wins on mixed likelihood-style rows.

I’d be interested in whether the simplified version looks closer to something you'd consider a maintainable internal optimization.

@daviesrob
Copy link
Copy Markdown
Member

Although it's not merged yet, you might want to take a look at our provisional contributing guide in #2003, and especially the AI policy, which is taken mostly from the Linux kernel one.

There are some interesting ideas here, but, apart from the GT-only case, the speed-up obtained seems rather modest compared to the complexity added. I suspect that some simpler, more targeted, changes might be able to give a similar result. A couple of avenues that may be worth exploring could be:

  • Profiling the existing parser suggests a lot of time is spent in a tight loop looking for various field separators. Optimising that could bring significant benefits. It's possible that a lot of your speed-up comes from by-passing that bit of code.
  • Keeping the specialised GT-only parser, as files like that certainly do exist in the wild. It should be fairly easy to work out when to trigger that case (only GT, and less than 10 ALT fields). It could assume the same ploidy all the way through and then fall back to a slower method if that turns out not to be the case.

Both of those would likely be good targets for SIMD code.

If you want to try a more radical solution, it might also be worth trying to replace the existing parser with one that pivots the data in text form before converting it to binary. That might turn out to be quicker than the rather complicated solution we have now. We may or may not want to use the results depending on how well it worked, but it would be interesting to see how much effort would be needed in order to conduct the experiment.

@jkbonfield
Copy link
Copy Markdown
Contributor

Given this is a big change to files that only have multi-sample GT records and a minimal speed change to everything else, do or any other readers have any feeling for how frequent this scenario is?

Modern callers seem to output a plethora of format fields, or at the very least a genotype confidence value. I'm aware some of the original 1000 Genomes data was pure GT, but it feels like an outlier. If there is evidence of GT-only still being a viable and common format then we may consider it worth optimising, but personally I'm not convinced currently that this is a bottleneck for most users.

What was your use case that triggered this PR?

@jhl-oai
Copy link
Copy Markdown
Author

jhl-oai commented May 11, 2026

@daviesrob Thanks for the pointer on the contribution guidelines; I wasn't aware of this. Re: the suggestions, I did try SIMD acceleration for the separator parsing, but did not observe a substantial speedup, though I don't recall the details. I'll have to dig up the details of that experiment. The pivot idea is interesting as well, I will experiment with it.

@jkbonfield The motivating example was for phasing/imputation, where often the panels are extremely large but often only have (phased) GTs; in the context of single-sample imputation/phasing, this results in the majority of runtime for e.g. SHAPEIT being the actual reference panel parsing when using, e.g. the UKB. There do exist specialized formats for SHAPEIT and similar software, but it would be useful to have this fast path for the generic VCF as well. I agree that this isn't so applicable to VCFs in the more general variant calling context like the outputs of joint calling etc.

More generally, the VCF API approach you suggested in #2001 (comment) does seem like the better long-term approach given the reasons you outlined. This might be something I can take a look at; I would imagine that in practice we would want to surface it as an opt-in additional API at least as a first pass.

@jkbonfield
Copy link
Copy Markdown
Contributor

Thanks for the use case example. Food for thought.

I agree that any major API rewrite should be an additional thing sitting along side the existing "legacy" API. My long term strategy, which I never really got around to, was to add it directly to bcftools initially and start rewriting tools in the new API to see how it performed and to get some real experience with what's needed. That permits the API to change easily while under development. Once stable it can then become part of htslib proper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants