Add opt-in planned parser for VCF FORMAT fields by jhl-oai · Pull Request #2005 · samtools/htslib

jhl-oai · 2026-05-08T11:26:41Z

Summary

TL;DR: A fast-path VCF FORMAT parser for well-defined FORMAT fields results in large speedups for GT-heavy VCF parsing: on real 1000G chr22 GT-only input, test_view improves from 25.08s to 10.27s (2.44x), and downstream bcftools view -Ob and bcftools query improves by 2-2.56x user time across --threads 0/2/4. On more general mixed FORMAT cohort inputs, the benefit is smaller but consistent: about 1.07-1.14x on GATK-output-like conversion/filter and reordered likelihood workloads, with unsupported shapes falling back to parity.

The default behavior is unchanged: the existing generic parser is used unless the caller explicitly sets:

HTS_VCF_FORMAT_PLAN=1

The current implementation takes the following approach: header-owned plan cache, tag/type-based compilation, conservative row execution, and whole-column fallback to the generic parser whenever a supported invariant is not met.

This is not intended to merge as-is with the current review controls. Before any merge, I would expect to either remove or replace the temporary environment gate/statistics surface and remove the interim review note from docs/.

This is an updated version of #2001 . Compared to the earlier draft, this version removes the width-specific integer-vector micro-specializations and uses one generic integer-vector row path instead. The executor now has only these row kinds:

GT2, INT1, INTVEC, FLOAT1, FLOATN, STR

The planned parser is optional and conservative. Unsupported schemas, dirty headers, malformed rows, unsupported GT shapes, unsafe widths, and unprofitable string/float-heavy layouts fall back for the whole FORMAT column.

Review Guide

An implementation note is in docs/FORMAT_PLAN_OVERVIEW.md, which should be untracked prior to merge. It contains an overview of the implementation; to summarize, the implementation in vcf.c consists of:

opt-in environment gating;
temporary diagnostic counters under HTS_VCF_FORMAT_PLAN_STATS=1;
a private bcf_hdr_aux_t FORMAT plan cache invalidated after bcf_hdr_sync();
a compiler that uses header tag metadata rather than exact whole-FORMAT string kernels;
row-local width resolution for fixed, allele-dependent, and bounded measured widths;
selected-sample handling that scans original sample columns and emits retained samples densely;
whole-column rollback/fallback before invoking the generic parser.

No public header or exported API is added for this planner state.

Correctness

The hard requirement is byte-identical BCF output compared with the generic parser. The planned path must not keep partial FORMAT output after detecting a fallback condition.

Focused coverage includes disabled/unknown environment behavior, selected samples, malformed FORMAT tokens, malformed numeric fields, phased missing GT forms such as .|., cache invalidation after header metadata changes, and fallback after partial planned parsing.

Current Performance Summary

These numbers are for commit bd643182c8fa722abbc0cb89860263a90bb97020.

test/test_view -b -l 0, mean of five baseline/planned repetitions, all 50 BCF comparisons ok:

Times in this table are user time, reported as mean +/- standard error.

Input	Baseline	Planned	Speedup	Planner hits / fallback	Notes
1000G chr22 full GT	26.174 +/- 0.127 s	10.750 +/- 0.136 s	2.43x	1,103,547 / 0	Largest win; sample-rich GT-only rows stay fully planned.
CCDG 10k	2.640 +/- 0.003 s	2.356 +/- 0.012 s	1.12x	9,861 / 139	Real cohort likelihood-style rows; long string rows fall back record-locally.
Large CCDG-like likelihood	4.208 +/- 0.024 s	3.824 +/- 0.014 s	1.10x	20,000 / 0	Synthetic cohort likelihood rows with CCDG-like field mix.
Large reordered likelihood	3.022 +/- 0.005 s	2.532 +/- 0.005 s	1.19x	20,000 / 0	Shows tag-level composition handles reordered FORMAT layouts.
Large multiallelic likelihood	3.238 +/- 0.007 s	2.860 +/- 0.004 s	1.13x	16,000 / 0	Exercises allele-count-dependent vector widths.
Large float/string negative	3.026 +/- 0.014 s	3.022 +/- 0.022 s	1.00x	0 / 16,000	Shows no regression when generic fallback is triggered.
Large phase-width variation	2.680 +/- 0.004 s	2.536 +/- 0.005 s	1.06x	12,000 / 0	Exercises row-local width handling.
Mixed row-local likelihood	2.238 +/- 0.005 s	1.952 +/- 0.004 s	1.15x	12,000 / 0	Exercises mixed row-local shapes while preserving planned execution.
GT-first reordered	1.778 +/- 0.002 s	1.488 +/- 0.002 s	1.19x	12,000 / 0	Reordered layout with GT first.
Two-string float negative	2.344 +/- 0.005 s	2.340 +/- 0.005 s	1.00x	0 / 12,000	Shows no regression when generic fallback is triggered.

Downstream bcftools was rebuilt against this branch:

bcftools 1.23.1-63-gd9375776
Using htslib 1.23.1-58-gbd643182

Repeated streamed bcftools runs use cksum, alternate baseline/planned order by repetition, and compare baseline-vs-plan checksums.

Selected bcftools user-time results:

Each cell is baseline -> planned (speedup). --threads applies to the bcftools view command shapes only; bcftools query rows are unthreaded and therefore use threads=0.

Dataset	Command	User t0	User t2	User t4	Real t0	Real t2	Real t4
CCDG 10k	`view_bcf`	2.716 -> 2.412 (1.13x)	2.806 -> 2.480 (1.13x)	2.776 -> 2.480 (1.12x)	4.382 -> 4.032 (1.09x)	3.160 -> 2.810 (1.12x)	3.056 -> 2.852 (1.07x)
CCDG 10k	`view_keep2_bcf`	2.588 -> 2.296 (1.13x)	2.670 -> 2.376 (1.12x)	2.662 -> 2.376 (1.12x)	2.682 -> 2.414 (1.11x)	2.362 -> 2.060 (1.15x)	2.340 -> 2.046 (1.14x)
CCDG 10k	`view_sites_bcf`	2.600 -> 2.298 (1.13x)	2.644 -> 2.340 (1.13x)	2.634 -> 2.338 (1.13x)	2.718 -> 2.402 (1.13x)	2.312 -> 2.008 (1.15x)	2.310 -> 2.020 (1.14x)
CCDG 10k	`query_gt_keep2`	0.668 -> 0.422 (1.58x)	n/a	n/a	0.716 -> 0.462 (1.55x)	n/a	n/a
CCDG 10k	`query_gt_all`	2.942 -> 2.634 (1.12x)	n/a	n/a	3.056 -> 2.748 (1.11x)	n/a	n/a
CCDG 10k	`filter_gt_keep2`	2.872 -> 2.572 (1.12x)	2.988 -> 2.654 (1.13x)	2.952 -> 2.662 (1.11x)	2.994 -> 2.668 (1.12x)	2.690 -> 2.330 (1.15x)	2.632 -> 2.330 (1.13x)
1000G chr22 full GT	`view_bcf`	26.946 -> 10.316 (2.61x)	27.088 -> 10.662 (2.54x)	26.886 -> 10.846 (2.48x)	33.056 -> 15.922 (2.08x)	24.924 -> 11.700 (2.13x)	24.674 -> 11.708 (2.11x)
1000G chr22 full GT	`view_keep2_bcf`	26.680 -> 10.260 (2.60x)	27.016 -> 10.768 (2.51x)	27.034 -> 10.662 (2.54x)	27.076 -> 10.522 (2.57x)	25.132 -> 8.850 (2.84x)	25.130 -> 8.734 (2.88x)
1000G chr22 full GT	`view_sites_bcf`	25.894 -> 9.704 (2.67x)	26.346 -> 10.244 (2.57x)	26.410 -> 10.150 (2.60x)	26.262 -> 9.984 (2.63x)	24.432 -> 8.310 (2.94x)	24.418 -> 8.242 (2.96x)
1000G chr22 full GT	`query_gt_keep2`	5.728 -> 2.888 (1.98x)	n/a	n/a	5.836 -> 2.998 (1.95x)	n/a	n/a
1000G chr22 full GT	`query_gt_all`	57.230 -> 40.938 (1.40x)	n/a	n/a	58.820 -> 42.396 (1.39x)	n/a	n/a
1000G chr22 full GT	`filter_gt_keep2`	48.782 -> 32.420 (1.50x)	49.516 -> 33.116 (1.50x)	51.412 -> 35.188 (1.46x)	49.376 -> 33.010 (1.50x)	48.132 -> 31.410 (1.53x)	49.984 -> 33.958 (1.47x)

Collapse the width-specific integer vector row kinds and parsers into a single INTVEC executor path, preserving whole-column fallback behavior for malformed or over-width fields. Add planner docs covering the current control surface, cache and fallback contract, focused tests, and the full-corpus simplification benchmark delta.

jhl-oai · 2026-05-08T11:37:48Z

@jkbonfield, thanks again for the feedback on the earlier PR; I agree with your concern about the previous implementation looking over-specialized. I’ve reworked the branch to simplify; in particular, the width-specific integer-vector micro-specializations have been removed; the executor now uses a much smaller set of generic row kinds:

GT2, INT1, INTVEC, FLOAT1, FLOATN, STR

So the planner is still tag/type driven, but it is no longer trying to carry separate code paths for lots of narrow integer-vector widths.

The current draft keeps the same conservative approach: opt-in only, byte-identical BCF output vs the generic parser, and whole-FORMAT-column fallback if a row does not meet the supported invariants.

I also agree with your broader point about a better long-term VCF API, and thanks for linking your previous work on it. A lazy/tokenized/query-oriented VCF representation seems like the more principled direction if we want to avoid forcing full parse and type conversion of every record.

That said, it feels like a larger, potentially breaking API-level project, whereas this PR is trying to stay within the existing API/ABI and recover a fairly large amount of performance for common cohort FORMAT workloads now. The current numbers are still around 2.4x for test_view on 1000G GT-only chr22 and about 2.5-2.6x user-time for some downstream bcftools view paths, with smaller 1.1x-ish wins on mixed likelihood-style rows.

I’d be interested in whether the simplified version looks closer to something you'd consider a maintainable internal optimization.

…ing-speedup

daviesrob · 2026-05-08T15:41:33Z

Although it's not merged yet, you might want to take a look at our provisional contributing guide in #2003, and especially the AI policy, which is taken mostly from the Linux kernel one.

There are some interesting ideas here, but, apart from the GT-only case, the speed-up obtained seems rather modest compared to the complexity added. I suspect that some simpler, more targeted, changes might be able to give a similar result. A couple of avenues that may be worth exploring could be:

Profiling the existing parser suggests a lot of time is spent in a tight loop looking for various field separators. Optimising that could bring significant benefits. It's possible that a lot of your speed-up comes from by-passing that bit of code.
Keeping the specialised GT-only parser, as files like that certainly do exist in the wild. It should be fairly easy to work out when to trigger that case (only GT, and less than 10 ALT fields). It could assume the same ploidy all the way through and then fall back to a slower method if that turns out not to be the case.

Both of those would likely be good targets for SIMD code.

If you want to try a more radical solution, it might also be worth trying to replace the existing parser with one that pivots the data in text form before converting it to binary. That might turn out to be quicker than the rather complicated solution we have now. We may or may not want to use the results depending on how well it worked, but it would be interesting to see how much effort would be needed in order to conduct the experiment.

jkbonfield · 2026-05-11T11:08:09Z

Given this is a big change to files that only have multi-sample GT records and a minimal speed change to everything else, do or any other readers have any feeling for how frequent this scenario is?

Modern callers seem to output a plethora of format fields, or at the very least a genotype confidence value. I'm aware some of the original 1000 Genomes data was pure GT, but it feels like an outlier. If there is evidence of GT-only still being a viable and common format then we may consider it worth optimising, but personally I'm not convinced currently that this is a bottleneck for most users.

What was your use case that triggered this PR?

jhl-oai · 2026-05-11T11:18:27Z

@daviesrob Thanks for the pointer on the contribution guidelines; I wasn't aware of this. Re: the suggestions, I did try SIMD acceleration for the separator parsing, but did not observe a substantial speedup, though I don't recall the details. I'll have to dig up the details of that experiment. The pivot idea is interesting as well, I will experiment with it.

@jkbonfield The motivating example was for phasing/imputation, where often the panels are extremely large but often only have (phased) GTs; in the context of single-sample imputation/phasing, this results in the majority of runtime for e.g. SHAPEIT being the actual reference panel parsing when using, e.g. the UKB. There do exist specialized formats for SHAPEIT and similar software, but it would be useful to have this fast path for the generic VCF as well. I agree that this isn't so applicable to VCFs in the more general variant calling context like the outputs of joint calling etc.

More generally, the VCF API approach you suggested in #2001 (comment) does seem like the better long-term approach given the reasons you outlined. This might be something I can take a look at; I would imagine that in practice we would want to surface it as an opt-in additional API at least as a first pass.

jkbonfield · 2026-05-11T12:55:31Z

Thanks for the use case example. Food for thought.

I agree that any major API rewrite should be an additional thing sitting along side the existing "legacy" API. My long term strategy, which I never really got around to, was to add it directly to bcftools initially and start rewriting tools in the new API to see how it performed and to get some real experience with what's needed. That permits the API to change easily while under development. Once stable it can then become part of htslib proper.

jhl-oai and others added 30 commits April 28, 2026 13:48

Prototype planned VCF FORMAT parser

210155b

Document CCDG FORMAT parser benchmark

3ccf1c8

Optimize planned AD and PL parsing

1e46cb8

Add general VCF FORMAT plan prototype

ccffd35

Benchmark compiled FORMAT interpreter mode

03e0d7f

Specialize compiled FORMAT opcode handlers

f2d3db1

Harden and specialize FORMAT planning

3d93431

Direct-write planned FORMAT payloads

bad7312

Guard planned FORMAT fast paths

dfef785

Add dynamic strict FORMAT numeric executor

ff9b3b5

Tighten dynamic FORMAT executor

544660c

Benchmark dynamic FORMAT conversions

45439ac

Clarify FORMAT plan benchmark state

3ec2e9f

Add dynamic FORMAT likelihood shape executor

ba09a36

Add VCF FORMAT shape benchmark corpus

61b4bd1

Add large FORMAT shape benchmark pass

fea9b9e

Cache dynamic FORMAT likelihood shapes

e9442b7

Add dynamic GT-only FORMAT fast path

3185c8a

Tighten dynamic likelihood parsing

3d51a29

Skip integer sentinel checks when proven absent

949eee3

Elide likelihood row op rebuild

aae9423

Refactor dynamic FORMAT parsing to composable ops

8f1a943

Harden composable FORMAT parser

006059a

Compact composable FORMAT row widths

b43c383

Trim FORMAT parser to dynamic path

b96d982

Add threaded bcftools FORMAT benchmarks

cb025b1

Consolidate FORMAT plan documentation

ec63fbb

Harden dynamic FORMAT planner

b433732

Add broader bcftools command benchmarks

65758b1

Harden FORMAT planner tests

ceda038

codex added 8 commits April 30, 2026 17:25

remove cooldown

f9e9934

Reuse FORMAT string spans in planner

be6eaa4

update

0767a55

Trim benchmark corpus tooling from product branch

52f5507

fix ci

8c809ba

Document FORMAT planner modules

0999351

Consolidate FORMAT planner review notes

7813ee3

Merge remote-tracking branch 'upstream/develop' into feature/vcf-pars…

5ccb995

…ing-speedup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opt-in planned parser for VCF FORMAT fields#2005

Add opt-in planned parser for VCF FORMAT fields#2005
jhl-oai wants to merge 39 commits into
samtools:developfrom
jhl-oai:feature/vcf-parsing-speedup

jhl-oai commented May 8, 2026

Uh oh!

jhl-oai commented May 8, 2026

Uh oh!

daviesrob commented May 8, 2026

Uh oh!

jkbonfield commented May 11, 2026

Uh oh!

jhl-oai commented May 11, 2026

Uh oh!

jkbonfield commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jhl-oai commented May 8, 2026

Summary

Review Guide

Correctness

Current Performance Summary

Uh oh!

jhl-oai commented May 8, 2026

Uh oh!

daviesrob commented May 8, 2026

Uh oh!

jkbonfield commented May 11, 2026

Uh oh!

jhl-oai commented May 11, 2026

Uh oh!

jkbonfield commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants