Add opt-in planned parser for VCF FORMAT fields#2005
Conversation
Collapse the width-specific integer vector row kinds and parsers into a single INTVEC executor path, preserving whole-column fallback behavior for malformed or over-width fields. Add planner docs covering the current control surface, cache and fallback contract, focused tests, and the full-corpus simplification benchmark delta.
|
@jkbonfield, thanks again for the feedback on the earlier PR; I agree with your concern about the previous implementation looking over-specialized. I’ve reworked the branch to simplify; in particular, the width-specific integer-vector micro-specializations have been removed; the executor now uses a much smaller set of generic row kinds: So the planner is still tag/type driven, but it is no longer trying to carry separate code paths for lots of narrow integer-vector widths. The current draft keeps the same conservative approach: opt-in only, byte-identical BCF output vs the generic parser, and whole-FORMAT-column fallback if a row does not meet the supported invariants. I also agree with your broader point about a better long-term VCF API, and thanks for linking your previous work on it. A lazy/tokenized/query-oriented VCF representation seems like the more principled direction if we want to avoid forcing full parse and type conversion of every record. That said, it feels like a larger, potentially breaking API-level project, whereas this PR is trying to stay within the existing API/ABI and recover a fairly large amount of performance for common cohort FORMAT workloads now. The current numbers are still around 2.4x for I’d be interested in whether the simplified version looks closer to something you'd consider a maintainable internal optimization. |
|
Although it's not merged yet, you might want to take a look at our provisional contributing guide in #2003, and especially the AI policy, which is taken mostly from the Linux kernel one. There are some interesting ideas here, but, apart from the GT-only case, the speed-up obtained seems rather modest compared to the complexity added. I suspect that some simpler, more targeted, changes might be able to give a similar result. A couple of avenues that may be worth exploring could be:
Both of those would likely be good targets for SIMD code. If you want to try a more radical solution, it might also be worth trying to replace the existing parser with one that pivots the data in text form before converting it to binary. That might turn out to be quicker than the rather complicated solution we have now. We may or may not want to use the results depending on how well it worked, but it would be interesting to see how much effort would be needed in order to conduct the experiment. |
|
Given this is a big change to files that only have multi-sample GT records and a minimal speed change to everything else, do or any other readers have any feeling for how frequent this scenario is? Modern callers seem to output a plethora of format fields, or at the very least a genotype confidence value. I'm aware some of the original 1000 Genomes data was pure GT, but it feels like an outlier. If there is evidence of GT-only still being a viable and common format then we may consider it worth optimising, but personally I'm not convinced currently that this is a bottleneck for most users. What was your use case that triggered this PR? |
|
@daviesrob Thanks for the pointer on the contribution guidelines; I wasn't aware of this. Re: the suggestions, I did try SIMD acceleration for the separator parsing, but did not observe a substantial speedup, though I don't recall the details. I'll have to dig up the details of that experiment. The pivot idea is interesting as well, I will experiment with it. @jkbonfield The motivating example was for phasing/imputation, where often the panels are extremely large but often only have (phased) GTs; in the context of single-sample imputation/phasing, this results in the majority of runtime for e.g. SHAPEIT being the actual reference panel parsing when using, e.g. the UKB. There do exist specialized formats for SHAPEIT and similar software, but it would be useful to have this fast path for the generic VCF as well. I agree that this isn't so applicable to VCFs in the more general variant calling context like the outputs of joint calling etc. More generally, the VCF API approach you suggested in #2001 (comment) does seem like the better long-term approach given the reasons you outlined. This might be something I can take a look at; I would imagine that in practice we would want to surface it as an opt-in additional API at least as a first pass. |
|
Thanks for the use case example. Food for thought. I agree that any major API rewrite should be an additional thing sitting along side the existing "legacy" API. My long term strategy, which I never really got around to, was to add it directly to bcftools initially and start rewriting tools in the new API to see how it performed and to get some real experience with what's needed. That permits the API to change easily while under development. Once stable it can then become part of htslib proper. |
Summary
TL;DR: A fast-path VCF FORMAT parser for well-defined FORMAT fields results in large speedups for GT-heavy VCF parsing: on real 1000G chr22 GT-only input,
test_viewimproves from 25.08s to 10.27s (2.44x), and downstreambcftools view -Obandbcftools queryimproves by 2-2.56x user time across--threads 0/2/4. On more general mixed FORMAT cohort inputs, the benefit is smaller but consistent: about 1.07-1.14x on GATK-output-like conversion/filter and reordered likelihood workloads, with unsupported shapes falling back to parity.The default behavior is unchanged: the existing generic parser is used unless the caller explicitly sets:
The current implementation takes the following approach: header-owned plan cache, tag/type-based compilation, conservative row execution, and whole-column fallback to the generic parser whenever a supported invariant is not met.
This is not intended to merge as-is with the current review controls. Before any merge, I would expect to either remove or replace the temporary environment gate/statistics surface and remove the interim review note from
docs/.This is an updated version of #2001 . Compared to the earlier draft, this version removes the width-specific integer-vector micro-specializations and uses one generic integer-vector row path instead. The executor now has only these row kinds:
The planned parser is optional and conservative. Unsupported schemas, dirty headers, malformed rows, unsupported GT shapes, unsafe widths, and unprofitable string/float-heavy layouts fall back for the whole FORMAT column.
Review Guide
An implementation note is in
docs/FORMAT_PLAN_OVERVIEW.md, which should be untracked prior to merge. It contains an overview of the implementation; to summarize, the implementation invcf.cconsists of:HTS_VCF_FORMAT_PLAN_STATS=1;bcf_hdr_aux_tFORMAT plan cache invalidated afterbcf_hdr_sync();No public header or exported API is added for this planner state.
Correctness
The hard requirement is byte-identical BCF output compared with the generic parser. The planned path must not keep partial FORMAT output after detecting a fallback condition.
Focused coverage includes disabled/unknown environment behavior, selected samples, malformed FORMAT tokens, malformed numeric fields, phased missing GT forms such as
.|., cache invalidation after header metadata changes, and fallback after partial planned parsing.Current Performance Summary
These numbers are for commit
bd643182c8fa722abbc0cb89860263a90bb97020.test/test_view -b -l 0, mean of five baseline/planned repetitions, all 50 BCF comparisonsok:Times in this table are user time, reported as mean +/- standard error.
Downstream bcftools was rebuilt against this branch:
Repeated streamed bcftools runs use
cksum, alternate baseline/planned order by repetition, and compare baseline-vs-plan checksums.Selected bcftools user-time results:
Each cell is
baseline -> planned (speedup).--threadsapplies to thebcftools viewcommand shapes only;bcftools queryrows are unthreaded and therefore usethreads=0.view_bcfview_keep2_bcfview_sites_bcfquery_gt_keep2query_gt_allfilter_gt_keep2view_bcfview_keep2_bcfview_sites_bcfquery_gt_keep2query_gt_allfilter_gt_keep2