Polyploid HLA Typing

End-to-end pipeline for chimeric (k=4) HLA typing of allo-HSCT and solid-organ transplant samples from short-read FASTQs. Outputs 4 haplotype sequences per gene tagged R(ecipient) / D(onor).

Installation & environment: INSTALL.md

1. Files

File	Role
`polyphase_v2.sh`	driver — run this
`hla_polyphase_assemble.py`	baseline typing engine (whatshap polyphase + IMGT scoring)
`reassign_gt_chimeric.py`	χ-aware GT correction before phasing
`estimate_chi_pooled.py`	pooled-continuous χ_R estimator
`iterative_remap_em.py`	EM refinement (Salmon-style read remap)
`rescue_gene_binned_reads.py`	conservative read-bin rescue v2 before per-gene alignment
`apply_class2_joint_rescue.py`	guarded class-II post-aggregate rescue
`em_refine_gate.py`	per-gene EM override gate logic
`aggregate_calls.py`	merges per-gene `calls.tsv` into one summary table
`evaluate_calls.py`	compares `<SAMPLE>.final_calls.tsv` with `truth_typing.tsv` at 2-field and G group resolution
`exon_typing_from_haps.py`	exon-level G group fallback/diagnostic for high-mask genes
`build_resource_indexes.sh`	rebuilds HLA resource indexes when files are missing or a custom resource set is used
`gene.spechla.bed`	per-gene typing region on bundled `hla.ref.extend.fa`
`resources/spechla/`	bundled SpecHLA-derived helper scripts and HLA reference files
`environment.yml`	conda environment spec
`octopus_to_imgt.py`, `caller_free_4hap.py`	rejected alternatives, kept for reference

2. Install

See INSTALL.md. Short version:

conda env create -f environment.yml
conda activate polyploid-hla
# HLA reference resources needed by the pipeline are bundled under
# resources/spechla. Set SPECHLA=/path/to/custom/resources only
# when intentionally overriding them.

# Optional: repair/rebuild bundled resource indexes.
bash build_resource_indexes.sh

3. Quick start

From the repository root:

FQ1=/path/sample.R1.fq.gz \
FQ2=/path/sample.R2.fq.gz \
SAMPLE=mySample \
RECIPIENT_MAJOR=0 \
bash polyphase_v2.sh

Final result — one file, six primary genes plus optional DRB345:

column -t asm_v2/mySample/mySample.final_calls.tsv

Example:

sample      gene      R1            R2            D1            D2            source
mySample    HLA-A     A*01:01:01:01 A*23:01:01:01 A*01:01:01:01 A*29:02:01:02 em-refined
mySample    HLA-B     B*08:01:01:01 B*44:03:01:01 B*08:01:01:01 B*45:01:01:01 em-refined
...

The file keeps both high-resolution calls and conservative report calls:

*_full: full allele chosen by the pipeline.
*_2field: allele collapsed to 2-field, useful when truth is low resolution.
*_g_group: allele converted through SpecHLA hla_nom_g.txt.
*_report: equals *_full by default; automatically downgraded to 2-field when a gene has high masked sequence fraction.
*_fraction: modelled haplotype proportion for the reported R1/R2/D1/D2 call. In the standard 2+2 chimerism model, R haplotypes are chi_R / 2 each and D haplotypes are (1 - chi_R) / 2 each; EM/direct modes use the fitted gene-specific chi when available.
*_read_fraction: allele-family read support fraction from EM read assignment (tf_counts.tsv). This is not forced to R1/R2 or D1/D2 being 1:1 and can differ between two alleles from the same person.
*_read_count: EM-assigned effective read weight for the same 2-field allele family.
*_copy_fraction_fit: fitted copy-level fraction for each R1/R2/D1/D2 slot. These four values sum to 1 when read support is available. The fit minimizes allele-family read-support error under x >= 0 and sum(x)=1, using the sample/gene chi as a weak regularizer when duplicated alleles make the copy split underdetermined.
copy_fit_error, copy_identifiability, copy_chi_r, and allele_support_fraction_sum: diagnostics for the copy-fraction fit. boundary_zero means the constrained fit placed one slot at zero because it lacked independent support; it is not a filled missing value.
mean_mask_fraction, report_level, warning: explain why a gene was reported at full vs. 2-field resolution.

The pipeline also writes a concise companion file, <SAMPLE>.final_calls.compact.tsv, with only the sample, gene, four reported alleles, four fitted copy fractions, and fit diagnostics. The original <SAMPLE>.final_calls.tsv remains the detailed result file.

When DRB345_TYPING=1 (default), the pipeline also appends an HLA-DRB345 row. This is not a seventh ordinary locus: it is a DRB1-linked add-on for the DRB3/DRB4/DRB5 genes. The add-on extracts read pairs with competitive DB support for DRB3/4/5, EM-remaps them to a combined DRB345 allele set, and uses the final DRB1 haplotypes to decide whether each R1/R2/D1/D2 haplotype should carry DRB3, DRB4, DRB5, or no DRB345 gene. It does not change the six primary gene calls. DRB345 DB-read extraction accepts bowtie2-style alignment scores, where the best score is often 0 and imperfect hits are negative. If the DRB1 row is high-mask / low-confidence, the add-on switches to an evidence-first mode: long locus-unique k-mers decide which DRB3/4/5 loci have credible support, then EM allele fractions are fit to the R/R/D/D dose model without hard DRB1 linkage. This lets DRB345 remain callable when DRB1 itself is unreliable.

The per-gene FASTAs (hap{1..4}.fa) and raw calls.tsv are still kept under asm_v2/<SAMPLE>/<gene_lc>/<HLA-X>/ for inspection.

4. Required inputs

Var	Meaning
`FQ1`, `FQ2`	paired short-read FASTQs (gz ok)
`SAMPLE`	sample id (used for output dirs)
`RECIPIENT_MAJOR`	`0` = donor major (post-HSCT blood, default); `1` = recipient major (solid-organ / pre-transplant)

Optional environment / database overrides:

Var	Default	Meaning
`SPECHLA`	`resources/spechla`	bundled HLA resource root; override only for a custom database
`PYBIN`	first `python` on PATH	python binary
`WHATSHAP`	first `whatshap` on PATH	whatshap binary
`FREEBAYES`	first `freebayes` on PATH	freebayes binary; use 1.3.6 if newer builds abort
`THREADS`	`8`	threads for bowtie2, BWA, whatshap, EM remap, and samtools helper steps
`SAMTOOLS_THREADS`	`$THREADS`	threads for samtools view/sort/index/merge
`WORK_DIR`	parent of this repository	base for output dirs
`OUT_ROOT`	`${WORK_DIR}/spechla_out`	per-sample alignments + VCFs
`ASM_ROOT`	`${WORK_DIR}/asm_v2`	typing outputs
`EXON_TYPING`	`1`	also write exon-level fallback diagnostics (`<SAMPLE>.exon_calls.tsv`)
`BOWTIE2_MODE`	`very-sensitive`	bowtie2 preset for IMGT competitive mapping; use `sensitive` for faster exploratory runs
`BOWTIE2_K`	`30`	max alignments reported per read pair during IMGT competitive mapping
`ASSEMBLE_ALIGNER`	`parasail`	base-level scorer; `mappy` is faster but less exact
`ASSEMBLE_PREFILTER_TOP`	`200`	mappy prefilter size before parasail scoring; smaller is faster
`EM_REFINE_PER_GENE_CHI`	`0`	experimental; fixed pooled/global χ is the recommended default
`EM_REFINE_RECIPIENT_MINOR_RESCUE`	`1`	recover low-frequency recipient-only alleles when donor-major EM fitting collapses R/D to the donor-like pair
`READ_BIN_RESCUE`	`1`	run truth-free read-bin rescue before per-gene alignment
`READ_BIN_RESCUE_GENES`	`HLA-DPB1`	default target for conservative rescue; broaden only for validation
`READ_BIN_RESCUE_RETENTION_GATE`	`1`	require abnormal full-vs-retained read support loss before rewriting a bin
`REUSE_BINNING_ROOT`	empty	seed deduped FASTQs, DB BAM, per-gene FASTQs, and `header.sam` from a prior run
`REUSE_BINNING_CLEAN_DOWNSTREAM`	`0`	remove downstream outputs after seeding cache when intentionally recomputing calls

The options above cover the recommended user-facing settings.

The default read-bin rescue is deliberately conservative: DPB1-only, paired-mate evidence, and retention-gated. Broad all-gene rescue addresses a real binning-loss failure mode but can increase phasing cost and should be run only as an explicit validation mode.

If indexes are missing after copying or replacing the resource directory, run:

bash build_resource_indexes.sh --resources "${SPECHLA:-resources/spechla}"

For exploratory reruns where speed matters more than final reporting, a useful starting point is:

THREADS=16 BOWTIE2_MODE=sensitive BOWTIE2_K=15 ASSEMBLE_PREFILTER_TOP=100 \
bash polyphase_v2.sh

5. Tuning per sample

The pipeline auto-estimates χ from the data. Defaults work for χ_R in [0.05, 0.50]. For boundary cases:

Situation	Override
χ_R < 0.03 (very deep chimerism)	`FB_MIN_AF=0.005 FB_MIN_AC=2`
χ_R > 0.50 (recipient majority)	`RECIPIENT_MAJOR=1`
Coverage < 50×	`MASK_MIN_DEPTH=3`
Coverage > 500×	`MASK_MIN_DEPTH=10`

For most samples, keep the defaults and only adjust the overrides above when the sample falls into one of those boundary cases.

6. Outputs

asm_v2/<SAMPLE>/
    <SAMPLE>.final_calls.tsv          ★ FINAL aggregated result (one row per gene)
    <SAMPLE>.exon_calls.tsv           exon-level G group diagnostic for high-mask genes
    <gene_lc>/<HLA-X>/
        calls.tsv                     per-gene final 4-hap call (R/D-tagged)
        calls.baseline.tsv            baseline before EM refinement (if overridden)
        hap{1..4}.fa                  per-haplotype masked FASTA

spechla_out/<SAMPLE>/                 intermediate alignments + variants
    <SAMPLE>.merge.bam, .freebayes.vcf.gz, .pooled_continuous.vcf.gz, ...
    <SAMPLE>.chimerism.txt            χ from AD-cluster estimator
    <SAMPLE>.chi_pooled.txt           χ from pooled-continuous (per gene)
    read_bin_rescue_manifest.tsv      rescue counts, retention gate, final status
    class2_joint_rescue_manifest.tsv  guarded class-II rescue audit trail
    em_refine/<gene>.{calls,summary,iterative}.tsv
    drb345/                          DRB3/4/5 linked add-on typing outputs

<SAMPLE>.final_calls.tsv columns: sample | gene | R1_full | R2_full | D1_full | D2_full | R1_2field | ... | R1_g_group | ... | R1_report | ... | R1_fraction | R2_fraction | D1_fraction | D2_fraction | R1_read_fraction | ... | R1_read_count | ... | R1_copy_fraction_fit | ... | copy_fit_error | copy_identifiability | copy_chi_r | allele_support_fraction_sum | source | mean_mask_fraction | report_level | warning.
<SAMPLE>.final_calls.compact.tsv columns: sample | gene | R1_allele | R1_copy_fraction | R2_allele | R2_copy_fraction | D1_allele | D1_copy_fraction | D2_allele | D2_copy_fraction | copy_identifiability | copy_fit_error.
Per-gene calls.tsv columns: global_hap | assignment(R/D) | allele | hap_fraction | allele_read_fraction | allele_read_count | em_weight for EM-refined calls, or ... | hap_fraction | total_assembly_score for baseline assembly calls. DRB345 add-on calls are stored under asm_v2/<SAMPLE>/hla-drb345/HLA-DRB345/calls.tsv with an extra drb1_linked_locus column.

If truth is available, evaluate with:

python evaluate_calls.py \
  --truth truth/truth_typing.tsv \
  --calls asm_v2/mySample/mySample.final_calls.tsv

Evaluation reports only 2field and g_group accuracy. It intentionally does not score 3-field because many truth entries are 2-field or G group resolution. For G group scoring, truth alleles that cannot be uniquely mapped through hla_nom_g.txt remain at 2-field resolution instead of being treated as false mismatches.

7. Re-running idempotently

SKIP_DONE=1 (default) skips steps whose output already exists. To force a re-run from a specific step, delete its output and re-invoke the driver:

rm spechla_out/mySample/mySample.freebayes.vcf.gz   # re-do variant call
rm -r asm_v2/mySample                               # re-do typing
bash polyphase_v2.sh

For expensive real-data replays, reuse the dedup / competitive DB map / per-gene binning outputs and recompute only downstream steps:

REUSE_BINNING_ROOT=/path/to/previous/spechla_out \
REUSE_BINNING_CLEAN_DOWNSTREAM=1 \
SKIP_DONE=1 \
bash polyphase_v2.sh

When reporting accuracy, regenerate evaluation from the current final_calls.tsv; stale *.eval.txt files can describe an older call set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Polyploid HLA Typing

1. Files

2. Install

3. Quick start

4. Required inputs

5. Tuning per sample

6. Outputs

7. Re-running idempotently

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
resources/spechla		resources/spechla
.gitignore		.gitignore
INSTALL.md		INSTALL.md
README.md		README.md
aggregate_calls.py		aggregate_calls.py
apply_class2_joint_rescue.py		apply_class2_joint_rescue.py
apply_direct_constrained_gate.py		apply_direct_constrained_gate.py
build_resource_indexes.sh		build_resource_indexes.sh
caller_free_4hap.py		caller_free_4hap.py
direct_quartet_likelihood_batch.py		direct_quartet_likelihood_batch.py
em_refine_gate.py		em_refine_gate.py
environment.yml		environment.yml
estimate_chi_pooled.py		estimate_chi_pooled.py
evaluate_calls.py		evaluate_calls.py
exon_typing_from_haps.py		exon_typing_from_haps.py
gene.spechla.bed		gene.spechla.bed
hla_polyphase_assemble.py		hla_polyphase_assemble.py
iterative_remap_em.py		iterative_remap_em.py
octopus_to_imgt.py		octopus_to_imgt.py
polyphase_v2.sh		polyphase_v2.sh
reassign_gt_chimeric.py		reassign_gt_chimeric.py
rescue_dpb1_family_reads.py		rescue_dpb1_family_reads.py
rescue_gene_binned_reads.py		rescue_gene_binned_reads.py
type_drb345.py		type_drb345.py

Folders and files

Latest commit

History

Repository files navigation

Polyploid HLA Typing

1. Files

2. Install

3. Quick start

4. Required inputs

5. Tuning per sample

6. Outputs

7. Re-running idempotently

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages