Fix smck bug#2525
Conversation
In msp_smc_k_common_ancestor_event, gsl_ran_flat occasionally draws random_mass == 0 (probability ~1/2^32 per CA event). With random_mass of zero, fenwick_find returns the first non-zero hull and remaining_mass = cum_sum - 0 lands at exactly x_hull->count. The walk loop with predicate `remaining_mass >= 0` then over-iterates by one match, runs off the head of the AVL tree, and trips the tsk_bug_assert at line 7694. Clamp remaining_mass strictly below x_hull->count so the loop terminates after exactly `count` matches. The fix is local; it does not change the RNG draw count or pattern, so seeded simulations that did not previously trip the bug produce identical trajectories. Add a regression test (test_smc_k_bug_repro) that replays the user-reported failure (50 diploid samples, seq_length 2e7, recomb 1e-8, demography change at t=30, seed 3940783591). Pre-fix the test aborts after ~74821 events; post-fix it completes in about a second.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2525 +/- ##
=======================================
Coverage 93.83% 93.83%
=======================================
Files 20 20
Lines 12104 12106 +2
Branches 2242 2243 +1
=======================================
+ Hits 11358 11360 +2
Misses 567 567
Partials 179 179
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
323edf3 to
d50953d
Compare
The prior test_smc_k_bug_repro replayed the user-reported failing
simulation (50 diploid samples, seq_length 2e7, ~75k events) to hit
the bug after about one second. Replace it with
test_smc_k_zero_random_mass, which uses a custom GSL rng_type whose
get_double callback returns successive doubles from a fixed array.
With the script {0.5, 0.0} and a 2-sample SMCk simulation, the first
common-ancestor event has gsl_ran_flat return 0 deterministically,
making remaining_mass == x_hull->count exactly — the same trigger the
clamp at lib/msprime.c handles. Pre-clamp the test aborts at line
7694; post-clamp it completes in about a millisecond.
The scripted RNG type is defined as static helpers in test_ancestry.c
next to the test that uses it.
Add chained-event SMCk clamp test
test_smc_k_zero_random_mass triggers the clamp at lib/msprime.c on
the very first CA event of a fresh 2-sample simulation, only walking
the loop once. Add a sibling test_smc_k_zero_random_mass_chained that
runs one CA event first (coalescing one pair of 4 lineages, freeing
and reusing hull slots, rearranging AVL trees) and then drives
gsl_ran_flat to 0 on the second CA event. Pre-fix the second event
aborts at line 7694; post-fix both complete in about a millisecond.
The test covers the clamp working from a non-trivial post-event
state. We initially hoped to also drive the second event to a hull
with count > 1 via the fenwick zero-skip path, but fenwick_find
always stops at the first non-zero slot and any non-zero predecessor
blocks the route to a higher-count hull. The honest scope (clamp in
chained events, count == 1) is documented in the test comment.
Update CHANGELOG
d50953d to
577cd77
Compare
|
The change here seems quite straightforward, so I think we can issue a bugfix release once this is merged. |
|
That's quite surprising -- if the proper distribution is uniform on [0, x), why is 0 a problem? |
|
@molpopgen I think it has to do with the behavior of I say merge it if you think it's good - it'd take me an hour or so to get my head around it to verify the solution, so I'm happy to go with what you say, but lmk if you'd like me to do that, @jeromekelleher. |
petrelharp
left a comment
There was a problem hiding this comment.
LGTM (modulo the caveat in the main thread)
Closes #2523
This was a bit more subtle than the fix in #2524 implied (which would probably have led to biases). The issue is the rng producing a value of exactly zero.