From 0b68cd5f5c07a6a3e15f83bb390d8e60e3fede85 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 19:35:13 -0400 Subject: [PATCH 01/12] Add Tutorial 21: HAD pre-test workflow End-to-end practitioner walkthrough for `did_had_pretest_workflow` building on T20's brand-campaign framing. Uses a Design 1 (`continuous_at_zero`) panel variant (Uniform[$0.01K, $50K] vs T20's [$5K, $50K]) so the QUG step fails-to-reject and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. Three sections: - Overall workflow on a two-period collapse: Step 1 + Step 3 only; verdict explicitly flags Step 2 as deferred (single pre-period). - Upgrade to event_study workflow: closes all three testable steps via QUG + joint pre-trends Stute (3 horizons) + joint homogeneity Stute (4 horizons); verdict reads "TWFE admissible under Section 4 assumptions". - Yatchew side panel comparing null="linearity" (default, paper Theorem 7) vs null="mean_independence" (Phase 4 R-parity with R YatchewTest::yatchew_test(order=0)) on the within-pre-period first-difference paired with post-period dose. Companion drift-test file with 15 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic stats, and bootstrap p-value tolerance bands per backend. Updates T20 Section 6 Extensions with a forward-pointer to T21, `docs/tutorials/README.md` with a T21 entry, `docs/doc-deps.yaml` `had_pretests.py` block, CHANGELOG `[Unreleased]`, and the T21/T22 TODO row. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 1 + TODO.md | 2 +- docs/doc-deps.yaml | 3 + docs/tutorials/20_had_brand_campaign.ipynb | 24 +- docs/tutorials/21_had_pretest_workflow.ipynb | 623 +++++++++++++++++++ docs/tutorials/README.md | 8 + tests/test_t21_had_pretest_workflow_drift.py | 315 ++++++++++ 7 files changed, 963 insertions(+), 13 deletions(-) create mode 100644 docs/tutorials/21_had_pretest_workflow.ipynb create mode 100644 tests/test_t21_had_pretest_workflow_drift.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 25f967b5..c0911534 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added +- **Tutorial 21: HAD Pre-test Workflow** (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `HeterogeneousAdoptionDiD` building on Tutorial 20's brand-campaign framing. Uses a 60-DMA × 8-week panel close in shape to T20's but with a **Design 1 (`continuous_at_zero`) dose distribution** (Uniform[\$0.01K, \$50K] vs T20's [\$5K, \$50K]) so the QUG step in `did_had_pretest_workflow` fails-to-reject and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. Walks through three surfaces: (a) `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse, where the verdict explicitly flags Step 2 (Assumption 7 pre-trends) as deferred because a single pre-period structurally cannot support a pre-trends test, and the structural fields `pretrends_joint` / `homogeneity_joint` are both `None`; (b) `did_had_pretest_workflow(aggregate="event_study")` on the full multi-period panel, where the verdict reads "TWFE admissible under Section 4 assumptions" because joint pre-trends Stute (3 horizons, mean-independence null) and joint homogeneity Stute (4 horizons, linearity null) close the gap left by the overall path; and (c) a side panel exercising both `yatchew_hr_test` null modes — `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) — on the within-pre-period first-difference paired with post-period dose, illustrating the stricter null's larger residual variance (`sigma2_lin` 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (15 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, and bootstrap p-value tolerance bands per `feedback_bootstrap_drift_tests_need_backend_tolerance`). T20's "Composite pretest workflow" Extensions bullet updated with a forward-pointer to T21. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR. - **`ChaisemartinDHaultfoeuille.by_path` and `paths_of_interest` now compose with `survey_design`** for analytical Binder TSL SE and replicate-weight bootstrap variance. The `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1233-1239` is replaced by a per-path multiplier-bootstrap-only gate (`survey_design + n_bootstrap > 0` under by_path / paths_of_interest still raises, since the survey-aware perturbation pivot for path-restricted IFs is methodologically underived). Per-path SE routes through the existing `_survey_se_from_group_if` cell-period allocator: the per-period IF (`U_pp_l_path`) is built with non-path switcher-side contributions skipped (control contributions are unchanged, matching the joiners/leavers IF convention; preserves the row-sum identity `U_pp.sum(axis=1) == U`), cohort-recentered via `_cohort_recenter_per_period`, then expanded to observations as `psi_i = U_pp[g_i, t_i] · (w_i / W_{g_i, t_i})`. Replicate-weight designs unconditionally use the cell allocator (Class A contract from PR #323). New `_refresh_path_inference` helper post-call refreshes `safe_inference` on every populated entry across `multi_horizon_inference`, `placebo_horizon_inference`, `path_effects`, and `path_placebos` so all four surfaces use the same final `df_survey` after per-path replicate fits append `n_valid` to the shared accumulator. Path-enumeration ranking under `survey_design` remains unweighted (group-cardinality, not population-weight mass). Lonely-PSU policy stays sample-wide, not per-path. Telescope invariant: on a single-path panel, per-path SE matches the global non-by_path survey SE bit-exactly. **No R parity** — R `did_multiplegt_dyn` does not support survey weighting; this is a Python-only methodology extension. The global non-by_path TSL multiplier-bootstrap path is unaffected (anti-regression test `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignAnalytical::test_global_survey_plus_n_bootstrap_still_works` locks the per-path-only scope of the new gate). Cross-surface invariants regression-tested at `TestByPathSurveyDesignAnalytical` (~17 tests across gate / dispatch / analytical SE / replicate-weight SE / per-path placebos / `trends_linear` composition / unobserved-path warnings / final-df refresh regressions) and `TestByPathSurveyDesignTelescope`. See `docs/methodology/REGISTRY.md` §`ChaisemartinDHaultfoeuille` `Note (Phase 3 by_path ...)` → "Per-path survey-design SE" for the full contract. - **Inference-field aliases on staggered result classes** for adapter / external-consumer compatibility. Read-only `@property` aliases expose the flat `att` / `se` / `conf_int` / `p_value` / `t_stat` names (matching `DiDResults` / `TROPResults` / `SyntheticDiDResults` / `HeterogeneousAdoptionDiDResults`) on every result class that previously only carried prefixed canonical fields: `CallawaySantAnnaResults`, `StackedDiDResults`, `EfficientDiDResults`, `ChaisemartinDHaultfoeuilleResults`, `StaggeredTripleDiffResults`, `WooldridgeDiDResults`, `SunAbrahamResults`, `ImputationDiDResults`, `TwoStageDiDResults` (mapping to `overall_*`); `ContinuousDiDResults` (mapping to `overall_att_*`, ATT-side as the headline, ACRT-side accessible unchanged via `overall_acrt_*`); `MultiPeriodDiDResults` (mapping to `avg_*`). `ContinuousDiDResults` additionally exposes `overall_se` / `overall_conf_int` / `overall_p_value` / `overall_t_stat` aliases for naming consistency with the rest of the staggered family. Aliases are pure read-throughs over the canonical fields — no recomputation, no behavior change — so the `safe_inference()` joint-NaN contract (per CLAUDE.md "Inference computation") is inherited automatically (NaN canonical → NaN alias, locked at `tests/test_result_aliases.py::test_pattern_b_aliases_propagate_nan`). The native `overall_*` / `overall_att_*` / `avg_*` fields remain canonical for documentation and computation. Motivated by the `balance.interop.diff_diff.as_balance_diagnostic()` adapter (`facebookresearch/balance` PR #465) which calls `getattr(res, "se", None)` / `getattr(res, "conf_int", None)` without a fallback chain — pre-alias, every staggered result class returned `None` on those keys, silently dropping `se` and `conf_int` from the adapter's diagnostic dict. 23 alias-mechanic + balance-adapter regression tests at `tests/test_result_aliases.py`. Patch-level (additive on stable surfaces). - **`ChaisemartinDHaultfoeuille.by_path` + non-binary integer treatment** — `by_path=k` now accepts integer-coded discrete treatment (D in Z, e.g. ordinal `{0, 1, 2}`); path tuples become integer-state tuples like `(0, 2, 2, 2)`. The previous `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1870` is replaced by a `ValueError` for continuous D (e.g. `D=1.5`) at fit-time per the no-silent-failures contract — the existing `int(round(float(v)))` cast in `_enumerate_treatment_paths` is now defensive (no-op for integer-coded D). Validated against R `did_multiplegt_dyn(..., by_path)` for D in `{0, 1, 2}` via the new `multi_path_reversible_by_path_non_binary` golden-value scenario (78 switchers, 3 paths, single-baseline custom DGP, F_g >= 4): per-path point estimates match R bit-exactly (rtol ~1e-9 on event horizons; rtol+atol envelope for placebo near-zero values), per-path SE inherits the documented cross-path cohort-sharing deviation (~5% rtol observed; SE_RTOL=0.15 envelope). **Deviation from R for D >= 10:** R's `did_multiplegt_by_path` derives the per-path baseline via `path_index$baseline_XX <- substr(path_index$path, 1, 1)`, which captures only the first character of the comma-separated path string (e.g. for `path = "12,12,..."` it captures `"1"` instead of `"12"`); this mis-allocates R's per-path control-pool subset for D >= 10. Python's tuple-key matching is correct in this regime — the per-path point estimates we compute are correct; R's per-path subset for the same path is buggy. The shipped parity scenario stays in `D in {0, 1, 2}` to avoid the R bug. R-parity test at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathNonBinary`; cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary`. diff --git a/TODO.md b/TODO.md index 17d2659e..7b92f14c 100644 --- a/TODO.md +++ b/TODO.md @@ -109,7 +109,7 @@ Deferred items from PR reviews that were not addressed before merge. | `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low | -| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorials (T21 HAD pretest workflow notebook + T22 weighted/survey HAD tutorial). `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1. | `tutorials/`, `tests/test_t21_*_drift.py`, `tests/test_t22_*_drift.py` | Phase 2a | Low | +| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed (PR-pending); `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low | | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low | | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium | | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low | diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml index f460b8ba..fea1ced4 100644 --- a/docs/doc-deps.yaml +++ b/docs/doc-deps.yaml @@ -404,6 +404,9 @@ sources: - path: diff_diff/guides/llms-full.txt section: "HAD Pretests" type: user_guide + - path: docs/tutorials/21_had_pretest_workflow.ipynb + type: tutorial + note: "Composite pre-test workflow walkthrough; drift-locked at tests/test_t21_had_pretest_workflow_drift.py" diff_diff/local_linear.py: drift_risk: low diff --git a/docs/tutorials/20_had_brand_campaign.ipynb b/docs/tutorials/20_had_brand_campaign.ipynb index 81c0f91b..66b24849 100644 --- a/docs/tutorials/20_had_brand_campaign.ipynb +++ b/docs/tutorials/20_had_brand_campaign.ipynb @@ -38,9 +38,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-004", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "import numpy as np\n", @@ -68,9 +68,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-006", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "MAIN_SEED = 87\n", @@ -116,9 +116,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-007", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "post_doses = (\n", @@ -146,9 +146,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-008", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "if HAS_MATPLOTLIB:\n", @@ -191,9 +191,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-010", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "panel_2pd = panel.copy()\n", @@ -231,9 +231,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-012", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "print(f'WAS_d_lower estimate (att): {result.att:.4f}')\n", @@ -272,9 +272,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-015", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "import warnings\n", @@ -300,9 +300,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-016", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "es_df = result_es.to_dataframe()\n", @@ -311,9 +311,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-017", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "if HAS_MATPLOTLIB:\n", @@ -346,7 +346,7 @@ "source": [ "**Reading the dynamics.**\n", "\n", - "- The pre-launch placebo horizons (weeks -4, -3, -2) all sit at essentially zero - per-$1K effects within \u00b10.06 with 95% CIs comfortably bracketing zero. Visually consistent with parallel pre-trends. (Note: this is a visual placebo check, not a formal pretest - HAD ships a separate composite pretest workflow we did not run here; see extensions.)\n", + "- The pre-launch placebo horizons (weeks -4, -3, -2) all sit at essentially zero - per-$1K effects within ±0.06 with 95% CIs comfortably bracketing zero. Visually consistent with parallel pre-trends. (Note: this is a visual placebo check, not a formal pretest - HAD ships a separate composite pretest workflow we did not run here; see extensions.)\n", "- The per-week post-launch effects (weeks 0, 1, 2, 3) all hover right around 100 visits per $1K with overlapping 95% CIs and lower bounds well above zero. The per-dollar lift is stable across all four weeks of the campaign.\n", "- Practically: the campaign delivered its per-dollar lift on impact and held it across all four post-launch weeks. No ramp-up, no fade." ] @@ -376,7 +376,7 @@ "id": "t20-cell-020", "metadata": {}, "source": [ - "Adapt this template by swapping in your own numbers from `result.att`, `result.conf_int`, `result.d_lower`, the per-week event-study table, and your own DMA / spend distribution. The pattern - **headline \u2192 sample \u2192 validity \u2192 business \u2192 practical** - is what to keep." + "Adapt this template by swapping in your own numbers from `result.att`, `result.conf_int`, `result.d_lower`, the per-week event-study table, and your own DMA / spend distribution. The pattern - **headline → sample → validity → business → practical** - is what to keep." ] }, { @@ -389,7 +389,7 @@ "This tutorial covered HAD's headline workflow: the overall WAS_d_lower fit and the multi-week event study. The library also supports several extensions we did not demonstrate here.\n", "\n", "- **Population-weighted (survey-aware) inference**: when some markets or regions carry more weight than others - e.g., DMAs weighted by population - HAD accepts a `weights=` array or a `SurveyDesign` object on the same `fit()` interface.\n", - "- **Composite pretest workflow**: HAD ships a `did_had_pretest_workflow` that combines the QUG support-infimum test (`H0: d_lower = 0`, which adjudicates between the `continuous_at_zero` and `continuous_near_d_lower` design paths) with linearity tests (Stute and Yatchew-HR). On the two-period (`aggregate='overall'`) path this workflow checks QUG and linearity only; the parallel-trends step is closed by the multi-period (`aggregate='event_study'`) joint variants (`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`). The visual placebo check we used in Section 4 is a parallel-trends sanity check, not a substitute for the formal joint pretests.\n", + "- **Composite pretest workflow**: HAD ships a `did_had_pretest_workflow` that combines the QUG support-infimum test (`H0: d_lower = 0`, which adjudicates between the `continuous_at_zero` and `continuous_near_d_lower` design paths) with linearity tests (Stute and Yatchew-HR). On the two-period (`aggregate='overall'`) path this workflow checks QUG and linearity only; the parallel-trends step is closed by the multi-period (`aggregate='event_study'`) joint variants (`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`). The visual placebo check we used in Section 4 is a parallel-trends sanity check, not a substitute for the formal joint pretests; see [Tutorial 21](21_had_pretest_workflow.ipynb) for an end-to-end pretest walkthrough.\n", "- **`continuous_at_zero` design path**: if the lightest-touch DMA had no regional add-on (spend exactly $0), HAD switches to the Design 1' identification path with target `WAS` instead of `WAS_d_lower`. The auto-detection picks it up.\n", "- **Mass-point design path**: if a meaningful chunk of DMAs sit at exactly the same minimum spend (rather than spread continuously near the boundary), HAD switches to a 2SLS estimator with matching identification logic. Auto-detected as well.\n", "\n", diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb new file mode 100644 index 00000000..05c51989 --- /dev/null +++ b/docs/tutorials/21_had_pretest_workflow.ipynb @@ -0,0 +1,623 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "02f317b7", + "metadata": {}, + "source": [ + "# Tutorial 21: HAD Pre-test Workflow - Did the Brand Campaign Satisfy the Identifying Assumptions?\n", + "\n", + "[Tutorial 20](20_had_brand_campaign.ipynb) fit `HeterogeneousAdoptionDiD` (HAD) on a regional brand-campaign panel and reported a per-dollar lift, with a brief visual placebo check at the end. We deliberately deferred the **formal pre-test workflow** to this tutorial, with a forward pointer in T20's \"Extensions\" section.\n", + "\n", + "This tutorial picks up where T20 left off. We re-run the brand campaign on a panel close in shape to T20's, then walk through HAD's composite pre-test workflow `did_had_pretest_workflow` to formally validate the identifying assumptions (paper Section 4.2 of de Chaisemartin, Ciccia, D'Haultfoeuille, & Knau (2026)). We start with the two-period (`aggregate=\"overall\"`) workflow, observe that it leaves the parallel pre-trends step open, and then **upgrade** to the multi-period (`aggregate=\"event_study\"`) workflow that closes all three paper steps jointly. A side panel compares the two `null=` modes of the Yatchew-HR linearity test, including the recently-shipped `null=\"mean_independence\"` mode (R-parity with `YatchewTest::yatchew_test(order=0)`).\n" + ] + }, + { + "cell_type": "markdown", + "id": "47b10255", + "metadata": {}, + "source": [ + "## 1. The Pre-test Battery\n", + "\n", + "de Chaisemartin et al. (2026) Section 4.2 lays out a four-step workflow for HAD identification:\n", + "\n", + "1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1, `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1', `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters.\n", + "2. **Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD.\n", + "3. **Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias?\n", + "4. **Step 4 - Boundary continuity (paper Assumptions 5, 6):** local-linearity of the dose-response near the boundary `d_lower`. **Non-testable**; argued from domain knowledge.\n", + "\n", + "The library bundles the testable steps into one entry point: `did_had_pretest_workflow`. It dispatches to a two-period implementation (steps 1 + 3 only - step 2 needs at least two pre-periods) or a multi-period implementation (steps 1 + 2 + 3 jointly). The Yatchew-HR test from Step 3 is also exposed standalone with two null modes; we exercise both in the side panel.\n" + ] + }, + { + "cell_type": "markdown", + "id": "b39cfcc4", + "metadata": {}, + "source": [ + "## 2. The Panel\n", + "\n", + "We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial spans roughly **$10 to $50K** (Uniform[\\$0.01K, \\$50K]) instead of T20's Uniform[\\$5K, \\$50K]. Some markets barely participated in the regional add-on - they put in essentially nothing. This shifts HAD's design path from T20's `continuous_near_d_lower` (Design 1', target = `WAS_d_lower`) to `continuous_at_zero` (Design 1, target = `WAS`) - the QUG test in Step 1 confirms that.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "e7d08b12", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-09T23:26:36.985852Z", + "iopub.status.busy": "2026-05-09T23:26:36.985618Z", + "iopub.status.idle": "2026-05-09T23:26:37.598610Z", + "shell.execute_reply": "2026-05-09T23:26:37.598289Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Panel: 60 DMAs x 8 weeks\n", + "Regional spend (post-launch): $0.18K - $49.00K\n", + "True per-$1K lift (locked at seed): 100.0 weekly visits\n" + ] + } + ], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from diff_diff import generate_continuous_did_data\n", + "\n", + "MAIN_SEED = 87\n", + "N_UNITS = 60\n", + "N_PERIODS = 8\n", + "COHORT_PERIOD = 5\n", + "TRUE_SLOPE = 100.0\n", + "BASELINE_VISITS = 5000.0\n", + "DOSE_LOW = 0.01\n", + "DOSE_HIGH = 50.0\n", + "\n", + "raw = generate_continuous_did_data(\n", + " n_units=N_UNITS,\n", + " n_periods=N_PERIODS,\n", + " cohort_periods=[COHORT_PERIOD],\n", + " never_treated_frac=0.0,\n", + " dose_distribution=\"uniform\",\n", + " dose_params={\"low\": DOSE_LOW, \"high\": DOSE_HIGH},\n", + " att_function=\"linear\",\n", + " att_intercept=0.0,\n", + " att_slope=TRUE_SLOPE,\n", + " unit_fe_sd=8.0,\n", + " time_trend=0.5,\n", + " noise_sd=2.0,\n", + " seed=MAIN_SEED,\n", + ")\n", + "panel = raw.copy()\n", + "panel.loc[panel[\"period\"] < panel[\"first_treat\"], \"dose\"] = 0.0\n", + "panel = panel.rename(\n", + " columns={\n", + " \"unit\": \"dma_id\",\n", + " \"period\": \"week\",\n", + " \"outcome\": \"weekly_visits\",\n", + " \"dose\": \"regional_spend_k\",\n", + " }\n", + ")\n", + "panel[\"weekly_visits\"] = panel[\"weekly_visits\"] + BASELINE_VISITS\n", + "\n", + "post = panel[panel[\"week\"] >= COHORT_PERIOD]\n", + "print(f\"Panel: {panel['dma_id'].nunique()} DMAs x {panel['week'].nunique()} weeks\")\n", + "print(\n", + " f\"Regional spend (post-launch): \"\n", + " f\"${post['regional_spend_k'].min():.2f}K - \"\n", + " f\"${post['regional_spend_k'].max():.2f}K\"\n", + ")\n", + "print(f\"True per-$1K lift (locked at seed): {TRUE_SLOPE} weekly visits\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "3a9b551b", + "metadata": {}, + "source": [ + "## 3. Step 1: The Overall Workflow (Two-Period Path)\n", + "\n", + "T20's headline used a two-period collapse of the panel - average pre-launch outcome per DMA against average post-launch outcome per DMA. That's also the natural input shape for HAD's two-period (`aggregate=\"overall\"`) pre-test workflow, which runs **paper Step 1 (QUG) + paper Step 3 (linearity, via Stute and Yatchew-HR)**. Step 2 (parallel pre-trends) is not implemented on this path - a single pre-period structurally can't support a pre-trends test - and the workflow's verdict says so explicitly.\n", + "\n", + "We collapse to two periods (pre = avg over weeks 1-4, post = avg over weeks 5-8), then call the workflow.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "b4057d6a", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-09T23:26:37.599845Z", + "iopub.status.busy": "2026-05-09T23:26:37.599739Z", + "iopub.status.idle": "2026-05-09T23:26:37.634021Z", + "shell.execute_reply": "2026-05-09T23:26:37.633738Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)\n", + "\n", + "all_pass = True\n", + "aggregate = 'overall'\n", + "pretrends_joint populated? False\n", + "homogeneity_joint populated? False\n" + ] + } + ], + "source": [ + "from diff_diff import did_had_pretest_workflow\n", + "\n", + "p = panel.copy()\n", + "p[\"period\"] = (p[\"week\"] >= COHORT_PERIOD).astype(int) + 1 # 1=pre, 2=post\n", + "two_period = p.groupby([\"dma_id\", \"period\"], as_index=False).agg(\n", + " weekly_visits=(\"weekly_visits\", \"mean\"),\n", + " regional_spend_k=(\"regional_spend_k\", \"mean\"),\n", + ")\n", + "# Workflow invariant: pre-period dose = 0 for every unit.\n", + "two_period.loc[two_period[\"period\"] == 1, \"regional_spend_k\"] = 0.0\n", + "# first_treat in the collapsed coordinates: 2 (the post-period) for every DMA.\n", + "two_period[\"first_treat\"] = 2\n", + "\n", + "overall_report = did_had_pretest_workflow(\n", + " data=two_period,\n", + " outcome_col=\"weekly_visits\",\n", + " dose_col=\"regional_spend_k\",\n", + " time_col=\"period\",\n", + " unit_col=\"dma_id\",\n", + " first_treat_col=\"first_treat\",\n", + " alpha=0.05,\n", + " n_bootstrap=999,\n", + " seed=21,\n", + " aggregate=\"overall\",\n", + ")\n", + "\n", + "print(overall_report.verdict)\n", + "print(f\"\\nall_pass = {overall_report.all_pass}\")\n", + "print(f\"aggregate = {overall_report.aggregate!r}\")\n", + "print(f\"pretrends_joint populated? {overall_report.pretrends_joint is not None}\")\n", + "print(f\"homogeneity_joint populated? {overall_report.homogeneity_joint is not None}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "8994fa7c", + "metadata": {}, + "source": [ + "**Reading the overall verdict.** Three things to note.\n", + "\n", + "- **Step 1 (QUG) fails to reject:** `D_(1)` (the smallest treated dose, ~\\$180 here) is small relative to the gap `D_(2) - D_(1)`, so the test statistic `T = D_(1) / (D_(2) - D_(1))` lands well below its critical value (1/alpha - 1 = 19 at alpha = 0.05). The data are consistent with `d_lower = 0` (Design 1, `continuous_at_zero`, target = `WAS`).\n", + "- **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The differenced outcome `dY` looks linear in `D`, so the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias.\n", + "- **Step 2 (Assumption 7 pre-trends) is structurally absent.** The verdict says so verbatim: `\"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)\"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path).\n", + "\n", + "Let's look at each individual test result.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "89e549ef", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-09T23:26:37.635153Z", + "iopub.status.busy": "2026-05-09T23:26:37.635075Z", + "iopub.status.idle": "2026-05-09T23:26:37.636869Z", + "shell.execute_reply": "2026-05-09T23:26:37.636643Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "================================================================\n", + " QUG null test (H_0: d_lower = 0) \n", + "================================================================\n", + "Statistic T: 3.8562\n", + "p-value: 0.2059\n", + "Critical value (1/alpha-1): 19.0000\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "Observations: 60\n", + "Excluded (d == 0): 0\n", + "D_(1): 0.1806\n", + "D_(2): 0.2274\n", + "================================================================\n", + "\n", + "================================================================\n", + " Stute CvM linearity test (H_0: linear E[dY|D]) \n", + "================================================================\n", + "CvM statistic: 0.0735\n", + "Bootstrap p-value: 0.6860\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "Bootstrap replications: 999\n", + "Observations: 60\n", + "Seed: 21\n", + "================================================================\n", + "\n", + "================================================================\n", + " Yatchew-HR linearity test (H_0: linear E[dY|D]) \n", + "================================================================\n", + "T_hr statistic: -34759.3017\n", + "p-value: 1.0000\n", + "Critical value (1-sided z): 1.6449\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "sigma^2_lin (OLS): 1.6177\n", + "sigma^2_diff (Yatchew): 6250.2569\n", + "sigma^2_W (HR scale): 1.3925\n", + "Observations: 60\n", + "================================================================\n" + ] + } + ], + "source": [ + "overall_report.qug.print_summary()\n", + "print()\n", + "overall_report.stute.print_summary()\n", + "print()\n", + "overall_report.yatchew.print_summary()\n" + ] + }, + { + "cell_type": "markdown", + "id": "892978cd", + "metadata": {}, + "source": [ + "A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000). That looks alarming but is correct here: under perfectly linear dose-response with very heterogeneous doses (Uniform[\\$0.01K, \\$50K]) and 60 sorted-by-dose units, the differencing variance `sigma2_diff` (which captures the squared gap between adjacent-by-dose units' `dy` values) is much larger than the OLS residual variance `sigma2_lin`. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. (For a different way to look at this same test, see the Yatchew side panel later in the notebook.)\n" + ] + }, + { + "cell_type": "markdown", + "id": "461e877c", + "metadata": {}, + "source": [ + "## 4. Step 2: Upgrade to the Event-Study Workflow\n", + "\n", + "The two-period workflow gave us evidence on Steps 1 and 3 but no formal evidence on Step 2 (parallel pre-trends). Our panel actually has 8 weeks - that's enough pre-periods to close Step 2 jointly with Stute's joint variant (paper Section 4.2 step 2 + Hlavka-Huskova 2020 / Delgado-Manteiga 2001 dependence-preserving Mammen multiplier bootstrap).\n", + "\n", + "We pass the full multi-period panel to `did_had_pretest_workflow(aggregate=\"event_study\", ...)`. The dispatch covers all three paper steps in one call:\n", + "\n", + "- **Step 1**: QUG re-runs on the dose distribution at the treatment period `F` (deterministic; same numbers as the overall path).\n", + "- **Step 2**: `joint_pretrends_test` - mean-independence joint Stute over the pre-period horizons (`E[Y_t - Y_base | D] = mu_t` for each t < F).\n", + "- **Step 3**: `joint_homogeneity_test` - linearity joint Stute over the post-period horizons (`E[Y_t - Y_base | D_t] = beta_{0,t} + beta_{fe,t} * D` for each t >= F).\n", + "\n", + "Step 3's \"Yatchew-HR\" arm has no joint variant in the paper (the differencing-based variance estimator doesn't have a derived multi-horizon extension), so the event-study path runs only joint Stute for linearity. Practitioners who want Yatchew-HR robustness on multi-period data can call the standalone `yatchew_hr_test` on each (base, post) pair manually.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "23b947ad", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-09T23:26:37.637903Z", + "iopub.status.busy": "2026-05-09T23:26:37.637825Z", + "iopub.status.idle": "2026-05-09T23:26:37.760056Z", + "shell.execute_reply": "2026-05-09T23:26:37.759776Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\n", + "\n", + "all_pass = True\n", + "aggregate = 'event_study'\n", + "pretrends_joint populated? True\n", + "homogeneity_joint populated? True\n" + ] + } + ], + "source": [ + "es_report = did_had_pretest_workflow(\n", + " data=panel,\n", + " outcome_col=\"weekly_visits\",\n", + " dose_col=\"regional_spend_k\",\n", + " time_col=\"week\",\n", + " unit_col=\"dma_id\",\n", + " first_treat_col=\"first_treat\",\n", + " alpha=0.05,\n", + " n_bootstrap=999,\n", + " seed=21,\n", + " aggregate=\"event_study\",\n", + ")\n", + "\n", + "print(es_report.verdict)\n", + "print(f\"\\nall_pass = {es_report.all_pass}\")\n", + "print(f\"aggregate = {es_report.aggregate!r}\")\n", + "print(f\"pretrends_joint populated? {es_report.pretrends_joint is not None}\")\n", + "print(f\"homogeneity_joint populated? {es_report.homogeneity_joint is not None}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "78d6b00d", + "metadata": {}, + "source": [ + "**Reading the event-study verdict.** Now the verdict reads `\"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\"`. The `\"deferred\"` caveat from the overall path is gone - all three paper steps closed jointly. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.\n", + "\n", + "The joint pre-trends test runs over `n_horizons = 3` (pre-periods 1, 2, 3, with week 4 reserved as the base period). The joint homogeneity test runs over `n_horizons = 4` (post-periods 5, 6, 7, 8). Let's inspect the per-horizon detail.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "dfaf3133", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-09T23:26:37.761340Z", + "iopub.status.busy": "2026-05-09T23:26:37.761247Z", + "iopub.status.idle": "2026-05-09T23:26:37.763147Z", + "shell.execute_reply": "2026-05-09T23:26:37.762905Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "================================================================\n", + " QUG null test (H_0: d_lower = 0) \n", + "================================================================\n", + "Statistic T: 3.8562\n", + "p-value: 0.2059\n", + "Critical value (1/alpha-1): 19.0000\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "Observations: 60\n", + "Excluded (d == 0): 0\n", + "D_(1): 0.1806\n", + "D_(2): 0.2274\n", + "================================================================\n", + "\n", + "================================================================\n", + " Joint Stute CvM test (mean-independence (pre-trends)) \n", + "================================================================\n", + "Joint CvM statistic: 7.1627\n", + "Bootstrap p-value: 0.0720\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "Bootstrap replications: 999\n", + "Horizons: 3\n", + "Observations: 60\n", + "Seed: 21\n", + "Exact-linear short-circuit: False\n", + "----------------------------------------------------------------\n", + "Per-horizon statistics:\n", + " 1 1.6112\n", + " 2 2.9262\n", + " 3 2.6253\n", + "================================================================\n", + "\n", + "================================================================\n", + " Joint Stute CvM test (linearity (post-homogeneity)) \n", + "================================================================\n", + "Joint CvM statistic: 1.3562\n", + "Bootstrap p-value: 0.7630\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "Bootstrap replications: 999\n", + "Horizons: 4\n", + "Observations: 60\n", + "Seed: 21\n", + "Exact-linear short-circuit: False\n", + "----------------------------------------------------------------\n", + "Per-horizon statistics:\n", + " 5 0.4218\n", + " 6 0.2186\n", + " 7 0.4928\n", + " 8 0.2230\n", + "================================================================\n" + ] + } + ], + "source": [ + "es_report.qug.print_summary()\n", + "print()\n", + "es_report.pretrends_joint.print_summary()\n", + "print()\n", + "es_report.homogeneity_joint.print_summary()\n" + ] + }, + { + "cell_type": "markdown", + "id": "acab854c", + "metadata": {}, + "source": [ + "The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold - the test is not vacuous, it is informative. It is consistent with parallel pre-trends but not by a wide margin. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring).\n", + "\n", + "The joint homogeneity p-value (~0.76) is a strong fail-to-reject. Linearity holds across all four post-launch horizons.\n", + "\n", + "Together with QUG (design verdict) and joint linearity (Step 3), this closes the testable portion of the paper's identification framework. Step 4 (boundary continuity, Assumptions 5 / 6) remains non-testable; we still defend it from domain knowledge as in T20.\n" + ] + }, + { + "cell_type": "markdown", + "id": "ba3a0c3c", + "metadata": {}, + "source": [ + "## 5. Side Panel: Yatchew-HR Null Modes\n", + "\n", + "The Yatchew-HR test exposes two `null=` modes (the second was added in 2026-04 for parity with the R `YatchewTest` package).\n", + "\n", + "- `null=\"linearity\"` (default; paper Theorem 7): tests `H0: E[dY | D]` is linear in `D`. Residuals come from OLS `dy ~ 1 + d`. This is what `did_had_pretest_workflow` calls under the hood.\n", + "- `null=\"mean_independence\"` (PR #400, 2026-04, Phase 4 R-parity): tests the stricter `H0: E[dY | D] = E[dY]`, i.e. `dY` is mean-independent of `D`. Residuals come from intercept-only OLS `dy ~ 1`. Mirrors R `YatchewTest::yatchew_test(order=0)`.\n", + "\n", + "The mean-independence mode is typically used on **placebo (pre-treatment) data** to test parallel pre-trends as a non-parametric mean-independence assertion. Below we construct an illustrative input - the within-pre-period first-difference `dy = Y[week=4] - Y[week=3]` paired with each DMA's actual post-period dose - and run both modes side by side. Both should fail to reject on this clean linear DGP; the contrast is in the residual structure.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "319a1c0c", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-09T23:26:37.764116Z", + "iopub.status.busy": "2026-05-09T23:26:37.764040Z", + "iopub.status.idle": "2026-05-09T23:26:37.768553Z", + "shell.execute_reply": "2026-05-09T23:26:37.768342Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "================================================================\n", + " Yatchew-HR linearity test (H_0: linear E[dY|D]) \n", + "================================================================\n", + "T_hr statistic: 0.0207\n", + "p-value: 0.4917\n", + "Critical value (1-sided z): 1.6449\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "sigma^2_lin (OLS): 6.5340\n", + "sigma^2_diff (Yatchew): 6.5170\n", + "sigma^2_W (HR scale): 6.3639\n", + "Observations: 60\n", + "================================================================\n", + "\n", + "================================================================\n", + " Yatchew-HR mean-independence test (H_0: E[dY|D] = E[dY]) \n", + "================================================================\n", + "T_hr statistic: 0.5536\n", + "p-value: 0.2899\n", + "Critical value (1-sided z): 1.6449\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "sigma^2_lin (OLS): 7.0076\n", + "sigma^2_diff (Yatchew): 6.5170\n", + "sigma^2_W (HR scale): 6.8638\n", + "Observations: 60\n", + "================================================================\n" + ] + } + ], + "source": [ + "from diff_diff import yatchew_hr_test\n", + "\n", + "panel_sorted = panel.sort_values([\"dma_id\", \"week\"]).reset_index(drop=True)\n", + "pre = panel_sorted[panel_sorted[\"week\"].isin([3, 4])]\n", + "pre_pivot = pre.pivot(index=\"dma_id\", columns=\"week\", values=\"weekly_visits\")\n", + "dy = (pre_pivot[4] - pre_pivot[3]).to_numpy(dtype=np.float64)\n", + "post_dose = (\n", + " panel_sorted[panel_sorted[\"week\"] == 5]\n", + " .set_index(\"dma_id\")\n", + " .sort_index()[\"regional_spend_k\"]\n", + " .to_numpy(dtype=np.float64)\n", + ")\n", + "\n", + "res_lin = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null=\"linearity\")\n", + "res_mi = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null=\"mean_independence\")\n", + "\n", + "print(res_lin.summary())\n", + "print()\n", + "print(res_mi.summary())\n" + ] + }, + { + "cell_type": "markdown", + "id": "d142244a", + "metadata": {}, + "source": [ + "**Reading the side-panel comparison.**\n", + "\n", + "- The `linearity` mode fits `dy ~ 1 + d` and computes residual variance `sigma2_lin` from those residuals. Under a clean linear DGP the residuals are small (close to noise variance), the gap `sigma2_lin - sigma2_diff` is near zero, and `T_hr` lands close to zero with a p-value far above alpha.\n", + "- The `mean_independence` mode fits intercept-only `dy ~ 1` and computes `sigma2_lin` as the population variance of `dy`. That residual variance is **strictly larger** than under `linearity` (the linear fit absorbs the dose-response signal that intercept-only does not). The gap `sigma2_lin - sigma2_diff` is then larger and `T_hr` is larger - same asymptotic distribution, stricter null, more easily rejected when the alternative is true.\n", + "\n", + "On clean linear placebo data both modes fail to reject - exactly what we want. On data where `dY` actually responds to `D` in pre-period (parallel pre-trends fail), `null=\"mean_independence\"` is more sensitive than `null=\"linearity\"` because linearity is a weaker null (linear pre-trends would fail to reject the linearity null but would reject the mean-independence null).\n", + "\n", + "When to choose which: use `null=\"linearity\"` to defend the joint identification assumption (paper Step 3, Assumption 8). Use `null=\"mean_independence\"` on placebo (pre-treatment) data when you want a non-parametric mean-independence assertion. The `null=\"mean_independence\"` mode is what R `YatchewTest::yatchew_test(order=0)` runs by default for placebo pre-trend tests.\n" + ] + }, + { + "cell_type": "markdown", + "id": "5c5d2b18", + "metadata": {}, + "source": [ + "## 6. Communicating the Validation to Leadership\n", + "\n", + "Pre-test results travel awkwardly to non-technical audiences. The template below structures the validation around what each test rules out - mirroring the headline-and-evidence pattern from T20 Section 5.\n", + "\n", + "> **Identifying assumptions for HAD on the brand-campaign panel are defended on all three paper steps.**\n", + ">\n", + "> - **Step 1 (QUG support-infimum, paper Theorem 4):** the test is consistent with the dose distribution starting at zero (`d_lower = 0`, p approximately 0.21). The library auto-detects the `continuous_at_zero` design and reports the WAS (Weighted Average Slope), as expected for this panel where some markets barely participated in the regional spend.\n", + "> - **Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test fails to reject (joint p approximately 0.07 across the three pre-period horizons). The pre-trend evidence is not a slam dunk - the p-value is close to alpha = 0.05 - but it is conclusive. In a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending.\n", + "> - **Step 3 (linearity, Assumption 8):** joint Stute homogeneity fails to reject (joint p approximately 0.76 across the four post-launch horizons). The linearity assumption needed for the WAS reading to reflect the average per-dose marginal effect (rather than masking heterogeneity bias) is comfortably supported.\n", + ">\n", + "> **Non-testable from data (Step 4, paper Assumptions 5 / 6, boundary continuity):** local-linearity of the dose-response near `d_lower`. Argued from domain knowledge - is there reason to believe the marginal effect of an additional $1K of regional spend is roughly constant across the dose range? In our case yes, by DGP construction; in a real analysis we would justify this from prior knowledge of the channel's response shape.\n", + ">\n", + "> **Bottom line:** TWFE is admissible under the paper's framework on this panel. The headline per-$1K lift from the HAD fit can be carried forward to leadership without methodological caveat beyond Step 4 (which is qualitative, not data-driven).\n" + ] + }, + { + "cell_type": "markdown", + "id": "0d0c55b3", + "metadata": {}, + "source": [ + "## 7. Extensions\n", + "\n", + "This tutorial covered the composite pre-test workflow on a single Design 1 panel. A few directions we did not exercise here:\n", + "\n", + "- **Survey-weighted / population-weighted inference** - HAD's pre-test workflow accepts `survey_design=` (or the deprecated `survey=` / `weights=` aliases) for design-based inference. The QUG step is permanently deferred under survey weighting (extreme-value theory under complex sampling is not a settled toolkit); the linearity family runs with PSU-level Mammen multiplier bootstrap (Stute and joint variants) and weighted OLS + weighted variance components (Yatchew). A follow-up tutorial covers this path end-to-end.\n", + "- **`trends_lin=True` (Pierce-Schott Eq 17 / 18 detrending)** - mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)`. Forwards into both joint pre-trends and joint homogeneity wrappers; consumes the placebo at `base_period - 1` and skips Step 2 if no earlier placebo survives the drop. Useful when you suspect linear time trends correlated with dose but want to keep the joint-Stute machinery.\n", + "- **Standalone constituent tests** - all four building blocks are exposed for direct calling: `qug_test`, `stute_test`, `yatchew_hr_test` (used in this tutorial's side panel), and the joint variants `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`.\n", + "\n", + "See the [`HeterogeneousAdoptionDiD` API reference](../api/had.html) and the [`HAD pre-tests` reference](../api/had.html#pre-tests) for the full parameter lists.\n", + "\n", + "**Related tutorials.**\n", + "\n", + "- [Tutorial 14: Continuous DiD](14_continuous_did.ipynb) - the Callaway-Goodman-Bacon-Sant'Anna estimator for continuous-dose settings WHERE you do have a never-treated unit AND want the per-dose ATT(d) curve, not just the average slope.\n", + "- [Tutorial 20: HAD for a National Brand Campaign](20_had_brand_campaign.ipynb) - the headline HAD fit and event-study this tutorial defends.\n", + "- [Tutorial 4: Parallel Trends](04_parallel_trends.ipynb) - parallel-trends tests for the binary-DiD setting.\n" + ] + }, + { + "cell_type": "markdown", + "id": "36453f8b", + "metadata": {}, + "source": [ + "## 8. Summary Checklist\n", + "\n", + "- HAD's pre-test workflow `did_had_pretest_workflow` bundles paper Section 4.2 Steps 1 (QUG support infimum), 2 (joint Stute pre-trends - event-study path only), and 3 (Stute / Yatchew-HR linearity, joint variant on event-study path).\n", + "- The two-period (`aggregate=\"overall\"`) path runs Steps 1 + 3 only - it cannot run Step 2 because a single pre-period structurally has nothing to test against. The verdict says so verbatim: \"Assumption 7 pre-trends test NOT run\".\n", + "- Upgrade to the multi-period (`aggregate=\"event_study\"`) path to close all three testable steps jointly. The verdict then reads \"TWFE admissible under Section 4 assumptions\" when nothing rejects.\n", + "- Step 4 (paper Assumptions 5 / 6, boundary continuity) is **non-testable** from data - argue from domain knowledge.\n", + "- The Yatchew-HR test exposes two null modes: `null=\"linearity\"` (paper Theorem 7, default; what the workflow calls under the hood) and `null=\"mean_independence\"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data).\n", + "- Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python).\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/tutorials/README.md b/docs/tutorials/README.md index 64773d46..f3f27bdb 100644 --- a/docs/tutorials/README.md +++ b/docs/tutorials/README.md @@ -103,6 +103,14 @@ Practitioner walkthrough for measuring per-dollar lift when every market is trea - Stakeholder communication template flagging the Assumption 5/6 identification caveat - Companion drift-test file (`tests/test_t20_had_brand_campaign_drift.py`) +### 21. HAD Pre-test Workflow (`21_had_pretest_workflow.ipynb`) +Composite pre-test walkthrough for `HeterogeneousAdoptionDiD`, building on Tutorial 20's brand-campaign framing on a Design 1 (`continuous_at_zero`) panel variant: +- Paper Section 4.2 step taxonomy (QUG support-infimum, parallel pre-trends, linearity) +- `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse: Step 1 + Step 3 only, verdict explicitly flags Step 2 as deferred +- Upgrade to `did_had_pretest_workflow(aggregate="event_study")` on the multi-week panel: closes all three testable steps via QUG + joint pre-trends Stute + joint homogeneity Stute +- Side panel comparing `yatchew_hr_test` `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) +- Companion drift-test file (`tests/test_t21_had_pretest_workflow_drift.py`) + ## Running the Notebooks 1. Install diff-diff with dependencies: diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py new file mode 100644 index 00000000..62a8893d --- /dev/null +++ b/tests/test_t21_had_pretest_workflow_drift.py @@ -0,0 +1,315 @@ +"""Drift detection for Tutorial 21 (`docs/tutorials/21_had_pretest_workflow.ipynb`). + +The tutorial narrative quotes seed-specific numbers (overall verdict +substring, QUG / Stute / Yatchew p-values, joint pre-trends and homogeneity +horizon counts and p-values, Yatchew side-panel statistics under both null +modes). If library numerics drift (estimator changes, RNG path changes, +BLAS path changes), the prose can go stale silently while `pytest --nbmake` +still passes - it only checks that the cells execute without error. + +These asserts re-derive the same numbers using the locked T21 DGP and seeds +the notebook uses, then check them against the values quoted in the +tutorial markdown. If a future change moves any number outside its +tolerance band, this test fails and a maintainer is forced to either +update the prose or investigate the methodology shift before merge. + +T21 DGP differs from T20: dose distribution is `Uniform[$0.01K, $50K]` +(was `[$5K, $50K]` in T20) so this is a Design 1 (`continuous_at_zero`) +panel where the QUG step fails-to-reject and the verdict text fires the +load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. +DGP and seed locked at `_scratch/t21_pretests/10_panel.py`. +Quoted numbers derived from `_scratch/t21_pretests/50_compose_narrative.py`. + +Bootstrap p-value pins use **abs tolerance bands >= 0.15** per +`feedback_bootstrap_drift_tests_need_backend_tolerance` (Rust vs pure-Python +RNG paths can diverge by ~0.05-0.15 and flip rounding boundaries). +Deterministic statistics (QUG, Yatchew sigma2_*) get exact `round(..., 2)` +or `round(..., 4)` pins. +""" + +from __future__ import annotations + +import numpy as np +import pytest + +from diff_diff import did_had_pretest_workflow, generate_continuous_did_data, yatchew_hr_test + +# Locked T21 DGP parameters (must stay in sync with the notebook). +MAIN_SEED = 87 +N_UNITS = 60 +N_PERIODS = 8 +COHORT_PERIOD = 5 +TRUE_SLOPE = 100.0 +BASELINE_VISITS = 5000.0 +DOSE_LOW = 0.01 # T21 change vs T20 (was 5.0): supports continuous_at_zero design. +DOSE_HIGH = 50.0 +WORKFLOW_SEED = 21 + + +@pytest.fixture(scope="module") +def panel(): + raw = generate_continuous_did_data( + n_units=N_UNITS, + n_periods=N_PERIODS, + cohort_periods=[COHORT_PERIOD], + never_treated_frac=0.0, + dose_distribution="uniform", + dose_params={"low": DOSE_LOW, "high": DOSE_HIGH}, + att_function="linear", + att_intercept=0.0, + att_slope=TRUE_SLOPE, + unit_fe_sd=8.0, + time_trend=0.5, + noise_sd=2.0, + seed=MAIN_SEED, + ) + p = raw.copy() + p.loc[p["period"] < p["first_treat"], "dose"] = 0.0 + p = p.rename( + columns={ + "unit": "dma_id", + "period": "week", + "outcome": "weekly_visits", + "dose": "regional_spend_k", + } + ) + p["weekly_visits"] = p["weekly_visits"] + BASELINE_VISITS + return p + + +@pytest.fixture(scope="module") +def two_period(panel): + p = panel.copy() + p["period"] = (p["week"] >= COHORT_PERIOD).astype(int) + 1 + collapsed = p.groupby(["dma_id", "period"], as_index=False).agg( + weekly_visits=("weekly_visits", "mean"), + regional_spend_k=("regional_spend_k", "mean"), + ) + collapsed.loc[collapsed["period"] == 1, "regional_spend_k"] = 0.0 + collapsed["first_treat"] = 2 + return collapsed + + +@pytest.fixture(scope="module") +def overall_report(two_period): + return did_had_pretest_workflow( + data=two_period, + outcome_col="weekly_visits", + dose_col="regional_spend_k", + time_col="period", + unit_col="dma_id", + first_treat_col="first_treat", + alpha=0.05, + n_bootstrap=999, + seed=WORKFLOW_SEED, + aggregate="overall", + ) + + +@pytest.fixture(scope="module") +def event_study_report(panel): + return did_had_pretest_workflow( + data=panel, + outcome_col="weekly_visits", + dose_col="regional_spend_k", + time_col="week", + unit_col="dma_id", + first_treat_col="first_treat", + alpha=0.05, + n_bootstrap=999, + seed=WORKFLOW_SEED, + aggregate="event_study", + ) + + +def test_panel_matches_t21_locked_dgp(panel): + """Section 2 narrative claims 60 DMAs over 8 weeks, regional spend + spanning roughly $10 to $50K (the T21 Design 1 variant). If the + DGP drifts, this surfaces.""" + assert panel["dma_id"].nunique() == N_UNITS + assert panel["week"].nunique() == N_PERIODS + post_doses = ( + panel.loc[panel["week"] >= COHORT_PERIOD].groupby("dma_id")["regional_spend_k"].first() + ) + assert post_doses.min() >= DOSE_LOW, post_doses.min() + assert post_doses.max() <= DOSE_HIGH, post_doses.max() + # T21 narrative says "starts from $10" - i.e. the smallest dose is + # below $1K (~$180 from numbers.json: d_order_1 = 0.180569...). + assert post_doses.min() < 1.0, post_doses.min() + + +def test_overall_verdict_flags_assumption_7_deferred(overall_report): + """Load-bearing pivot for the upgrade-arc narrative. Sections 3-4 + of the notebook quote this verdict substring verbatim. If + `_compose_verdict()` is refactored such that the substring changes + or moves, this test surfaces it.""" + pivot = "Assumption 7 pre-trends test NOT run" + assert pivot in overall_report.verdict, overall_report.verdict + # Adjacent pivot the prose also quotes: + assert ( + "paper step 2 deferred to Phase 3 follow-up" in overall_report.verdict + ), overall_report.verdict + + +def test_overall_path_structural_anchors(overall_report): + """Notebook Section 3 prose claims `pretrends_joint` and + `homogeneity_joint` are both None on the overall path (they are + not populated on the two-period dispatch). Sturdier than a + verdict-string anchor against future verdict refactors.""" + assert overall_report.aggregate == "overall" + assert overall_report.pretrends_joint is None + assert overall_report.homogeneity_joint is None + assert overall_report.all_pass is True + + +def test_overall_qug_fails_to_reject(overall_report): + """Section 3 narrative claims QUG fails to reject (consistent with + Design 1, `continuous_at_zero`). QUG is fully deterministic; pin + exact rounded values.""" + assert overall_report.qug.reject is False + # T statistic = D_(1) / (D_(2) - D_(1)) is fully deterministic. + assert round(overall_report.qug.t_stat, 2) == 3.86, overall_report.qug.t_stat + assert round(overall_report.qug.critical_value, 1) == 19.0, overall_report.qug.critical_value + + +def test_overall_stute_fails_to_reject(overall_report): + """Section 3 narrative claims Stute fails-to-reject linearity. + Stute uses Mammen wild bootstrap so the p-value is RNG-dependent; + use binary fail-to-reject + abs tolerance band per + `feedback_bootstrap_drift_tests_need_backend_tolerance`.""" + assert overall_report.stute.reject is False + # Tight enough to catch methodology drift, loose enough for backend + # RNG path differences. + assert overall_report.stute.p_value > 0.50, overall_report.stute.p_value + + +def test_overall_yatchew_fails_to_reject(overall_report): + """Section 3 narrative + cell 9 callout describe the very large + negative Yatchew T_hr (~-35,000) under perfect linearity with + heterogeneous doses. Pin sigma2_* (deterministic) and the + rejection decision.""" + assert overall_report.yatchew.reject is False + assert overall_report.yatchew.p_value > 0.99, overall_report.yatchew.p_value + # sigma2_diff is deterministic given the panel. + assert ( + round(overall_report.yatchew.sigma2_diff, 0) == 6250.0 + ), overall_report.yatchew.sigma2_diff + + +def test_event_study_verdict_says_admissible(event_study_report): + """Sections 4-5 narrative claims the event-study verdict reads + 'TWFE admissible under Section 4 assumptions' (no `deferred` + caveat). Locks the upgrade-arc closure pivot.""" + assert "TWFE admissible" in event_study_report.verdict, event_study_report.verdict + assert "deferred" not in event_study_report.verdict, event_study_report.verdict + + +def test_event_study_path_structural_anchors(event_study_report): + """Section 4 narrative claims `pretrends_joint` and + `homogeneity_joint` are both populated on the event-study path + (the upgrade arc closure). Mirror of the overall path's negative + structural anchor.""" + assert event_study_report.aggregate == "event_study" + assert event_study_report.pretrends_joint is not None + assert event_study_report.homogeneity_joint is not None + assert event_study_report.all_pass is True + + +def test_event_study_qug_matches_overall(event_study_report, overall_report): + """Section 4 narrative claims QUG re-runs deterministically with + the same numbers as the overall path (same dose distribution at + F).""" + assert event_study_report.qug.reject is overall_report.qug.reject + assert round(event_study_report.qug.t_stat, 4) == round(overall_report.qug.t_stat, 4) + + +def test_event_study_pretrends_horizons_correct(event_study_report): + """Section 4 narrative claims `joint_pretrends_test` runs over 3 + horizons (pre-periods 1, 2, 3, with week 4 reserved as the base + period). Locks the earlier-pre-period precondition closure + (PR #402 R7) for T21's specific panel: F=5, t_pre={1,2,3,4}, + base=4, earlier pre-periods={1,2,3}.""" + pj = event_study_report.pretrends_joint + assert pj is not None + assert pj.n_horizons == 3, pj.n_horizons + assert pj.horizon_labels == ["1", "2", "3"], pj.horizon_labels + + +def test_event_study_homogeneity_horizons_correct(event_study_report): + """Section 4 narrative claims `joint_homogeneity_test` runs over 4 + post horizons (weeks 5, 6, 7, 8).""" + hj = event_study_report.homogeneity_joint + assert hj is not None + assert hj.n_horizons == 4, hj.n_horizons + assert hj.horizon_labels == ["5", "6", "7", "8"], hj.horizon_labels + + +def test_event_study_pretrends_fails_to_reject(event_study_report): + """Section 4 narrative quotes the pre-trends p-value as 'close to + alpha = 0.05 but conclusive' (~0.07 from numbers.json). Use binary + fail-to-reject + a wide abs tolerance band - bootstrap p-values + near alpha are the most sensitive to RNG path differences.""" + pj = event_study_report.pretrends_joint + assert pj is not None + assert pj.reject is False + # Tight upper bound to catch a real methodology shift; lower bound + # would catch a regression that pushes pre-trends to look pristine + # (which would belie the "close to alpha" narrative). + assert 0.0 <= pj.p_value <= 0.25, pj.p_value + + +def test_event_study_homogeneity_fails_to_reject(event_study_report): + """Section 4 narrative claims joint homogeneity strongly fails to + reject (~0.76 from numbers.json).""" + hj = event_study_report.homogeneity_joint + assert hj is not None + assert hj.reject is False + assert hj.p_value > 0.50, hj.p_value + + +def test_yatchew_side_panel_linearity_passes(panel): + """Section 5 (Yatchew side panel) narrative claims `null="linearity"` + fails to reject on the within-pre-period first-difference paired + with post-period dose. Pin the T_hr statistic (deterministic); + Yatchew has no bootstrap component.""" + panel_sorted = panel.sort_values(["dma_id", "week"]).reset_index(drop=True) + pre = panel_sorted[panel_sorted["week"].isin([3, 4])] + pre_pivot = pre.pivot(index="dma_id", columns="week", values="weekly_visits") + dy = (pre_pivot[4] - pre_pivot[3]).to_numpy(dtype=np.float64) + post_dose = ( + panel_sorted[panel_sorted["week"] == 5] + .set_index("dma_id") + .sort_index()["regional_spend_k"] + .to_numpy(dtype=np.float64) + ) + res = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="linearity") + assert res.reject is False + assert res.null_form == "linearity" + assert round(res.t_stat_hr, 2) == 0.02, res.t_stat_hr + assert round(res.sigma2_lin, 2) == 6.53, res.sigma2_lin + + +def test_yatchew_side_panel_mean_independence_passes(panel): + """Section 5 narrative claims `null="mean_independence"` fails to + reject on the same input but with larger sigma2_lin (the stricter + null has more residual variance to explain).""" + panel_sorted = panel.sort_values(["dma_id", "week"]).reset_index(drop=True) + pre = panel_sorted[panel_sorted["week"].isin([3, 4])] + pre_pivot = pre.pivot(index="dma_id", columns="week", values="weekly_visits") + dy = (pre_pivot[4] - pre_pivot[3]).to_numpy(dtype=np.float64) + post_dose = ( + panel_sorted[panel_sorted["week"] == 5] + .set_index("dma_id") + .sort_index()["regional_spend_k"] + .to_numpy(dtype=np.float64) + ) + res_mi = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="mean_independence") + res_lin = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="linearity") + assert res_mi.reject is False + assert res_mi.null_form == "mean_independence" + assert round(res_mi.t_stat_hr, 2) == 0.55, res_mi.t_stat_hr + assert round(res_mi.sigma2_lin, 2) == 7.01, res_mi.sigma2_lin + # Pedagogical claim from Section 5: stricter null -> larger sigma2_lin. + assert res_mi.sigma2_lin > res_lin.sigma2_lin + # And the differencing variance (sigma2_diff) is shared across modes. + assert round(res_mi.sigma2_diff, 4) == round(res_lin.sigma2_diff, 4) From 7e97532e8620b511b3196cb5c2655a1dc5240b87 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 19:38:19 -0400 Subject: [PATCH 02/12] =?UTF-8?q?Fix=20T21=20PR=20number=20reference=20(#4?= =?UTF-8?q?00=20=E2=86=92=20#397)=20for=20yatchew=5Fhr=5Ftest=20mean=5Find?= =?UTF-8?q?ependence?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR #397 added the `null="mean_independence"` mode; PR #400 was the release-rollup that bundled it. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/21_had_pretest_workflow.ipynb | 90 ++++++++++---------- 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb index 05c51989..1956de12 100644 --- a/docs/tutorials/21_had_pretest_workflow.ipynb +++ b/docs/tutorials/21_had_pretest_workflow.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "02f317b7", + "id": "6623db00", "metadata": {}, "source": [ "# Tutorial 21: HAD Pre-test Workflow - Did the Brand Campaign Satisfy the Identifying Assumptions?\n", @@ -14,7 +14,7 @@ }, { "cell_type": "markdown", - "id": "47b10255", + "id": "baf69855", "metadata": {}, "source": [ "## 1. The Pre-test Battery\n", @@ -31,7 +31,7 @@ }, { "cell_type": "markdown", - "id": "b39cfcc4", + "id": "587dd5ae", "metadata": {}, "source": [ "## 2. The Panel\n", @@ -42,13 +42,13 @@ { "cell_type": "code", "execution_count": 1, - "id": "e7d08b12", + "id": "4169ccef", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:26:36.985852Z", - "iopub.status.busy": "2026-05-09T23:26:36.985618Z", - "iopub.status.idle": "2026-05-09T23:26:37.598610Z", - "shell.execute_reply": "2026-05-09T23:26:37.598289Z" + "iopub.execute_input": "2026-05-09T23:37:56.569904Z", + "iopub.status.busy": "2026-05-09T23:37:56.569787Z", + "iopub.status.idle": "2026-05-09T23:37:57.370321Z", + "shell.execute_reply": "2026-05-09T23:37:57.370029Z" } }, "outputs": [ @@ -116,7 +116,7 @@ }, { "cell_type": "markdown", - "id": "3a9b551b", + "id": "c21196bd", "metadata": {}, "source": [ "## 3. Step 1: The Overall Workflow (Two-Period Path)\n", @@ -129,13 +129,13 @@ { "cell_type": "code", "execution_count": 2, - "id": "b4057d6a", + "id": "f57f8c97", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:26:37.599845Z", - "iopub.status.busy": "2026-05-09T23:26:37.599739Z", - "iopub.status.idle": "2026-05-09T23:26:37.634021Z", - "shell.execute_reply": "2026-05-09T23:26:37.633738Z" + "iopub.execute_input": "2026-05-09T23:37:57.371617Z", + "iopub.status.busy": "2026-05-09T23:37:57.371488Z", + "iopub.status.idle": "2026-05-09T23:37:57.410189Z", + "shell.execute_reply": "2026-05-09T23:37:57.409927Z" } }, "outputs": [ @@ -188,7 +188,7 @@ }, { "cell_type": "markdown", - "id": "8994fa7c", + "id": "a37bb4f5", "metadata": {}, "source": [ "**Reading the overall verdict.** Three things to note.\n", @@ -203,13 +203,13 @@ { "cell_type": "code", "execution_count": 3, - "id": "89e549ef", + "id": "78aaa722", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:26:37.635153Z", - "iopub.status.busy": "2026-05-09T23:26:37.635075Z", - "iopub.status.idle": "2026-05-09T23:26:37.636869Z", - "shell.execute_reply": "2026-05-09T23:26:37.636643Z" + "iopub.execute_input": "2026-05-09T23:37:57.411584Z", + "iopub.status.busy": "2026-05-09T23:37:57.411480Z", + "iopub.status.idle": "2026-05-09T23:37:57.413454Z", + "shell.execute_reply": "2026-05-09T23:37:57.413187Z" } }, "outputs": [ @@ -269,7 +269,7 @@ }, { "cell_type": "markdown", - "id": "892978cd", + "id": "aaa21a26", "metadata": {}, "source": [ "A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000). That looks alarming but is correct here: under perfectly linear dose-response with very heterogeneous doses (Uniform[\\$0.01K, \\$50K]) and 60 sorted-by-dose units, the differencing variance `sigma2_diff` (which captures the squared gap between adjacent-by-dose units' `dy` values) is much larger than the OLS residual variance `sigma2_lin`. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. (For a different way to look at this same test, see the Yatchew side panel later in the notebook.)\n" @@ -277,7 +277,7 @@ }, { "cell_type": "markdown", - "id": "461e877c", + "id": "09b0f2a3", "metadata": {}, "source": [ "## 4. Step 2: Upgrade to the Event-Study Workflow\n", @@ -296,13 +296,13 @@ { "cell_type": "code", "execution_count": 4, - "id": "23b947ad", + "id": "d94b8cbf", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:26:37.637903Z", - "iopub.status.busy": "2026-05-09T23:26:37.637825Z", - "iopub.status.idle": "2026-05-09T23:26:37.760056Z", - "shell.execute_reply": "2026-05-09T23:26:37.759776Z" + "iopub.execute_input": "2026-05-09T23:37:57.414542Z", + "iopub.status.busy": "2026-05-09T23:37:57.414461Z", + "iopub.status.idle": "2026-05-09T23:37:57.539317Z", + "shell.execute_reply": "2026-05-09T23:37:57.539034Z" } }, "outputs": [ @@ -342,7 +342,7 @@ }, { "cell_type": "markdown", - "id": "78d6b00d", + "id": "ebb7378f", "metadata": {}, "source": [ "**Reading the event-study verdict.** Now the verdict reads `\"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\"`. The `\"deferred\"` caveat from the overall path is gone - all three paper steps closed jointly. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.\n", @@ -353,13 +353,13 @@ { "cell_type": "code", "execution_count": 5, - "id": "dfaf3133", + "id": "4c0f47d0", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:26:37.761340Z", - "iopub.status.busy": "2026-05-09T23:26:37.761247Z", - "iopub.status.idle": "2026-05-09T23:26:37.763147Z", - "shell.execute_reply": "2026-05-09T23:26:37.762905Z" + "iopub.execute_input": "2026-05-09T23:37:57.540476Z", + "iopub.status.busy": "2026-05-09T23:37:57.540385Z", + "iopub.status.idle": "2026-05-09T23:37:57.542348Z", + "shell.execute_reply": "2026-05-09T23:37:57.542100Z" } }, "outputs": [ @@ -432,7 +432,7 @@ }, { "cell_type": "markdown", - "id": "acab854c", + "id": "f28a7820", "metadata": {}, "source": [ "The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold - the test is not vacuous, it is informative. It is consistent with parallel pre-trends but not by a wide margin. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring).\n", @@ -444,7 +444,7 @@ }, { "cell_type": "markdown", - "id": "ba3a0c3c", + "id": "b805e082", "metadata": {}, "source": [ "## 5. Side Panel: Yatchew-HR Null Modes\n", @@ -452,7 +452,7 @@ "The Yatchew-HR test exposes two `null=` modes (the second was added in 2026-04 for parity with the R `YatchewTest` package).\n", "\n", "- `null=\"linearity\"` (default; paper Theorem 7): tests `H0: E[dY | D]` is linear in `D`. Residuals come from OLS `dy ~ 1 + d`. This is what `did_had_pretest_workflow` calls under the hood.\n", - "- `null=\"mean_independence\"` (PR #400, 2026-04, Phase 4 R-parity): tests the stricter `H0: E[dY | D] = E[dY]`, i.e. `dY` is mean-independent of `D`. Residuals come from intercept-only OLS `dy ~ 1`. Mirrors R `YatchewTest::yatchew_test(order=0)`.\n", + "- `null=\"mean_independence\"` (added 2026-04-26 in PR #397, Phase 4 R-parity): tests the stricter `H0: E[dY | D] = E[dY]`, i.e. `dY` is mean-independent of `D`. Residuals come from intercept-only OLS `dy ~ 1`. Mirrors R `YatchewTest::yatchew_test(order=0)`.\n", "\n", "The mean-independence mode is typically used on **placebo (pre-treatment) data** to test parallel pre-trends as a non-parametric mean-independence assertion. Below we construct an illustrative input - the within-pre-period first-difference `dy = Y[week=4] - Y[week=3]` paired with each DMA's actual post-period dose - and run both modes side by side. Both should fail to reject on this clean linear DGP; the contrast is in the residual structure.\n" ] @@ -460,13 +460,13 @@ { "cell_type": "code", "execution_count": 6, - "id": "319a1c0c", + "id": "63fa34db", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:26:37.764116Z", - "iopub.status.busy": "2026-05-09T23:26:37.764040Z", - "iopub.status.idle": "2026-05-09T23:26:37.768553Z", - "shell.execute_reply": "2026-05-09T23:26:37.768342Z" + "iopub.execute_input": "2026-05-09T23:37:57.543368Z", + "iopub.status.busy": "2026-05-09T23:37:57.543298Z", + "iopub.status.idle": "2026-05-09T23:37:57.547774Z", + "shell.execute_reply": "2026-05-09T23:37:57.547551Z" } }, "outputs": [ @@ -528,7 +528,7 @@ }, { "cell_type": "markdown", - "id": "d142244a", + "id": "da899c45", "metadata": {}, "source": [ "**Reading the side-panel comparison.**\n", @@ -543,7 +543,7 @@ }, { "cell_type": "markdown", - "id": "5c5d2b18", + "id": "c98e7202", "metadata": {}, "source": [ "## 6. Communicating the Validation to Leadership\n", @@ -563,7 +563,7 @@ }, { "cell_type": "markdown", - "id": "0d0c55b3", + "id": "56246bf2", "metadata": {}, "source": [ "## 7. Extensions\n", @@ -585,7 +585,7 @@ }, { "cell_type": "markdown", - "id": "36453f8b", + "id": "ca588d0c", "metadata": {}, "source": [ "## 8. Summary Checklist\n", From 6c281d14d3961f553925cdba2e42771e8e99aee8 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 19:47:20 -0400 Subject: [PATCH 03/12] Tighten T21 methodology framing: QUG-decision vs DGP-support, non-rejection vs proof Two methodology framing issues in T21: 1. The DGP `Uniform[$0.01K, $50K]` has support strictly above zero. The tutorial / README / CHANGELOG / drift-test docstrings called it a "true Design 1 (`continuous_at_zero`)" panel, conflating "QUG fails-to-reject d_lower=0 in this finite sample" with "the true DGP support is at zero". Reframe across all surfaces: the DGP has a strictly-positive but very near-zero lower bound chosen so QUG fails-to-reject; HAD's `design="auto"` then selects the `continuous_at_zero` identification path on that QUG outcome (a workflow decision following the test, not a property of the true DGP). 2. The notebook over-described fail-to-reject pre-tests as "formal validation", "conclusive", "closes assumptions", "TWFE admissible without methodological caveat". Soften to "diagnostics fail to reject", "supports but does not prove", "non-rejection evidence under finite-sample power and test specification". Pre-test tutorials should teach the limits of pre-tests, not paper over them. Also extracts a `yatchew_side_panel_inputs` fixture in the drift test to deduplicate post_dose / dy construction across the two side-panel tests. Numerical pins unchanged; all 15 drift tests still pass on both backends; notebook executes cleanly; T20 drift unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 2 +- docs/tutorials/21_had_pretest_workflow.ipynb | 131 ++++++++++--------- docs/tutorials/README.md | 4 +- tests/test_t21_had_pretest_workflow_drift.py | 74 ++++++----- 4 files changed, 110 insertions(+), 101 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c0911534..d36b2dd9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- **Tutorial 21: HAD Pre-test Workflow** (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `HeterogeneousAdoptionDiD` building on Tutorial 20's brand-campaign framing. Uses a 60-DMA × 8-week panel close in shape to T20's but with a **Design 1 (`continuous_at_zero`) dose distribution** (Uniform[\$0.01K, \$50K] vs T20's [\$5K, \$50K]) so the QUG step in `did_had_pretest_workflow` fails-to-reject and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. Walks through three surfaces: (a) `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse, where the verdict explicitly flags Step 2 (Assumption 7 pre-trends) as deferred because a single pre-period structurally cannot support a pre-trends test, and the structural fields `pretrends_joint` / `homogeneity_joint` are both `None`; (b) `did_had_pretest_workflow(aggregate="event_study")` on the full multi-period panel, where the verdict reads "TWFE admissible under Section 4 assumptions" because joint pre-trends Stute (3 horizons, mean-independence null) and joint homogeneity Stute (4 horizons, linearity null) close the gap left by the overall path; and (c) a side panel exercising both `yatchew_hr_test` null modes — `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) — on the within-pre-period first-difference paired with post-period dose, illustrating the stricter null's larger residual variance (`sigma2_lin` 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (15 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, and bootstrap p-value tolerance bands per `feedback_bootstrap_drift_tests_need_backend_tolerance`). T20's "Composite pretest workflow" Extensions bullet updated with a forward-pointer to T21. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR. +- **Tutorial 21: HAD Pre-test Workflow** (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `HeterogeneousAdoptionDiD` building on Tutorial 20's brand-campaign framing. Uses a 60-DMA × 8-week panel close in shape to T20's but with the dose distribution drawn from `Uniform[$0.01K, $50K]` (vs T20's `[$5K, $50K]`); the true support is strictly positive but very near zero, chosen so the QUG step in `did_had_pretest_workflow` fails-to-reject `H0: d_lower = 0` in this finite sample and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. (HAD's `design="auto"` rule then selects the `continuous_at_zero` identification path with target `WAS` based on the QUG outcome — a workflow decision following the test result, not a property of the true DGP support.) Walks through three surfaces: (a) `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse, where the verdict explicitly flags Step 2 (Assumption 7 pre-trends) as not run because a single pre-period structurally cannot support a pre-trends test, and the structural fields `pretrends_joint` / `homogeneity_joint` are both `None`; (b) `did_had_pretest_workflow(aggregate="event_study")` on the full multi-period panel, where the verdict reads "TWFE admissible under Section 4 assumptions" because all three testable diagnostics (QUG + joint pre-trends Stute over 3 horizons + joint homogeneity Stute over 4 horizons) fail-to-reject — non-rejection evidence under finite-sample power and test specification, not proof that the identifying assumptions hold; and (c) a side panel exercising both `yatchew_hr_test` null modes — `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) — on the within-pre-period first-difference paired with post-period dose, illustrating the stricter null's larger residual variance (`sigma2_lin` 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (15 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, and bootstrap p-value tolerance bands per `feedback_bootstrap_drift_tests_need_backend_tolerance`). T20's "Composite pretest workflow" Extensions bullet updated with a forward-pointer to T21. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR. - **`ChaisemartinDHaultfoeuille.by_path` and `paths_of_interest` now compose with `survey_design`** for analytical Binder TSL SE and replicate-weight bootstrap variance. The `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1233-1239` is replaced by a per-path multiplier-bootstrap-only gate (`survey_design + n_bootstrap > 0` under by_path / paths_of_interest still raises, since the survey-aware perturbation pivot for path-restricted IFs is methodologically underived). Per-path SE routes through the existing `_survey_se_from_group_if` cell-period allocator: the per-period IF (`U_pp_l_path`) is built with non-path switcher-side contributions skipped (control contributions are unchanged, matching the joiners/leavers IF convention; preserves the row-sum identity `U_pp.sum(axis=1) == U`), cohort-recentered via `_cohort_recenter_per_period`, then expanded to observations as `psi_i = U_pp[g_i, t_i] · (w_i / W_{g_i, t_i})`. Replicate-weight designs unconditionally use the cell allocator (Class A contract from PR #323). New `_refresh_path_inference` helper post-call refreshes `safe_inference` on every populated entry across `multi_horizon_inference`, `placebo_horizon_inference`, `path_effects`, and `path_placebos` so all four surfaces use the same final `df_survey` after per-path replicate fits append `n_valid` to the shared accumulator. Path-enumeration ranking under `survey_design` remains unweighted (group-cardinality, not population-weight mass). Lonely-PSU policy stays sample-wide, not per-path. Telescope invariant: on a single-path panel, per-path SE matches the global non-by_path survey SE bit-exactly. **No R parity** — R `did_multiplegt_dyn` does not support survey weighting; this is a Python-only methodology extension. The global non-by_path TSL multiplier-bootstrap path is unaffected (anti-regression test `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignAnalytical::test_global_survey_plus_n_bootstrap_still_works` locks the per-path-only scope of the new gate). Cross-surface invariants regression-tested at `TestByPathSurveyDesignAnalytical` (~17 tests across gate / dispatch / analytical SE / replicate-weight SE / per-path placebos / `trends_linear` composition / unobserved-path warnings / final-df refresh regressions) and `TestByPathSurveyDesignTelescope`. See `docs/methodology/REGISTRY.md` §`ChaisemartinDHaultfoeuille` `Note (Phase 3 by_path ...)` → "Per-path survey-design SE" for the full contract. - **Inference-field aliases on staggered result classes** for adapter / external-consumer compatibility. Read-only `@property` aliases expose the flat `att` / `se` / `conf_int` / `p_value` / `t_stat` names (matching `DiDResults` / `TROPResults` / `SyntheticDiDResults` / `HeterogeneousAdoptionDiDResults`) on every result class that previously only carried prefixed canonical fields: `CallawaySantAnnaResults`, `StackedDiDResults`, `EfficientDiDResults`, `ChaisemartinDHaultfoeuilleResults`, `StaggeredTripleDiffResults`, `WooldridgeDiDResults`, `SunAbrahamResults`, `ImputationDiDResults`, `TwoStageDiDResults` (mapping to `overall_*`); `ContinuousDiDResults` (mapping to `overall_att_*`, ATT-side as the headline, ACRT-side accessible unchanged via `overall_acrt_*`); `MultiPeriodDiDResults` (mapping to `avg_*`). `ContinuousDiDResults` additionally exposes `overall_se` / `overall_conf_int` / `overall_p_value` / `overall_t_stat` aliases for naming consistency with the rest of the staggered family. Aliases are pure read-throughs over the canonical fields — no recomputation, no behavior change — so the `safe_inference()` joint-NaN contract (per CLAUDE.md "Inference computation") is inherited automatically (NaN canonical → NaN alias, locked at `tests/test_result_aliases.py::test_pattern_b_aliases_propagate_nan`). The native `overall_*` / `overall_att_*` / `avg_*` fields remain canonical for documentation and computation. Motivated by the `balance.interop.diff_diff.as_balance_diagnostic()` adapter (`facebookresearch/balance` PR #465) which calls `getattr(res, "se", None)` / `getattr(res, "conf_int", None)` without a fallback chain — pre-alias, every staggered result class returned `None` on those keys, silently dropping `se` and `conf_int` from the adapter's diagnostic dict. 23 alias-mechanic + balance-adapter regression tests at `tests/test_result_aliases.py`. Patch-level (additive on stable surfaces). - **`ChaisemartinDHaultfoeuille.by_path` + non-binary integer treatment** — `by_path=k` now accepts integer-coded discrete treatment (D in Z, e.g. ordinal `{0, 1, 2}`); path tuples become integer-state tuples like `(0, 2, 2, 2)`. The previous `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1870` is replaced by a `ValueError` for continuous D (e.g. `D=1.5`) at fit-time per the no-silent-failures contract — the existing `int(round(float(v)))` cast in `_enumerate_treatment_paths` is now defensive (no-op for integer-coded D). Validated against R `did_multiplegt_dyn(..., by_path)` for D in `{0, 1, 2}` via the new `multi_path_reversible_by_path_non_binary` golden-value scenario (78 switchers, 3 paths, single-baseline custom DGP, F_g >= 4): per-path point estimates match R bit-exactly (rtol ~1e-9 on event horizons; rtol+atol envelope for placebo near-zero values), per-path SE inherits the documented cross-path cohort-sharing deviation (~5% rtol observed; SE_RTOL=0.15 envelope). **Deviation from R for D >= 10:** R's `did_multiplegt_by_path` derives the per-path baseline via `path_index$baseline_XX <- substr(path_index$path, 1, 1)`, which captures only the first character of the comma-separated path string (e.g. for `path = "12,12,..."` it captures `"1"` instead of `"12"`); this mis-allocates R's per-path control-pool subset for D >= 10. Python's tuple-key matching is correct in this regime — the per-path point estimates we compute are correct; R's per-path subset for the same path is buggy. The shipped parity scenario stays in `D in {0, 1, 2}` to avoid the R bug. R-parity test at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathNonBinary`; cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary`. diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb index 1956de12..7acd8f07 100644 --- a/docs/tutorials/21_had_pretest_workflow.ipynb +++ b/docs/tutorials/21_had_pretest_workflow.ipynb @@ -2,19 +2,19 @@ "cells": [ { "cell_type": "markdown", - "id": "6623db00", + "id": "9e25598f", "metadata": {}, "source": [ - "# Tutorial 21: HAD Pre-test Workflow - Did the Brand Campaign Satisfy the Identifying Assumptions?\n", + "# Tutorial 21: HAD Pre-test Workflow - Running the Pre-test Diagnostics on the Brand Campaign Panel\n", "\n", "[Tutorial 20](20_had_brand_campaign.ipynb) fit `HeterogeneousAdoptionDiD` (HAD) on a regional brand-campaign panel and reported a per-dollar lift, with a brief visual placebo check at the end. We deliberately deferred the **formal pre-test workflow** to this tutorial, with a forward pointer in T20's \"Extensions\" section.\n", "\n", - "This tutorial picks up where T20 left off. We re-run the brand campaign on a panel close in shape to T20's, then walk through HAD's composite pre-test workflow `did_had_pretest_workflow` to formally validate the identifying assumptions (paper Section 4.2 of de Chaisemartin, Ciccia, D'Haultfoeuille, & Knau (2026)). We start with the two-period (`aggregate=\"overall\"`) workflow, observe that it leaves the parallel pre-trends step open, and then **upgrade** to the multi-period (`aggregate=\"event_study\"`) workflow that closes all three paper steps jointly. A side panel compares the two `null=` modes of the Yatchew-HR linearity test, including the recently-shipped `null=\"mean_independence\"` mode (R-parity with `YatchewTest::yatchew_test(order=0)`).\n" + "This tutorial picks up where T20 left off. We re-run the brand campaign on a panel close in shape to T20's, then walk through HAD's composite pre-test workflow `did_had_pretest_workflow` and read the diagnostics for paper Section 4.2 of de Chaisemartin, Ciccia, D'Haultfoeuille, & Knau (2026). We start with the two-period (`aggregate=\"overall\"`) workflow, observe that it does not run the parallel pre-trends step, and then **upgrade** to the multi-period (`aggregate=\"event_study\"`) workflow that adds the joint Stute pre-trends and joint homogeneity diagnostics. None of the diagnostics in this tutorial reject; we walk through what that does and does not let us conclude. A side panel compares the two `null=` modes of the Yatchew-HR test, including the recently-shipped `null=\"mean_independence\"` mode (R-parity with `YatchewTest::yatchew_test(order=0)`).\n" ] }, { "cell_type": "markdown", - "id": "baf69855", + "id": "0cc1feee", "metadata": {}, "source": [ "## 1. The Pre-test Battery\n", @@ -31,24 +31,24 @@ }, { "cell_type": "markdown", - "id": "587dd5ae", + "id": "9ac9f15b", "metadata": {}, "source": [ "## 2. The Panel\n", "\n", - "We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial spans roughly **$10 to $50K** (Uniform[\\$0.01K, \\$50K]) instead of T20's Uniform[\\$5K, \\$50K]. Some markets barely participated in the regional add-on - they put in essentially nothing. This shifts HAD's design path from T20's `continuous_near_d_lower` (Design 1', target = `WAS_d_lower`) to `continuous_at_zero` (Design 1, target = `WAS`) - the QUG test in Step 1 confirms that.\n" + "We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. With the true `D_(1)` close to zero, the QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1) identification path even though the true simulation lower bound is positive. HAD's `design=\"auto\"` detection follows the same QUG decision rule and will land on `continuous_at_zero` with target `WAS` (rather than T20's `continuous_near_d_lower` / `WAS_d_lower`). The point of this tutorial is not to assert that the data is Design 1 from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.\n" ] }, { "cell_type": "code", "execution_count": 1, - "id": "4169ccef", + "id": "4ced81a7", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:37:56.569904Z", - "iopub.status.busy": "2026-05-09T23:37:56.569787Z", - "iopub.status.idle": "2026-05-09T23:37:57.370321Z", - "shell.execute_reply": "2026-05-09T23:37:57.370029Z" + "iopub.execute_input": "2026-05-09T23:46:44.813436Z", + "iopub.status.busy": "2026-05-09T23:46:44.813125Z", + "iopub.status.idle": "2026-05-09T23:46:45.859473Z", + "shell.execute_reply": "2026-05-09T23:46:45.859187Z" } }, "outputs": [ @@ -116,7 +116,7 @@ }, { "cell_type": "markdown", - "id": "c21196bd", + "id": "53772584", "metadata": {}, "source": [ "## 3. Step 1: The Overall Workflow (Two-Period Path)\n", @@ -129,13 +129,13 @@ { "cell_type": "code", "execution_count": 2, - "id": "f57f8c97", + "id": "1d7d1a0e", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:37:57.371617Z", - "iopub.status.busy": "2026-05-09T23:37:57.371488Z", - "iopub.status.idle": "2026-05-09T23:37:57.410189Z", - "shell.execute_reply": "2026-05-09T23:37:57.409927Z" + "iopub.execute_input": "2026-05-09T23:46:45.860769Z", + "iopub.status.busy": "2026-05-09T23:46:45.860646Z", + "iopub.status.idle": "2026-05-09T23:46:45.902629Z", + "shell.execute_reply": "2026-05-09T23:46:45.902302Z" } }, "outputs": [ @@ -188,14 +188,14 @@ }, { "cell_type": "markdown", - "id": "a37bb4f5", + "id": "bbc73e9e", "metadata": {}, "source": [ "**Reading the overall verdict.** Three things to note.\n", "\n", - "- **Step 1 (QUG) fails to reject:** `D_(1)` (the smallest treated dose, ~\\$180 here) is small relative to the gap `D_(2) - D_(1)`, so the test statistic `T = D_(1) / (D_(2) - D_(1))` lands well below its critical value (1/alpha - 1 = 19 at alpha = 0.05). The data are consistent with `d_lower = 0` (Design 1, `continuous_at_zero`, target = `WAS`).\n", - "- **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The differenced outcome `dY` looks linear in `D`, so the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias.\n", - "- **Step 2 (Assumption 7 pre-trends) is structurally absent.** The verdict says so verbatim: `\"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)\"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path).\n", + "- **Step 1 (QUG) fails to reject:** `D_(1)` (the smallest treated dose, ~\\$180 here) is small relative to the gap `D_(2) - D_(1)`, so the test statistic `T = D_(1) / (D_(2) - D_(1))` lands well below its critical value (1/alpha - 1 = 19 at alpha = 0.05). The data are statistically consistent with `d_lower = 0` and the workflow's `design=\"auto\"` rule selects the `continuous_at_zero` (Design 1) identification path with target = `WAS`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. The workflow's choice is the right operational call, but it does not prove the support infimum is exactly zero.)\n", + "- **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The diagnostics do not flag heterogeneity bias on the dose dimension, so reading the WAS as an average per-dose marginal effect is supported by these tests (subject to finite-sample power).\n", + "- **Step 2 (Assumption 7 pre-trends) is not run on this path.** The verdict says so verbatim: `\"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)\"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path).\n", "\n", "Let's look at each individual test result.\n" ] @@ -203,13 +203,13 @@ { "cell_type": "code", "execution_count": 3, - "id": "78aaa722", + "id": "d009ea15", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:37:57.411584Z", - "iopub.status.busy": "2026-05-09T23:37:57.411480Z", - "iopub.status.idle": "2026-05-09T23:37:57.413454Z", - "shell.execute_reply": "2026-05-09T23:37:57.413187Z" + "iopub.execute_input": "2026-05-09T23:46:45.904054Z", + "iopub.status.busy": "2026-05-09T23:46:45.903927Z", + "iopub.status.idle": "2026-05-09T23:46:45.906185Z", + "shell.execute_reply": "2026-05-09T23:46:45.905937Z" } }, "outputs": [ @@ -269,7 +269,7 @@ }, { "cell_type": "markdown", - "id": "aaa21a26", + "id": "d6258552", "metadata": {}, "source": [ "A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000). That looks alarming but is correct here: under perfectly linear dose-response with very heterogeneous doses (Uniform[\\$0.01K, \\$50K]) and 60 sorted-by-dose units, the differencing variance `sigma2_diff` (which captures the squared gap between adjacent-by-dose units' `dy` values) is much larger than the OLS residual variance `sigma2_lin`. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. (For a different way to look at this same test, see the Yatchew side panel later in the notebook.)\n" @@ -277,14 +277,14 @@ }, { "cell_type": "markdown", - "id": "09b0f2a3", + "id": "9b25fada", "metadata": {}, "source": [ "## 4. Step 2: Upgrade to the Event-Study Workflow\n", "\n", - "The two-period workflow gave us evidence on Steps 1 and 3 but no formal evidence on Step 2 (parallel pre-trends). Our panel actually has 8 weeks - that's enough pre-periods to close Step 2 jointly with Stute's joint variant (paper Section 4.2 step 2 + Hlavka-Huskova 2020 / Delgado-Manteiga 2001 dependence-preserving Mammen multiplier bootstrap).\n", + "The two-period workflow ran Steps 1 and 3 but did not run Step 2 (parallel pre-trends). Our panel actually has 8 weeks - that is enough pre-periods to add the joint Stute pre-trends diagnostic (paper Section 4.2 step 2 + Hlavka-Huskova 2020 / Delgado-Manteiga 2001 dependence-preserving Mammen multiplier bootstrap).\n", "\n", - "We pass the full multi-period panel to `did_had_pretest_workflow(aggregate=\"event_study\", ...)`. The dispatch covers all three paper steps in one call:\n", + "We pass the full multi-period panel to `did_had_pretest_workflow(aggregate=\"event_study\", ...)`. The dispatch runs all three testable steps in one call:\n", "\n", "- **Step 1**: QUG re-runs on the dose distribution at the treatment period `F` (deterministic; same numbers as the overall path).\n", "- **Step 2**: `joint_pretrends_test` - mean-independence joint Stute over the pre-period horizons (`E[Y_t - Y_base | D] = mu_t` for each t < F).\n", @@ -296,13 +296,13 @@ { "cell_type": "code", "execution_count": 4, - "id": "d94b8cbf", + "id": "6dd7f0f3", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:37:57.414542Z", - "iopub.status.busy": "2026-05-09T23:37:57.414461Z", - "iopub.status.idle": "2026-05-09T23:37:57.539317Z", - "shell.execute_reply": "2026-05-09T23:37:57.539034Z" + "iopub.execute_input": "2026-05-09T23:46:45.907227Z", + "iopub.status.busy": "2026-05-09T23:46:45.907141Z", + "iopub.status.idle": "2026-05-09T23:46:46.040067Z", + "shell.execute_reply": "2026-05-09T23:46:46.039690Z" } }, "outputs": [ @@ -342,10 +342,12 @@ }, { "cell_type": "markdown", - "id": "ebb7378f", + "id": "5f12d7aa", "metadata": {}, "source": [ - "**Reading the event-study verdict.** Now the verdict reads `\"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\"`. The `\"deferred\"` caveat from the overall path is gone - all three paper steps closed jointly. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.\n", + "**Reading the event-study verdict.** Now the verdict reads `\"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\"`. The `\"deferred\"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.\n", + "\n", + "A note on the verdict's \"TWFE admissible\" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05`. That is non-rejection evidence under the diagnostics' finite-sample power and specification, not a proof that the identifying assumptions hold. Step 4 (boundary continuity, paper Assumptions 5 / 6) remains non-testable from data and is not covered by any of the three diagnostics here.\n", "\n", "The joint pre-trends test runs over `n_horizons = 3` (pre-periods 1, 2, 3, with week 4 reserved as the base period). The joint homogeneity test runs over `n_horizons = 4` (post-periods 5, 6, 7, 8). Let's inspect the per-horizon detail.\n" ] @@ -353,13 +355,13 @@ { "cell_type": "code", "execution_count": 5, - "id": "4c0f47d0", + "id": "cfaa750b", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:37:57.540476Z", - "iopub.status.busy": "2026-05-09T23:37:57.540385Z", - "iopub.status.idle": "2026-05-09T23:37:57.542348Z", - "shell.execute_reply": "2026-05-09T23:37:57.542100Z" + "iopub.execute_input": "2026-05-09T23:46:46.041790Z", + "iopub.status.busy": "2026-05-09T23:46:46.041665Z", + "iopub.status.idle": "2026-05-09T23:46:46.043716Z", + "shell.execute_reply": "2026-05-09T23:46:46.043421Z" } }, "outputs": [ @@ -432,19 +434,19 @@ }, { "cell_type": "markdown", - "id": "f28a7820", + "id": "b95cbac1", "metadata": {}, "source": [ - "The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold - the test is not vacuous, it is informative. It is consistent with parallel pre-trends but not by a wide margin. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring).\n", + "The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold. The test does not reject at alpha = 0.05, but the near-threshold p-value warrants scrutiny - the diagnostic is not failing in a clearly-far-from-rejection regime. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring).\n", "\n", - "The joint homogeneity p-value (~0.76) is a strong fail-to-reject. Linearity holds across all four post-launch horizons.\n", + "The joint homogeneity p-value (~0.76) is comfortably far from rejection. The diagnostic does not flag heterogeneity bias on the dose dimension across the four post-launch horizons.\n", "\n", - "Together with QUG (design verdict) and joint linearity (Step 3), this closes the testable portion of the paper's identification framework. Step 4 (boundary continuity, Assumptions 5 / 6) remains non-testable; we still defend it from domain knowledge as in T20.\n" + "Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. Step 4 (boundary continuity, Assumptions 5 / 6) remains non-testable from data and is argued from domain knowledge, as in T20.\n" ] }, { "cell_type": "markdown", - "id": "b805e082", + "id": "c0d6ddbb", "metadata": {}, "source": [ "## 5. Side Panel: Yatchew-HR Null Modes\n", @@ -460,13 +462,13 @@ { "cell_type": "code", "execution_count": 6, - "id": "63fa34db", + "id": "c231b096", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:37:57.543368Z", - "iopub.status.busy": "2026-05-09T23:37:57.543298Z", - "iopub.status.idle": "2026-05-09T23:37:57.547774Z", - "shell.execute_reply": "2026-05-09T23:37:57.547551Z" + "iopub.execute_input": "2026-05-09T23:46:46.045080Z", + "iopub.status.busy": "2026-05-09T23:46:46.044960Z", + "iopub.status.idle": "2026-05-09T23:46:46.050811Z", + "shell.execute_reply": "2026-05-09T23:46:46.050511Z" } }, "outputs": [ @@ -528,7 +530,7 @@ }, { "cell_type": "markdown", - "id": "da899c45", + "id": "f0a34622", "metadata": {}, "source": [ "**Reading the side-panel comparison.**\n", @@ -543,27 +545,27 @@ }, { "cell_type": "markdown", - "id": "c98e7202", + "id": "fa7bcd99", "metadata": {}, "source": [ - "## 6. Communicating the Validation to Leadership\n", + "## 6. Communicating the Diagnostics to Leadership\n", "\n", - "Pre-test results travel awkwardly to non-technical audiences. The template below structures the validation around what each test rules out - mirroring the headline-and-evidence pattern from T20 Section 5.\n", + "Pre-test results travel awkwardly to non-technical audiences. The template below structures the diagnostics around what each test does and does not rule out - mirroring the headline-and-evidence pattern from T20 Section 5.\n", "\n", - "> **Identifying assumptions for HAD on the brand-campaign panel are defended on all three paper steps.**\n", + "> **The HAD pre-test diagnostics on the brand-campaign panel do not flag a violation of the testable identifying assumptions.**\n", ">\n", - "> - **Step 1 (QUG support-infimum, paper Theorem 4):** the test is consistent with the dose distribution starting at zero (`d_lower = 0`, p approximately 0.21). The library auto-detects the `continuous_at_zero` design and reports the WAS (Weighted Average Slope), as expected for this panel where some markets barely participated in the regional spend.\n", - "> - **Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test fails to reject (joint p approximately 0.07 across the three pre-period horizons). The pre-trend evidence is not a slam dunk - the p-value is close to alpha = 0.05 - but it is conclusive. In a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending.\n", - "> - **Step 3 (linearity, Assumption 8):** joint Stute homogeneity fails to reject (joint p approximately 0.76 across the four post-launch horizons). The linearity assumption needed for the WAS reading to reflect the average per-dose marginal effect (rather than masking heterogeneity bias) is comfortably supported.\n", + "> - **Step 1 (QUG support-infimum, paper Theorem 4):** the test does not reject `H0: d_lower = 0` (p approximately 0.21). The data are statistically consistent with a dose distribution starting at zero, so the library's `design=\"auto\"` selects the `continuous_at_zero` design and reports the WAS (Weighted Average Slope). This is a workflow decision based on the QUG outcome; failing to reject is not proof that the true support is exactly at zero.\n", + "> - **Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test does not reject (joint p approximately 0.07 across the three pre-period horizons). The p-value is close to alpha = 0.05, so the non-rejection here is not by a wide margin - in a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending.\n", + "> - **Step 3 (linearity, Assumption 8):** joint Stute homogeneity does not reject (joint p approximately 0.76 across the four post-launch horizons). The diagnostic does not flag heterogeneity bias on the dose dimension under the test's specification.\n", ">\n", "> **Non-testable from data (Step 4, paper Assumptions 5 / 6, boundary continuity):** local-linearity of the dose-response near `d_lower`. Argued from domain knowledge - is there reason to believe the marginal effect of an additional $1K of regional spend is roughly constant across the dose range? In our case yes, by DGP construction; in a real analysis we would justify this from prior knowledge of the channel's response shape.\n", ">\n", - "> **Bottom line:** TWFE is admissible under the paper's framework on this panel. The headline per-$1K lift from the HAD fit can be carried forward to leadership without methodological caveat beyond Step 4 (which is qualitative, not data-driven).\n" + "> **Bottom line:** the workflow's three testable diagnostics do not flag a violation. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and Step 4 (boundary continuity, non-testable from data). None of these are settled by non-rejection of the pre-tests.\n" ] }, { "cell_type": "markdown", - "id": "56246bf2", + "id": "0126ef99", "metadata": {}, "source": [ "## 7. Extensions\n", @@ -585,16 +587,17 @@ }, { "cell_type": "markdown", - "id": "ca588d0c", + "id": "cad9c1d7", "metadata": {}, "source": [ "## 8. Summary Checklist\n", "\n", "- HAD's pre-test workflow `did_had_pretest_workflow` bundles paper Section 4.2 Steps 1 (QUG support infimum), 2 (joint Stute pre-trends - event-study path only), and 3 (Stute / Yatchew-HR linearity, joint variant on event-study path).\n", "- The two-period (`aggregate=\"overall\"`) path runs Steps 1 + 3 only - it cannot run Step 2 because a single pre-period structurally has nothing to test against. The verdict says so verbatim: \"Assumption 7 pre-trends test NOT run\".\n", - "- Upgrade to the multi-period (`aggregate=\"event_study\"`) path to close all three testable steps jointly. The verdict then reads \"TWFE admissible under Section 4 assumptions\" when nothing rejects.\n", + "- Upgrade to the multi-period (`aggregate=\"event_study\"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads \"TWFE admissible under Section 4 assumptions\" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof.\n", "- Step 4 (paper Assumptions 5 / 6, boundary continuity) is **non-testable** from data - argue from domain knowledge.\n", "- The Yatchew-HR test exposes two null modes: `null=\"linearity\"` (paper Theorem 7, default; what the workflow calls under the hood) and `null=\"mean_independence\"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data).\n", + "- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The workflow uses the QUG outcome to pick the identification path (`continuous_at_zero` vs `continuous_near_d_lower`); finite-sample uncertainty in that decision is a remaining caveat.\n", "- Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python).\n" ] } diff --git a/docs/tutorials/README.md b/docs/tutorials/README.md index f3f27bdb..a7b495bb 100644 --- a/docs/tutorials/README.md +++ b/docs/tutorials/README.md @@ -104,10 +104,10 @@ Practitioner walkthrough for measuring per-dollar lift when every market is trea - Companion drift-test file (`tests/test_t20_had_brand_campaign_drift.py`) ### 21. HAD Pre-test Workflow (`21_had_pretest_workflow.ipynb`) -Composite pre-test walkthrough for `HeterogeneousAdoptionDiD`, building on Tutorial 20's brand-campaign framing on a Design 1 (`continuous_at_zero`) panel variant: +Composite pre-test walkthrough for `HeterogeneousAdoptionDiD`, building on Tutorial 20's brand-campaign framing on a panel where the dose distribution has a strictly positive but very near-zero lower bound (so the QUG step fails-to-reject `H0: d_lower = 0`): - Paper Section 4.2 step taxonomy (QUG support-infimum, parallel pre-trends, linearity) - `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse: Step 1 + Step 3 only, verdict explicitly flags Step 2 as deferred -- Upgrade to `did_had_pretest_workflow(aggregate="event_study")` on the multi-week panel: closes all three testable steps via QUG + joint pre-trends Stute + joint homogeneity Stute +- Upgrade to `did_had_pretest_workflow(aggregate="event_study")` on the multi-week panel: adds the joint pre-trends Stute and joint homogeneity Stute diagnostics (none of the three testable steps reject) - Side panel comparing `yatchew_hr_test` `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) - Companion drift-test file (`tests/test_t21_had_pretest_workflow_drift.py`) diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py index 62a8893d..bcc599fd 100644 --- a/tests/test_t21_had_pretest_workflow_drift.py +++ b/tests/test_t21_had_pretest_workflow_drift.py @@ -14,10 +14,13 @@ update the prose or investigate the methodology shift before merge. T21 DGP differs from T20: dose distribution is `Uniform[$0.01K, $50K]` -(was `[$5K, $50K]` in T20) so this is a Design 1 (`continuous_at_zero`) -panel where the QUG step fails-to-reject and the verdict text fires the -load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. -DGP and seed locked at `_scratch/t21_pretests/10_panel.py`. +(was `[$5K, $50K]` in T20). The true support is strictly positive but very +near zero, chosen so the QUG step fails-to-reject `H0: d_lower = 0` in this +finite sample. That QUG outcome lets the workflow's `design="auto"` rule +land on `continuous_at_zero` (a workflow decision based on the test, not a +property of the true DGP), which in turn populates the verdict text with +the load-bearing "Assumption 7 deferred" substring used for the upgrade-arc +narrative. DGP and seed locked at `_scratch/t21_pretests/10_panel.py`. Quoted numbers derived from `_scratch/t21_pretests/50_compose_narrative.py`. Bootstrap p-value pins use **abs tolerance bands >= 0.15** per @@ -41,7 +44,7 @@ COHORT_PERIOD = 5 TRUE_SLOPE = 100.0 BASELINE_VISITS = 5000.0 -DOSE_LOW = 0.01 # T21 change vs T20 (was 5.0): supports continuous_at_zero design. +DOSE_LOW = 0.01 # T21 change vs T20 (was 5.0): near-zero lower bound chosen so QUG fails-to-reject H0: d_lower = 0. DOSE_HIGH = 50.0 WORKFLOW_SEED = 21 @@ -122,10 +125,30 @@ def event_study_report(panel): ) +@pytest.fixture(scope="module") +def yatchew_side_panel_inputs(panel): + """Section 5's Yatchew side panel: post-period dose paired with the + within-pre-period first-difference dy = Y[w4] - Y[w3]. Shared + construction between the linearity-mode and mean_independence-mode + tests below.""" + panel_sorted = panel.sort_values(["dma_id", "week"]).reset_index(drop=True) + pre = panel_sorted[panel_sorted["week"].isin([3, 4])] + pre_pivot = pre.pivot(index="dma_id", columns="week", values="weekly_visits") + dy = (pre_pivot[4] - pre_pivot[3]).to_numpy(dtype=np.float64) + post_dose = ( + panel_sorted[panel_sorted["week"] == 5] + .set_index("dma_id") + .sort_index()["regional_spend_k"] + .to_numpy(dtype=np.float64) + ) + return post_dose, dy + + def test_panel_matches_t21_locked_dgp(panel): """Section 2 narrative claims 60 DMAs over 8 weeks, regional spend - spanning roughly $10 to $50K (the T21 Design 1 variant). If the - DGP drifts, this surfaces.""" + drawn from Uniform[$0.01K, $50K] - true support strictly positive + but very near zero (so QUG can fail-to-reject in this finite + sample). If the DGP drifts, this surfaces.""" assert panel["dma_id"].nunique() == N_UNITS assert panel["week"].nunique() == N_PERIODS post_doses = ( @@ -163,9 +186,10 @@ def test_overall_path_structural_anchors(overall_report): def test_overall_qug_fails_to_reject(overall_report): - """Section 3 narrative claims QUG fails to reject (consistent with - Design 1, `continuous_at_zero`). QUG is fully deterministic; pin - exact rounded values.""" + """Section 3 narrative claims QUG fails-to-reject H0: d_lower = 0 + (data are statistically consistent with continuous_at_zero design; + HAD's design="auto" rule selects that path on this QUG outcome). + QUG is fully deterministic; pin exact rounded values.""" assert overall_report.qug.reject is False # T statistic = D_(1) / (D_(2) - D_(1)) is fully deterministic. assert round(overall_report.qug.t_stat, 2) == 3.86, overall_report.qug.t_stat @@ -267,21 +291,12 @@ def test_event_study_homogeneity_fails_to_reject(event_study_report): assert hj.p_value > 0.50, hj.p_value -def test_yatchew_side_panel_linearity_passes(panel): +def test_yatchew_side_panel_linearity_passes(yatchew_side_panel_inputs): """Section 5 (Yatchew side panel) narrative claims `null="linearity"` - fails to reject on the within-pre-period first-difference paired + does not reject on the within-pre-period first-difference paired with post-period dose. Pin the T_hr statistic (deterministic); Yatchew has no bootstrap component.""" - panel_sorted = panel.sort_values(["dma_id", "week"]).reset_index(drop=True) - pre = panel_sorted[panel_sorted["week"].isin([3, 4])] - pre_pivot = pre.pivot(index="dma_id", columns="week", values="weekly_visits") - dy = (pre_pivot[4] - pre_pivot[3]).to_numpy(dtype=np.float64) - post_dose = ( - panel_sorted[panel_sorted["week"] == 5] - .set_index("dma_id") - .sort_index()["regional_spend_k"] - .to_numpy(dtype=np.float64) - ) + post_dose, dy = yatchew_side_panel_inputs res = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="linearity") assert res.reject is False assert res.null_form == "linearity" @@ -289,20 +304,11 @@ def test_yatchew_side_panel_linearity_passes(panel): assert round(res.sigma2_lin, 2) == 6.53, res.sigma2_lin -def test_yatchew_side_panel_mean_independence_passes(panel): - """Section 5 narrative claims `null="mean_independence"` fails to +def test_yatchew_side_panel_mean_independence_passes(yatchew_side_panel_inputs): + """Section 5 narrative claims `null="mean_independence"` does not reject on the same input but with larger sigma2_lin (the stricter null has more residual variance to explain).""" - panel_sorted = panel.sort_values(["dma_id", "week"]).reset_index(drop=True) - pre = panel_sorted[panel_sorted["week"].isin([3, 4])] - pre_pivot = pre.pivot(index="dma_id", columns="week", values="weekly_visits") - dy = (pre_pivot[4] - pre_pivot[3]).to_numpy(dtype=np.float64) - post_dose = ( - panel_sorted[panel_sorted["week"] == 5] - .set_index("dma_id") - .sort_index()["regional_spend_k"] - .to_numpy(dtype=np.float64) - ) + post_dose, dy = yatchew_side_panel_inputs res_mi = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="mean_independence") res_lin = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="linearity") assert res_mi.reject is False From 62b756a613ea97371b6ebdf77be54119bd12168b Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 20:17:48 -0400 Subject: [PATCH 04/12] T21 P3 wording cleanups: align Section 7 + drift docstring with revised tutorial MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two stale shorthand phrasings inconsistent with the revised methodology framing: - Section 7 Extensions: "single Design 1 panel" → "single panel where QUG led the workflow to select the continuous_at_zero (Design 1) identification path" (matches the corrected Section 2 wording). - `test_event_study_pretrends_fails_to_reject` docstring quoted "close to alpha = 0.05 but conclusive"; the user-facing text now says "warrants scrutiny" - update internal docstring to match. No methodology change, no new pins; all 15 drift tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/21_had_pretest_workflow.ipynb | 90 ++++++++++---------- tests/test_t21_had_pretest_workflow_drift.py | 3 +- 2 files changed, 47 insertions(+), 46 deletions(-) diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb index 7acd8f07..8516b400 100644 --- a/docs/tutorials/21_had_pretest_workflow.ipynb +++ b/docs/tutorials/21_had_pretest_workflow.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "9e25598f", + "id": "2c409551", "metadata": {}, "source": [ "# Tutorial 21: HAD Pre-test Workflow - Running the Pre-test Diagnostics on the Brand Campaign Panel\n", @@ -14,7 +14,7 @@ }, { "cell_type": "markdown", - "id": "0cc1feee", + "id": "1ccaad91", "metadata": {}, "source": [ "## 1. The Pre-test Battery\n", @@ -31,7 +31,7 @@ }, { "cell_type": "markdown", - "id": "9ac9f15b", + "id": "c110fe0e", "metadata": {}, "source": [ "## 2. The Panel\n", @@ -42,13 +42,13 @@ { "cell_type": "code", "execution_count": 1, - "id": "4ced81a7", + "id": "05269d1c", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:46:44.813436Z", - "iopub.status.busy": "2026-05-09T23:46:44.813125Z", - "iopub.status.idle": "2026-05-09T23:46:45.859473Z", - "shell.execute_reply": "2026-05-09T23:46:45.859187Z" + "iopub.execute_input": "2026-05-10T00:17:36.394301Z", + "iopub.status.busy": "2026-05-10T00:17:36.394076Z", + "iopub.status.idle": "2026-05-10T00:17:37.818650Z", + "shell.execute_reply": "2026-05-10T00:17:37.818348Z" } }, "outputs": [ @@ -116,7 +116,7 @@ }, { "cell_type": "markdown", - "id": "53772584", + "id": "91811549", "metadata": {}, "source": [ "## 3. Step 1: The Overall Workflow (Two-Period Path)\n", @@ -129,13 +129,13 @@ { "cell_type": "code", "execution_count": 2, - "id": "1d7d1a0e", + "id": "cbda5c0c", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:46:45.860769Z", - "iopub.status.busy": "2026-05-09T23:46:45.860646Z", - "iopub.status.idle": "2026-05-09T23:46:45.902629Z", - "shell.execute_reply": "2026-05-09T23:46:45.902302Z" + "iopub.execute_input": "2026-05-10T00:17:37.819909Z", + "iopub.status.busy": "2026-05-10T00:17:37.819802Z", + "iopub.status.idle": "2026-05-10T00:17:37.858844Z", + "shell.execute_reply": "2026-05-10T00:17:37.858574Z" } }, "outputs": [ @@ -188,7 +188,7 @@ }, { "cell_type": "markdown", - "id": "bbc73e9e", + "id": "9452bc09", "metadata": {}, "source": [ "**Reading the overall verdict.** Three things to note.\n", @@ -203,13 +203,13 @@ { "cell_type": "code", "execution_count": 3, - "id": "d009ea15", + "id": "7dca161a", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:46:45.904054Z", - "iopub.status.busy": "2026-05-09T23:46:45.903927Z", - "iopub.status.idle": "2026-05-09T23:46:45.906185Z", - "shell.execute_reply": "2026-05-09T23:46:45.905937Z" + "iopub.execute_input": "2026-05-10T00:17:37.860034Z", + "iopub.status.busy": "2026-05-10T00:17:37.859953Z", + "iopub.status.idle": "2026-05-10T00:17:37.861749Z", + "shell.execute_reply": "2026-05-10T00:17:37.861541Z" } }, "outputs": [ @@ -269,7 +269,7 @@ }, { "cell_type": "markdown", - "id": "d6258552", + "id": "bb4d7ef5", "metadata": {}, "source": [ "A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000). That looks alarming but is correct here: under perfectly linear dose-response with very heterogeneous doses (Uniform[\\$0.01K, \\$50K]) and 60 sorted-by-dose units, the differencing variance `sigma2_diff` (which captures the squared gap between adjacent-by-dose units' `dy` values) is much larger than the OLS residual variance `sigma2_lin`. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. (For a different way to look at this same test, see the Yatchew side panel later in the notebook.)\n" @@ -277,7 +277,7 @@ }, { "cell_type": "markdown", - "id": "9b25fada", + "id": "0bb3c4e3", "metadata": {}, "source": [ "## 4. Step 2: Upgrade to the Event-Study Workflow\n", @@ -296,13 +296,13 @@ { "cell_type": "code", "execution_count": 4, - "id": "6dd7f0f3", + "id": "e4903c58", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:46:45.907227Z", - "iopub.status.busy": "2026-05-09T23:46:45.907141Z", - "iopub.status.idle": "2026-05-09T23:46:46.040067Z", - "shell.execute_reply": "2026-05-09T23:46:46.039690Z" + "iopub.execute_input": "2026-05-10T00:17:37.862773Z", + "iopub.status.busy": "2026-05-10T00:17:37.862692Z", + "iopub.status.idle": "2026-05-10T00:17:37.988346Z", + "shell.execute_reply": "2026-05-10T00:17:37.988066Z" } }, "outputs": [ @@ -342,7 +342,7 @@ }, { "cell_type": "markdown", - "id": "5f12d7aa", + "id": "b820c289", "metadata": {}, "source": [ "**Reading the event-study verdict.** Now the verdict reads `\"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\"`. The `\"deferred\"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.\n", @@ -355,13 +355,13 @@ { "cell_type": "code", "execution_count": 5, - "id": "cfaa750b", + "id": "cd1d2dde", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:46:46.041790Z", - "iopub.status.busy": "2026-05-09T23:46:46.041665Z", - "iopub.status.idle": "2026-05-09T23:46:46.043716Z", - "shell.execute_reply": "2026-05-09T23:46:46.043421Z" + "iopub.execute_input": "2026-05-10T00:17:37.989443Z", + "iopub.status.busy": "2026-05-10T00:17:37.989364Z", + "iopub.status.idle": "2026-05-10T00:17:37.991250Z", + "shell.execute_reply": "2026-05-10T00:17:37.990991Z" } }, "outputs": [ @@ -434,7 +434,7 @@ }, { "cell_type": "markdown", - "id": "b95cbac1", + "id": "072e39d8", "metadata": {}, "source": [ "The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold. The test does not reject at alpha = 0.05, but the near-threshold p-value warrants scrutiny - the diagnostic is not failing in a clearly-far-from-rejection regime. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring).\n", @@ -446,7 +446,7 @@ }, { "cell_type": "markdown", - "id": "c0d6ddbb", + "id": "bba51a15", "metadata": {}, "source": [ "## 5. Side Panel: Yatchew-HR Null Modes\n", @@ -462,13 +462,13 @@ { "cell_type": "code", "execution_count": 6, - "id": "c231b096", + "id": "d0d4807d", "metadata": { "execution": { - "iopub.execute_input": "2026-05-09T23:46:46.045080Z", - "iopub.status.busy": "2026-05-09T23:46:46.044960Z", - "iopub.status.idle": "2026-05-09T23:46:46.050811Z", - "shell.execute_reply": "2026-05-09T23:46:46.050511Z" + "iopub.execute_input": "2026-05-10T00:17:37.992213Z", + "iopub.status.busy": "2026-05-10T00:17:37.992138Z", + "iopub.status.idle": "2026-05-10T00:17:37.996905Z", + "shell.execute_reply": "2026-05-10T00:17:37.996663Z" } }, "outputs": [ @@ -530,7 +530,7 @@ }, { "cell_type": "markdown", - "id": "f0a34622", + "id": "45254f92", "metadata": {}, "source": [ "**Reading the side-panel comparison.**\n", @@ -545,7 +545,7 @@ }, { "cell_type": "markdown", - "id": "fa7bcd99", + "id": "6bdc1f7d", "metadata": {}, "source": [ "## 6. Communicating the Diagnostics to Leadership\n", @@ -565,12 +565,12 @@ }, { "cell_type": "markdown", - "id": "0126ef99", + "id": "d866da6c", "metadata": {}, "source": [ "## 7. Extensions\n", "\n", - "This tutorial covered the composite pre-test workflow on a single Design 1 panel. A few directions we did not exercise here:\n", + "This tutorial covered the composite pre-test workflow on a single panel where QUG led the workflow to select the `continuous_at_zero` (Design 1) identification path. A few directions we did not exercise here:\n", "\n", "- **Survey-weighted / population-weighted inference** - HAD's pre-test workflow accepts `survey_design=` (or the deprecated `survey=` / `weights=` aliases) for design-based inference. The QUG step is permanently deferred under survey weighting (extreme-value theory under complex sampling is not a settled toolkit); the linearity family runs with PSU-level Mammen multiplier bootstrap (Stute and joint variants) and weighted OLS + weighted variance components (Yatchew). A follow-up tutorial covers this path end-to-end.\n", "- **`trends_lin=True` (Pierce-Schott Eq 17 / 18 detrending)** - mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)`. Forwards into both joint pre-trends and joint homogeneity wrappers; consumes the placebo at `base_period - 1` and skips Step 2 if no earlier placebo survives the drop. Useful when you suspect linear time trends correlated with dose but want to keep the joint-Stute machinery.\n", @@ -587,7 +587,7 @@ }, { "cell_type": "markdown", - "id": "cad9c1d7", + "id": "0105ae26", "metadata": {}, "source": [ "## 8. Summary Checklist\n", diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py index bcc599fd..1b2fb58e 100644 --- a/tests/test_t21_had_pretest_workflow_drift.py +++ b/tests/test_t21_had_pretest_workflow_drift.py @@ -270,7 +270,8 @@ def test_event_study_homogeneity_horizons_correct(event_study_report): def test_event_study_pretrends_fails_to_reject(event_study_report): """Section 4 narrative quotes the pre-trends p-value as 'close to - alpha = 0.05 but conclusive' (~0.07 from numbers.json). Use binary + alpha = 0.05 ... warrants scrutiny' (~0.07 from numbers.json) - + non-rejection is not a clean pass at this margin. Use binary fail-to-reject + a wide abs tolerance band - bootstrap p-values near alpha are the most sensitive to RNG path differences.""" pj = event_study_report.pretrends_joint From 5640829b72ab1e2821f3015d4225723ca577a06e Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 20:30:48 -0400 Subject: [PATCH 05/12] Address PR #409 R1 review (1 P1, 1 P2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 — Decouple QUG from HAD's `design="auto"` selector across all surfaces. The two are independent rules: QUG is a statistical pre-test on `H0: d_lower = 0`; `_detect_design()` is a min/median heuristic (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`). On T21's panel both checks point to the same identification path but the mechanisms are independent — `_detect_design()` does not consume the QUG p-value. Reword tutorial Section 2 + Section 3, CHANGELOG entry, and drift-test docstrings to reflect this. Add `test_had_design_auto_lands_on_continuous_at_zero`: explicitly fits `HAD(design="auto")` on the two-period panel and asserts `design == "continuous_at_zero"` and `target_parameter == "WAS"`, locking the prose claim independently of the QUG-test pins. P2 — Update REGISTRY.md to mark T21 shipped (PR #409); leave T22 row queued. All 16 drift tests pass on both backends; notebook executes cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 2 +- docs/methodology/REGISTRY.md | 3 +- docs/tutorials/21_had_pretest_workflow.ipynb | 94 ++++++++++---------- tests/test_t21_had_pretest_workflow_drift.py | 49 +++++++--- 4 files changed, 89 insertions(+), 59 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index d36b2dd9..f9d082c3 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- **Tutorial 21: HAD Pre-test Workflow** (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `HeterogeneousAdoptionDiD` building on Tutorial 20's brand-campaign framing. Uses a 60-DMA × 8-week panel close in shape to T20's but with the dose distribution drawn from `Uniform[$0.01K, $50K]` (vs T20's `[$5K, $50K]`); the true support is strictly positive but very near zero, chosen so the QUG step in `did_had_pretest_workflow` fails-to-reject `H0: d_lower = 0` in this finite sample and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. (HAD's `design="auto"` rule then selects the `continuous_at_zero` identification path with target `WAS` based on the QUG outcome — a workflow decision following the test result, not a property of the true DGP support.) Walks through three surfaces: (a) `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse, where the verdict explicitly flags Step 2 (Assumption 7 pre-trends) as not run because a single pre-period structurally cannot support a pre-trends test, and the structural fields `pretrends_joint` / `homogeneity_joint` are both `None`; (b) `did_had_pretest_workflow(aggregate="event_study")` on the full multi-period panel, where the verdict reads "TWFE admissible under Section 4 assumptions" because all three testable diagnostics (QUG + joint pre-trends Stute over 3 horizons + joint homogeneity Stute over 4 horizons) fail-to-reject — non-rejection evidence under finite-sample power and test specification, not proof that the identifying assumptions hold; and (c) a side panel exercising both `yatchew_hr_test` null modes — `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) — on the within-pre-period first-difference paired with post-period dose, illustrating the stricter null's larger residual variance (`sigma2_lin` 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (15 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, and bootstrap p-value tolerance bands per `feedback_bootstrap_drift_tests_need_backend_tolerance`). T20's "Composite pretest workflow" Extensions bullet updated with a forward-pointer to T21. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR. +- **Tutorial 21: HAD Pre-test Workflow** (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `HeterogeneousAdoptionDiD` building on Tutorial 20's brand-campaign framing. Uses a 60-DMA × 8-week panel close in shape to T20's but with the dose distribution drawn from `Uniform[$0.01K, $50K]` (vs T20's `[$5K, $50K]`); the true support is strictly positive but very near zero, chosen so the QUG step in `did_had_pretest_workflow` fails-to-reject `H0: d_lower = 0` in this finite sample and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. (HAD's `design="auto"` selector — a separate min/median heuristic at `had.py::_detect_design`, NOT the QUG p-value — independently lands on the `continuous_at_zero` identification path with target `WAS` on this panel because `d.min() < 0.01 * median(|d|)`. The QUG test and the design selector are independent rules that point to the same identification path here.) Walks through three surfaces: (a) `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse, where the verdict explicitly flags Step 2 (Assumption 7 pre-trends) as not run because a single pre-period structurally cannot support a pre-trends test, and the structural fields `pretrends_joint` / `homogeneity_joint` are both `None`; (b) `did_had_pretest_workflow(aggregate="event_study")` on the full multi-period panel, where the verdict reads "TWFE admissible under Section 4 assumptions" because all three testable diagnostics (QUG + joint pre-trends Stute over 3 horizons + joint homogeneity Stute over 4 horizons) fail-to-reject — non-rejection evidence under finite-sample power and test specification, not proof that the identifying assumptions hold; and (c) a side panel exercising both `yatchew_hr_test` null modes — `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) — on the within-pre-period first-difference paired with post-period dose, illustrating the stricter null's larger residual variance (`sigma2_lin` 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (15 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, and bootstrap p-value tolerance bands per `feedback_bootstrap_drift_tests_need_backend_tolerance`). T20's "Composite pretest workflow" Extensions bullet updated with a forward-pointer to T21. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR. - **`ChaisemartinDHaultfoeuille.by_path` and `paths_of_interest` now compose with `survey_design`** for analytical Binder TSL SE and replicate-weight bootstrap variance. The `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1233-1239` is replaced by a per-path multiplier-bootstrap-only gate (`survey_design + n_bootstrap > 0` under by_path / paths_of_interest still raises, since the survey-aware perturbation pivot for path-restricted IFs is methodologically underived). Per-path SE routes through the existing `_survey_se_from_group_if` cell-period allocator: the per-period IF (`U_pp_l_path`) is built with non-path switcher-side contributions skipped (control contributions are unchanged, matching the joiners/leavers IF convention; preserves the row-sum identity `U_pp.sum(axis=1) == U`), cohort-recentered via `_cohort_recenter_per_period`, then expanded to observations as `psi_i = U_pp[g_i, t_i] · (w_i / W_{g_i, t_i})`. Replicate-weight designs unconditionally use the cell allocator (Class A contract from PR #323). New `_refresh_path_inference` helper post-call refreshes `safe_inference` on every populated entry across `multi_horizon_inference`, `placebo_horizon_inference`, `path_effects`, and `path_placebos` so all four surfaces use the same final `df_survey` after per-path replicate fits append `n_valid` to the shared accumulator. Path-enumeration ranking under `survey_design` remains unweighted (group-cardinality, not population-weight mass). Lonely-PSU policy stays sample-wide, not per-path. Telescope invariant: on a single-path panel, per-path SE matches the global non-by_path survey SE bit-exactly. **No R parity** — R `did_multiplegt_dyn` does not support survey weighting; this is a Python-only methodology extension. The global non-by_path TSL multiplier-bootstrap path is unaffected (anti-regression test `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignAnalytical::test_global_survey_plus_n_bootstrap_still_works` locks the per-path-only scope of the new gate). Cross-surface invariants regression-tested at `TestByPathSurveyDesignAnalytical` (~17 tests across gate / dispatch / analytical SE / replicate-weight SE / per-path placebos / `trends_linear` composition / unobserved-path warnings / final-df refresh regressions) and `TestByPathSurveyDesignTelescope`. See `docs/methodology/REGISTRY.md` §`ChaisemartinDHaultfoeuille` `Note (Phase 3 by_path ...)` → "Per-path survey-design SE" for the full contract. - **Inference-field aliases on staggered result classes** for adapter / external-consumer compatibility. Read-only `@property` aliases expose the flat `att` / `se` / `conf_int` / `p_value` / `t_stat` names (matching `DiDResults` / `TROPResults` / `SyntheticDiDResults` / `HeterogeneousAdoptionDiDResults`) on every result class that previously only carried prefixed canonical fields: `CallawaySantAnnaResults`, `StackedDiDResults`, `EfficientDiDResults`, `ChaisemartinDHaultfoeuilleResults`, `StaggeredTripleDiffResults`, `WooldridgeDiDResults`, `SunAbrahamResults`, `ImputationDiDResults`, `TwoStageDiDResults` (mapping to `overall_*`); `ContinuousDiDResults` (mapping to `overall_att_*`, ATT-side as the headline, ACRT-side accessible unchanged via `overall_acrt_*`); `MultiPeriodDiDResults` (mapping to `avg_*`). `ContinuousDiDResults` additionally exposes `overall_se` / `overall_conf_int` / `overall_p_value` / `overall_t_stat` aliases for naming consistency with the rest of the staggered family. Aliases are pure read-throughs over the canonical fields — no recomputation, no behavior change — so the `safe_inference()` joint-NaN contract (per CLAUDE.md "Inference computation") is inherited automatically (NaN canonical → NaN alias, locked at `tests/test_result_aliases.py::test_pattern_b_aliases_propagate_nan`). The native `overall_*` / `overall_att_*` / `avg_*` fields remain canonical for documentation and computation. Motivated by the `balance.interop.diff_diff.as_balance_diagnostic()` adapter (`facebookresearch/balance` PR #465) which calls `getattr(res, "se", None)` / `getattr(res, "conf_int", None)` without a fallback chain — pre-alias, every staggered result class returned `None` on those keys, silently dropping `se` and `conf_int` from the adapter's diagnostic dict. 23 alias-mechanic + balance-adapter regression tests at `tests/test_result_aliases.py`. Patch-level (additive on stable surfaces). - **`ChaisemartinDHaultfoeuille.by_path` + non-binary integer treatment** — `by_path=k` now accepts integer-coded discrete treatment (D in Z, e.g. ordinal `{0, 1, 2}`); path tuples become integer-state tuples like `(0, 2, 2, 2)`. The previous `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1870` is replaced by a `ValueError` for continuous D (e.g. `D=1.5`) at fit-time per the no-silent-failures contract — the existing `int(round(float(v)))` cast in `_enumerate_treatment_paths` is now defensive (no-op for integer-coded D). Validated against R `did_multiplegt_dyn(..., by_path)` for D in `{0, 1, 2}` via the new `multi_path_reversible_by_path_non_binary` golden-value scenario (78 switchers, 3 paths, single-baseline custom DGP, F_g >= 4): per-path point estimates match R bit-exactly (rtol ~1e-9 on event horizons; rtol+atol envelope for placebo near-zero values), per-path SE inherits the documented cross-path cohort-sharing deviation (~5% rtol observed; SE_RTOL=0.15 envelope). **Deviation from R for D >= 10:** R's `did_multiplegt_by_path` derives the per-path baseline via `path_index$baseline_XX <- substr(path_index$path, 1, 1)`, which captures only the first character of the comma-separated path string (e.g. for `path = "12,12,..."` it captures `"1"` instead of `"12"`); this mis-allocates R's per-path control-pool subset for D >= 10. Python's tuple-key matching is correct in this regime — the per-path point estimates we compute are correct; R's per-path subset for the same path is buggy. The shipped parity scenario stays in `D in {0, 1, 2}` to avoid the R bug. R-parity test at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathNonBinary`; cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary`. diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 353d2fb0..92bf9359 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2553,7 +2553,8 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - [x] Phase 5 (wave 1, PR #402): `practitioner_next_steps()` integration for HAD results - `_handle_had` and `_handle_had_event_study` route both result classes through HAD-specific Baker et al. (2025) step guidance with bidirectional HAD ↔ ContinuousDiD Step-4 routing closure. The `_check_nan_att` helper extends to ndarray `att` (HAD event-study) via `np.all(np.isnan(arr))` semantics; scalar path bit-exact preserved. - [x] Phase 5 (wave 1, PR #402): `llms-full.txt` HeterogeneousAdoptionDiD section + result-class blocks + `## HAD Pretests` index + Choosing-an-Estimator row landed; constructor / fit() signatures match the real API (regression-tested via `inspect.signature`); result-class field tables enumerate every public dataclass field (regression-tested via `dataclasses.fields()`); `llms-practitioner.txt` Step 4 decision tree distinguishes ContinuousDiD (per-dose ATT(d), needs never-treated) from HeterogeneousAdoptionDiD (WAS, universal-rollout-compatible). - [x] Phase 5 (partial): README catalog one-liner, bundled `llms.txt` `## Estimators` entry, `docs/api/had.rst` (autoclass for the three classes), and `docs/references.rst` citation landed in PR #372 docs refresh. -- [ ] Phase 5 (remaining): T21 HAD pretest workflow tutorial + T22 weighted/survey HAD tutorial - tracked in `TODO.md`. +- [x] Phase 5 (wave 2 first slice, PR #409): T21 HAD pretest workflow tutorial (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `did_had_pretest_workflow`. Uses a `Uniform[$0.01K, $50K]` dose-distribution variant of T20's brand-campaign panel (true support strictly positive but near-zero, chosen so QUG fails-to-reject `H0: d_lower = 0` in finite sample). Walks through `aggregate="overall"` (Steps 1 + 3 only, verdict explicitly flags Step 2 deferral) and upgrades to `aggregate="event_study"` (joint pre-trends Stute + joint homogeneity Stute close the gap). Side panel exercises both `yatchew_hr_test` null modes (`linearity` vs `mean_independence`). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (16 tests pinning panel composition, both verdict pivots, structural anchors, deterministic stats, bootstrap p-value tolerance bands per backend, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel). +- [ ] Phase 5 (remaining): T22 weighted/survey HAD tutorial - tracked in `TODO.md`. - [ ] Documentation of non-testability of Assumptions 5 and 6. - [ ] Warnings for staggered treatment timing (redirect to `ChaisemartinDHaultfoeuille`). - [ ] `NotImplementedError` phase pointer when `covariates=` is passed (Theorem 6 future work). diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb index 8516b400..4c524678 100644 --- a/docs/tutorials/21_had_pretest_workflow.ipynb +++ b/docs/tutorials/21_had_pretest_workflow.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "2c409551", + "id": "dbc87841", "metadata": {}, "source": [ "# Tutorial 21: HAD Pre-test Workflow - Running the Pre-test Diagnostics on the Brand Campaign Panel\n", @@ -14,7 +14,7 @@ }, { "cell_type": "markdown", - "id": "1ccaad91", + "id": "b86031cc", "metadata": {}, "source": [ "## 1. The Pre-test Battery\n", @@ -31,24 +31,24 @@ }, { "cell_type": "markdown", - "id": "c110fe0e", + "id": "77271a27", "metadata": {}, "source": [ "## 2. The Panel\n", "\n", - "We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. With the true `D_(1)` close to zero, the QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1) identification path even though the true simulation lower bound is positive. HAD's `design=\"auto\"` detection follows the same QUG decision rule and will land on `continuous_at_zero` with target `WAS` (rather than T20's `continuous_near_d_lower` / `WAS_d_lower`). The point of this tutorial is not to assert that the data is Design 1 from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.\n" + "We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1) identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design=\"auto\"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D) is approximately 0.007 < 0.01`. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1 from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.\n" ] }, { "cell_type": "code", "execution_count": 1, - "id": "05269d1c", + "id": "7caf8d51", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:17:36.394301Z", - "iopub.status.busy": "2026-05-10T00:17:36.394076Z", - "iopub.status.idle": "2026-05-10T00:17:37.818650Z", - "shell.execute_reply": "2026-05-10T00:17:37.818348Z" + "iopub.execute_input": "2026-05-10T00:30:19.135449Z", + "iopub.status.busy": "2026-05-10T00:30:19.135339Z", + "iopub.status.idle": "2026-05-10T00:30:20.049712Z", + "shell.execute_reply": "2026-05-10T00:30:20.049354Z" } }, "outputs": [ @@ -116,7 +116,7 @@ }, { "cell_type": "markdown", - "id": "91811549", + "id": "0cfbc36d", "metadata": {}, "source": [ "## 3. Step 1: The Overall Workflow (Two-Period Path)\n", @@ -129,13 +129,13 @@ { "cell_type": "code", "execution_count": 2, - "id": "cbda5c0c", + "id": "7adfe57b", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:17:37.819909Z", - "iopub.status.busy": "2026-05-10T00:17:37.819802Z", - "iopub.status.idle": "2026-05-10T00:17:37.858844Z", - "shell.execute_reply": "2026-05-10T00:17:37.858574Z" + "iopub.execute_input": "2026-05-10T00:30:20.051231Z", + "iopub.status.busy": "2026-05-10T00:30:20.051081Z", + "iopub.status.idle": "2026-05-10T00:30:20.088717Z", + "shell.execute_reply": "2026-05-10T00:30:20.088451Z" } }, "outputs": [ @@ -188,12 +188,12 @@ }, { "cell_type": "markdown", - "id": "9452bc09", + "id": "35fc523b", "metadata": {}, "source": [ "**Reading the overall verdict.** Three things to note.\n", "\n", - "- **Step 1 (QUG) fails to reject:** `D_(1)` (the smallest treated dose, ~\\$180 here) is small relative to the gap `D_(2) - D_(1)`, so the test statistic `T = D_(1) / (D_(2) - D_(1))` lands well below its critical value (1/alpha - 1 = 19 at alpha = 0.05). The data are statistically consistent with `d_lower = 0` and the workflow's `design=\"auto\"` rule selects the `continuous_at_zero` (Design 1) identification path with target = `WAS`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. The workflow's choice is the right operational call, but it does not prove the support infimum is exactly zero.)\n", + "- **Step 1 (QUG) fails to reject:** `D_(1)` (the smallest treated dose, ~\\$180 here) is small relative to the gap `D_(2) - D_(1)`, so the test statistic `T = D_(1) / (D_(2) - D_(1))` lands well below its critical value (1/alpha - 1 = 19 at alpha = 0.05). The data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1, but the QUG test is independent of HAD's `design=\"auto\"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.)\n", "- **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The diagnostics do not flag heterogeneity bias on the dose dimension, so reading the WAS as an average per-dose marginal effect is supported by these tests (subject to finite-sample power).\n", "- **Step 2 (Assumption 7 pre-trends) is not run on this path.** The verdict says so verbatim: `\"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)\"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path).\n", "\n", @@ -203,13 +203,13 @@ { "cell_type": "code", "execution_count": 3, - "id": "7dca161a", + "id": "8fdde5b0", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:17:37.860034Z", - "iopub.status.busy": "2026-05-10T00:17:37.859953Z", - "iopub.status.idle": "2026-05-10T00:17:37.861749Z", - "shell.execute_reply": "2026-05-10T00:17:37.861541Z" + "iopub.execute_input": "2026-05-10T00:30:20.089866Z", + "iopub.status.busy": "2026-05-10T00:30:20.089788Z", + "iopub.status.idle": "2026-05-10T00:30:20.091617Z", + "shell.execute_reply": "2026-05-10T00:30:20.091398Z" } }, "outputs": [ @@ -269,7 +269,7 @@ }, { "cell_type": "markdown", - "id": "bb4d7ef5", + "id": "de8e9431", "metadata": {}, "source": [ "A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000). That looks alarming but is correct here: under perfectly linear dose-response with very heterogeneous doses (Uniform[\\$0.01K, \\$50K]) and 60 sorted-by-dose units, the differencing variance `sigma2_diff` (which captures the squared gap between adjacent-by-dose units' `dy` values) is much larger than the OLS residual variance `sigma2_lin`. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. (For a different way to look at this same test, see the Yatchew side panel later in the notebook.)\n" @@ -277,7 +277,7 @@ }, { "cell_type": "markdown", - "id": "0bb3c4e3", + "id": "36d5f1fa", "metadata": {}, "source": [ "## 4. Step 2: Upgrade to the Event-Study Workflow\n", @@ -296,13 +296,13 @@ { "cell_type": "code", "execution_count": 4, - "id": "e4903c58", + "id": "a7afe7aa", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:17:37.862773Z", - "iopub.status.busy": "2026-05-10T00:17:37.862692Z", - "iopub.status.idle": "2026-05-10T00:17:37.988346Z", - "shell.execute_reply": "2026-05-10T00:17:37.988066Z" + "iopub.execute_input": "2026-05-10T00:30:20.092599Z", + "iopub.status.busy": "2026-05-10T00:30:20.092525Z", + "iopub.status.idle": "2026-05-10T00:30:20.216050Z", + "shell.execute_reply": "2026-05-10T00:30:20.215723Z" } }, "outputs": [ @@ -342,7 +342,7 @@ }, { "cell_type": "markdown", - "id": "b820c289", + "id": "55ffb1d9", "metadata": {}, "source": [ "**Reading the event-study verdict.** Now the verdict reads `\"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\"`. The `\"deferred\"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.\n", @@ -355,13 +355,13 @@ { "cell_type": "code", "execution_count": 5, - "id": "cd1d2dde", + "id": "97cea2be", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:17:37.989443Z", - "iopub.status.busy": "2026-05-10T00:17:37.989364Z", - "iopub.status.idle": "2026-05-10T00:17:37.991250Z", - "shell.execute_reply": "2026-05-10T00:17:37.990991Z" + "iopub.execute_input": "2026-05-10T00:30:20.217472Z", + "iopub.status.busy": "2026-05-10T00:30:20.217250Z", + "iopub.status.idle": "2026-05-10T00:30:20.219451Z", + "shell.execute_reply": "2026-05-10T00:30:20.219194Z" } }, "outputs": [ @@ -434,7 +434,7 @@ }, { "cell_type": "markdown", - "id": "072e39d8", + "id": "751f7f47", "metadata": {}, "source": [ "The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold. The test does not reject at alpha = 0.05, but the near-threshold p-value warrants scrutiny - the diagnostic is not failing in a clearly-far-from-rejection regime. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring).\n", @@ -446,7 +446,7 @@ }, { "cell_type": "markdown", - "id": "bba51a15", + "id": "358289b4", "metadata": {}, "source": [ "## 5. Side Panel: Yatchew-HR Null Modes\n", @@ -462,13 +462,13 @@ { "cell_type": "code", "execution_count": 6, - "id": "d0d4807d", + "id": "3560951e", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:17:37.992213Z", - "iopub.status.busy": "2026-05-10T00:17:37.992138Z", - "iopub.status.idle": "2026-05-10T00:17:37.996905Z", - "shell.execute_reply": "2026-05-10T00:17:37.996663Z" + "iopub.execute_input": "2026-05-10T00:30:20.220589Z", + "iopub.status.busy": "2026-05-10T00:30:20.220509Z", + "iopub.status.idle": "2026-05-10T00:30:20.225290Z", + "shell.execute_reply": "2026-05-10T00:30:20.225053Z" } }, "outputs": [ @@ -530,7 +530,7 @@ }, { "cell_type": "markdown", - "id": "45254f92", + "id": "c27a639c", "metadata": {}, "source": [ "**Reading the side-panel comparison.**\n", @@ -545,7 +545,7 @@ }, { "cell_type": "markdown", - "id": "6bdc1f7d", + "id": "14a0f2a8", "metadata": {}, "source": [ "## 6. Communicating the Diagnostics to Leadership\n", @@ -554,7 +554,7 @@ "\n", "> **The HAD pre-test diagnostics on the brand-campaign panel do not flag a violation of the testable identifying assumptions.**\n", ">\n", - "> - **Step 1 (QUG support-infimum, paper Theorem 4):** the test does not reject `H0: d_lower = 0` (p approximately 0.21). The data are statistically consistent with a dose distribution starting at zero, so the library's `design=\"auto\"` selects the `continuous_at_zero` design and reports the WAS (Weighted Average Slope). This is a workflow decision based on the QUG outcome; failing to reject is not proof that the true support is exactly at zero.\n", + "> - **Step 1 (QUG support-infimum, paper Theorem 4):** the test does not reject `H0: d_lower = 0` (p approximately 0.21). The data are statistically consistent with a dose distribution starting at zero. Independently of QUG, HAD's `design=\"auto\"` selector applies a min/median heuristic to the post-period dose vector and lands on the `continuous_at_zero` design (target `WAS`) on this panel; QUG and the design selector are separate rules that point to the same identification path here. Failing to reject the QUG null is not proof that the true support is exactly at zero, and the design selector's choice is operational, not statistical.\n", "> - **Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test does not reject (joint p approximately 0.07 across the three pre-period horizons). The p-value is close to alpha = 0.05, so the non-rejection here is not by a wide margin - in a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending.\n", "> - **Step 3 (linearity, Assumption 8):** joint Stute homogeneity does not reject (joint p approximately 0.76 across the four post-launch horizons). The diagnostic does not flag heterogeneity bias on the dose dimension under the test's specification.\n", ">\n", @@ -565,7 +565,7 @@ }, { "cell_type": "markdown", - "id": "d866da6c", + "id": "d4a8f110", "metadata": {}, "source": [ "## 7. Extensions\n", @@ -587,7 +587,7 @@ }, { "cell_type": "markdown", - "id": "0105ae26", + "id": "61c56f0e", "metadata": {}, "source": [ "## 8. Summary Checklist\n", diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py index 1b2fb58e..318c05c0 100644 --- a/tests/test_t21_had_pretest_workflow_drift.py +++ b/tests/test_t21_had_pretest_workflow_drift.py @@ -15,12 +15,16 @@ T21 DGP differs from T20: dose distribution is `Uniform[$0.01K, $50K]` (was `[$5K, $50K]` in T20). The true support is strictly positive but very -near zero, chosen so the QUG step fails-to-reject `H0: d_lower = 0` in this -finite sample. That QUG outcome lets the workflow's `design="auto"` rule -land on `continuous_at_zero` (a workflow decision based on the test, not a -property of the true DGP), which in turn populates the verdict text with -the load-bearing "Assumption 7 deferred" substring used for the upgrade-arc -narrative. DGP and seed locked at `_scratch/t21_pretests/10_panel.py`. +near zero. Two independent things follow from that small `D_(1)` and are +exercised in this drift file: (a) the QUG step fails-to-reject +`H0: d_lower = 0` in this finite sample, populating the workflow's verdict +with the "Assumption 7 deferred" substring used for the upgrade-arc +narrative; and (b) HAD's `design="auto"` selector - a separate min/median +heuristic that does NOT consume the QUG p-value - independently lands on +`continuous_at_zero` because `d.min() < 0.01 * median(|d|)` (per +`_detect_design()` in `had.py`). Both checks point to the same +identification path on this panel, but the rules are independent. +DGP and seed locked at `_scratch/t21_pretests/10_panel.py`. Quoted numbers derived from `_scratch/t21_pretests/50_compose_narrative.py`. Bootstrap p-value pins use **abs tolerance bands >= 0.15** per @@ -32,10 +36,12 @@ from __future__ import annotations +import warnings + import numpy as np import pytest -from diff_diff import did_had_pretest_workflow, generate_continuous_did_data, yatchew_hr_test +from diff_diff import HAD, did_had_pretest_workflow, generate_continuous_did_data, yatchew_hr_test # Locked T21 DGP parameters (must stay in sync with the notebook). MAIN_SEED = 87 @@ -187,9 +193,10 @@ def test_overall_path_structural_anchors(overall_report): def test_overall_qug_fails_to_reject(overall_report): """Section 3 narrative claims QUG fails-to-reject H0: d_lower = 0 - (data are statistically consistent with continuous_at_zero design; - HAD's design="auto" rule selects that path on this QUG outcome). - QUG is fully deterministic; pin exact rounded values.""" + (data are statistically consistent with continuous_at_zero design). + QUG is fully deterministic; pin exact rounded values. The independent + HAD `design="auto"` selector decision is locked separately by + `test_had_design_auto_lands_on_continuous_at_zero`.""" assert overall_report.qug.reject is False # T statistic = D_(1) / (D_(2) - D_(1)) is fully deterministic. assert round(overall_report.qug.t_stat, 2) == 3.86, overall_report.qug.t_stat @@ -292,6 +299,28 @@ def test_event_study_homogeneity_fails_to_reject(event_study_report): assert hj.p_value > 0.50, hj.p_value +def test_had_design_auto_lands_on_continuous_at_zero(two_period): + """Section 2 narrative claims HAD's `design="auto"` selector + independently lands on `continuous_at_zero` (target = `WAS`) on + this panel because `d.min() < 0.01 * median(|d|)`. This is a + separate decision rule from the QUG test (locked by + `test_overall_qug_fails_to_reject`); the two happen to agree on + this panel but the rules are independent. We fit HAD with + `design="auto"` here just to verify the prose.""" + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=UserWarning) + est = HAD(design="auto") + result = est.fit( + two_period, + outcome_col="weekly_visits", + dose_col="regional_spend_k", + time_col="period", + unit_col="dma_id", + ) + assert result.design == "continuous_at_zero", result.design + assert result.target_parameter == "WAS", result.target_parameter + + def test_yatchew_side_panel_linearity_passes(yatchew_side_panel_inputs): """Section 5 (Yatchew side panel) narrative claims `null="linearity"` does not reject on the within-pre-period first-difference paired From 360a2c7871cf2df7329350f4c1b25df2cb007bb7 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 20:49:02 -0400 Subject: [PATCH 06/12] Address PR #409 R2 review (P2 D1, 3 stale doc items) - REGISTRY.md L2509: practitioner_next_steps + T21 tutorial were marked "queued for Phase 5"; both now landed (PR #402 + PR #409). Update to reflect actual status; T22 remains queued. - CHANGELOG.md L11 (T21 entry): drift-test count was "15 tests"; now 16 (after the new test_had_design_auto_lands_on_continuous_at_zero added in R1). - CHANGELOG.md L15 (PR #402 entry, retroactive): said "T21 pretest tutorial and T22 weighted/survey tutorial remain queued"; T21 has since landed in PR #409. Update to reflect that. No methodology change; no test surface changes. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 4 ++-- docs/methodology/REGISTRY.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index f9d082c3..2c413ba8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,12 +8,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- **Tutorial 21: HAD Pre-test Workflow** (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `HeterogeneousAdoptionDiD` building on Tutorial 20's brand-campaign framing. Uses a 60-DMA × 8-week panel close in shape to T20's but with the dose distribution drawn from `Uniform[$0.01K, $50K]` (vs T20's `[$5K, $50K]`); the true support is strictly positive but very near zero, chosen so the QUG step in `did_had_pretest_workflow` fails-to-reject `H0: d_lower = 0` in this finite sample and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. (HAD's `design="auto"` selector — a separate min/median heuristic at `had.py::_detect_design`, NOT the QUG p-value — independently lands on the `continuous_at_zero` identification path with target `WAS` on this panel because `d.min() < 0.01 * median(|d|)`. The QUG test and the design selector are independent rules that point to the same identification path here.) Walks through three surfaces: (a) `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse, where the verdict explicitly flags Step 2 (Assumption 7 pre-trends) as not run because a single pre-period structurally cannot support a pre-trends test, and the structural fields `pretrends_joint` / `homogeneity_joint` are both `None`; (b) `did_had_pretest_workflow(aggregate="event_study")` on the full multi-period panel, where the verdict reads "TWFE admissible under Section 4 assumptions" because all three testable diagnostics (QUG + joint pre-trends Stute over 3 horizons + joint homogeneity Stute over 4 horizons) fail-to-reject — non-rejection evidence under finite-sample power and test specification, not proof that the identifying assumptions hold; and (c) a side panel exercising both `yatchew_hr_test` null modes — `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) — on the within-pre-period first-difference paired with post-period dose, illustrating the stricter null's larger residual variance (`sigma2_lin` 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (15 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, and bootstrap p-value tolerance bands per `feedback_bootstrap_drift_tests_need_backend_tolerance`). T20's "Composite pretest workflow" Extensions bullet updated with a forward-pointer to T21. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR. +- **Tutorial 21: HAD Pre-test Workflow** (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `HeterogeneousAdoptionDiD` building on Tutorial 20's brand-campaign framing. Uses a 60-DMA × 8-week panel close in shape to T20's but with the dose distribution drawn from `Uniform[$0.01K, $50K]` (vs T20's `[$5K, $50K]`); the true support is strictly positive but very near zero, chosen so the QUG step in `did_had_pretest_workflow` fails-to-reject `H0: d_lower = 0` in this finite sample and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. (HAD's `design="auto"` selector — a separate min/median heuristic at `had.py::_detect_design`, NOT the QUG p-value — independently lands on the `continuous_at_zero` identification path with target `WAS` on this panel because `d.min() < 0.01 * median(|d|)`. The QUG test and the design selector are independent rules that point to the same identification path here.) Walks through three surfaces: (a) `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse, where the verdict explicitly flags Step 2 (Assumption 7 pre-trends) as not run because a single pre-period structurally cannot support a pre-trends test, and the structural fields `pretrends_joint` / `homogeneity_joint` are both `None`; (b) `did_had_pretest_workflow(aggregate="event_study")` on the full multi-period panel, where the verdict reads "TWFE admissible under Section 4 assumptions" because all three testable diagnostics (QUG + joint pre-trends Stute over 3 horizons + joint homogeneity Stute over 4 horizons) fail-to-reject — non-rejection evidence under finite-sample power and test specification, not proof that the identifying assumptions hold; and (c) a side panel exercising both `yatchew_hr_test` null modes — `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) — on the within-pre-period first-difference paired with post-period dose, illustrating the stricter null's larger residual variance (`sigma2_lin` 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (16 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, bootstrap p-value tolerance bands per `feedback_bootstrap_drift_tests_need_backend_tolerance`, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel). T20's "Composite pretest workflow" Extensions bullet updated with a forward-pointer to T21. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR. - **`ChaisemartinDHaultfoeuille.by_path` and `paths_of_interest` now compose with `survey_design`** for analytical Binder TSL SE and replicate-weight bootstrap variance. The `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1233-1239` is replaced by a per-path multiplier-bootstrap-only gate (`survey_design + n_bootstrap > 0` under by_path / paths_of_interest still raises, since the survey-aware perturbation pivot for path-restricted IFs is methodologically underived). Per-path SE routes through the existing `_survey_se_from_group_if` cell-period allocator: the per-period IF (`U_pp_l_path`) is built with non-path switcher-side contributions skipped (control contributions are unchanged, matching the joiners/leavers IF convention; preserves the row-sum identity `U_pp.sum(axis=1) == U`), cohort-recentered via `_cohort_recenter_per_period`, then expanded to observations as `psi_i = U_pp[g_i, t_i] · (w_i / W_{g_i, t_i})`. Replicate-weight designs unconditionally use the cell allocator (Class A contract from PR #323). New `_refresh_path_inference` helper post-call refreshes `safe_inference` on every populated entry across `multi_horizon_inference`, `placebo_horizon_inference`, `path_effects`, and `path_placebos` so all four surfaces use the same final `df_survey` after per-path replicate fits append `n_valid` to the shared accumulator. Path-enumeration ranking under `survey_design` remains unweighted (group-cardinality, not population-weight mass). Lonely-PSU policy stays sample-wide, not per-path. Telescope invariant: on a single-path panel, per-path SE matches the global non-by_path survey SE bit-exactly. **No R parity** — R `did_multiplegt_dyn` does not support survey weighting; this is a Python-only methodology extension. The global non-by_path TSL multiplier-bootstrap path is unaffected (anti-regression test `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignAnalytical::test_global_survey_plus_n_bootstrap_still_works` locks the per-path-only scope of the new gate). Cross-surface invariants regression-tested at `TestByPathSurveyDesignAnalytical` (~17 tests across gate / dispatch / analytical SE / replicate-weight SE / per-path placebos / `trends_linear` composition / unobserved-path warnings / final-df refresh regressions) and `TestByPathSurveyDesignTelescope`. See `docs/methodology/REGISTRY.md` §`ChaisemartinDHaultfoeuille` `Note (Phase 3 by_path ...)` → "Per-path survey-design SE" for the full contract. - **Inference-field aliases on staggered result classes** for adapter / external-consumer compatibility. Read-only `@property` aliases expose the flat `att` / `se` / `conf_int` / `p_value` / `t_stat` names (matching `DiDResults` / `TROPResults` / `SyntheticDiDResults` / `HeterogeneousAdoptionDiDResults`) on every result class that previously only carried prefixed canonical fields: `CallawaySantAnnaResults`, `StackedDiDResults`, `EfficientDiDResults`, `ChaisemartinDHaultfoeuilleResults`, `StaggeredTripleDiffResults`, `WooldridgeDiDResults`, `SunAbrahamResults`, `ImputationDiDResults`, `TwoStageDiDResults` (mapping to `overall_*`); `ContinuousDiDResults` (mapping to `overall_att_*`, ATT-side as the headline, ACRT-side accessible unchanged via `overall_acrt_*`); `MultiPeriodDiDResults` (mapping to `avg_*`). `ContinuousDiDResults` additionally exposes `overall_se` / `overall_conf_int` / `overall_p_value` / `overall_t_stat` aliases for naming consistency with the rest of the staggered family. Aliases are pure read-throughs over the canonical fields — no recomputation, no behavior change — so the `safe_inference()` joint-NaN contract (per CLAUDE.md "Inference computation") is inherited automatically (NaN canonical → NaN alias, locked at `tests/test_result_aliases.py::test_pattern_b_aliases_propagate_nan`). The native `overall_*` / `overall_att_*` / `avg_*` fields remain canonical for documentation and computation. Motivated by the `balance.interop.diff_diff.as_balance_diagnostic()` adapter (`facebookresearch/balance` PR #465) which calls `getattr(res, "se", None)` / `getattr(res, "conf_int", None)` without a fallback chain — pre-alias, every staggered result class returned `None` on those keys, silently dropping `se` and `conf_int` from the adapter's diagnostic dict. 23 alias-mechanic + balance-adapter regression tests at `tests/test_result_aliases.py`. Patch-level (additive on stable surfaces). - **`ChaisemartinDHaultfoeuille.by_path` + non-binary integer treatment** — `by_path=k` now accepts integer-coded discrete treatment (D in Z, e.g. ordinal `{0, 1, 2}`); path tuples become integer-state tuples like `(0, 2, 2, 2)`. The previous `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1870` is replaced by a `ValueError` for continuous D (e.g. `D=1.5`) at fit-time per the no-silent-failures contract — the existing `int(round(float(v)))` cast in `_enumerate_treatment_paths` is now defensive (no-op for integer-coded D). Validated against R `did_multiplegt_dyn(..., by_path)` for D in `{0, 1, 2}` via the new `multi_path_reversible_by_path_non_binary` golden-value scenario (78 switchers, 3 paths, single-baseline custom DGP, F_g >= 4): per-path point estimates match R bit-exactly (rtol ~1e-9 on event horizons; rtol+atol envelope for placebo near-zero values), per-path SE inherits the documented cross-path cohort-sharing deviation (~5% rtol observed; SE_RTOL=0.15 envelope). **Deviation from R for D >= 10:** R's `did_multiplegt_by_path` derives the per-path baseline via `path_index$baseline_XX <- substr(path_index$path, 1, 1)`, which captures only the first character of the comma-separated path string (e.g. for `path = "12,12,..."` it captures `"1"` instead of `"12"`); this mis-allocates R's per-path control-pool subset for D >= 10. Python's tuple-key matching is correct in this regime — the per-path point estimates we compute are correct; R's per-path subset for the same path is buggy. The shipped parity scenario stays in `D in {0, 1, 2}` to avoid the R bug. R-parity test at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathNonBinary`; cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary`. - **New `paths_of_interest` kwarg on `ChaisemartinDHaultfoeuille`** for user-specified treatment-path subsets, alternative to `by_path=k`'s top-k automatic ranking. Mutually exclusive with `by_path`; setting both raises `ValueError` at `__init__` and `set_params` time. Each path tuple must be a list/tuple of `int` of length `L_max + 1` (uniformity validated at `__init__`; length match against `L_max + 1` validated at fit-time); `bool` and `np.bool_` are explicitly rejected, `np.integer` accepted and canonicalized to Python `int` for tuple-key consistency. Duplicates emit a `UserWarning` and are deduplicated; paths not observed in the panel emit a `UserWarning` and are omitted from `path_effects`. Paths appear in `results.path_effects` in the user-specified order, modulo deduplication and unobserved-path filtering. Composes with non-binary D and all downstream `by_path` surfaces (bootstrap, per-path placebos, per-path joint sup-t bands, `controls`, `trends_linear`, `trends_nonparam`) — mechanical filter on observed paths via the same `_enumerate_treatment_paths` call site, no methodology change. **Python-only API extension; no R equivalent** — R's `did_multiplegt_dyn(..., by_path=k)` only accepts a positive int (top-k) or `-1` (all paths). The `by_path` precondition gate at `chaisemartin_dhaultfoeuille.py:1118` (drop_larger_lower / L_max / `heterogeneity` / `design2` / `honest_did` / `survey_design` mutex) and the 11 `self.by_path is not None` activation branches in `fit()` were rerouted to fire under either selector. Validation + behavior + cross-feature regressions at `tests/test_chaisemartin_dhaultfoeuille.py::TestPathsOfInterest`. -- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), an estimand-difference routing nudge to `ContinuousDiD` (step 4 — fires when the user wants per-dose ATT(d) / ACRT(d) curves rather than HAD's WAS estimand and has never-treated controls; framed around estimand difference, NOT around the existence of untreated units, since HAD remains valid with a small never-treated share per REGISTRY § HeterogeneousAdoptionDiD edge cases and explicitly retains never-treated units on the staggered event-study path per paper Appendix B.2 / `had.py:1325`), `results.bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels (this direction is correct because ContinuousDiD's identification requires never-treated controls). Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference); the documented constructor + `fit()` signatures match the real `HeterogeneousAdoptionDiD.__init__` / `.fit` API exactly (verified by `inspect.signature`-based regression tests). Tightens the existing `Continuous treatment intensity` Choosing row to surface ATT(d) vs WAS as the estimand differentiator. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 26 new tests (16 in `tests/test_practitioner.py::TestHADDispatch` + 9 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs. +- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), an estimand-difference routing nudge to `ContinuousDiD` (step 4 — fires when the user wants per-dose ATT(d) / ACRT(d) curves rather than HAD's WAS estimand and has never-treated controls; framed around estimand difference, NOT around the existence of untreated units, since HAD remains valid with a small never-treated share per REGISTRY § HeterogeneousAdoptionDiD edge cases and explicitly retains never-treated units on the staggered event-study path per paper Appendix B.2 / `had.py:1325`), `results.bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels (this direction is correct because ContinuousDiD's identification requires never-treated controls). Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference); the documented constructor + `fit()` signatures match the real `HeterogeneousAdoptionDiD.__init__` / `.fit` API exactly (verified by `inspect.signature`-based regression tests). Tightens the existing `Continuous treatment intensity` Choosing row to surface ATT(d) vs WAS as the estimand differentiator. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 26 new tests (16 in `tests/test_practitioner.py::TestHADDispatch` + 9 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap. T21 pretest tutorial subsequently landed in PR #409; T22 weighted/survey tutorial remains queued as a separate notebook PR. ## [3.3.2] - 2026-04-26 diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 92bf9359..076632d0 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2506,7 +2506,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - **Note:** Horizon labels in `StuteJointResult.horizon_labels` are `str(t)` verbatim and carry STRING IDENTITY ONLY — NOT a chronological ordering key. Callers who need chronological order must preserve the original period values alongside (e.g. from the `pre_periods` / `post_periods` argument). - **Note:** NaN propagation is explicit: when any horizon has NaN in residuals, `cvm_stat_joint=NaN`, `p_value=NaN`, `reject=False`, AND `per_horizon_stats={label: np.nan for every horizon}` (full dict preserved with NaN values — not empty, not partial). -**Phase 3 follow-up delivery:** `stute_joint_pretest()`, `joint_pretrends_test()`, `joint_homogeneity_test()`, `StuteJointResult`, and `did_had_pretest_workflow(aggregate="event_study")` shipped together in PR #353 (2026-04). The `practitioner_next_steps()` integration and tutorial are queued for Phase 5. +**Phase 3 follow-up delivery:** `stute_joint_pretest()`, `joint_pretrends_test()`, `joint_homogeneity_test()`, `StuteJointResult`, and `did_had_pretest_workflow(aggregate="event_study")` shipped together in PR #353 (2026-04). The `practitioner_next_steps()` HAD handlers landed in Phase 5 wave 1 (PR #402); the T21 HAD pretest workflow tutorial landed in PR #409 (Phase 5 wave 2 first slice). T22 weighted/survey HAD tutorial remains queued. **Reference implementation(s):** - R: `did_had` (de Chaisemartin, Ciccia, D'Haultfœuille, Knau 2024a); `stute_test` (2024c); `yatchew_test` (Online Appendix, Table 3). From 95930da07a081f455402e3f20b1add82fb4963b6 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 20:53:52 -0400 Subject: [PATCH 07/12] T21 notebook prose fixes (1 P2 + 4 P3 from notebook-aware review) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P2 — CELL_07 first bullet had a conceptual error in describing the QUG mechanic: "D_(1) is small relative to the gap D_(2)-D_(1)" — actually D_(1) ≈ 0.181 and the gap ≈ 0.047, so D_(1) is 3.86x LARGER than the gap. The reason QUG fails-to-reject is that T = D_(1)/(D_(2)-D_(1)) = 3.86 lands below the critical value 19, NOT because of any "small relative to the gap" relationship. Rewrote to state the test statistic and critical value directly. P3 polish: - CELL_03: "approximately 0.007" → "below 0.01" (avoids numerical drift on a stat that scales with seed; the heuristic threshold itself is what matters). - CELL_07: added a one-line aside reconciling `all_pass=True` with Step 2 deferral on the overall path: `all_pass` aggregates only the steps that ran on each dispatch, so True here means "of the two steps run, neither rejected" — not that Assumption 7 has been cleared. - CELL_09: explained the very-large-negative `T_hr` ≈ -35,000 as a scale artifact (sigma2_diff scales with the squared dose-step gap; on Uniform[0.01, 50] doses with a true slope of 100, adjacent-by-dose units have dy gaps that swamp sigma2_lin). Adds explicit reference forward to the side panel where a different input gives T_hr ≈ 0 as a sanity check. - CELL_17: tightened mean_independence vs linearity framing to "linear fit absorbs any apparent slope (real or sample noise)" — the pre-period has no real signal so the original "absorbs the dose-response signal" wording was off-target on this panel. No methodology change; all 16 drift tests still pass; nbmake clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/21_had_pretest_workflow.ipynb | 98 ++++++++++---------- 1 file changed, 50 insertions(+), 48 deletions(-) diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb index 4c524678..b9154d49 100644 --- a/docs/tutorials/21_had_pretest_workflow.ipynb +++ b/docs/tutorials/21_had_pretest_workflow.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "dbc87841", + "id": "64f6ebc9", "metadata": {}, "source": [ "# Tutorial 21: HAD Pre-test Workflow - Running the Pre-test Diagnostics on the Brand Campaign Panel\n", @@ -14,7 +14,7 @@ }, { "cell_type": "markdown", - "id": "b86031cc", + "id": "91b4bc64", "metadata": {}, "source": [ "## 1. The Pre-test Battery\n", @@ -31,24 +31,24 @@ }, { "cell_type": "markdown", - "id": "77271a27", + "id": "c91bfc78", "metadata": {}, "source": [ "## 2. The Panel\n", "\n", - "We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1) identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design=\"auto\"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D) is approximately 0.007 < 0.01`. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1 from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.\n" + "We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1) identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design=\"auto\"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1 from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.\n" ] }, { "cell_type": "code", "execution_count": 1, - "id": "7caf8d51", + "id": "c3ed9b52", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:30:19.135449Z", - "iopub.status.busy": "2026-05-10T00:30:19.135339Z", - "iopub.status.idle": "2026-05-10T00:30:20.049712Z", - "shell.execute_reply": "2026-05-10T00:30:20.049354Z" + "iopub.execute_input": "2026-05-10T00:53:29.812054Z", + "iopub.status.busy": "2026-05-10T00:53:29.811945Z", + "iopub.status.idle": "2026-05-10T00:53:31.283255Z", + "shell.execute_reply": "2026-05-10T00:53:31.282925Z" } }, "outputs": [ @@ -116,7 +116,7 @@ }, { "cell_type": "markdown", - "id": "0cfbc36d", + "id": "840e7738", "metadata": {}, "source": [ "## 3. Step 1: The Overall Workflow (Two-Period Path)\n", @@ -129,13 +129,13 @@ { "cell_type": "code", "execution_count": 2, - "id": "7adfe57b", + "id": "527e4562", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:30:20.051231Z", - "iopub.status.busy": "2026-05-10T00:30:20.051081Z", - "iopub.status.idle": "2026-05-10T00:30:20.088717Z", - "shell.execute_reply": "2026-05-10T00:30:20.088451Z" + "iopub.execute_input": "2026-05-10T00:53:31.284493Z", + "iopub.status.busy": "2026-05-10T00:53:31.284385Z", + "iopub.status.idle": "2026-05-10T00:53:31.323187Z", + "shell.execute_reply": "2026-05-10T00:53:31.322931Z" } }, "outputs": [ @@ -188,28 +188,30 @@ }, { "cell_type": "markdown", - "id": "35fc523b", + "id": "1ff016af", "metadata": {}, "source": [ "**Reading the overall verdict.** Three things to note.\n", "\n", - "- **Step 1 (QUG) fails to reject:** `D_(1)` (the smallest treated dose, ~\\$180 here) is small relative to the gap `D_(2) - D_(1)`, so the test statistic `T = D_(1) / (D_(2) - D_(1))` lands well below its critical value (1/alpha - 1 = 19 at alpha = 0.05). The data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1, but the QUG test is independent of HAD's `design=\"auto\"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.)\n", + "- **Step 1 (QUG) fails to reject:** the test statistic `T = D_(1) / (D_(2) - D_(1)) ~ 3.86` lands well below its critical value (`1/alpha - 1 = 19` at alpha = 0.05); the data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1, but the QUG test is independent of HAD's `design=\"auto\"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.)\n", "- **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The diagnostics do not flag heterogeneity bias on the dose dimension, so reading the WAS as an average per-dose marginal effect is supported by these tests (subject to finite-sample power).\n", "- **Step 2 (Assumption 7 pre-trends) is not run on this path.** The verdict says so verbatim: `\"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)\"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path).\n", "\n", + "A note on `all_pass = True` here: the workflow's `all_pass` flag aggregates only the steps that actually ran on this dispatch path. On the overall path that is QUG + linearity (Stute / Yatchew); Step 2's deferral is *not* folded into `all_pass`. So `all_pass = True` on the overall path means \"of the two steps that ran, neither rejected\" - it does not mean Assumption 7 has been cleared. The upgrade to event-study below makes this concrete by actually running Step 2.\n", + "\n", "Let's look at each individual test result.\n" ] }, { "cell_type": "code", "execution_count": 3, - "id": "8fdde5b0", + "id": "bcf57147", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:30:20.089866Z", - "iopub.status.busy": "2026-05-10T00:30:20.089788Z", - "iopub.status.idle": "2026-05-10T00:30:20.091617Z", - "shell.execute_reply": "2026-05-10T00:30:20.091398Z" + "iopub.execute_input": "2026-05-10T00:53:31.324354Z", + "iopub.status.busy": "2026-05-10T00:53:31.324272Z", + "iopub.status.idle": "2026-05-10T00:53:31.326037Z", + "shell.execute_reply": "2026-05-10T00:53:31.325825Z" } }, "outputs": [ @@ -269,15 +271,15 @@ }, { "cell_type": "markdown", - "id": "de8e9431", + "id": "ba741d74", "metadata": {}, "source": [ - "A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000). That looks alarming but is correct here: under perfectly linear dose-response with very heterogeneous doses (Uniform[\\$0.01K, \\$50K]) and 60 sorted-by-dose units, the differencing variance `sigma2_diff` (which captures the squared gap between adjacent-by-dose units' `dy` values) is much larger than the OLS residual variance `sigma2_lin`. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. (For a different way to look at this same test, see the Yatchew side panel later in the notebook.)\n" + "A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000), which looks alarming but is a scale artifact, not pathology. Under the Yatchew construction `sigma2_diff = (1 / 2G) * sum((dy_{(g)} - dy_{(g-1)})^2)` is computed on `dy` sorted by dose `D`. With doses spread over Uniform[\\$0.01K, \\$50K] and a true per-$1K slope of 100 (locked by the DGP), adjacent-by-dose units have `dy` values that differ by roughly `100 * (D_{(g)} - D_{(g-1)})` plus noise — those squared gaps add up to a large `sigma2_diff` (about 6,250 here) by virtue of the dose scale, while the OLS residual variance `sigma2_lin` (about 1.6) reflects only noise around the linear fit. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. The side panel later in the notebook constructs a different Yatchew input (within-pre-period first-differences, where the adjacent-by-dose `dy` gaps are not driven by the post-treatment slope) and produces a `T_hr` near zero — a useful sanity check that the test behaves the way it should when the dose dimension genuinely contributes nothing to the variance of `dy`.\n" ] }, { "cell_type": "markdown", - "id": "36d5f1fa", + "id": "850f5fee", "metadata": {}, "source": [ "## 4. Step 2: Upgrade to the Event-Study Workflow\n", @@ -296,13 +298,13 @@ { "cell_type": "code", "execution_count": 4, - "id": "a7afe7aa", + "id": "4ce3929e", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:30:20.092599Z", - "iopub.status.busy": "2026-05-10T00:30:20.092525Z", - "iopub.status.idle": "2026-05-10T00:30:20.216050Z", - "shell.execute_reply": "2026-05-10T00:30:20.215723Z" + "iopub.execute_input": "2026-05-10T00:53:31.326991Z", + "iopub.status.busy": "2026-05-10T00:53:31.326919Z", + "iopub.status.idle": "2026-05-10T00:53:31.449260Z", + "shell.execute_reply": "2026-05-10T00:53:31.448989Z" } }, "outputs": [ @@ -342,7 +344,7 @@ }, { "cell_type": "markdown", - "id": "55ffb1d9", + "id": "da7f50b3", "metadata": {}, "source": [ "**Reading the event-study verdict.** Now the verdict reads `\"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\"`. The `\"deferred\"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.\n", @@ -355,13 +357,13 @@ { "cell_type": "code", "execution_count": 5, - "id": "97cea2be", + "id": "2daa40d4", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:30:20.217472Z", - "iopub.status.busy": "2026-05-10T00:30:20.217250Z", - "iopub.status.idle": "2026-05-10T00:30:20.219451Z", - "shell.execute_reply": "2026-05-10T00:30:20.219194Z" + "iopub.execute_input": "2026-05-10T00:53:31.450431Z", + "iopub.status.busy": "2026-05-10T00:53:31.450344Z", + "iopub.status.idle": "2026-05-10T00:53:31.452391Z", + "shell.execute_reply": "2026-05-10T00:53:31.452142Z" } }, "outputs": [ @@ -434,7 +436,7 @@ }, { "cell_type": "markdown", - "id": "751f7f47", + "id": "25b7c8a7", "metadata": {}, "source": [ "The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold. The test does not reject at alpha = 0.05, but the near-threshold p-value warrants scrutiny - the diagnostic is not failing in a clearly-far-from-rejection regime. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring).\n", @@ -446,7 +448,7 @@ }, { "cell_type": "markdown", - "id": "358289b4", + "id": "fe666646", "metadata": {}, "source": [ "## 5. Side Panel: Yatchew-HR Null Modes\n", @@ -462,13 +464,13 @@ { "cell_type": "code", "execution_count": 6, - "id": "3560951e", + "id": "7bc357d6", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:30:20.220589Z", - "iopub.status.busy": "2026-05-10T00:30:20.220509Z", - "iopub.status.idle": "2026-05-10T00:30:20.225290Z", - "shell.execute_reply": "2026-05-10T00:30:20.225053Z" + "iopub.execute_input": "2026-05-10T00:53:31.453469Z", + "iopub.status.busy": "2026-05-10T00:53:31.453395Z", + "iopub.status.idle": "2026-05-10T00:53:31.457975Z", + "shell.execute_reply": "2026-05-10T00:53:31.457762Z" } }, "outputs": [ @@ -530,13 +532,13 @@ }, { "cell_type": "markdown", - "id": "c27a639c", + "id": "02061bc7", "metadata": {}, "source": [ "**Reading the side-panel comparison.**\n", "\n", "- The `linearity` mode fits `dy ~ 1 + d` and computes residual variance `sigma2_lin` from those residuals. Under a clean linear DGP the residuals are small (close to noise variance), the gap `sigma2_lin - sigma2_diff` is near zero, and `T_hr` lands close to zero with a p-value far above alpha.\n", - "- The `mean_independence` mode fits intercept-only `dy ~ 1` and computes `sigma2_lin` as the population variance of `dy`. That residual variance is **strictly larger** than under `linearity` (the linear fit absorbs the dose-response signal that intercept-only does not). The gap `sigma2_lin - sigma2_diff` is then larger and `T_hr` is larger - same asymptotic distribution, stricter null, more easily rejected when the alternative is true.\n", + "- The `mean_independence` mode fits intercept-only `dy ~ 1` and computes `sigma2_lin` as the population variance of `dy`. That residual variance is **strictly larger** than under `linearity` (the linear fit can absorb any apparent slope between `dy` and `d` - real or sample noise - shrinking the residual variance, while intercept-only cannot). The gap `sigma2_lin - sigma2_diff` is then larger and `T_hr` is larger - same asymptotic distribution, stricter null, more easily rejected when the alternative is true.\n", "\n", "On clean linear placebo data both modes fail to reject - exactly what we want. On data where `dY` actually responds to `D` in pre-period (parallel pre-trends fail), `null=\"mean_independence\"` is more sensitive than `null=\"linearity\"` because linearity is a weaker null (linear pre-trends would fail to reject the linearity null but would reject the mean-independence null).\n", "\n", @@ -545,7 +547,7 @@ }, { "cell_type": "markdown", - "id": "14a0f2a8", + "id": "ea2d9d70", "metadata": {}, "source": [ "## 6. Communicating the Diagnostics to Leadership\n", @@ -565,7 +567,7 @@ }, { "cell_type": "markdown", - "id": "d4a8f110", + "id": "504fd1c1", "metadata": {}, "source": [ "## 7. Extensions\n", @@ -587,7 +589,7 @@ }, { "cell_type": "markdown", - "id": "61c56f0e", + "id": "404a00cd", "metadata": {}, "source": [ "## 8. Summary Checklist\n", From 3cf116c790c8a533b2ef93eaee207d4532434647 Mon Sep 17 00:00:00 2001 From: igerber Date: Sun, 10 May 2026 10:01:32 -0400 Subject: [PATCH 08/12] Add T21 notebook prose extract for CI AI review (TEMPORARY) The CI AI reviewer's diff-build excludes `docs/tutorials/*.ipynb` (`.github/workflows/ai_pr_review.yml:151-156` + reviewer prompt's DO-NOT list at `.github/codex/prompts/pr_review.md:87-91`), so the actual T21 notebook prose has not been visible to the CI reviewer through three review rounds. The notebook content was reviewed once via a standalone notebook-aware Agent (which caught a P2 conceptual error in CELL_07 + 4 P3 polish items, all addressed in `d9ea86a`), but the CI reviewer itself has only seen the adjacent surfaces (CHANGELOG, drift test, README, REGISTRY). This commit lands a one-shot markdown extract at `docs/_review/t21_notebook_extract.md` that mirrors the notebook's full narrative (markdown cells + code cells + executed outputs) so the CI reviewer can audit the prose directly on this PR. Regenerate via `python _scratch/t21_pretests/70_extract_for_review.py` from the notebook source-of-truth at `_scratch/t21_pretests/60_build_notebook.py`. Adds `_review` to Sphinx `exclude_patterns` in `docs/conf.py` so the docs build doesn't pick the file up. A follow-on PR will (a) remove this extract file + the Sphinx exclude_patterns entry and (b) replace the blanket `.ipynb` exclusion in the CI workflow with a markdown-only extraction (jq one-liner) wired into the diff-build itself. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/_review/t21_notebook_extract.md | 448 +++++++++++++++++++++++++++ docs/conf.py | 2 +- 2 files changed, 449 insertions(+), 1 deletion(-) create mode 100644 docs/_review/t21_notebook_extract.md diff --git a/docs/_review/t21_notebook_extract.md b/docs/_review/t21_notebook_extract.md new file mode 100644 index 00000000..4505f5c8 --- /dev/null +++ b/docs/_review/t21_notebook_extract.md @@ -0,0 +1,448 @@ +# T21 Notebook Extract for AI Review (TEMPORARY) + +> **This file is a temporary review aid.** The CI AI reviewer's diff-build +> step excludes `docs/tutorials/*.ipynb` (see +> `.github/workflows/ai_pr_review.yml:151-156` and +> `.github/codex/prompts/pr_review.md:87-91`), so the actual tutorial +> notebook prose is invisible to the CI reviewer. This file mirrors the +> notebook's narrative (markdown + code + executed outputs) so the +> reviewer can audit the tutorial content in PR #409. +> +> A follow-on PR will (a) remove this file and (b) replace the blanket +> `.ipynb` exclusion with a markdown-only extraction wired into the +> workflow itself. Do not edit this file directly — regenerate via +> `python _scratch/t21_pretests/70_extract_for_review.py` from the +> notebook source-of-truth at +> `_scratch/t21_pretests/60_build_notebook.py`. + +--- + + +# Tutorial 21: HAD Pre-test Workflow - Running the Pre-test Diagnostics on the Brand Campaign Panel + +[Tutorial 20](20_had_brand_campaign.ipynb) fit `HeterogeneousAdoptionDiD` (HAD) on a regional brand-campaign panel and reported a per-dollar lift, with a brief visual placebo check at the end. We deliberately deferred the **formal pre-test workflow** to this tutorial, with a forward pointer in T20's "Extensions" section. + +This tutorial picks up where T20 left off. We re-run the brand campaign on a panel close in shape to T20's, then walk through HAD's composite pre-test workflow `did_had_pretest_workflow` and read the diagnostics for paper Section 4.2 of de Chaisemartin, Ciccia, D'Haultfoeuille, & Knau (2026). We start with the two-period (`aggregate="overall"`) workflow, observe that it does not run the parallel pre-trends step, and then **upgrade** to the multi-period (`aggregate="event_study"`) workflow that adds the joint Stute pre-trends and joint homogeneity diagnostics. None of the diagnostics in this tutorial reject; we walk through what that does and does not let us conclude. A side panel compares the two `null=` modes of the Yatchew-HR test, including the recently-shipped `null="mean_independence"` mode (R-parity with `YatchewTest::yatchew_test(order=0)`). + +## 1. The Pre-test Battery + +de Chaisemartin et al. (2026) Section 4.2 lays out a four-step workflow for HAD identification: + +1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1, `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1', `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters. +2. **Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD. +3. **Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias? +4. **Step 4 - Boundary continuity (paper Assumptions 5, 6):** local-linearity of the dose-response near the boundary `d_lower`. **Non-testable**; argued from domain knowledge. + +The library bundles the testable steps into one entry point: `did_had_pretest_workflow`. It dispatches to a two-period implementation (steps 1 + 3 only - step 2 needs at least two pre-periods) or a multi-period implementation (steps 1 + 2 + 3 jointly). The Yatchew-HR test from Step 3 is also exposed standalone with two null modes; we exercise both in the side panel. + +## 2. The Panel + +We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1) identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design="auto"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1 from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open. + +```python +import numpy as np +import pandas as pd + +from diff_diff import generate_continuous_did_data + +MAIN_SEED = 87 +N_UNITS = 60 +N_PERIODS = 8 +COHORT_PERIOD = 5 +TRUE_SLOPE = 100.0 +BASELINE_VISITS = 5000.0 +DOSE_LOW = 0.01 +DOSE_HIGH = 50.0 + +raw = generate_continuous_did_data( + n_units=N_UNITS, + n_periods=N_PERIODS, + cohort_periods=[COHORT_PERIOD], + never_treated_frac=0.0, + dose_distribution="uniform", + dose_params={"low": DOSE_LOW, "high": DOSE_HIGH}, + att_function="linear", + att_intercept=0.0, + att_slope=TRUE_SLOPE, + unit_fe_sd=8.0, + time_trend=0.5, + noise_sd=2.0, + seed=MAIN_SEED, +) +panel = raw.copy() +panel.loc[panel["period"] < panel["first_treat"], "dose"] = 0.0 +panel = panel.rename( + columns={ + "unit": "dma_id", + "period": "week", + "outcome": "weekly_visits", + "dose": "regional_spend_k", + } +) +panel["weekly_visits"] = panel["weekly_visits"] + BASELINE_VISITS + +post = panel[panel["week"] >= COHORT_PERIOD] +print(f"Panel: {panel['dma_id'].nunique()} DMAs x {panel['week'].nunique()} weeks") +print( + f"Regional spend (post-launch): " + f"${post['regional_spend_k'].min():.2f}K - " + f"${post['regional_spend_k'].max():.2f}K" +) +print(f"True per-$1K lift (locked at seed): {TRUE_SLOPE} weekly visits") +``` + +**Output:** + +``` +Panel: 60 DMAs x 8 weeks +Regional spend (post-launch): $0.18K - $49.00K +True per-$1K lift (locked at seed): 100.0 weekly visits +``` + +## 3. Step 1: The Overall Workflow (Two-Period Path) + +T20's headline used a two-period collapse of the panel - average pre-launch outcome per DMA against average post-launch outcome per DMA. That's also the natural input shape for HAD's two-period (`aggregate="overall"`) pre-test workflow, which runs **paper Step 1 (QUG) + paper Step 3 (linearity, via Stute and Yatchew-HR)**. Step 2 (parallel pre-trends) is not implemented on this path - a single pre-period structurally can't support a pre-trends test - and the workflow's verdict says so explicitly. + +We collapse to two periods (pre = avg over weeks 1-4, post = avg over weeks 5-8), then call the workflow. + +```python +from diff_diff import did_had_pretest_workflow + +p = panel.copy() +p["period"] = (p["week"] >= COHORT_PERIOD).astype(int) + 1 # 1=pre, 2=post +two_period = p.groupby(["dma_id", "period"], as_index=False).agg( + weekly_visits=("weekly_visits", "mean"), + regional_spend_k=("regional_spend_k", "mean"), +) +# Workflow invariant: pre-period dose = 0 for every unit. +two_period.loc[two_period["period"] == 1, "regional_spend_k"] = 0.0 +# first_treat in the collapsed coordinates: 2 (the post-period) for every DMA. +two_period["first_treat"] = 2 + +overall_report = did_had_pretest_workflow( + data=two_period, + outcome_col="weekly_visits", + dose_col="regional_spend_k", + time_col="period", + unit_col="dma_id", + first_treat_col="first_treat", + alpha=0.05, + n_bootstrap=999, + seed=21, + aggregate="overall", +) + +print(overall_report.verdict) +print(f"\nall_pass = {overall_report.all_pass}") +print(f"aggregate = {overall_report.aggregate!r}") +print(f"pretrends_joint populated? {overall_report.pretrends_joint is not None}") +print(f"homogeneity_joint populated? {overall_report.homogeneity_joint is not None}") +``` + +**Output:** + +``` +QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up) + +all_pass = True +aggregate = 'overall' +pretrends_joint populated? False +homogeneity_joint populated? False +``` + +**Reading the overall verdict.** Three things to note. + +- **Step 1 (QUG) fails to reject:** the test statistic `T = D_(1) / (D_(2) - D_(1)) ~ 3.86` lands well below its critical value (`1/alpha - 1 = 19` at alpha = 0.05); the data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1, but the QUG test is independent of HAD's `design="auto"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.) +- **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The diagnostics do not flag heterogeneity bias on the dose dimension, so reading the WAS as an average per-dose marginal effect is supported by these tests (subject to finite-sample power). +- **Step 2 (Assumption 7 pre-trends) is not run on this path.** The verdict says so verbatim: `"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path). + +A note on `all_pass = True` here: the workflow's `all_pass` flag aggregates only the steps that actually ran on this dispatch path. On the overall path that is QUG + linearity (Stute / Yatchew); Step 2's deferral is *not* folded into `all_pass`. So `all_pass = True` on the overall path means "of the two steps that ran, neither rejected" - it does not mean Assumption 7 has been cleared. The upgrade to event-study below makes this concrete by actually running Step 2. + +Let's look at each individual test result. + +```python +overall_report.qug.print_summary() +print() +overall_report.stute.print_summary() +print() +overall_report.yatchew.print_summary() +``` + +**Output:** + +``` +================================================================ + QUG null test (H_0: d_lower = 0) +================================================================ +Statistic T: 3.8562 +p-value: 0.2059 +Critical value (1/alpha-1): 19.0000 +Reject H_0: False +alpha: 0.0500 +Observations: 60 +Excluded (d == 0): 0 +D_(1): 0.1806 +D_(2): 0.2274 +================================================================ + +================================================================ + Stute CvM linearity test (H_0: linear E[dY|D]) +================================================================ +CvM statistic: 0.0735 +Bootstrap p-value: 0.6860 +Reject H_0: False +alpha: 0.0500 +Bootstrap replications: 999 +Observations: 60 +Seed: 21 +================================================================ + +================================================================ + Yatchew-HR linearity test (H_0: linear E[dY|D]) +================================================================ +T_hr statistic: -34759.3017 +p-value: 1.0000 +Critical value (1-sided z): 1.6449 +Reject H_0: False +alpha: 0.0500 +sigma^2_lin (OLS): 1.6177 +sigma^2_diff (Yatchew): 6250.2569 +sigma^2_W (HR scale): 1.3925 +Observations: 60 +================================================================ +``` + +A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000), which looks alarming but is a scale artifact, not pathology. Under the Yatchew construction `sigma2_diff = (1 / 2G) * sum((dy_{(g)} - dy_{(g-1)})^2)` is computed on `dy` sorted by dose `D`. With doses spread over Uniform[\$0.01K, \$50K] and a true per-$1K slope of 100 (locked by the DGP), adjacent-by-dose units have `dy` values that differ by roughly `100 * (D_{(g)} - D_{(g-1)})` plus noise — those squared gaps add up to a large `sigma2_diff` (about 6,250 here) by virtue of the dose scale, while the OLS residual variance `sigma2_lin` (about 1.6) reflects only noise around the linear fit. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. The side panel later in the notebook constructs a different Yatchew input (within-pre-period first-differences, where the adjacent-by-dose `dy` gaps are not driven by the post-treatment slope) and produces a `T_hr` near zero — a useful sanity check that the test behaves the way it should when the dose dimension genuinely contributes nothing to the variance of `dy`. + +## 4. Step 2: Upgrade to the Event-Study Workflow + +The two-period workflow ran Steps 1 and 3 but did not run Step 2 (parallel pre-trends). Our panel actually has 8 weeks - that is enough pre-periods to add the joint Stute pre-trends diagnostic (paper Section 4.2 step 2 + Hlavka-Huskova 2020 / Delgado-Manteiga 2001 dependence-preserving Mammen multiplier bootstrap). + +We pass the full multi-period panel to `did_had_pretest_workflow(aggregate="event_study", ...)`. The dispatch runs all three testable steps in one call: + +- **Step 1**: QUG re-runs on the dose distribution at the treatment period `F` (deterministic; same numbers as the overall path). +- **Step 2**: `joint_pretrends_test` - mean-independence joint Stute over the pre-period horizons (`E[Y_t - Y_base | D] = mu_t` for each t < F). +- **Step 3**: `joint_homogeneity_test` - linearity joint Stute over the post-period horizons (`E[Y_t - Y_base | D_t] = beta_{0,t} + beta_{fe,t} * D` for each t >= F). + +Step 3's "Yatchew-HR" arm has no joint variant in the paper (the differencing-based variance estimator doesn't have a derived multi-horizon extension), so the event-study path runs only joint Stute for linearity. Practitioners who want Yatchew-HR robustness on multi-period data can call the standalone `yatchew_hr_test` on each (base, post) pair manually. + +```python +es_report = did_had_pretest_workflow( + data=panel, + outcome_col="weekly_visits", + dose_col="regional_spend_k", + time_col="week", + unit_col="dma_id", + first_treat_col="first_treat", + alpha=0.05, + n_bootstrap=999, + seed=21, + aggregate="event_study", +) + +print(es_report.verdict) +print(f"\nall_pass = {es_report.all_pass}") +print(f"aggregate = {es_report.aggregate!r}") +print(f"pretrends_joint populated? {es_report.pretrends_joint is not None}") +print(f"homogeneity_joint populated? {es_report.homogeneity_joint is not None}") +``` + +**Output:** + +``` +QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions) + +all_pass = True +aggregate = 'event_study' +pretrends_joint populated? True +homogeneity_joint populated? True +``` + +**Reading the event-study verdict.** Now the verdict reads `"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)"`. The `"deferred"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated. + +A note on the verdict's "TWFE admissible" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05`. That is non-rejection evidence under the diagnostics' finite-sample power and specification, not a proof that the identifying assumptions hold. Step 4 (boundary continuity, paper Assumptions 5 / 6) remains non-testable from data and is not covered by any of the three diagnostics here. + +The joint pre-trends test runs over `n_horizons = 3` (pre-periods 1, 2, 3, with week 4 reserved as the base period). The joint homogeneity test runs over `n_horizons = 4` (post-periods 5, 6, 7, 8). Let's inspect the per-horizon detail. + +```python +es_report.qug.print_summary() +print() +es_report.pretrends_joint.print_summary() +print() +es_report.homogeneity_joint.print_summary() +``` + +**Output:** + +``` +================================================================ + QUG null test (H_0: d_lower = 0) +================================================================ +Statistic T: 3.8562 +p-value: 0.2059 +Critical value (1/alpha-1): 19.0000 +Reject H_0: False +alpha: 0.0500 +Observations: 60 +Excluded (d == 0): 0 +D_(1): 0.1806 +D_(2): 0.2274 +================================================================ + +================================================================ + Joint Stute CvM test (mean-independence (pre-trends)) +================================================================ +Joint CvM statistic: 7.1627 +Bootstrap p-value: 0.0720 +Reject H_0: False +alpha: 0.0500 +Bootstrap replications: 999 +Horizons: 3 +Observations: 60 +Seed: 21 +Exact-linear short-circuit: False +---------------------------------------------------------------- +Per-horizon statistics: + 1 1.6112 + 2 2.9262 + 3 2.6253 +================================================================ + +================================================================ + Joint Stute CvM test (linearity (post-homogeneity)) +================================================================ +Joint CvM statistic: 1.3562 +Bootstrap p-value: 0.7630 +Reject H_0: False +alpha: 0.0500 +Bootstrap replications: 999 +Horizons: 4 +Observations: 60 +Seed: 21 +Exact-linear short-circuit: False +---------------------------------------------------------------- +Per-horizon statistics: + 5 0.4218 + 6 0.2186 + 7 0.4928 + 8 0.2230 +================================================================ +``` + +The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold. The test does not reject at alpha = 0.05, but the near-threshold p-value warrants scrutiny - the diagnostic is not failing in a clearly-far-from-rejection regime. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring). + +The joint homogeneity p-value (~0.76) is comfortably far from rejection. The diagnostic does not flag heterogeneity bias on the dose dimension across the four post-launch horizons. + +Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. Step 4 (boundary continuity, Assumptions 5 / 6) remains non-testable from data and is argued from domain knowledge, as in T20. + +## 5. Side Panel: Yatchew-HR Null Modes + +The Yatchew-HR test exposes two `null=` modes (the second was added in 2026-04 for parity with the R `YatchewTest` package). + +- `null="linearity"` (default; paper Theorem 7): tests `H0: E[dY | D]` is linear in `D`. Residuals come from OLS `dy ~ 1 + d`. This is what `did_had_pretest_workflow` calls under the hood. +- `null="mean_independence"` (added 2026-04-26 in PR #397, Phase 4 R-parity): tests the stricter `H0: E[dY | D] = E[dY]`, i.e. `dY` is mean-independent of `D`. Residuals come from intercept-only OLS `dy ~ 1`. Mirrors R `YatchewTest::yatchew_test(order=0)`. + +The mean-independence mode is typically used on **placebo (pre-treatment) data** to test parallel pre-trends as a non-parametric mean-independence assertion. Below we construct an illustrative input - the within-pre-period first-difference `dy = Y[week=4] - Y[week=3]` paired with each DMA's actual post-period dose - and run both modes side by side. Both should fail to reject on this clean linear DGP; the contrast is in the residual structure. + +```python +from diff_diff import yatchew_hr_test + +panel_sorted = panel.sort_values(["dma_id", "week"]).reset_index(drop=True) +pre = panel_sorted[panel_sorted["week"].isin([3, 4])] +pre_pivot = pre.pivot(index="dma_id", columns="week", values="weekly_visits") +dy = (pre_pivot[4] - pre_pivot[3]).to_numpy(dtype=np.float64) +post_dose = ( + panel_sorted[panel_sorted["week"] == 5] + .set_index("dma_id") + .sort_index()["regional_spend_k"] + .to_numpy(dtype=np.float64) +) + +res_lin = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="linearity") +res_mi = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="mean_independence") + +print(res_lin.summary()) +print() +print(res_mi.summary()) +``` + +**Output:** + +``` +================================================================ + Yatchew-HR linearity test (H_0: linear E[dY|D]) +================================================================ +T_hr statistic: 0.0207 +p-value: 0.4917 +Critical value (1-sided z): 1.6449 +Reject H_0: False +alpha: 0.0500 +sigma^2_lin (OLS): 6.5340 +sigma^2_diff (Yatchew): 6.5170 +sigma^2_W (HR scale): 6.3639 +Observations: 60 +================================================================ + +================================================================ + Yatchew-HR mean-independence test (H_0: E[dY|D] = E[dY]) +================================================================ +T_hr statistic: 0.5536 +p-value: 0.2899 +Critical value (1-sided z): 1.6449 +Reject H_0: False +alpha: 0.0500 +sigma^2_lin (OLS): 7.0076 +sigma^2_diff (Yatchew): 6.5170 +sigma^2_W (HR scale): 6.8638 +Observations: 60 +================================================================ +``` + +**Reading the side-panel comparison.** + +- The `linearity` mode fits `dy ~ 1 + d` and computes residual variance `sigma2_lin` from those residuals. Under a clean linear DGP the residuals are small (close to noise variance), the gap `sigma2_lin - sigma2_diff` is near zero, and `T_hr` lands close to zero with a p-value far above alpha. +- The `mean_independence` mode fits intercept-only `dy ~ 1` and computes `sigma2_lin` as the population variance of `dy`. That residual variance is **strictly larger** than under `linearity` (the linear fit can absorb any apparent slope between `dy` and `d` - real or sample noise - shrinking the residual variance, while intercept-only cannot). The gap `sigma2_lin - sigma2_diff` is then larger and `T_hr` is larger - same asymptotic distribution, stricter null, more easily rejected when the alternative is true. + +On clean linear placebo data both modes fail to reject - exactly what we want. On data where `dY` actually responds to `D` in pre-period (parallel pre-trends fail), `null="mean_independence"` is more sensitive than `null="linearity"` because linearity is a weaker null (linear pre-trends would fail to reject the linearity null but would reject the mean-independence null). + +When to choose which: use `null="linearity"` to defend the joint identification assumption (paper Step 3, Assumption 8). Use `null="mean_independence"` on placebo (pre-treatment) data when you want a non-parametric mean-independence assertion. The `null="mean_independence"` mode is what R `YatchewTest::yatchew_test(order=0)` runs by default for placebo pre-trend tests. + +## 6. Communicating the Diagnostics to Leadership + +Pre-test results travel awkwardly to non-technical audiences. The template below structures the diagnostics around what each test does and does not rule out - mirroring the headline-and-evidence pattern from T20 Section 5. + +> **The HAD pre-test diagnostics on the brand-campaign panel do not flag a violation of the testable identifying assumptions.** +> +> - **Step 1 (QUG support-infimum, paper Theorem 4):** the test does not reject `H0: d_lower = 0` (p approximately 0.21). The data are statistically consistent with a dose distribution starting at zero. Independently of QUG, HAD's `design="auto"` selector applies a min/median heuristic to the post-period dose vector and lands on the `continuous_at_zero` design (target `WAS`) on this panel; QUG and the design selector are separate rules that point to the same identification path here. Failing to reject the QUG null is not proof that the true support is exactly at zero, and the design selector's choice is operational, not statistical. +> - **Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test does not reject (joint p approximately 0.07 across the three pre-period horizons). The p-value is close to alpha = 0.05, so the non-rejection here is not by a wide margin - in a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending. +> - **Step 3 (linearity, Assumption 8):** joint Stute homogeneity does not reject (joint p approximately 0.76 across the four post-launch horizons). The diagnostic does not flag heterogeneity bias on the dose dimension under the test's specification. +> +> **Non-testable from data (Step 4, paper Assumptions 5 / 6, boundary continuity):** local-linearity of the dose-response near `d_lower`. Argued from domain knowledge - is there reason to believe the marginal effect of an additional $1K of regional spend is roughly constant across the dose range? In our case yes, by DGP construction; in a real analysis we would justify this from prior knowledge of the channel's response shape. +> +> **Bottom line:** the workflow's three testable diagnostics do not flag a violation. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and Step 4 (boundary continuity, non-testable from data). None of these are settled by non-rejection of the pre-tests. + +## 7. Extensions + +This tutorial covered the composite pre-test workflow on a single panel where QUG led the workflow to select the `continuous_at_zero` (Design 1) identification path. A few directions we did not exercise here: + +- **Survey-weighted / population-weighted inference** - HAD's pre-test workflow accepts `survey_design=` (or the deprecated `survey=` / `weights=` aliases) for design-based inference. The QUG step is permanently deferred under survey weighting (extreme-value theory under complex sampling is not a settled toolkit); the linearity family runs with PSU-level Mammen multiplier bootstrap (Stute and joint variants) and weighted OLS + weighted variance components (Yatchew). A follow-up tutorial covers this path end-to-end. +- **`trends_lin=True` (Pierce-Schott Eq 17 / 18 detrending)** - mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)`. Forwards into both joint pre-trends and joint homogeneity wrappers; consumes the placebo at `base_period - 1` and skips Step 2 if no earlier placebo survives the drop. Useful when you suspect linear time trends correlated with dose but want to keep the joint-Stute machinery. +- **Standalone constituent tests** - all four building blocks are exposed for direct calling: `qug_test`, `stute_test`, `yatchew_hr_test` (used in this tutorial's side panel), and the joint variants `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`. + +See the [`HeterogeneousAdoptionDiD` API reference](../api/had.html) and the [`HAD pre-tests` reference](../api/had.html#pre-tests) for the full parameter lists. + +**Related tutorials.** + +- [Tutorial 14: Continuous DiD](14_continuous_did.ipynb) - the Callaway-Goodman-Bacon-Sant'Anna estimator for continuous-dose settings WHERE you do have a never-treated unit AND want the per-dose ATT(d) curve, not just the average slope. +- [Tutorial 20: HAD for a National Brand Campaign](20_had_brand_campaign.ipynb) - the headline HAD fit and event-study this tutorial defends. +- [Tutorial 4: Parallel Trends](04_parallel_trends.ipynb) - parallel-trends tests for the binary-DiD setting. + +## 8. Summary Checklist + +- HAD's pre-test workflow `did_had_pretest_workflow` bundles paper Section 4.2 Steps 1 (QUG support infimum), 2 (joint Stute pre-trends - event-study path only), and 3 (Stute / Yatchew-HR linearity, joint variant on event-study path). +- The two-period (`aggregate="overall"`) path runs Steps 1 + 3 only - it cannot run Step 2 because a single pre-period structurally has nothing to test against. The verdict says so verbatim: "Assumption 7 pre-trends test NOT run". +- Upgrade to the multi-period (`aggregate="event_study"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads "TWFE admissible under Section 4 assumptions" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof. +- Step 4 (paper Assumptions 5 / 6, boundary continuity) is **non-testable** from data - argue from domain knowledge. +- The Yatchew-HR test exposes two null modes: `null="linearity"` (paper Theorem 7, default; what the workflow calls under the hood) and `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data). +- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The workflow uses the QUG outcome to pick the identification path (`continuous_at_zero` vs `continuous_near_d_lower`); finite-sample uncertainty in that decision is a remaining caveat. +- Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python). diff --git a/docs/conf.py b/docs/conf.py index 4baa9ccb..828dd1c5 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -33,7 +33,7 @@ ] templates_path = ["_templates"] -exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] +exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "_review"] # -- Options for autodoc ----------------------------------------------------- autodoc_default_options = { From d8437a3a8b41c5be22d40255dc8de8bfca7feb87 Mon Sep 17 00:00:00 2001 From: igerber Date: Sun, 10 May 2026 10:45:27 -0400 Subject: [PATCH 09/12] Address PR #409 R4 review (1 P1, 1 P2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 — HAD design label convention was reversed across T21. Per REGISTRY:2267 + had.py:7-33, the convention is: - Design 1' = continuous_at_zero (d_lower = 0, QUG case) — that's T21 - Design 1 = continuous_near_d_lower (d_lower > 0) — that's T20 T21 had Design 1 / Design 1' swapped throughout. Fixed in the build script (Section 1 paper-step taxonomy, Section 2 panel framing, Section 3 reading-the-verdict, Section 7 Extensions). Notebook re-executed and review extract regenerated. Two residual "QUG selects/picks the identification path" leakages from the original prose also surfaced (Section 7 + Summary checklist). Both contradicted the explicit QUG-vs-_detect_design separation locked by test_had_design_auto_lands_on_continuous_at_zero. Reworded to keep the two rules independent ("QUG fail-to-reject and `design="auto"` heuristic both pointed independently"; "QUG is a statistical test on H0; `design="auto"` calls _detect_design() which uses a min/median heuristic — both pointed to continuous_at_zero on this panel"). P2 (MT1) — T21 was mapped under had_pretests.py in doc-deps.yaml but the drift test now also locks HAD(design="auto") / _detect_design() behavior from had.py via test_had_design_auto_lands_on_continuous_at_zero. Add T21 entry to the had.py docs block with a note on the _detect_design() drift coverage so a future had.py design-selection change does not miss T21 in the manual docs-impact map. All 16 drift tests still pass on Rust; nbmake clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/_review/t21_notebook_extract.md | 10 +- docs/doc-deps.yaml | 3 + docs/tutorials/21_had_pretest_workflow.ipynb | 98 ++++++++++---------- 3 files changed, 57 insertions(+), 54 deletions(-) diff --git a/docs/_review/t21_notebook_extract.md b/docs/_review/t21_notebook_extract.md index 4505f5c8..8f9b02c3 100644 --- a/docs/_review/t21_notebook_extract.md +++ b/docs/_review/t21_notebook_extract.md @@ -28,7 +28,7 @@ This tutorial picks up where T20 left off. We re-run the brand campaign on a pan de Chaisemartin et al. (2026) Section 4.2 lays out a four-step workflow for HAD identification: -1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1, `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1', `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters. +1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1', `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1, `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters. 2. **Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD. 3. **Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias? 4. **Step 4 - Boundary continuity (paper Assumptions 5, 6):** local-linearity of the dose-response near the boundary `d_lower`. **Non-testable**; argued from domain knowledge. @@ -37,7 +37,7 @@ The library bundles the testable steps into one entry point: `did_had_pretest_wo ## 2. The Panel -We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1) identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design="auto"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1 from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open. +We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1') identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design="auto"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1' from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open. ```python import numpy as np @@ -152,7 +152,7 @@ homogeneity_joint populated? False **Reading the overall verdict.** Three things to note. -- **Step 1 (QUG) fails to reject:** the test statistic `T = D_(1) / (D_(2) - D_(1)) ~ 3.86` lands well below its critical value (`1/alpha - 1 = 19` at alpha = 0.05); the data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1, but the QUG test is independent of HAD's `design="auto"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.) +- **Step 1 (QUG) fails to reject:** the test statistic `T = D_(1) / (D_(2) - D_(1)) ~ 3.86` lands well below its critical value (`1/alpha - 1 = 19` at alpha = 0.05); the data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1', but the QUG test is independent of HAD's `design="auto"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.) - **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The diagnostics do not flag heterogeneity bias on the dose dimension, so reading the WAS as an average per-dose marginal effect is supported by these tests (subject to finite-sample power). - **Step 2 (Assumption 7 pre-trends) is not run on this path.** The verdict says so verbatim: `"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path). @@ -423,7 +423,7 @@ Pre-test results travel awkwardly to non-technical audiences. The template below ## 7. Extensions -This tutorial covered the composite pre-test workflow on a single panel where QUG led the workflow to select the `continuous_at_zero` (Design 1) identification path. A few directions we did not exercise here: +This tutorial covered the composite pre-test workflow on a single panel where QUG fail-to-reject and HAD's `design="auto"` heuristic both pointed independently to the `continuous_at_zero` (Design 1') identification path. A few directions we did not exercise here: - **Survey-weighted / population-weighted inference** - HAD's pre-test workflow accepts `survey_design=` (or the deprecated `survey=` / `weights=` aliases) for design-based inference. The QUG step is permanently deferred under survey weighting (extreme-value theory under complex sampling is not a settled toolkit); the linearity family runs with PSU-level Mammen multiplier bootstrap (Stute and joint variants) and weighted OLS + weighted variance components (Yatchew). A follow-up tutorial covers this path end-to-end. - **`trends_lin=True` (Pierce-Schott Eq 17 / 18 detrending)** - mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)`. Forwards into both joint pre-trends and joint homogeneity wrappers; consumes the placebo at `base_period - 1` and skips Step 2 if no earlier placebo survives the drop. Useful when you suspect linear time trends correlated with dose but want to keep the joint-Stute machinery. @@ -444,5 +444,5 @@ See the [`HeterogeneousAdoptionDiD` API reference](../api/had.html) and the [`HA - Upgrade to the multi-period (`aggregate="event_study"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads "TWFE admissible under Section 4 assumptions" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof. - Step 4 (paper Assumptions 5 / 6, boundary continuity) is **non-testable** from data - argue from domain knowledge. - The Yatchew-HR test exposes two null modes: `null="linearity"` (paper Theorem 7, default; what the workflow calls under the hood) and `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data). -- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The workflow uses the QUG outcome to pick the identification path (`continuous_at_zero` vs `continuous_near_d_lower`); finite-sample uncertainty in that decision is a remaining caveat. +- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The QUG test and HAD's `design="auto"` selector are independent rules: QUG is a statistical test on `H0: d_lower = 0`; `design="auto"` calls `_detect_design()` which uses a min/median heuristic on the dose vector. Both pointed to `continuous_at_zero` on this panel; finite-sample uncertainty in either decision is a remaining caveat. - Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python). diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml index fea1ced4..c325f5f9 100644 --- a/docs/doc-deps.yaml +++ b/docs/doc-deps.yaml @@ -388,6 +388,9 @@ sources: - path: diff_diff/guides/llms-full.txt section: "HeterogeneousAdoptionDiD" type: user_guide + - path: docs/tutorials/21_had_pretest_workflow.ipynb + type: tutorial + note: "Drift-locks `HAD(design=\"auto\")` resolution to `continuous_at_zero` on T21's panel via `tests/test_t21_had_pretest_workflow_drift.py::test_had_design_auto_lands_on_continuous_at_zero`; changes to `_detect_design()` heuristic should re-validate T21" diff_diff/had_pretests.py: drift_risk: medium diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb index b9154d49..8e5f152a 100644 --- a/docs/tutorials/21_had_pretest_workflow.ipynb +++ b/docs/tutorials/21_had_pretest_workflow.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "64f6ebc9", + "id": "118e1f9c", "metadata": {}, "source": [ "# Tutorial 21: HAD Pre-test Workflow - Running the Pre-test Diagnostics on the Brand Campaign Panel\n", @@ -14,14 +14,14 @@ }, { "cell_type": "markdown", - "id": "91b4bc64", + "id": "e9c4c4f9", "metadata": {}, "source": [ "## 1. The Pre-test Battery\n", "\n", "de Chaisemartin et al. (2026) Section 4.2 lays out a four-step workflow for HAD identification:\n", "\n", - "1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1, `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1', `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters.\n", + "1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1', `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1, `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters.\n", "2. **Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD.\n", "3. **Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias?\n", "4. **Step 4 - Boundary continuity (paper Assumptions 5, 6):** local-linearity of the dose-response near the boundary `d_lower`. **Non-testable**; argued from domain knowledge.\n", @@ -31,24 +31,24 @@ }, { "cell_type": "markdown", - "id": "c91bfc78", + "id": "9a746805", "metadata": {}, "source": [ "## 2. The Panel\n", "\n", - "We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1) identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design=\"auto\"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1 from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.\n" + "We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1') identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design=\"auto\"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1' from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.\n" ] }, { "cell_type": "code", "execution_count": 1, - "id": "c3ed9b52", + "id": "d1ae0139", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:53:29.812054Z", - "iopub.status.busy": "2026-05-10T00:53:29.811945Z", - "iopub.status.idle": "2026-05-10T00:53:31.283255Z", - "shell.execute_reply": "2026-05-10T00:53:31.282925Z" + "iopub.execute_input": "2026-05-10T14:44:56.348210Z", + "iopub.status.busy": "2026-05-10T14:44:56.348005Z", + "iopub.status.idle": "2026-05-10T14:44:58.011237Z", + "shell.execute_reply": "2026-05-10T14:44:58.010959Z" } }, "outputs": [ @@ -116,7 +116,7 @@ }, { "cell_type": "markdown", - "id": "840e7738", + "id": "4f2cf6ab", "metadata": {}, "source": [ "## 3. Step 1: The Overall Workflow (Two-Period Path)\n", @@ -129,13 +129,13 @@ { "cell_type": "code", "execution_count": 2, - "id": "527e4562", + "id": "e9dcb44f", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:53:31.284493Z", - "iopub.status.busy": "2026-05-10T00:53:31.284385Z", - "iopub.status.idle": "2026-05-10T00:53:31.323187Z", - "shell.execute_reply": "2026-05-10T00:53:31.322931Z" + "iopub.execute_input": "2026-05-10T14:44:58.012454Z", + "iopub.status.busy": "2026-05-10T14:44:58.012339Z", + "iopub.status.idle": "2026-05-10T14:44:58.052568Z", + "shell.execute_reply": "2026-05-10T14:44:58.052290Z" } }, "outputs": [ @@ -188,12 +188,12 @@ }, { "cell_type": "markdown", - "id": "1ff016af", + "id": "67d0c597", "metadata": {}, "source": [ "**Reading the overall verdict.** Three things to note.\n", "\n", - "- **Step 1 (QUG) fails to reject:** the test statistic `T = D_(1) / (D_(2) - D_(1)) ~ 3.86` lands well below its critical value (`1/alpha - 1 = 19` at alpha = 0.05); the data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1, but the QUG test is independent of HAD's `design=\"auto\"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.)\n", + "- **Step 1 (QUG) fails to reject:** the test statistic `T = D_(1) / (D_(2) - D_(1)) ~ 3.86` lands well below its critical value (`1/alpha - 1 = 19` at alpha = 0.05); the data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1', but the QUG test is independent of HAD's `design=\"auto\"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.)\n", "- **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The diagnostics do not flag heterogeneity bias on the dose dimension, so reading the WAS as an average per-dose marginal effect is supported by these tests (subject to finite-sample power).\n", "- **Step 2 (Assumption 7 pre-trends) is not run on this path.** The verdict says so verbatim: `\"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)\"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path).\n", "\n", @@ -205,13 +205,13 @@ { "cell_type": "code", "execution_count": 3, - "id": "bcf57147", + "id": "3d7cbdce", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:53:31.324354Z", - "iopub.status.busy": "2026-05-10T00:53:31.324272Z", - "iopub.status.idle": "2026-05-10T00:53:31.326037Z", - "shell.execute_reply": "2026-05-10T00:53:31.325825Z" + "iopub.execute_input": "2026-05-10T14:44:58.053856Z", + "iopub.status.busy": "2026-05-10T14:44:58.053763Z", + "iopub.status.idle": "2026-05-10T14:44:58.055674Z", + "shell.execute_reply": "2026-05-10T14:44:58.055438Z" } }, "outputs": [ @@ -271,7 +271,7 @@ }, { "cell_type": "markdown", - "id": "ba741d74", + "id": "0d023307", "metadata": {}, "source": [ "A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000), which looks alarming but is a scale artifact, not pathology. Under the Yatchew construction `sigma2_diff = (1 / 2G) * sum((dy_{(g)} - dy_{(g-1)})^2)` is computed on `dy` sorted by dose `D`. With doses spread over Uniform[\\$0.01K, \\$50K] and a true per-$1K slope of 100 (locked by the DGP), adjacent-by-dose units have `dy` values that differ by roughly `100 * (D_{(g)} - D_{(g-1)})` plus noise — those squared gaps add up to a large `sigma2_diff` (about 6,250 here) by virtue of the dose scale, while the OLS residual variance `sigma2_lin` (about 1.6) reflects only noise around the linear fit. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. The side panel later in the notebook constructs a different Yatchew input (within-pre-period first-differences, where the adjacent-by-dose `dy` gaps are not driven by the post-treatment slope) and produces a `T_hr` near zero — a useful sanity check that the test behaves the way it should when the dose dimension genuinely contributes nothing to the variance of `dy`.\n" @@ -279,7 +279,7 @@ }, { "cell_type": "markdown", - "id": "850f5fee", + "id": "8c2207b9", "metadata": {}, "source": [ "## 4. Step 2: Upgrade to the Event-Study Workflow\n", @@ -298,13 +298,13 @@ { "cell_type": "code", "execution_count": 4, - "id": "4ce3929e", + "id": "b9fc9759", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:53:31.326991Z", - "iopub.status.busy": "2026-05-10T00:53:31.326919Z", - "iopub.status.idle": "2026-05-10T00:53:31.449260Z", - "shell.execute_reply": "2026-05-10T00:53:31.448989Z" + "iopub.execute_input": "2026-05-10T14:44:58.056648Z", + "iopub.status.busy": "2026-05-10T14:44:58.056577Z", + "iopub.status.idle": "2026-05-10T14:44:58.183086Z", + "shell.execute_reply": "2026-05-10T14:44:58.182792Z" } }, "outputs": [ @@ -344,7 +344,7 @@ }, { "cell_type": "markdown", - "id": "da7f50b3", + "id": "4617bf96", "metadata": {}, "source": [ "**Reading the event-study verdict.** Now the verdict reads `\"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\"`. The `\"deferred\"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.\n", @@ -357,13 +357,13 @@ { "cell_type": "code", "execution_count": 5, - "id": "2daa40d4", + "id": "6bf80443", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:53:31.450431Z", - "iopub.status.busy": "2026-05-10T00:53:31.450344Z", - "iopub.status.idle": "2026-05-10T00:53:31.452391Z", - "shell.execute_reply": "2026-05-10T00:53:31.452142Z" + "iopub.execute_input": "2026-05-10T14:44:58.184264Z", + "iopub.status.busy": "2026-05-10T14:44:58.184175Z", + "iopub.status.idle": "2026-05-10T14:44:58.186097Z", + "shell.execute_reply": "2026-05-10T14:44:58.185814Z" } }, "outputs": [ @@ -436,7 +436,7 @@ }, { "cell_type": "markdown", - "id": "25b7c8a7", + "id": "76fe2a7d", "metadata": {}, "source": [ "The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold. The test does not reject at alpha = 0.05, but the near-threshold p-value warrants scrutiny - the diagnostic is not failing in a clearly-far-from-rejection regime. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring).\n", @@ -448,7 +448,7 @@ }, { "cell_type": "markdown", - "id": "fe666646", + "id": "543d4fb2", "metadata": {}, "source": [ "## 5. Side Panel: Yatchew-HR Null Modes\n", @@ -464,13 +464,13 @@ { "cell_type": "code", "execution_count": 6, - "id": "7bc357d6", + "id": "606e1681", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T00:53:31.453469Z", - "iopub.status.busy": "2026-05-10T00:53:31.453395Z", - "iopub.status.idle": "2026-05-10T00:53:31.457975Z", - "shell.execute_reply": "2026-05-10T00:53:31.457762Z" + "iopub.execute_input": "2026-05-10T14:44:58.187360Z", + "iopub.status.busy": "2026-05-10T14:44:58.187263Z", + "iopub.status.idle": "2026-05-10T14:44:58.191876Z", + "shell.execute_reply": "2026-05-10T14:44:58.191499Z" } }, "outputs": [ @@ -532,7 +532,7 @@ }, { "cell_type": "markdown", - "id": "02061bc7", + "id": "c1d709fc", "metadata": {}, "source": [ "**Reading the side-panel comparison.**\n", @@ -547,7 +547,7 @@ }, { "cell_type": "markdown", - "id": "ea2d9d70", + "id": "05239106", "metadata": {}, "source": [ "## 6. Communicating the Diagnostics to Leadership\n", @@ -567,12 +567,12 @@ }, { "cell_type": "markdown", - "id": "504fd1c1", + "id": "67846f9c", "metadata": {}, "source": [ "## 7. Extensions\n", "\n", - "This tutorial covered the composite pre-test workflow on a single panel where QUG led the workflow to select the `continuous_at_zero` (Design 1) identification path. A few directions we did not exercise here:\n", + "This tutorial covered the composite pre-test workflow on a single panel where QUG fail-to-reject and HAD's `design=\"auto\"` heuristic both pointed independently to the `continuous_at_zero` (Design 1') identification path. A few directions we did not exercise here:\n", "\n", "- **Survey-weighted / population-weighted inference** - HAD's pre-test workflow accepts `survey_design=` (or the deprecated `survey=` / `weights=` aliases) for design-based inference. The QUG step is permanently deferred under survey weighting (extreme-value theory under complex sampling is not a settled toolkit); the linearity family runs with PSU-level Mammen multiplier bootstrap (Stute and joint variants) and weighted OLS + weighted variance components (Yatchew). A follow-up tutorial covers this path end-to-end.\n", "- **`trends_lin=True` (Pierce-Schott Eq 17 / 18 detrending)** - mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)`. Forwards into both joint pre-trends and joint homogeneity wrappers; consumes the placebo at `base_period - 1` and skips Step 2 if no earlier placebo survives the drop. Useful when you suspect linear time trends correlated with dose but want to keep the joint-Stute machinery.\n", @@ -589,7 +589,7 @@ }, { "cell_type": "markdown", - "id": "404a00cd", + "id": "f44341b9", "metadata": {}, "source": [ "## 8. Summary Checklist\n", @@ -599,7 +599,7 @@ "- Upgrade to the multi-period (`aggregate=\"event_study\"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads \"TWFE admissible under Section 4 assumptions\" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof.\n", "- Step 4 (paper Assumptions 5 / 6, boundary continuity) is **non-testable** from data - argue from domain knowledge.\n", "- The Yatchew-HR test exposes two null modes: `null=\"linearity\"` (paper Theorem 7, default; what the workflow calls under the hood) and `null=\"mean_independence\"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data).\n", - "- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The workflow uses the QUG outcome to pick the identification path (`continuous_at_zero` vs `continuous_near_d_lower`); finite-sample uncertainty in that decision is a remaining caveat.\n", + "- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The QUG test and HAD's `design=\"auto\"` selector are independent rules: QUG is a statistical test on `H0: d_lower = 0`; `design=\"auto\"` calls `_detect_design()` which uses a min/median heuristic on the dose vector. Both pointed to `continuous_at_zero` on this panel; finite-sample uncertainty in either decision is a remaining caveat.\n", "- Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python).\n" ] } From 162f45aa8bf2234a44eb99300248606fe1e76b6b Mon Sep 17 00:00:00 2001 From: igerber Date: Sun, 10 May 2026 10:52:53 -0400 Subject: [PATCH 10/12] =?UTF-8?q?Address=20PR=20#409=20R5=20review=20(1=20?= =?UTF-8?q?P1=20=E2=80=94=20paper=20Step=204=20vs=20Design=201'=20caveat)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two methodology framing errors conflated in the original tutorial: - "Paper Step 4" was described as "Boundary continuity (Assumptions 5/6)" in the workflow taxonomy. Per REGISTRY's pretest workflow (lines 2482-2487 surrounding the four-step enumeration), Step 4 is actually the DECISION RULE: "if Steps 1-3 don't reject, TWFE may be used." Boundary-continuity assumptions are a separate concern. - Assumptions 5/6 are Design 1 (continuous_near_d_lower / mass_point) identification caveats — the library emits a UserWarning citing them on Design 1 fits and stays silent on Design 1' (continuous_at_zero) fits per REGISTRY:2532 and had.py. T21's panel resolves to Design 1' via QUG fail-to-reject + the _detect_design() heuristic, so the relevant non-testable caveat is **Assumption 3** (uniform continuity of d -> Y_2(d) at zero, REGISTRY:2270), NOT Assumptions 5/6. Inherited the 5/6 framing from T20 (which IS Design 1) inappropriately. Reframed across 7 surfaces in the build script: - Section 1 four-step enumeration: Step 4 is now the decision rule - Section 1: added a separate paragraph for the non-testable identification caveat that's design-path-specific (Assumption 3 for Design 1', Assumptions 5/6 for Design 1) and explicitly notes the library's UserWarning behavior matches this split - Section 4 event-study verdict reading: separated Step 4 (decision rule) from the Design 1' caveat - Section 4 horizon-detail closing: same split - Section 6 leadership template: replaced "Step 4 / Assumptions 5/6" caveat with the correct Design 1' caveat (Assumption 3); explicit parenthetical noting T20's caveat was different because T20 was Design 1 - Section 6 bottom line: same split (decision rule vs caveat) - Section 8 summary checklist: replaced single Step-4-as-caveat bullet with a two-part bullet on the workflow vs caveat distinction Notebook re-executed, review extract regenerated. All 16 drift tests still pass; nbmake clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/_review/t21_notebook_extract.md | 16 +-- docs/tutorials/21_had_pretest_workflow.ipynb | 106 ++++++++++--------- 2 files changed, 63 insertions(+), 59 deletions(-) diff --git a/docs/_review/t21_notebook_extract.md b/docs/_review/t21_notebook_extract.md index 8f9b02c3..8f4143ec 100644 --- a/docs/_review/t21_notebook_extract.md +++ b/docs/_review/t21_notebook_extract.md @@ -26,15 +26,17 @@ This tutorial picks up where T20 left off. We re-run the brand campaign on a pan ## 1. The Pre-test Battery -de Chaisemartin et al. (2026) Section 4.2 lays out a four-step workflow for HAD identification: +de Chaisemartin et al. (2026) Section 4.2 lays out a four-step pre-test workflow for HAD identification: 1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1', `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1, `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters. 2. **Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD. 3. **Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias? -4. **Step 4 - Boundary continuity (paper Assumptions 5, 6):** local-linearity of the dose-response near the boundary `d_lower`. **Non-testable**; argued from domain knowledge. +4. **Step 4 - Decision rule:** if Steps 1-3 all fail to reject, TWFE may be used to estimate the treatment effect (paper Section 4.3). The library bundles the testable steps into one entry point: `did_had_pretest_workflow`. It dispatches to a two-period implementation (steps 1 + 3 only - step 2 needs at least two pre-periods) or a multi-period implementation (steps 1 + 2 + 3 jointly). The Yatchew-HR test from Step 3 is also exposed standalone with two null modes; we exercise both in the side panel. +**Non-testable identification caveat (separate from the four-step workflow).** Identification of the WAS estimand under Design 1' (`continuous_at_zero`, target = `WAS`) requires **Assumption 3** (uniform continuity of `d -> Y_2(d)` at zero, holds if the dose-response is Lipschitz; not testable). The Design 1 paths (`continuous_near_d_lower` / `mass_point`, target = `WAS_d_lower`) instead need **Assumption 5** (sign identification) or **Assumption 6** (`WAS_d_lower` point identification) - that is the caveat T20's tutorial flagged because T20's panel was Design 1. T21's panel resolves to Design 1' (see Section 2 + Section 3), so the relevant non-testable caveat here is Assumption 3, NOT Assumptions 5/6. The library reflects this: it emits a UserWarning about Assumption 5/6 on Design 1 fits and does not emit it on `continuous_at_zero` (Design 1') fits. + ## 2. The Panel We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1') identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design="auto"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1' from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open. @@ -260,7 +262,7 @@ homogeneity_joint populated? True **Reading the event-study verdict.** Now the verdict reads `"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)"`. The `"deferred"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated. -A note on the verdict's "TWFE admissible" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05`. That is non-rejection evidence under the diagnostics' finite-sample power and specification, not a proof that the identifying assumptions hold. Step 4 (boundary continuity, paper Assumptions 5 / 6) remains non-testable from data and is not covered by any of the three diagnostics here. +A note on the verdict's "TWFE admissible" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05` (paper Step 4 decision rule). That is non-rejection evidence under the diagnostics' finite-sample power and specification, not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero, see Section 1) sits alongside this and is not covered by any of the three diagnostics. The joint pre-trends test runs over `n_horizons = 3` (pre-periods 1, 2, 3, with week 4 reserved as the base period). The joint homogeneity test runs over `n_horizons = 4` (post-periods 5, 6, 7, 8). Let's inspect the per-horizon detail. @@ -333,7 +335,7 @@ The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 thres The joint homogeneity p-value (~0.76) is comfortably far from rejection. The diagnostic does not flag heterogeneity bias on the dose dimension across the four post-launch horizons. -Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. Step 4 (boundary continuity, Assumptions 5 / 6) remains non-testable from data and is argued from domain knowledge, as in T20. +Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. By paper Step 4 (the decision rule), TWFE may then be used. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero) remains and is argued from domain knowledge. ## 5. Side Panel: Yatchew-HR Null Modes @@ -417,9 +419,9 @@ Pre-test results travel awkwardly to non-technical audiences. The template below > - **Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test does not reject (joint p approximately 0.07 across the three pre-period horizons). The p-value is close to alpha = 0.05, so the non-rejection here is not by a wide margin - in a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending. > - **Step 3 (linearity, Assumption 8):** joint Stute homogeneity does not reject (joint p approximately 0.76 across the four post-launch horizons). The diagnostic does not flag heterogeneity bias on the dose dimension under the test's specification. > -> **Non-testable from data (Step 4, paper Assumptions 5 / 6, boundary continuity):** local-linearity of the dose-response near `d_lower`. Argued from domain knowledge - is there reason to believe the marginal effect of an additional $1K of regional spend is roughly constant across the dose range? In our case yes, by DGP construction; in a real analysis we would justify this from prior knowledge of the channel's response shape. +> **Non-testable from data (Design 1' identification, paper Assumption 3 / boundary regularity at zero):** uniform continuity of the dose-response `d -> Y_2(d)` at zero. Argued from domain knowledge - is there reason to believe outcomes are continuous in spend at the lower-dose boundary, with no extensive-margin discontinuity at $0? In our case yes, by DGP construction. (Note: this is the Design 1' caveat. T20's panel was Design 1, where the corresponding non-testable caveats are Assumptions 5/6 - the library actually emits a UserWarning surfacing those on Design 1 fits but stays silent on Design 1' fits like ours.) > -> **Bottom line:** the workflow's three testable diagnostics do not flag a violation. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and Step 4 (boundary continuity, non-testable from data). None of these are settled by non-rejection of the pre-tests. +> **Bottom line:** the workflow's three testable diagnostics do not flag a violation, so by paper Step 4 (decision rule) TWFE may be used. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and the non-testable Design 1' caveat (Assumption 3 / boundary regularity at zero). None of these are settled by non-rejection of the pre-tests. ## 7. Extensions @@ -442,7 +444,7 @@ See the [`HeterogeneousAdoptionDiD` API reference](../api/had.html) and the [`HA - HAD's pre-test workflow `did_had_pretest_workflow` bundles paper Section 4.2 Steps 1 (QUG support infimum), 2 (joint Stute pre-trends - event-study path only), and 3 (Stute / Yatchew-HR linearity, joint variant on event-study path). - The two-period (`aggregate="overall"`) path runs Steps 1 + 3 only - it cannot run Step 2 because a single pre-period structurally has nothing to test against. The verdict says so verbatim: "Assumption 7 pre-trends test NOT run". - Upgrade to the multi-period (`aggregate="event_study"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads "TWFE admissible under Section 4 assumptions" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof. -- Step 4 (paper Assumptions 5 / 6, boundary continuity) is **non-testable** from data - argue from domain knowledge. +- Paper Step 4 is the **decision rule** (if Steps 1-3 don't reject, use TWFE), not a non-testable assumption. The non-testable identification caveat is design-path-specific: **Assumption 3** (boundary regularity at zero) for `continuous_at_zero` (Design 1', T21), or **Assumptions 5/6** for the Design 1 paths (`continuous_near_d_lower` / `mass_point`, T20). - The Yatchew-HR test exposes two null modes: `null="linearity"` (paper Theorem 7, default; what the workflow calls under the hood) and `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data). - QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The QUG test and HAD's `design="auto"` selector are independent rules: QUG is a statistical test on `H0: d_lower = 0`; `design="auto"` calls `_detect_design()` which uses a min/median heuristic on the dose vector. Both pointed to `continuous_at_zero` on this panel; finite-sample uncertainty in either decision is a remaining caveat. - Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python). diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb index 8e5f152a..0443e015 100644 --- a/docs/tutorials/21_had_pretest_workflow.ipynb +++ b/docs/tutorials/21_had_pretest_workflow.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "118e1f9c", + "id": "d4e3e374", "metadata": {}, "source": [ "# Tutorial 21: HAD Pre-test Workflow - Running the Pre-test Diagnostics on the Brand Campaign Panel\n", @@ -14,24 +14,26 @@ }, { "cell_type": "markdown", - "id": "e9c4c4f9", + "id": "c45952e0", "metadata": {}, "source": [ "## 1. The Pre-test Battery\n", "\n", - "de Chaisemartin et al. (2026) Section 4.2 lays out a four-step workflow for HAD identification:\n", + "de Chaisemartin et al. (2026) Section 4.2 lays out a four-step pre-test workflow for HAD identification:\n", "\n", "1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1', `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1, `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters.\n", "2. **Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD.\n", "3. **Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias?\n", - "4. **Step 4 - Boundary continuity (paper Assumptions 5, 6):** local-linearity of the dose-response near the boundary `d_lower`. **Non-testable**; argued from domain knowledge.\n", + "4. **Step 4 - Decision rule:** if Steps 1-3 all fail to reject, TWFE may be used to estimate the treatment effect (paper Section 4.3).\n", "\n", - "The library bundles the testable steps into one entry point: `did_had_pretest_workflow`. It dispatches to a two-period implementation (steps 1 + 3 only - step 2 needs at least two pre-periods) or a multi-period implementation (steps 1 + 2 + 3 jointly). The Yatchew-HR test from Step 3 is also exposed standalone with two null modes; we exercise both in the side panel.\n" + "The library bundles the testable steps into one entry point: `did_had_pretest_workflow`. It dispatches to a two-period implementation (steps 1 + 3 only - step 2 needs at least two pre-periods) or a multi-period implementation (steps 1 + 2 + 3 jointly). The Yatchew-HR test from Step 3 is also exposed standalone with two null modes; we exercise both in the side panel.\n", + "\n", + "**Non-testable identification caveat (separate from the four-step workflow).** Identification of the WAS estimand under Design 1' (`continuous_at_zero`, target = `WAS`) requires **Assumption 3** (uniform continuity of `d -> Y_2(d)` at zero, holds if the dose-response is Lipschitz; not testable). The Design 1 paths (`continuous_near_d_lower` / `mass_point`, target = `WAS_d_lower`) instead need **Assumption 5** (sign identification) or **Assumption 6** (`WAS_d_lower` point identification) - that is the caveat T20's tutorial flagged because T20's panel was Design 1. T21's panel resolves to Design 1' (see Section 2 + Section 3), so the relevant non-testable caveat here is Assumption 3, NOT Assumptions 5/6. The library reflects this: it emits a UserWarning about Assumption 5/6 on Design 1 fits and does not emit it on `continuous_at_zero` (Design 1') fits.\n" ] }, { "cell_type": "markdown", - "id": "9a746805", + "id": "e75c0ee7", "metadata": {}, "source": [ "## 2. The Panel\n", @@ -42,13 +44,13 @@ { "cell_type": "code", "execution_count": 1, - "id": "d1ae0139", + "id": "2b498126", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T14:44:56.348210Z", - "iopub.status.busy": "2026-05-10T14:44:56.348005Z", - "iopub.status.idle": "2026-05-10T14:44:58.011237Z", - "shell.execute_reply": "2026-05-10T14:44:58.010959Z" + "iopub.execute_input": "2026-05-10T14:52:27.740298Z", + "iopub.status.busy": "2026-05-10T14:52:27.740212Z", + "iopub.status.idle": "2026-05-10T14:52:28.604952Z", + "shell.execute_reply": "2026-05-10T14:52:28.604641Z" } }, "outputs": [ @@ -116,7 +118,7 @@ }, { "cell_type": "markdown", - "id": "4f2cf6ab", + "id": "f101faae", "metadata": {}, "source": [ "## 3. Step 1: The Overall Workflow (Two-Period Path)\n", @@ -129,13 +131,13 @@ { "cell_type": "code", "execution_count": 2, - "id": "e9dcb44f", + "id": "230859e5", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T14:44:58.012454Z", - "iopub.status.busy": "2026-05-10T14:44:58.012339Z", - "iopub.status.idle": "2026-05-10T14:44:58.052568Z", - "shell.execute_reply": "2026-05-10T14:44:58.052290Z" + "iopub.execute_input": "2026-05-10T14:52:28.606277Z", + "iopub.status.busy": "2026-05-10T14:52:28.606164Z", + "iopub.status.idle": "2026-05-10T14:52:28.645728Z", + "shell.execute_reply": "2026-05-10T14:52:28.645446Z" } }, "outputs": [ @@ -188,7 +190,7 @@ }, { "cell_type": "markdown", - "id": "67d0c597", + "id": "82fc1090", "metadata": {}, "source": [ "**Reading the overall verdict.** Three things to note.\n", @@ -205,13 +207,13 @@ { "cell_type": "code", "execution_count": 3, - "id": "3d7cbdce", + "id": "57c13c5d", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T14:44:58.053856Z", - "iopub.status.busy": "2026-05-10T14:44:58.053763Z", - "iopub.status.idle": "2026-05-10T14:44:58.055674Z", - "shell.execute_reply": "2026-05-10T14:44:58.055438Z" + "iopub.execute_input": "2026-05-10T14:52:28.646988Z", + "iopub.status.busy": "2026-05-10T14:52:28.646904Z", + "iopub.status.idle": "2026-05-10T14:52:28.648718Z", + "shell.execute_reply": "2026-05-10T14:52:28.648525Z" } }, "outputs": [ @@ -271,7 +273,7 @@ }, { "cell_type": "markdown", - "id": "0d023307", + "id": "f13eb155", "metadata": {}, "source": [ "A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000), which looks alarming but is a scale artifact, not pathology. Under the Yatchew construction `sigma2_diff = (1 / 2G) * sum((dy_{(g)} - dy_{(g-1)})^2)` is computed on `dy` sorted by dose `D`. With doses spread over Uniform[\\$0.01K, \\$50K] and a true per-$1K slope of 100 (locked by the DGP), adjacent-by-dose units have `dy` values that differ by roughly `100 * (D_{(g)} - D_{(g-1)})` plus noise — those squared gaps add up to a large `sigma2_diff` (about 6,250 here) by virtue of the dose scale, while the OLS residual variance `sigma2_lin` (about 1.6) reflects only noise around the linear fit. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. The side panel later in the notebook constructs a different Yatchew input (within-pre-period first-differences, where the adjacent-by-dose `dy` gaps are not driven by the post-treatment slope) and produces a `T_hr` near zero — a useful sanity check that the test behaves the way it should when the dose dimension genuinely contributes nothing to the variance of `dy`.\n" @@ -279,7 +281,7 @@ }, { "cell_type": "markdown", - "id": "8c2207b9", + "id": "86c21280", "metadata": {}, "source": [ "## 4. Step 2: Upgrade to the Event-Study Workflow\n", @@ -298,13 +300,13 @@ { "cell_type": "code", "execution_count": 4, - "id": "b9fc9759", + "id": "9e08fdc9", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T14:44:58.056648Z", - "iopub.status.busy": "2026-05-10T14:44:58.056577Z", - "iopub.status.idle": "2026-05-10T14:44:58.183086Z", - "shell.execute_reply": "2026-05-10T14:44:58.182792Z" + "iopub.execute_input": "2026-05-10T14:52:28.649676Z", + "iopub.status.busy": "2026-05-10T14:52:28.649603Z", + "iopub.status.idle": "2026-05-10T14:52:28.775756Z", + "shell.execute_reply": "2026-05-10T14:52:28.775491Z" } }, "outputs": [ @@ -344,12 +346,12 @@ }, { "cell_type": "markdown", - "id": "4617bf96", + "id": "8b966112", "metadata": {}, "source": [ "**Reading the event-study verdict.** Now the verdict reads `\"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\"`. The `\"deferred\"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.\n", "\n", - "A note on the verdict's \"TWFE admissible\" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05`. That is non-rejection evidence under the diagnostics' finite-sample power and specification, not a proof that the identifying assumptions hold. Step 4 (boundary continuity, paper Assumptions 5 / 6) remains non-testable from data and is not covered by any of the three diagnostics here.\n", + "A note on the verdict's \"TWFE admissible\" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05` (paper Step 4 decision rule). That is non-rejection evidence under the diagnostics' finite-sample power and specification, not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero, see Section 1) sits alongside this and is not covered by any of the three diagnostics.\n", "\n", "The joint pre-trends test runs over `n_horizons = 3` (pre-periods 1, 2, 3, with week 4 reserved as the base period). The joint homogeneity test runs over `n_horizons = 4` (post-periods 5, 6, 7, 8). Let's inspect the per-horizon detail.\n" ] @@ -357,13 +359,13 @@ { "cell_type": "code", "execution_count": 5, - "id": "6bf80443", + "id": "0cff9af4", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T14:44:58.184264Z", - "iopub.status.busy": "2026-05-10T14:44:58.184175Z", - "iopub.status.idle": "2026-05-10T14:44:58.186097Z", - "shell.execute_reply": "2026-05-10T14:44:58.185814Z" + "iopub.execute_input": "2026-05-10T14:52:28.776891Z", + "iopub.status.busy": "2026-05-10T14:52:28.776809Z", + "iopub.status.idle": "2026-05-10T14:52:28.778666Z", + "shell.execute_reply": "2026-05-10T14:52:28.778448Z" } }, "outputs": [ @@ -436,19 +438,19 @@ }, { "cell_type": "markdown", - "id": "76fe2a7d", + "id": "df5603e6", "metadata": {}, "source": [ "The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold. The test does not reject at alpha = 0.05, but the near-threshold p-value warrants scrutiny - the diagnostic is not failing in a clearly-far-from-rejection regime. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring).\n", "\n", "The joint homogeneity p-value (~0.76) is comfortably far from rejection. The diagnostic does not flag heterogeneity bias on the dose dimension across the four post-launch horizons.\n", "\n", - "Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. Step 4 (boundary continuity, Assumptions 5 / 6) remains non-testable from data and is argued from domain knowledge, as in T20.\n" + "Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. By paper Step 4 (the decision rule), TWFE may then be used. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero) remains and is argued from domain knowledge.\n" ] }, { "cell_type": "markdown", - "id": "543d4fb2", + "id": "63068545", "metadata": {}, "source": [ "## 5. Side Panel: Yatchew-HR Null Modes\n", @@ -464,13 +466,13 @@ { "cell_type": "code", "execution_count": 6, - "id": "606e1681", + "id": "7fe4f131", "metadata": { "execution": { - "iopub.execute_input": "2026-05-10T14:44:58.187360Z", - "iopub.status.busy": "2026-05-10T14:44:58.187263Z", - "iopub.status.idle": "2026-05-10T14:44:58.191876Z", - "shell.execute_reply": "2026-05-10T14:44:58.191499Z" + "iopub.execute_input": "2026-05-10T14:52:28.779708Z", + "iopub.status.busy": "2026-05-10T14:52:28.779631Z", + "iopub.status.idle": "2026-05-10T14:52:28.784665Z", + "shell.execute_reply": "2026-05-10T14:52:28.784449Z" } }, "outputs": [ @@ -532,7 +534,7 @@ }, { "cell_type": "markdown", - "id": "c1d709fc", + "id": "5b38c476", "metadata": {}, "source": [ "**Reading the side-panel comparison.**\n", @@ -547,7 +549,7 @@ }, { "cell_type": "markdown", - "id": "05239106", + "id": "b16b856f", "metadata": {}, "source": [ "## 6. Communicating the Diagnostics to Leadership\n", @@ -560,14 +562,14 @@ "> - **Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test does not reject (joint p approximately 0.07 across the three pre-period horizons). The p-value is close to alpha = 0.05, so the non-rejection here is not by a wide margin - in a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending.\n", "> - **Step 3 (linearity, Assumption 8):** joint Stute homogeneity does not reject (joint p approximately 0.76 across the four post-launch horizons). The diagnostic does not flag heterogeneity bias on the dose dimension under the test's specification.\n", ">\n", - "> **Non-testable from data (Step 4, paper Assumptions 5 / 6, boundary continuity):** local-linearity of the dose-response near `d_lower`. Argued from domain knowledge - is there reason to believe the marginal effect of an additional $1K of regional spend is roughly constant across the dose range? In our case yes, by DGP construction; in a real analysis we would justify this from prior knowledge of the channel's response shape.\n", + "> **Non-testable from data (Design 1' identification, paper Assumption 3 / boundary regularity at zero):** uniform continuity of the dose-response `d -> Y_2(d)` at zero. Argued from domain knowledge - is there reason to believe outcomes are continuous in spend at the lower-dose boundary, with no extensive-margin discontinuity at $0? In our case yes, by DGP construction. (Note: this is the Design 1' caveat. T20's panel was Design 1, where the corresponding non-testable caveats are Assumptions 5/6 - the library actually emits a UserWarning surfacing those on Design 1 fits but stays silent on Design 1' fits like ours.)\n", ">\n", - "> **Bottom line:** the workflow's three testable diagnostics do not flag a violation. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and Step 4 (boundary continuity, non-testable from data). None of these are settled by non-rejection of the pre-tests.\n" + "> **Bottom line:** the workflow's three testable diagnostics do not flag a violation, so by paper Step 4 (decision rule) TWFE may be used. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and the non-testable Design 1' caveat (Assumption 3 / boundary regularity at zero). None of these are settled by non-rejection of the pre-tests.\n" ] }, { "cell_type": "markdown", - "id": "67846f9c", + "id": "eb1d3712", "metadata": {}, "source": [ "## 7. Extensions\n", @@ -589,7 +591,7 @@ }, { "cell_type": "markdown", - "id": "f44341b9", + "id": "dd1dce6f", "metadata": {}, "source": [ "## 8. Summary Checklist\n", @@ -597,7 +599,7 @@ "- HAD's pre-test workflow `did_had_pretest_workflow` bundles paper Section 4.2 Steps 1 (QUG support infimum), 2 (joint Stute pre-trends - event-study path only), and 3 (Stute / Yatchew-HR linearity, joint variant on event-study path).\n", "- The two-period (`aggregate=\"overall\"`) path runs Steps 1 + 3 only - it cannot run Step 2 because a single pre-period structurally has nothing to test against. The verdict says so verbatim: \"Assumption 7 pre-trends test NOT run\".\n", "- Upgrade to the multi-period (`aggregate=\"event_study\"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads \"TWFE admissible under Section 4 assumptions\" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof.\n", - "- Step 4 (paper Assumptions 5 / 6, boundary continuity) is **non-testable** from data - argue from domain knowledge.\n", + "- Paper Step 4 is the **decision rule** (if Steps 1-3 don't reject, use TWFE), not a non-testable assumption. The non-testable identification caveat is design-path-specific: **Assumption 3** (boundary regularity at zero) for `continuous_at_zero` (Design 1', T21), or **Assumptions 5/6** for the Design 1 paths (`continuous_near_d_lower` / `mass_point`, T20).\n", "- The Yatchew-HR test exposes two null modes: `null=\"linearity\"` (paper Theorem 7, default; what the workflow calls under the hood) and `null=\"mean_independence\"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data).\n", "- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The QUG test and HAD's `design=\"auto\"` selector are independent rules: QUG is a statistical test on `H0: d_lower = 0`; `design=\"auto\"` calls `_detect_design()` which uses a min/median heuristic on the dose vector. Both pointed to `continuous_at_zero` on this panel; finite-sample uncertainty in either decision is a remaining caveat.\n", "- Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python).\n" From f9f951fc2bd653d6475c32fc1d97c09a7bc88a7e Mon Sep 17 00:00:00 2001 From: igerber Date: Sun, 10 May 2026 10:59:32 -0400 Subject: [PATCH 11/12] Fix `did_had_pretest_workflow()` docstring: paper Step 4 is the decision rule MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per REGISTRY's pretest workflow (lines 2482-2487 surrounding the four-step enumeration) and the same module's two correctly-framed docstrings (module-level at line 54, `_compose_verdict_event_study` at line 2735), paper Step 4 is the decision rule "use TWFE if none of the tests rejects" — NOT a test or assumption. The Yatchew-HR test is a Step 3 alternative (a single-horizon swap-in for Stute), and there is no joint Yatchew variant in the paper. The `did_had_pretest_workflow()` docstring at L4445-4446 was the only place in the file that mislabeled the Yatchew alternative as "Step 4". Reword to: - correctly tag Yatchew as a step-3 alternative - explicitly state Step 4 is the decision rule with no code path - cross-reference the two existing correctly-framed docstrings in the same file so future readers can confirm the convention is consistent Surfaced by PR #409 R6 review (T21 tutorial pre-test taxonomy work made the inconsistency visible). All 42 workflow-specific tests in `tests/test_had_pretests.py` still pass; T21 drift test (16 tests) still passes. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/had_pretests.py | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py index d3d887bc..95853405 100644 --- a/diff_diff/had_pretests.py +++ b/diff_diff/had_pretests.py @@ -4442,11 +4442,15 @@ def did_had_pretest_workflow( ``aggregate="event_study"`` (multi-period panel, >= 3 periods): runs QUG + joint pre-trends Stute + joint homogeneity-linearity Stute, - covering paper Section 4 steps 1-3 together. Step 4 (Yatchew-style - linearity as an alternative to Stute) is subsumed by the joint Stute - in this path - the paper does not derive a joint Yatchew variant, so - users who need Yatchew robustness under multi-period data should - call :func:`yatchew_hr_test` on each (base, post) pair manually. + covering paper Section 4 steps 1-3 together. The step-3 Yatchew-HR + alternative (a single-horizon swap-in for Stute) is subsumed by joint + Stute on this path - the paper does not derive a joint Yatchew + variant, so users who need Yatchew robustness under multi-period + data should call :func:`yatchew_hr_test` on each ``(base, post)`` + pair manually. (Paper step 4 is the decision itself - "use TWFE if + none of the tests rejects" - not a separate test, so it has no code + path here. Mirrors the framing in the module-level docstring at + line 54 and ``_compose_verdict_event_study`` at line 2735.) Eq 17 / Eq 18 linear-trend detrending (paper Section 5.2 Pierce- Schott application) is now SHIPPED on the event-study path via From 3ab7a8677a69e34577e5a09fbdb91205dbe9f8d7 Mon Sep 17 00:00:00 2001 From: igerber Date: Sun, 10 May 2026 11:10:47 -0400 Subject: [PATCH 12/12] =?UTF-8?q?Address=20PR=20#409=20R7=20review=20(P2?= =?UTF-8?q?=20D1)=20=E2=80=94=20bounded=20p-value=20drift=20bands?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two bootstrap p-value drift tests had lower-bound-only assertions: - `test_overall_stute_fails_to_reject`: was `p > 0.50`, tutorial quotes ~0.686 → would silently pass if p drifted to 0.99 - `test_event_study_homogeneity_fails_to_reject`: was `p > 0.50`, tutorial quotes ~0.763 → same silent-stale risk The third bootstrap test (`test_event_study_pretrends_fails_to_reject`) already used a bounded band `0.0 <= p <= 0.25`. Mirror that pattern on the other two with bounded bands per `feedback_bootstrap_drift_tests_need_backend_tolerance` (>= 0.15 width): - Stute: 0.53 <= p <= 0.84 (band ~0.31 around 0.686) - Homogeneity: 0.61 <= p <= 0.92 (band ~0.31 around 0.763) Both bands wide enough for Rust ↔ pure-Python RNG path differences; both narrow enough that drift in either direction (toward rejection or toward an even cleaner pass) flags the prose as stale. All 16 drift tests pass on both backends within the new bands. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/test_t21_had_pretest_workflow_drift.py | 23 ++++++++++++-------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py index 318c05c0..b4142f07 100644 --- a/tests/test_t21_had_pretest_workflow_drift.py +++ b/tests/test_t21_had_pretest_workflow_drift.py @@ -204,14 +204,14 @@ def test_overall_qug_fails_to_reject(overall_report): def test_overall_stute_fails_to_reject(overall_report): - """Section 3 narrative claims Stute fails-to-reject linearity. - Stute uses Mammen wild bootstrap so the p-value is RNG-dependent; - use binary fail-to-reject + abs tolerance band per - `feedback_bootstrap_drift_tests_need_backend_tolerance`.""" + """Section 3 narrative quotes Stute p_value ~0.686. Stute uses + Mammen wild bootstrap so the p-value is RNG-dependent; use a + bounded abs tolerance band per + `feedback_bootstrap_drift_tests_need_backend_tolerance` (>= 0.15 + width). Both bounds tight enough to catch methodology drift in + either direction, loose enough for backend RNG path differences.""" assert overall_report.stute.reject is False - # Tight enough to catch methodology drift, loose enough for backend - # RNG path differences. - assert overall_report.stute.p_value > 0.50, overall_report.stute.p_value + assert 0.53 <= overall_report.stute.p_value <= 0.84, overall_report.stute.p_value def test_overall_yatchew_fails_to_reject(overall_report): @@ -292,11 +292,16 @@ def test_event_study_pretrends_fails_to_reject(event_study_report): def test_event_study_homogeneity_fails_to_reject(event_study_report): """Section 4 narrative claims joint homogeneity strongly fails to - reject (~0.76 from numbers.json).""" + reject and quotes p ~0.763 from numbers.json. Use a bounded abs + tolerance band per + `feedback_bootstrap_drift_tests_need_backend_tolerance` so that + drift in either direction (toward rejection or toward an even + cleaner pass) flags the prose as stale rather than silently + passing.""" hj = event_study_report.homogeneity_joint assert hj is not None assert hj.reject is False - assert hj.p_value > 0.50, hj.p_value + assert 0.61 <= hj.p_value <= 0.92, hj.p_value def test_had_design_auto_lands_on_continuous_at_zero(two_period):