diff --git a/CHANGELOG.md b/CHANGELOG.md index 25f967b5..2c413ba8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,11 +8,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added +- **Tutorial 21: HAD Pre-test Workflow** (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `HeterogeneousAdoptionDiD` building on Tutorial 20's brand-campaign framing. Uses a 60-DMA × 8-week panel close in shape to T20's but with the dose distribution drawn from `Uniform[$0.01K, $50K]` (vs T20's `[$5K, $50K]`); the true support is strictly positive but very near zero, chosen so the QUG step in `did_had_pretest_workflow` fails-to-reject `H0: d_lower = 0` in this finite sample and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. (HAD's `design="auto"` selector — a separate min/median heuristic at `had.py::_detect_design`, NOT the QUG p-value — independently lands on the `continuous_at_zero` identification path with target `WAS` on this panel because `d.min() < 0.01 * median(|d|)`. The QUG test and the design selector are independent rules that point to the same identification path here.) Walks through three surfaces: (a) `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse, where the verdict explicitly flags Step 2 (Assumption 7 pre-trends) as not run because a single pre-period structurally cannot support a pre-trends test, and the structural fields `pretrends_joint` / `homogeneity_joint` are both `None`; (b) `did_had_pretest_workflow(aggregate="event_study")` on the full multi-period panel, where the verdict reads "TWFE admissible under Section 4 assumptions" because all three testable diagnostics (QUG + joint pre-trends Stute over 3 horizons + joint homogeneity Stute over 4 horizons) fail-to-reject — non-rejection evidence under finite-sample power and test specification, not proof that the identifying assumptions hold; and (c) a side panel exercising both `yatchew_hr_test` null modes — `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) — on the within-pre-period first-difference paired with post-period dose, illustrating the stricter null's larger residual variance (`sigma2_lin` 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (16 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, bootstrap p-value tolerance bands per `feedback_bootstrap_drift_tests_need_backend_tolerance`, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel). T20's "Composite pretest workflow" Extensions bullet updated with a forward-pointer to T21. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR. - **`ChaisemartinDHaultfoeuille.by_path` and `paths_of_interest` now compose with `survey_design`** for analytical Binder TSL SE and replicate-weight bootstrap variance. The `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1233-1239` is replaced by a per-path multiplier-bootstrap-only gate (`survey_design + n_bootstrap > 0` under by_path / paths_of_interest still raises, since the survey-aware perturbation pivot for path-restricted IFs is methodologically underived). Per-path SE routes through the existing `_survey_se_from_group_if` cell-period allocator: the per-period IF (`U_pp_l_path`) is built with non-path switcher-side contributions skipped (control contributions are unchanged, matching the joiners/leavers IF convention; preserves the row-sum identity `U_pp.sum(axis=1) == U`), cohort-recentered via `_cohort_recenter_per_period`, then expanded to observations as `psi_i = U_pp[g_i, t_i] · (w_i / W_{g_i, t_i})`. Replicate-weight designs unconditionally use the cell allocator (Class A contract from PR #323). New `_refresh_path_inference` helper post-call refreshes `safe_inference` on every populated entry across `multi_horizon_inference`, `placebo_horizon_inference`, `path_effects`, and `path_placebos` so all four surfaces use the same final `df_survey` after per-path replicate fits append `n_valid` to the shared accumulator. Path-enumeration ranking under `survey_design` remains unweighted (group-cardinality, not population-weight mass). Lonely-PSU policy stays sample-wide, not per-path. Telescope invariant: on a single-path panel, per-path SE matches the global non-by_path survey SE bit-exactly. **No R parity** — R `did_multiplegt_dyn` does not support survey weighting; this is a Python-only methodology extension. The global non-by_path TSL multiplier-bootstrap path is unaffected (anti-regression test `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignAnalytical::test_global_survey_plus_n_bootstrap_still_works` locks the per-path-only scope of the new gate). Cross-surface invariants regression-tested at `TestByPathSurveyDesignAnalytical` (~17 tests across gate / dispatch / analytical SE / replicate-weight SE / per-path placebos / `trends_linear` composition / unobserved-path warnings / final-df refresh regressions) and `TestByPathSurveyDesignTelescope`. See `docs/methodology/REGISTRY.md` §`ChaisemartinDHaultfoeuille` `Note (Phase 3 by_path ...)` → "Per-path survey-design SE" for the full contract. - **Inference-field aliases on staggered result classes** for adapter / external-consumer compatibility. Read-only `@property` aliases expose the flat `att` / `se` / `conf_int` / `p_value` / `t_stat` names (matching `DiDResults` / `TROPResults` / `SyntheticDiDResults` / `HeterogeneousAdoptionDiDResults`) on every result class that previously only carried prefixed canonical fields: `CallawaySantAnnaResults`, `StackedDiDResults`, `EfficientDiDResults`, `ChaisemartinDHaultfoeuilleResults`, `StaggeredTripleDiffResults`, `WooldridgeDiDResults`, `SunAbrahamResults`, `ImputationDiDResults`, `TwoStageDiDResults` (mapping to `overall_*`); `ContinuousDiDResults` (mapping to `overall_att_*`, ATT-side as the headline, ACRT-side accessible unchanged via `overall_acrt_*`); `MultiPeriodDiDResults` (mapping to `avg_*`). `ContinuousDiDResults` additionally exposes `overall_se` / `overall_conf_int` / `overall_p_value` / `overall_t_stat` aliases for naming consistency with the rest of the staggered family. Aliases are pure read-throughs over the canonical fields — no recomputation, no behavior change — so the `safe_inference()` joint-NaN contract (per CLAUDE.md "Inference computation") is inherited automatically (NaN canonical → NaN alias, locked at `tests/test_result_aliases.py::test_pattern_b_aliases_propagate_nan`). The native `overall_*` / `overall_att_*` / `avg_*` fields remain canonical for documentation and computation. Motivated by the `balance.interop.diff_diff.as_balance_diagnostic()` adapter (`facebookresearch/balance` PR #465) which calls `getattr(res, "se", None)` / `getattr(res, "conf_int", None)` without a fallback chain — pre-alias, every staggered result class returned `None` on those keys, silently dropping `se` and `conf_int` from the adapter's diagnostic dict. 23 alias-mechanic + balance-adapter regression tests at `tests/test_result_aliases.py`. Patch-level (additive on stable surfaces). - **`ChaisemartinDHaultfoeuille.by_path` + non-binary integer treatment** — `by_path=k` now accepts integer-coded discrete treatment (D in Z, e.g. ordinal `{0, 1, 2}`); path tuples become integer-state tuples like `(0, 2, 2, 2)`. The previous `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1870` is replaced by a `ValueError` for continuous D (e.g. `D=1.5`) at fit-time per the no-silent-failures contract — the existing `int(round(float(v)))` cast in `_enumerate_treatment_paths` is now defensive (no-op for integer-coded D). Validated against R `did_multiplegt_dyn(..., by_path)` for D in `{0, 1, 2}` via the new `multi_path_reversible_by_path_non_binary` golden-value scenario (78 switchers, 3 paths, single-baseline custom DGP, F_g >= 4): per-path point estimates match R bit-exactly (rtol ~1e-9 on event horizons; rtol+atol envelope for placebo near-zero values), per-path SE inherits the documented cross-path cohort-sharing deviation (~5% rtol observed; SE_RTOL=0.15 envelope). **Deviation from R for D >= 10:** R's `did_multiplegt_by_path` derives the per-path baseline via `path_index$baseline_XX <- substr(path_index$path, 1, 1)`, which captures only the first character of the comma-separated path string (e.g. for `path = "12,12,..."` it captures `"1"` instead of `"12"`); this mis-allocates R's per-path control-pool subset for D >= 10. Python's tuple-key matching is correct in this regime — the per-path point estimates we compute are correct; R's per-path subset for the same path is buggy. The shipped parity scenario stays in `D in {0, 1, 2}` to avoid the R bug. R-parity test at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathNonBinary`; cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary`. - **New `paths_of_interest` kwarg on `ChaisemartinDHaultfoeuille`** for user-specified treatment-path subsets, alternative to `by_path=k`'s top-k automatic ranking. Mutually exclusive with `by_path`; setting both raises `ValueError` at `__init__` and `set_params` time. Each path tuple must be a list/tuple of `int` of length `L_max + 1` (uniformity validated at `__init__`; length match against `L_max + 1` validated at fit-time); `bool` and `np.bool_` are explicitly rejected, `np.integer` accepted and canonicalized to Python `int` for tuple-key consistency. Duplicates emit a `UserWarning` and are deduplicated; paths not observed in the panel emit a `UserWarning` and are omitted from `path_effects`. Paths appear in `results.path_effects` in the user-specified order, modulo deduplication and unobserved-path filtering. Composes with non-binary D and all downstream `by_path` surfaces (bootstrap, per-path placebos, per-path joint sup-t bands, `controls`, `trends_linear`, `trends_nonparam`) — mechanical filter on observed paths via the same `_enumerate_treatment_paths` call site, no methodology change. **Python-only API extension; no R equivalent** — R's `did_multiplegt_dyn(..., by_path=k)` only accepts a positive int (top-k) or `-1` (all paths). The `by_path` precondition gate at `chaisemartin_dhaultfoeuille.py:1118` (drop_larger_lower / L_max / `heterogeneity` / `design2` / `honest_did` / `survey_design` mutex) and the 11 `self.by_path is not None` activation branches in `fit()` were rerouted to fire under either selector. Validation + behavior + cross-feature regressions at `tests/test_chaisemartin_dhaultfoeuille.py::TestPathsOfInterest`. -- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), an estimand-difference routing nudge to `ContinuousDiD` (step 4 — fires when the user wants per-dose ATT(d) / ACRT(d) curves rather than HAD's WAS estimand and has never-treated controls; framed around estimand difference, NOT around the existence of untreated units, since HAD remains valid with a small never-treated share per REGISTRY § HeterogeneousAdoptionDiD edge cases and explicitly retains never-treated units on the staggered event-study path per paper Appendix B.2 / `had.py:1325`), `results.bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels (this direction is correct because ContinuousDiD's identification requires never-treated controls). Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference); the documented constructor + `fit()` signatures match the real `HeterogeneousAdoptionDiD.__init__` / `.fit` API exactly (verified by `inspect.signature`-based regression tests). Tightens the existing `Continuous treatment intensity` Choosing row to surface ATT(d) vs WAS as the estimand differentiator. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 26 new tests (16 in `tests/test_practitioner.py::TestHADDispatch` + 9 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs. +- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), an estimand-difference routing nudge to `ContinuousDiD` (step 4 — fires when the user wants per-dose ATT(d) / ACRT(d) curves rather than HAD's WAS estimand and has never-treated controls; framed around estimand difference, NOT around the existence of untreated units, since HAD remains valid with a small never-treated share per REGISTRY § HeterogeneousAdoptionDiD edge cases and explicitly retains never-treated units on the staggered event-study path per paper Appendix B.2 / `had.py:1325`), `results.bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels (this direction is correct because ContinuousDiD's identification requires never-treated controls). Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference); the documented constructor + `fit()` signatures match the real `HeterogeneousAdoptionDiD.__init__` / `.fit` API exactly (verified by `inspect.signature`-based regression tests). Tightens the existing `Continuous treatment intensity` Choosing row to surface ATT(d) vs WAS as the estimand differentiator. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 26 new tests (16 in `tests/test_practitioner.py::TestHADDispatch` + 9 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap. T21 pretest tutorial subsequently landed in PR #409; T22 weighted/survey tutorial remains queued as a separate notebook PR. ## [3.3.2] - 2026-04-26 diff --git a/TODO.md b/TODO.md index 17d2659e..7b92f14c 100644 --- a/TODO.md +++ b/TODO.md @@ -109,7 +109,7 @@ Deferred items from PR reviews that were not addressed before merge. | `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low | -| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorials (T21 HAD pretest workflow notebook + T22 weighted/survey HAD tutorial). `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1. | `tutorials/`, `tests/test_t21_*_drift.py`, `tests/test_t22_*_drift.py` | Phase 2a | Low | +| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed (PR-pending); `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low | | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low | | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium | | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low | diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py index d3d887bc..95853405 100644 --- a/diff_diff/had_pretests.py +++ b/diff_diff/had_pretests.py @@ -4442,11 +4442,15 @@ def did_had_pretest_workflow( ``aggregate="event_study"`` (multi-period panel, >= 3 periods): runs QUG + joint pre-trends Stute + joint homogeneity-linearity Stute, - covering paper Section 4 steps 1-3 together. Step 4 (Yatchew-style - linearity as an alternative to Stute) is subsumed by the joint Stute - in this path - the paper does not derive a joint Yatchew variant, so - users who need Yatchew robustness under multi-period data should - call :func:`yatchew_hr_test` on each (base, post) pair manually. + covering paper Section 4 steps 1-3 together. The step-3 Yatchew-HR + alternative (a single-horizon swap-in for Stute) is subsumed by joint + Stute on this path - the paper does not derive a joint Yatchew + variant, so users who need Yatchew robustness under multi-period + data should call :func:`yatchew_hr_test` on each ``(base, post)`` + pair manually. (Paper step 4 is the decision itself - "use TWFE if + none of the tests rejects" - not a separate test, so it has no code + path here. Mirrors the framing in the module-level docstring at + line 54 and ``_compose_verdict_event_study`` at line 2735.) Eq 17 / Eq 18 linear-trend detrending (paper Section 5.2 Pierce- Schott application) is now SHIPPED on the event-study path via diff --git a/docs/_review/t21_notebook_extract.md b/docs/_review/t21_notebook_extract.md new file mode 100644 index 00000000..8f4143ec --- /dev/null +++ b/docs/_review/t21_notebook_extract.md @@ -0,0 +1,450 @@ +# T21 Notebook Extract for AI Review (TEMPORARY) + +> **This file is a temporary review aid.** The CI AI reviewer's diff-build +> step excludes `docs/tutorials/*.ipynb` (see +> `.github/workflows/ai_pr_review.yml:151-156` and +> `.github/codex/prompts/pr_review.md:87-91`), so the actual tutorial +> notebook prose is invisible to the CI reviewer. This file mirrors the +> notebook's narrative (markdown + code + executed outputs) so the +> reviewer can audit the tutorial content in PR #409. +> +> A follow-on PR will (a) remove this file and (b) replace the blanket +> `.ipynb` exclusion with a markdown-only extraction wired into the +> workflow itself. Do not edit this file directly — regenerate via +> `python _scratch/t21_pretests/70_extract_for_review.py` from the +> notebook source-of-truth at +> `_scratch/t21_pretests/60_build_notebook.py`. + +--- + + +# Tutorial 21: HAD Pre-test Workflow - Running the Pre-test Diagnostics on the Brand Campaign Panel + +[Tutorial 20](20_had_brand_campaign.ipynb) fit `HeterogeneousAdoptionDiD` (HAD) on a regional brand-campaign panel and reported a per-dollar lift, with a brief visual placebo check at the end. We deliberately deferred the **formal pre-test workflow** to this tutorial, with a forward pointer in T20's "Extensions" section. + +This tutorial picks up where T20 left off. We re-run the brand campaign on a panel close in shape to T20's, then walk through HAD's composite pre-test workflow `did_had_pretest_workflow` and read the diagnostics for paper Section 4.2 of de Chaisemartin, Ciccia, D'Haultfoeuille, & Knau (2026). We start with the two-period (`aggregate="overall"`) workflow, observe that it does not run the parallel pre-trends step, and then **upgrade** to the multi-period (`aggregate="event_study"`) workflow that adds the joint Stute pre-trends and joint homogeneity diagnostics. None of the diagnostics in this tutorial reject; we walk through what that does and does not let us conclude. A side panel compares the two `null=` modes of the Yatchew-HR test, including the recently-shipped `null="mean_independence"` mode (R-parity with `YatchewTest::yatchew_test(order=0)`). + +## 1. The Pre-test Battery + +de Chaisemartin et al. (2026) Section 4.2 lays out a four-step pre-test workflow for HAD identification: + +1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1', `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1, `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters. +2. **Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD. +3. **Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias? +4. **Step 4 - Decision rule:** if Steps 1-3 all fail to reject, TWFE may be used to estimate the treatment effect (paper Section 4.3). + +The library bundles the testable steps into one entry point: `did_had_pretest_workflow`. It dispatches to a two-period implementation (steps 1 + 3 only - step 2 needs at least two pre-periods) or a multi-period implementation (steps 1 + 2 + 3 jointly). The Yatchew-HR test from Step 3 is also exposed standalone with two null modes; we exercise both in the side panel. + +**Non-testable identification caveat (separate from the four-step workflow).** Identification of the WAS estimand under Design 1' (`continuous_at_zero`, target = `WAS`) requires **Assumption 3** (uniform continuity of `d -> Y_2(d)` at zero, holds if the dose-response is Lipschitz; not testable). The Design 1 paths (`continuous_near_d_lower` / `mass_point`, target = `WAS_d_lower`) instead need **Assumption 5** (sign identification) or **Assumption 6** (`WAS_d_lower` point identification) - that is the caveat T20's tutorial flagged because T20's panel was Design 1. T21's panel resolves to Design 1' (see Section 2 + Section 3), so the relevant non-testable caveat here is Assumption 3, NOT Assumptions 5/6. The library reflects this: it emits a UserWarning about Assumption 5/6 on Design 1 fits and does not emit it on `continuous_at_zero` (Design 1') fits. + +## 2. The Panel + +We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1') identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design="auto"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1' from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open. + +```python +import numpy as np +import pandas as pd + +from diff_diff import generate_continuous_did_data + +MAIN_SEED = 87 +N_UNITS = 60 +N_PERIODS = 8 +COHORT_PERIOD = 5 +TRUE_SLOPE = 100.0 +BASELINE_VISITS = 5000.0 +DOSE_LOW = 0.01 +DOSE_HIGH = 50.0 + +raw = generate_continuous_did_data( + n_units=N_UNITS, + n_periods=N_PERIODS, + cohort_periods=[COHORT_PERIOD], + never_treated_frac=0.0, + dose_distribution="uniform", + dose_params={"low": DOSE_LOW, "high": DOSE_HIGH}, + att_function="linear", + att_intercept=0.0, + att_slope=TRUE_SLOPE, + unit_fe_sd=8.0, + time_trend=0.5, + noise_sd=2.0, + seed=MAIN_SEED, +) +panel = raw.copy() +panel.loc[panel["period"] < panel["first_treat"], "dose"] = 0.0 +panel = panel.rename( + columns={ + "unit": "dma_id", + "period": "week", + "outcome": "weekly_visits", + "dose": "regional_spend_k", + } +) +panel["weekly_visits"] = panel["weekly_visits"] + BASELINE_VISITS + +post = panel[panel["week"] >= COHORT_PERIOD] +print(f"Panel: {panel['dma_id'].nunique()} DMAs x {panel['week'].nunique()} weeks") +print( + f"Regional spend (post-launch): " + f"${post['regional_spend_k'].min():.2f}K - " + f"${post['regional_spend_k'].max():.2f}K" +) +print(f"True per-$1K lift (locked at seed): {TRUE_SLOPE} weekly visits") +``` + +**Output:** + +``` +Panel: 60 DMAs x 8 weeks +Regional spend (post-launch): $0.18K - $49.00K +True per-$1K lift (locked at seed): 100.0 weekly visits +``` + +## 3. Step 1: The Overall Workflow (Two-Period Path) + +T20's headline used a two-period collapse of the panel - average pre-launch outcome per DMA against average post-launch outcome per DMA. That's also the natural input shape for HAD's two-period (`aggregate="overall"`) pre-test workflow, which runs **paper Step 1 (QUG) + paper Step 3 (linearity, via Stute and Yatchew-HR)**. Step 2 (parallel pre-trends) is not implemented on this path - a single pre-period structurally can't support a pre-trends test - and the workflow's verdict says so explicitly. + +We collapse to two periods (pre = avg over weeks 1-4, post = avg over weeks 5-8), then call the workflow. + +```python +from diff_diff import did_had_pretest_workflow + +p = panel.copy() +p["period"] = (p["week"] >= COHORT_PERIOD).astype(int) + 1 # 1=pre, 2=post +two_period = p.groupby(["dma_id", "period"], as_index=False).agg( + weekly_visits=("weekly_visits", "mean"), + regional_spend_k=("regional_spend_k", "mean"), +) +# Workflow invariant: pre-period dose = 0 for every unit. +two_period.loc[two_period["period"] == 1, "regional_spend_k"] = 0.0 +# first_treat in the collapsed coordinates: 2 (the post-period) for every DMA. +two_period["first_treat"] = 2 + +overall_report = did_had_pretest_workflow( + data=two_period, + outcome_col="weekly_visits", + dose_col="regional_spend_k", + time_col="period", + unit_col="dma_id", + first_treat_col="first_treat", + alpha=0.05, + n_bootstrap=999, + seed=21, + aggregate="overall", +) + +print(overall_report.verdict) +print(f"\nall_pass = {overall_report.all_pass}") +print(f"aggregate = {overall_report.aggregate!r}") +print(f"pretrends_joint populated? {overall_report.pretrends_joint is not None}") +print(f"homogeneity_joint populated? {overall_report.homogeneity_joint is not None}") +``` + +**Output:** + +``` +QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up) + +all_pass = True +aggregate = 'overall' +pretrends_joint populated? False +homogeneity_joint populated? False +``` + +**Reading the overall verdict.** Three things to note. + +- **Step 1 (QUG) fails to reject:** the test statistic `T = D_(1) / (D_(2) - D_(1)) ~ 3.86` lands well below its critical value (`1/alpha - 1 = 19` at alpha = 0.05); the data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1', but the QUG test is independent of HAD's `design="auto"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.) +- **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The diagnostics do not flag heterogeneity bias on the dose dimension, so reading the WAS as an average per-dose marginal effect is supported by these tests (subject to finite-sample power). +- **Step 2 (Assumption 7 pre-trends) is not run on this path.** The verdict says so verbatim: `"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path). + +A note on `all_pass = True` here: the workflow's `all_pass` flag aggregates only the steps that actually ran on this dispatch path. On the overall path that is QUG + linearity (Stute / Yatchew); Step 2's deferral is *not* folded into `all_pass`. So `all_pass = True` on the overall path means "of the two steps that ran, neither rejected" - it does not mean Assumption 7 has been cleared. The upgrade to event-study below makes this concrete by actually running Step 2. + +Let's look at each individual test result. + +```python +overall_report.qug.print_summary() +print() +overall_report.stute.print_summary() +print() +overall_report.yatchew.print_summary() +``` + +**Output:** + +``` +================================================================ + QUG null test (H_0: d_lower = 0) +================================================================ +Statistic T: 3.8562 +p-value: 0.2059 +Critical value (1/alpha-1): 19.0000 +Reject H_0: False +alpha: 0.0500 +Observations: 60 +Excluded (d == 0): 0 +D_(1): 0.1806 +D_(2): 0.2274 +================================================================ + +================================================================ + Stute CvM linearity test (H_0: linear E[dY|D]) +================================================================ +CvM statistic: 0.0735 +Bootstrap p-value: 0.6860 +Reject H_0: False +alpha: 0.0500 +Bootstrap replications: 999 +Observations: 60 +Seed: 21 +================================================================ + +================================================================ + Yatchew-HR linearity test (H_0: linear E[dY|D]) +================================================================ +T_hr statistic: -34759.3017 +p-value: 1.0000 +Critical value (1-sided z): 1.6449 +Reject H_0: False +alpha: 0.0500 +sigma^2_lin (OLS): 1.6177 +sigma^2_diff (Yatchew): 6250.2569 +sigma^2_W (HR scale): 1.3925 +Observations: 60 +================================================================ +``` + +A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000), which looks alarming but is a scale artifact, not pathology. Under the Yatchew construction `sigma2_diff = (1 / 2G) * sum((dy_{(g)} - dy_{(g-1)})^2)` is computed on `dy` sorted by dose `D`. With doses spread over Uniform[\$0.01K, \$50K] and a true per-$1K slope of 100 (locked by the DGP), adjacent-by-dose units have `dy` values that differ by roughly `100 * (D_{(g)} - D_{(g-1)})` plus noise — those squared gaps add up to a large `sigma2_diff` (about 6,250 here) by virtue of the dose scale, while the OLS residual variance `sigma2_lin` (about 1.6) reflects only noise around the linear fit. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. The side panel later in the notebook constructs a different Yatchew input (within-pre-period first-differences, where the adjacent-by-dose `dy` gaps are not driven by the post-treatment slope) and produces a `T_hr` near zero — a useful sanity check that the test behaves the way it should when the dose dimension genuinely contributes nothing to the variance of `dy`. + +## 4. Step 2: Upgrade to the Event-Study Workflow + +The two-period workflow ran Steps 1 and 3 but did not run Step 2 (parallel pre-trends). Our panel actually has 8 weeks - that is enough pre-periods to add the joint Stute pre-trends diagnostic (paper Section 4.2 step 2 + Hlavka-Huskova 2020 / Delgado-Manteiga 2001 dependence-preserving Mammen multiplier bootstrap). + +We pass the full multi-period panel to `did_had_pretest_workflow(aggregate="event_study", ...)`. The dispatch runs all three testable steps in one call: + +- **Step 1**: QUG re-runs on the dose distribution at the treatment period `F` (deterministic; same numbers as the overall path). +- **Step 2**: `joint_pretrends_test` - mean-independence joint Stute over the pre-period horizons (`E[Y_t - Y_base | D] = mu_t` for each t < F). +- **Step 3**: `joint_homogeneity_test` - linearity joint Stute over the post-period horizons (`E[Y_t - Y_base | D_t] = beta_{0,t} + beta_{fe,t} * D` for each t >= F). + +Step 3's "Yatchew-HR" arm has no joint variant in the paper (the differencing-based variance estimator doesn't have a derived multi-horizon extension), so the event-study path runs only joint Stute for linearity. Practitioners who want Yatchew-HR robustness on multi-period data can call the standalone `yatchew_hr_test` on each (base, post) pair manually. + +```python +es_report = did_had_pretest_workflow( + data=panel, + outcome_col="weekly_visits", + dose_col="regional_spend_k", + time_col="week", + unit_col="dma_id", + first_treat_col="first_treat", + alpha=0.05, + n_bootstrap=999, + seed=21, + aggregate="event_study", +) + +print(es_report.verdict) +print(f"\nall_pass = {es_report.all_pass}") +print(f"aggregate = {es_report.aggregate!r}") +print(f"pretrends_joint populated? {es_report.pretrends_joint is not None}") +print(f"homogeneity_joint populated? {es_report.homogeneity_joint is not None}") +``` + +**Output:** + +``` +QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions) + +all_pass = True +aggregate = 'event_study' +pretrends_joint populated? True +homogeneity_joint populated? True +``` + +**Reading the event-study verdict.** Now the verdict reads `"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)"`. The `"deferred"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated. + +A note on the verdict's "TWFE admissible" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05` (paper Step 4 decision rule). That is non-rejection evidence under the diagnostics' finite-sample power and specification, not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero, see Section 1) sits alongside this and is not covered by any of the three diagnostics. + +The joint pre-trends test runs over `n_horizons = 3` (pre-periods 1, 2, 3, with week 4 reserved as the base period). The joint homogeneity test runs over `n_horizons = 4` (post-periods 5, 6, 7, 8). Let's inspect the per-horizon detail. + +```python +es_report.qug.print_summary() +print() +es_report.pretrends_joint.print_summary() +print() +es_report.homogeneity_joint.print_summary() +``` + +**Output:** + +``` +================================================================ + QUG null test (H_0: d_lower = 0) +================================================================ +Statistic T: 3.8562 +p-value: 0.2059 +Critical value (1/alpha-1): 19.0000 +Reject H_0: False +alpha: 0.0500 +Observations: 60 +Excluded (d == 0): 0 +D_(1): 0.1806 +D_(2): 0.2274 +================================================================ + +================================================================ + Joint Stute CvM test (mean-independence (pre-trends)) +================================================================ +Joint CvM statistic: 7.1627 +Bootstrap p-value: 0.0720 +Reject H_0: False +alpha: 0.0500 +Bootstrap replications: 999 +Horizons: 3 +Observations: 60 +Seed: 21 +Exact-linear short-circuit: False +---------------------------------------------------------------- +Per-horizon statistics: + 1 1.6112 + 2 2.9262 + 3 2.6253 +================================================================ + +================================================================ + Joint Stute CvM test (linearity (post-homogeneity)) +================================================================ +Joint CvM statistic: 1.3562 +Bootstrap p-value: 0.7630 +Reject H_0: False +alpha: 0.0500 +Bootstrap replications: 999 +Horizons: 4 +Observations: 60 +Seed: 21 +Exact-linear short-circuit: False +---------------------------------------------------------------- +Per-horizon statistics: + 5 0.4218 + 6 0.2186 + 7 0.4928 + 8 0.2230 +================================================================ +``` + +The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold. The test does not reject at alpha = 0.05, but the near-threshold p-value warrants scrutiny - the diagnostic is not failing in a clearly-far-from-rejection regime. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring). + +The joint homogeneity p-value (~0.76) is comfortably far from rejection. The diagnostic does not flag heterogeneity bias on the dose dimension across the four post-launch horizons. + +Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. By paper Step 4 (the decision rule), TWFE may then be used. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero) remains and is argued from domain knowledge. + +## 5. Side Panel: Yatchew-HR Null Modes + +The Yatchew-HR test exposes two `null=` modes (the second was added in 2026-04 for parity with the R `YatchewTest` package). + +- `null="linearity"` (default; paper Theorem 7): tests `H0: E[dY | D]` is linear in `D`. Residuals come from OLS `dy ~ 1 + d`. This is what `did_had_pretest_workflow` calls under the hood. +- `null="mean_independence"` (added 2026-04-26 in PR #397, Phase 4 R-parity): tests the stricter `H0: E[dY | D] = E[dY]`, i.e. `dY` is mean-independent of `D`. Residuals come from intercept-only OLS `dy ~ 1`. Mirrors R `YatchewTest::yatchew_test(order=0)`. + +The mean-independence mode is typically used on **placebo (pre-treatment) data** to test parallel pre-trends as a non-parametric mean-independence assertion. Below we construct an illustrative input - the within-pre-period first-difference `dy = Y[week=4] - Y[week=3]` paired with each DMA's actual post-period dose - and run both modes side by side. Both should fail to reject on this clean linear DGP; the contrast is in the residual structure. + +```python +from diff_diff import yatchew_hr_test + +panel_sorted = panel.sort_values(["dma_id", "week"]).reset_index(drop=True) +pre = panel_sorted[panel_sorted["week"].isin([3, 4])] +pre_pivot = pre.pivot(index="dma_id", columns="week", values="weekly_visits") +dy = (pre_pivot[4] - pre_pivot[3]).to_numpy(dtype=np.float64) +post_dose = ( + panel_sorted[panel_sorted["week"] == 5] + .set_index("dma_id") + .sort_index()["regional_spend_k"] + .to_numpy(dtype=np.float64) +) + +res_lin = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="linearity") +res_mi = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="mean_independence") + +print(res_lin.summary()) +print() +print(res_mi.summary()) +``` + +**Output:** + +``` +================================================================ + Yatchew-HR linearity test (H_0: linear E[dY|D]) +================================================================ +T_hr statistic: 0.0207 +p-value: 0.4917 +Critical value (1-sided z): 1.6449 +Reject H_0: False +alpha: 0.0500 +sigma^2_lin (OLS): 6.5340 +sigma^2_diff (Yatchew): 6.5170 +sigma^2_W (HR scale): 6.3639 +Observations: 60 +================================================================ + +================================================================ + Yatchew-HR mean-independence test (H_0: E[dY|D] = E[dY]) +================================================================ +T_hr statistic: 0.5536 +p-value: 0.2899 +Critical value (1-sided z): 1.6449 +Reject H_0: False +alpha: 0.0500 +sigma^2_lin (OLS): 7.0076 +sigma^2_diff (Yatchew): 6.5170 +sigma^2_W (HR scale): 6.8638 +Observations: 60 +================================================================ +``` + +**Reading the side-panel comparison.** + +- The `linearity` mode fits `dy ~ 1 + d` and computes residual variance `sigma2_lin` from those residuals. Under a clean linear DGP the residuals are small (close to noise variance), the gap `sigma2_lin - sigma2_diff` is near zero, and `T_hr` lands close to zero with a p-value far above alpha. +- The `mean_independence` mode fits intercept-only `dy ~ 1` and computes `sigma2_lin` as the population variance of `dy`. That residual variance is **strictly larger** than under `linearity` (the linear fit can absorb any apparent slope between `dy` and `d` - real or sample noise - shrinking the residual variance, while intercept-only cannot). The gap `sigma2_lin - sigma2_diff` is then larger and `T_hr` is larger - same asymptotic distribution, stricter null, more easily rejected when the alternative is true. + +On clean linear placebo data both modes fail to reject - exactly what we want. On data where `dY` actually responds to `D` in pre-period (parallel pre-trends fail), `null="mean_independence"` is more sensitive than `null="linearity"` because linearity is a weaker null (linear pre-trends would fail to reject the linearity null but would reject the mean-independence null). + +When to choose which: use `null="linearity"` to defend the joint identification assumption (paper Step 3, Assumption 8). Use `null="mean_independence"` on placebo (pre-treatment) data when you want a non-parametric mean-independence assertion. The `null="mean_independence"` mode is what R `YatchewTest::yatchew_test(order=0)` runs by default for placebo pre-trend tests. + +## 6. Communicating the Diagnostics to Leadership + +Pre-test results travel awkwardly to non-technical audiences. The template below structures the diagnostics around what each test does and does not rule out - mirroring the headline-and-evidence pattern from T20 Section 5. + +> **The HAD pre-test diagnostics on the brand-campaign panel do not flag a violation of the testable identifying assumptions.** +> +> - **Step 1 (QUG support-infimum, paper Theorem 4):** the test does not reject `H0: d_lower = 0` (p approximately 0.21). The data are statistically consistent with a dose distribution starting at zero. Independently of QUG, HAD's `design="auto"` selector applies a min/median heuristic to the post-period dose vector and lands on the `continuous_at_zero` design (target `WAS`) on this panel; QUG and the design selector are separate rules that point to the same identification path here. Failing to reject the QUG null is not proof that the true support is exactly at zero, and the design selector's choice is operational, not statistical. +> - **Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test does not reject (joint p approximately 0.07 across the three pre-period horizons). The p-value is close to alpha = 0.05, so the non-rejection here is not by a wide margin - in a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending. +> - **Step 3 (linearity, Assumption 8):** joint Stute homogeneity does not reject (joint p approximately 0.76 across the four post-launch horizons). The diagnostic does not flag heterogeneity bias on the dose dimension under the test's specification. +> +> **Non-testable from data (Design 1' identification, paper Assumption 3 / boundary regularity at zero):** uniform continuity of the dose-response `d -> Y_2(d)` at zero. Argued from domain knowledge - is there reason to believe outcomes are continuous in spend at the lower-dose boundary, with no extensive-margin discontinuity at $0? In our case yes, by DGP construction. (Note: this is the Design 1' caveat. T20's panel was Design 1, where the corresponding non-testable caveats are Assumptions 5/6 - the library actually emits a UserWarning surfacing those on Design 1 fits but stays silent on Design 1' fits like ours.) +> +> **Bottom line:** the workflow's three testable diagnostics do not flag a violation, so by paper Step 4 (decision rule) TWFE may be used. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and the non-testable Design 1' caveat (Assumption 3 / boundary regularity at zero). None of these are settled by non-rejection of the pre-tests. + +## 7. Extensions + +This tutorial covered the composite pre-test workflow on a single panel where QUG fail-to-reject and HAD's `design="auto"` heuristic both pointed independently to the `continuous_at_zero` (Design 1') identification path. A few directions we did not exercise here: + +- **Survey-weighted / population-weighted inference** - HAD's pre-test workflow accepts `survey_design=` (or the deprecated `survey=` / `weights=` aliases) for design-based inference. The QUG step is permanently deferred under survey weighting (extreme-value theory under complex sampling is not a settled toolkit); the linearity family runs with PSU-level Mammen multiplier bootstrap (Stute and joint variants) and weighted OLS + weighted variance components (Yatchew). A follow-up tutorial covers this path end-to-end. +- **`trends_lin=True` (Pierce-Schott Eq 17 / 18 detrending)** - mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)`. Forwards into both joint pre-trends and joint homogeneity wrappers; consumes the placebo at `base_period - 1` and skips Step 2 if no earlier placebo survives the drop. Useful when you suspect linear time trends correlated with dose but want to keep the joint-Stute machinery. +- **Standalone constituent tests** - all four building blocks are exposed for direct calling: `qug_test`, `stute_test`, `yatchew_hr_test` (used in this tutorial's side panel), and the joint variants `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`. + +See the [`HeterogeneousAdoptionDiD` API reference](../api/had.html) and the [`HAD pre-tests` reference](../api/had.html#pre-tests) for the full parameter lists. + +**Related tutorials.** + +- [Tutorial 14: Continuous DiD](14_continuous_did.ipynb) - the Callaway-Goodman-Bacon-Sant'Anna estimator for continuous-dose settings WHERE you do have a never-treated unit AND want the per-dose ATT(d) curve, not just the average slope. +- [Tutorial 20: HAD for a National Brand Campaign](20_had_brand_campaign.ipynb) - the headline HAD fit and event-study this tutorial defends. +- [Tutorial 4: Parallel Trends](04_parallel_trends.ipynb) - parallel-trends tests for the binary-DiD setting. + +## 8. Summary Checklist + +- HAD's pre-test workflow `did_had_pretest_workflow` bundles paper Section 4.2 Steps 1 (QUG support infimum), 2 (joint Stute pre-trends - event-study path only), and 3 (Stute / Yatchew-HR linearity, joint variant on event-study path). +- The two-period (`aggregate="overall"`) path runs Steps 1 + 3 only - it cannot run Step 2 because a single pre-period structurally has nothing to test against. The verdict says so verbatim: "Assumption 7 pre-trends test NOT run". +- Upgrade to the multi-period (`aggregate="event_study"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads "TWFE admissible under Section 4 assumptions" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof. +- Paper Step 4 is the **decision rule** (if Steps 1-3 don't reject, use TWFE), not a non-testable assumption. The non-testable identification caveat is design-path-specific: **Assumption 3** (boundary regularity at zero) for `continuous_at_zero` (Design 1', T21), or **Assumptions 5/6** for the Design 1 paths (`continuous_near_d_lower` / `mass_point`, T20). +- The Yatchew-HR test exposes two null modes: `null="linearity"` (paper Theorem 7, default; what the workflow calls under the hood) and `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data). +- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The QUG test and HAD's `design="auto"` selector are independent rules: QUG is a statistical test on `H0: d_lower = 0`; `design="auto"` calls `_detect_design()` which uses a min/median heuristic on the dose vector. Both pointed to `continuous_at_zero` on this panel; finite-sample uncertainty in either decision is a remaining caveat. +- Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python). diff --git a/docs/conf.py b/docs/conf.py index 4baa9ccb..828dd1c5 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -33,7 +33,7 @@ ] templates_path = ["_templates"] -exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] +exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "_review"] # -- Options for autodoc ----------------------------------------------------- autodoc_default_options = { diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml index f460b8ba..c325f5f9 100644 --- a/docs/doc-deps.yaml +++ b/docs/doc-deps.yaml @@ -388,6 +388,9 @@ sources: - path: diff_diff/guides/llms-full.txt section: "HeterogeneousAdoptionDiD" type: user_guide + - path: docs/tutorials/21_had_pretest_workflow.ipynb + type: tutorial + note: "Drift-locks `HAD(design=\"auto\")` resolution to `continuous_at_zero` on T21's panel via `tests/test_t21_had_pretest_workflow_drift.py::test_had_design_auto_lands_on_continuous_at_zero`; changes to `_detect_design()` heuristic should re-validate T21" diff_diff/had_pretests.py: drift_risk: medium @@ -404,6 +407,9 @@ sources: - path: diff_diff/guides/llms-full.txt section: "HAD Pretests" type: user_guide + - path: docs/tutorials/21_had_pretest_workflow.ipynb + type: tutorial + note: "Composite pre-test workflow walkthrough; drift-locked at tests/test_t21_had_pretest_workflow_drift.py" diff_diff/local_linear.py: drift_risk: low diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 353d2fb0..076632d0 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2506,7 +2506,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - **Note:** Horizon labels in `StuteJointResult.horizon_labels` are `str(t)` verbatim and carry STRING IDENTITY ONLY — NOT a chronological ordering key. Callers who need chronological order must preserve the original period values alongside (e.g. from the `pre_periods` / `post_periods` argument). - **Note:** NaN propagation is explicit: when any horizon has NaN in residuals, `cvm_stat_joint=NaN`, `p_value=NaN`, `reject=False`, AND `per_horizon_stats={label: np.nan for every horizon}` (full dict preserved with NaN values — not empty, not partial). -**Phase 3 follow-up delivery:** `stute_joint_pretest()`, `joint_pretrends_test()`, `joint_homogeneity_test()`, `StuteJointResult`, and `did_had_pretest_workflow(aggregate="event_study")` shipped together in PR #353 (2026-04). The `practitioner_next_steps()` integration and tutorial are queued for Phase 5. +**Phase 3 follow-up delivery:** `stute_joint_pretest()`, `joint_pretrends_test()`, `joint_homogeneity_test()`, `StuteJointResult`, and `did_had_pretest_workflow(aggregate="event_study")` shipped together in PR #353 (2026-04). The `practitioner_next_steps()` HAD handlers landed in Phase 5 wave 1 (PR #402); the T21 HAD pretest workflow tutorial landed in PR #409 (Phase 5 wave 2 first slice). T22 weighted/survey HAD tutorial remains queued. **Reference implementation(s):** - R: `did_had` (de Chaisemartin, Ciccia, D'Haultfœuille, Knau 2024a); `stute_test` (2024c); `yatchew_test` (Online Appendix, Table 3). @@ -2553,7 +2553,8 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - [x] Phase 5 (wave 1, PR #402): `practitioner_next_steps()` integration for HAD results - `_handle_had` and `_handle_had_event_study` route both result classes through HAD-specific Baker et al. (2025) step guidance with bidirectional HAD ↔ ContinuousDiD Step-4 routing closure. The `_check_nan_att` helper extends to ndarray `att` (HAD event-study) via `np.all(np.isnan(arr))` semantics; scalar path bit-exact preserved. - [x] Phase 5 (wave 1, PR #402): `llms-full.txt` HeterogeneousAdoptionDiD section + result-class blocks + `## HAD Pretests` index + Choosing-an-Estimator row landed; constructor / fit() signatures match the real API (regression-tested via `inspect.signature`); result-class field tables enumerate every public dataclass field (regression-tested via `dataclasses.fields()`); `llms-practitioner.txt` Step 4 decision tree distinguishes ContinuousDiD (per-dose ATT(d), needs never-treated) from HeterogeneousAdoptionDiD (WAS, universal-rollout-compatible). - [x] Phase 5 (partial): README catalog one-liner, bundled `llms.txt` `## Estimators` entry, `docs/api/had.rst` (autoclass for the three classes), and `docs/references.rst` citation landed in PR #372 docs refresh. -- [ ] Phase 5 (remaining): T21 HAD pretest workflow tutorial + T22 weighted/survey HAD tutorial - tracked in `TODO.md`. +- [x] Phase 5 (wave 2 first slice, PR #409): T21 HAD pretest workflow tutorial (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `did_had_pretest_workflow`. Uses a `Uniform[$0.01K, $50K]` dose-distribution variant of T20's brand-campaign panel (true support strictly positive but near-zero, chosen so QUG fails-to-reject `H0: d_lower = 0` in finite sample). Walks through `aggregate="overall"` (Steps 1 + 3 only, verdict explicitly flags Step 2 deferral) and upgrades to `aggregate="event_study"` (joint pre-trends Stute + joint homogeneity Stute close the gap). Side panel exercises both `yatchew_hr_test` null modes (`linearity` vs `mean_independence`). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (16 tests pinning panel composition, both verdict pivots, structural anchors, deterministic stats, bootstrap p-value tolerance bands per backend, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel). +- [ ] Phase 5 (remaining): T22 weighted/survey HAD tutorial - tracked in `TODO.md`. - [ ] Documentation of non-testability of Assumptions 5 and 6. - [ ] Warnings for staggered treatment timing (redirect to `ChaisemartinDHaultfoeuille`). - [ ] `NotImplementedError` phase pointer when `covariates=` is passed (Theorem 6 future work). diff --git a/docs/tutorials/20_had_brand_campaign.ipynb b/docs/tutorials/20_had_brand_campaign.ipynb index 81c0f91b..66b24849 100644 --- a/docs/tutorials/20_had_brand_campaign.ipynb +++ b/docs/tutorials/20_had_brand_campaign.ipynb @@ -38,9 +38,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-004", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "import numpy as np\n", @@ -68,9 +68,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-006", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "MAIN_SEED = 87\n", @@ -116,9 +116,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-007", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "post_doses = (\n", @@ -146,9 +146,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-008", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "if HAS_MATPLOTLIB:\n", @@ -191,9 +191,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-010", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "panel_2pd = panel.copy()\n", @@ -231,9 +231,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-012", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "print(f'WAS_d_lower estimate (att): {result.att:.4f}')\n", @@ -272,9 +272,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-015", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "import warnings\n", @@ -300,9 +300,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-016", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "es_df = result_es.to_dataframe()\n", @@ -311,9 +311,9 @@ }, { "cell_type": "code", + "execution_count": null, "id": "t20-cell-017", "metadata": {}, - "execution_count": null, "outputs": [], "source": [ "if HAS_MATPLOTLIB:\n", @@ -346,7 +346,7 @@ "source": [ "**Reading the dynamics.**\n", "\n", - "- The pre-launch placebo horizons (weeks -4, -3, -2) all sit at essentially zero - per-$1K effects within \u00b10.06 with 95% CIs comfortably bracketing zero. Visually consistent with parallel pre-trends. (Note: this is a visual placebo check, not a formal pretest - HAD ships a separate composite pretest workflow we did not run here; see extensions.)\n", + "- The pre-launch placebo horizons (weeks -4, -3, -2) all sit at essentially zero - per-$1K effects within ±0.06 with 95% CIs comfortably bracketing zero. Visually consistent with parallel pre-trends. (Note: this is a visual placebo check, not a formal pretest - HAD ships a separate composite pretest workflow we did not run here; see extensions.)\n", "- The per-week post-launch effects (weeks 0, 1, 2, 3) all hover right around 100 visits per $1K with overlapping 95% CIs and lower bounds well above zero. The per-dollar lift is stable across all four weeks of the campaign.\n", "- Practically: the campaign delivered its per-dollar lift on impact and held it across all four post-launch weeks. No ramp-up, no fade." ] @@ -376,7 +376,7 @@ "id": "t20-cell-020", "metadata": {}, "source": [ - "Adapt this template by swapping in your own numbers from `result.att`, `result.conf_int`, `result.d_lower`, the per-week event-study table, and your own DMA / spend distribution. The pattern - **headline \u2192 sample \u2192 validity \u2192 business \u2192 practical** - is what to keep." + "Adapt this template by swapping in your own numbers from `result.att`, `result.conf_int`, `result.d_lower`, the per-week event-study table, and your own DMA / spend distribution. The pattern - **headline → sample → validity → business → practical** - is what to keep." ] }, { @@ -389,7 +389,7 @@ "This tutorial covered HAD's headline workflow: the overall WAS_d_lower fit and the multi-week event study. The library also supports several extensions we did not demonstrate here.\n", "\n", "- **Population-weighted (survey-aware) inference**: when some markets or regions carry more weight than others - e.g., DMAs weighted by population - HAD accepts a `weights=` array or a `SurveyDesign` object on the same `fit()` interface.\n", - "- **Composite pretest workflow**: HAD ships a `did_had_pretest_workflow` that combines the QUG support-infimum test (`H0: d_lower = 0`, which adjudicates between the `continuous_at_zero` and `continuous_near_d_lower` design paths) with linearity tests (Stute and Yatchew-HR). On the two-period (`aggregate='overall'`) path this workflow checks QUG and linearity only; the parallel-trends step is closed by the multi-period (`aggregate='event_study'`) joint variants (`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`). The visual placebo check we used in Section 4 is a parallel-trends sanity check, not a substitute for the formal joint pretests.\n", + "- **Composite pretest workflow**: HAD ships a `did_had_pretest_workflow` that combines the QUG support-infimum test (`H0: d_lower = 0`, which adjudicates between the `continuous_at_zero` and `continuous_near_d_lower` design paths) with linearity tests (Stute and Yatchew-HR). On the two-period (`aggregate='overall'`) path this workflow checks QUG and linearity only; the parallel-trends step is closed by the multi-period (`aggregate='event_study'`) joint variants (`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`). The visual placebo check we used in Section 4 is a parallel-trends sanity check, not a substitute for the formal joint pretests; see [Tutorial 21](21_had_pretest_workflow.ipynb) for an end-to-end pretest walkthrough.\n", "- **`continuous_at_zero` design path**: if the lightest-touch DMA had no regional add-on (spend exactly $0), HAD switches to the Design 1' identification path with target `WAS` instead of `WAS_d_lower`. The auto-detection picks it up.\n", "- **Mass-point design path**: if a meaningful chunk of DMAs sit at exactly the same minimum spend (rather than spread continuously near the boundary), HAD switches to a 2SLS estimator with matching identification logic. Auto-detected as well.\n", "\n", diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb new file mode 100644 index 00000000..0443e015 --- /dev/null +++ b/docs/tutorials/21_had_pretest_workflow.ipynb @@ -0,0 +1,630 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d4e3e374", + "metadata": {}, + "source": [ + "# Tutorial 21: HAD Pre-test Workflow - Running the Pre-test Diagnostics on the Brand Campaign Panel\n", + "\n", + "[Tutorial 20](20_had_brand_campaign.ipynb) fit `HeterogeneousAdoptionDiD` (HAD) on a regional brand-campaign panel and reported a per-dollar lift, with a brief visual placebo check at the end. We deliberately deferred the **formal pre-test workflow** to this tutorial, with a forward pointer in T20's \"Extensions\" section.\n", + "\n", + "This tutorial picks up where T20 left off. We re-run the brand campaign on a panel close in shape to T20's, then walk through HAD's composite pre-test workflow `did_had_pretest_workflow` and read the diagnostics for paper Section 4.2 of de Chaisemartin, Ciccia, D'Haultfoeuille, & Knau (2026). We start with the two-period (`aggregate=\"overall\"`) workflow, observe that it does not run the parallel pre-trends step, and then **upgrade** to the multi-period (`aggregate=\"event_study\"`) workflow that adds the joint Stute pre-trends and joint homogeneity diagnostics. None of the diagnostics in this tutorial reject; we walk through what that does and does not let us conclude. A side panel compares the two `null=` modes of the Yatchew-HR test, including the recently-shipped `null=\"mean_independence\"` mode (R-parity with `YatchewTest::yatchew_test(order=0)`).\n" + ] + }, + { + "cell_type": "markdown", + "id": "c45952e0", + "metadata": {}, + "source": [ + "## 1. The Pre-test Battery\n", + "\n", + "de Chaisemartin et al. (2026) Section 4.2 lays out a four-step pre-test workflow for HAD identification:\n", + "\n", + "1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1', `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1, `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters.\n", + "2. **Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD.\n", + "3. **Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias?\n", + "4. **Step 4 - Decision rule:** if Steps 1-3 all fail to reject, TWFE may be used to estimate the treatment effect (paper Section 4.3).\n", + "\n", + "The library bundles the testable steps into one entry point: `did_had_pretest_workflow`. It dispatches to a two-period implementation (steps 1 + 3 only - step 2 needs at least two pre-periods) or a multi-period implementation (steps 1 + 2 + 3 jointly). The Yatchew-HR test from Step 3 is also exposed standalone with two null modes; we exercise both in the side panel.\n", + "\n", + "**Non-testable identification caveat (separate from the four-step workflow).** Identification of the WAS estimand under Design 1' (`continuous_at_zero`, target = `WAS`) requires **Assumption 3** (uniform continuity of `d -> Y_2(d)` at zero, holds if the dose-response is Lipschitz; not testable). The Design 1 paths (`continuous_near_d_lower` / `mass_point`, target = `WAS_d_lower`) instead need **Assumption 5** (sign identification) or **Assumption 6** (`WAS_d_lower` point identification) - that is the caveat T20's tutorial flagged because T20's panel was Design 1. T21's panel resolves to Design 1' (see Section 2 + Section 3), so the relevant non-testable caveat here is Assumption 3, NOT Assumptions 5/6. The library reflects this: it emits a UserWarning about Assumption 5/6 on Design 1 fits and does not emit it on `continuous_at_zero` (Design 1') fits.\n" + ] + }, + { + "cell_type": "markdown", + "id": "e75c0ee7", + "metadata": {}, + "source": [ + "## 2. The Panel\n", + "\n", + "We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1') identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design=\"auto\"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1' from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "2b498126", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-10T14:52:27.740298Z", + "iopub.status.busy": "2026-05-10T14:52:27.740212Z", + "iopub.status.idle": "2026-05-10T14:52:28.604952Z", + "shell.execute_reply": "2026-05-10T14:52:28.604641Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Panel: 60 DMAs x 8 weeks\n", + "Regional spend (post-launch): $0.18K - $49.00K\n", + "True per-$1K lift (locked at seed): 100.0 weekly visits\n" + ] + } + ], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from diff_diff import generate_continuous_did_data\n", + "\n", + "MAIN_SEED = 87\n", + "N_UNITS = 60\n", + "N_PERIODS = 8\n", + "COHORT_PERIOD = 5\n", + "TRUE_SLOPE = 100.0\n", + "BASELINE_VISITS = 5000.0\n", + "DOSE_LOW = 0.01\n", + "DOSE_HIGH = 50.0\n", + "\n", + "raw = generate_continuous_did_data(\n", + " n_units=N_UNITS,\n", + " n_periods=N_PERIODS,\n", + " cohort_periods=[COHORT_PERIOD],\n", + " never_treated_frac=0.0,\n", + " dose_distribution=\"uniform\",\n", + " dose_params={\"low\": DOSE_LOW, \"high\": DOSE_HIGH},\n", + " att_function=\"linear\",\n", + " att_intercept=0.0,\n", + " att_slope=TRUE_SLOPE,\n", + " unit_fe_sd=8.0,\n", + " time_trend=0.5,\n", + " noise_sd=2.0,\n", + " seed=MAIN_SEED,\n", + ")\n", + "panel = raw.copy()\n", + "panel.loc[panel[\"period\"] < panel[\"first_treat\"], \"dose\"] = 0.0\n", + "panel = panel.rename(\n", + " columns={\n", + " \"unit\": \"dma_id\",\n", + " \"period\": \"week\",\n", + " \"outcome\": \"weekly_visits\",\n", + " \"dose\": \"regional_spend_k\",\n", + " }\n", + ")\n", + "panel[\"weekly_visits\"] = panel[\"weekly_visits\"] + BASELINE_VISITS\n", + "\n", + "post = panel[panel[\"week\"] >= COHORT_PERIOD]\n", + "print(f\"Panel: {panel['dma_id'].nunique()} DMAs x {panel['week'].nunique()} weeks\")\n", + "print(\n", + " f\"Regional spend (post-launch): \"\n", + " f\"${post['regional_spend_k'].min():.2f}K - \"\n", + " f\"${post['regional_spend_k'].max():.2f}K\"\n", + ")\n", + "print(f\"True per-$1K lift (locked at seed): {TRUE_SLOPE} weekly visits\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "f101faae", + "metadata": {}, + "source": [ + "## 3. Step 1: The Overall Workflow (Two-Period Path)\n", + "\n", + "T20's headline used a two-period collapse of the panel - average pre-launch outcome per DMA against average post-launch outcome per DMA. That's also the natural input shape for HAD's two-period (`aggregate=\"overall\"`) pre-test workflow, which runs **paper Step 1 (QUG) + paper Step 3 (linearity, via Stute and Yatchew-HR)**. Step 2 (parallel pre-trends) is not implemented on this path - a single pre-period structurally can't support a pre-trends test - and the workflow's verdict says so explicitly.\n", + "\n", + "We collapse to two periods (pre = avg over weeks 1-4, post = avg over weeks 5-8), then call the workflow.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "230859e5", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-10T14:52:28.606277Z", + "iopub.status.busy": "2026-05-10T14:52:28.606164Z", + "iopub.status.idle": "2026-05-10T14:52:28.645728Z", + "shell.execute_reply": "2026-05-10T14:52:28.645446Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)\n", + "\n", + "all_pass = True\n", + "aggregate = 'overall'\n", + "pretrends_joint populated? False\n", + "homogeneity_joint populated? False\n" + ] + } + ], + "source": [ + "from diff_diff import did_had_pretest_workflow\n", + "\n", + "p = panel.copy()\n", + "p[\"period\"] = (p[\"week\"] >= COHORT_PERIOD).astype(int) + 1 # 1=pre, 2=post\n", + "two_period = p.groupby([\"dma_id\", \"period\"], as_index=False).agg(\n", + " weekly_visits=(\"weekly_visits\", \"mean\"),\n", + " regional_spend_k=(\"regional_spend_k\", \"mean\"),\n", + ")\n", + "# Workflow invariant: pre-period dose = 0 for every unit.\n", + "two_period.loc[two_period[\"period\"] == 1, \"regional_spend_k\"] = 0.0\n", + "# first_treat in the collapsed coordinates: 2 (the post-period) for every DMA.\n", + "two_period[\"first_treat\"] = 2\n", + "\n", + "overall_report = did_had_pretest_workflow(\n", + " data=two_period,\n", + " outcome_col=\"weekly_visits\",\n", + " dose_col=\"regional_spend_k\",\n", + " time_col=\"period\",\n", + " unit_col=\"dma_id\",\n", + " first_treat_col=\"first_treat\",\n", + " alpha=0.05,\n", + " n_bootstrap=999,\n", + " seed=21,\n", + " aggregate=\"overall\",\n", + ")\n", + "\n", + "print(overall_report.verdict)\n", + "print(f\"\\nall_pass = {overall_report.all_pass}\")\n", + "print(f\"aggregate = {overall_report.aggregate!r}\")\n", + "print(f\"pretrends_joint populated? {overall_report.pretrends_joint is not None}\")\n", + "print(f\"homogeneity_joint populated? {overall_report.homogeneity_joint is not None}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "82fc1090", + "metadata": {}, + "source": [ + "**Reading the overall verdict.** Three things to note.\n", + "\n", + "- **Step 1 (QUG) fails to reject:** the test statistic `T = D_(1) / (D_(2) - D_(1)) ~ 3.86` lands well below its critical value (`1/alpha - 1 = 19` at alpha = 0.05); the data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1', but the QUG test is independent of HAD's `design=\"auto\"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.)\n", + "- **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The diagnostics do not flag heterogeneity bias on the dose dimension, so reading the WAS as an average per-dose marginal effect is supported by these tests (subject to finite-sample power).\n", + "- **Step 2 (Assumption 7 pre-trends) is not run on this path.** The verdict says so verbatim: `\"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)\"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path).\n", + "\n", + "A note on `all_pass = True` here: the workflow's `all_pass` flag aggregates only the steps that actually ran on this dispatch path. On the overall path that is QUG + linearity (Stute / Yatchew); Step 2's deferral is *not* folded into `all_pass`. So `all_pass = True` on the overall path means \"of the two steps that ran, neither rejected\" - it does not mean Assumption 7 has been cleared. The upgrade to event-study below makes this concrete by actually running Step 2.\n", + "\n", + "Let's look at each individual test result.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "57c13c5d", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-10T14:52:28.646988Z", + "iopub.status.busy": "2026-05-10T14:52:28.646904Z", + "iopub.status.idle": "2026-05-10T14:52:28.648718Z", + "shell.execute_reply": "2026-05-10T14:52:28.648525Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "================================================================\n", + " QUG null test (H_0: d_lower = 0) \n", + "================================================================\n", + "Statistic T: 3.8562\n", + "p-value: 0.2059\n", + "Critical value (1/alpha-1): 19.0000\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "Observations: 60\n", + "Excluded (d == 0): 0\n", + "D_(1): 0.1806\n", + "D_(2): 0.2274\n", + "================================================================\n", + "\n", + "================================================================\n", + " Stute CvM linearity test (H_0: linear E[dY|D]) \n", + "================================================================\n", + "CvM statistic: 0.0735\n", + "Bootstrap p-value: 0.6860\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "Bootstrap replications: 999\n", + "Observations: 60\n", + "Seed: 21\n", + "================================================================\n", + "\n", + "================================================================\n", + " Yatchew-HR linearity test (H_0: linear E[dY|D]) \n", + "================================================================\n", + "T_hr statistic: -34759.3017\n", + "p-value: 1.0000\n", + "Critical value (1-sided z): 1.6449\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "sigma^2_lin (OLS): 1.6177\n", + "sigma^2_diff (Yatchew): 6250.2569\n", + "sigma^2_W (HR scale): 1.3925\n", + "Observations: 60\n", + "================================================================\n" + ] + } + ], + "source": [ + "overall_report.qug.print_summary()\n", + "print()\n", + "overall_report.stute.print_summary()\n", + "print()\n", + "overall_report.yatchew.print_summary()\n" + ] + }, + { + "cell_type": "markdown", + "id": "f13eb155", + "metadata": {}, + "source": [ + "A note on the Yatchew row. The `T_hr` statistic is **very large and negative** (~-35,000), which looks alarming but is a scale artifact, not pathology. Under the Yatchew construction `sigma2_diff = (1 / 2G) * sum((dy_{(g)} - dy_{(g-1)})^2)` is computed on `dy` sorted by dose `D`. With doses spread over Uniform[\\$0.01K, \\$50K] and a true per-$1K slope of 100 (locked by the DGP), adjacent-by-dose units have `dy` values that differ by roughly `100 * (D_{(g)} - D_{(g-1)})` plus noise — those squared gaps add up to a large `sigma2_diff` (about 6,250 here) by virtue of the dose scale, while the OLS residual variance `sigma2_lin` (about 1.6) reflects only noise around the linear fit. The formula `T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W` then goes massively negative, p-value rounds to 1.0, and we comfortably fail to reject linearity. The side panel later in the notebook constructs a different Yatchew input (within-pre-period first-differences, where the adjacent-by-dose `dy` gaps are not driven by the post-treatment slope) and produces a `T_hr` near zero — a useful sanity check that the test behaves the way it should when the dose dimension genuinely contributes nothing to the variance of `dy`.\n" + ] + }, + { + "cell_type": "markdown", + "id": "86c21280", + "metadata": {}, + "source": [ + "## 4. Step 2: Upgrade to the Event-Study Workflow\n", + "\n", + "The two-period workflow ran Steps 1 and 3 but did not run Step 2 (parallel pre-trends). Our panel actually has 8 weeks - that is enough pre-periods to add the joint Stute pre-trends diagnostic (paper Section 4.2 step 2 + Hlavka-Huskova 2020 / Delgado-Manteiga 2001 dependence-preserving Mammen multiplier bootstrap).\n", + "\n", + "We pass the full multi-period panel to `did_had_pretest_workflow(aggregate=\"event_study\", ...)`. The dispatch runs all three testable steps in one call:\n", + "\n", + "- **Step 1**: QUG re-runs on the dose distribution at the treatment period `F` (deterministic; same numbers as the overall path).\n", + "- **Step 2**: `joint_pretrends_test` - mean-independence joint Stute over the pre-period horizons (`E[Y_t - Y_base | D] = mu_t` for each t < F).\n", + "- **Step 3**: `joint_homogeneity_test` - linearity joint Stute over the post-period horizons (`E[Y_t - Y_base | D_t] = beta_{0,t} + beta_{fe,t} * D` for each t >= F).\n", + "\n", + "Step 3's \"Yatchew-HR\" arm has no joint variant in the paper (the differencing-based variance estimator doesn't have a derived multi-horizon extension), so the event-study path runs only joint Stute for linearity. Practitioners who want Yatchew-HR robustness on multi-period data can call the standalone `yatchew_hr_test` on each (base, post) pair manually.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "9e08fdc9", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-10T14:52:28.649676Z", + "iopub.status.busy": "2026-05-10T14:52:28.649603Z", + "iopub.status.idle": "2026-05-10T14:52:28.775756Z", + "shell.execute_reply": "2026-05-10T14:52:28.775491Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\n", + "\n", + "all_pass = True\n", + "aggregate = 'event_study'\n", + "pretrends_joint populated? True\n", + "homogeneity_joint populated? True\n" + ] + } + ], + "source": [ + "es_report = did_had_pretest_workflow(\n", + " data=panel,\n", + " outcome_col=\"weekly_visits\",\n", + " dose_col=\"regional_spend_k\",\n", + " time_col=\"week\",\n", + " unit_col=\"dma_id\",\n", + " first_treat_col=\"first_treat\",\n", + " alpha=0.05,\n", + " n_bootstrap=999,\n", + " seed=21,\n", + " aggregate=\"event_study\",\n", + ")\n", + "\n", + "print(es_report.verdict)\n", + "print(f\"\\nall_pass = {es_report.all_pass}\")\n", + "print(f\"aggregate = {es_report.aggregate!r}\")\n", + "print(f\"pretrends_joint populated? {es_report.pretrends_joint is not None}\")\n", + "print(f\"homogeneity_joint populated? {es_report.homogeneity_joint is not None}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "8b966112", + "metadata": {}, + "source": [ + "**Reading the event-study verdict.** Now the verdict reads `\"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)\"`. The `\"deferred\"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.\n", + "\n", + "A note on the verdict's \"TWFE admissible\" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05` (paper Step 4 decision rule). That is non-rejection evidence under the diagnostics' finite-sample power and specification, not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero, see Section 1) sits alongside this and is not covered by any of the three diagnostics.\n", + "\n", + "The joint pre-trends test runs over `n_horizons = 3` (pre-periods 1, 2, 3, with week 4 reserved as the base period). The joint homogeneity test runs over `n_horizons = 4` (post-periods 5, 6, 7, 8). Let's inspect the per-horizon detail.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "0cff9af4", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-10T14:52:28.776891Z", + "iopub.status.busy": "2026-05-10T14:52:28.776809Z", + "iopub.status.idle": "2026-05-10T14:52:28.778666Z", + "shell.execute_reply": "2026-05-10T14:52:28.778448Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "================================================================\n", + " QUG null test (H_0: d_lower = 0) \n", + "================================================================\n", + "Statistic T: 3.8562\n", + "p-value: 0.2059\n", + "Critical value (1/alpha-1): 19.0000\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "Observations: 60\n", + "Excluded (d == 0): 0\n", + "D_(1): 0.1806\n", + "D_(2): 0.2274\n", + "================================================================\n", + "\n", + "================================================================\n", + " Joint Stute CvM test (mean-independence (pre-trends)) \n", + "================================================================\n", + "Joint CvM statistic: 7.1627\n", + "Bootstrap p-value: 0.0720\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "Bootstrap replications: 999\n", + "Horizons: 3\n", + "Observations: 60\n", + "Seed: 21\n", + "Exact-linear short-circuit: False\n", + "----------------------------------------------------------------\n", + "Per-horizon statistics:\n", + " 1 1.6112\n", + " 2 2.9262\n", + " 3 2.6253\n", + "================================================================\n", + "\n", + "================================================================\n", + " Joint Stute CvM test (linearity (post-homogeneity)) \n", + "================================================================\n", + "Joint CvM statistic: 1.3562\n", + "Bootstrap p-value: 0.7630\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "Bootstrap replications: 999\n", + "Horizons: 4\n", + "Observations: 60\n", + "Seed: 21\n", + "Exact-linear short-circuit: False\n", + "----------------------------------------------------------------\n", + "Per-horizon statistics:\n", + " 5 0.4218\n", + " 6 0.2186\n", + " 7 0.4928\n", + " 8 0.2230\n", + "================================================================\n" + ] + } + ], + "source": [ + "es_report.qug.print_summary()\n", + "print()\n", + "es_report.pretrends_joint.print_summary()\n", + "print()\n", + "es_report.homogeneity_joint.print_summary()\n" + ] + }, + { + "cell_type": "markdown", + "id": "df5603e6", + "metadata": {}, + "source": [ + "The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 threshold. The test does not reject at alpha = 0.05, but the near-threshold p-value warrants scrutiny - the diagnostic is not failing in a clearly-far-from-rejection regime. In a real analysis this would warrant a closer look at the per-horizon CvM contributions (visible in `per_horizon_stats`) and possibly a Pierce-Schott-style linear-trend detrending via `trends_lin=True` (an extension we do not demonstrate here; see `did_had_pretest_workflow`'s docstring).\n", + "\n", + "The joint homogeneity p-value (~0.76) is comfortably far from rejection. The diagnostic does not flag heterogeneity bias on the dose dimension across the four post-launch horizons.\n", + "\n", + "Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. By paper Step 4 (the decision rule), TWFE may then be used. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero) remains and is argued from domain knowledge.\n" + ] + }, + { + "cell_type": "markdown", + "id": "63068545", + "metadata": {}, + "source": [ + "## 5. Side Panel: Yatchew-HR Null Modes\n", + "\n", + "The Yatchew-HR test exposes two `null=` modes (the second was added in 2026-04 for parity with the R `YatchewTest` package).\n", + "\n", + "- `null=\"linearity\"` (default; paper Theorem 7): tests `H0: E[dY | D]` is linear in `D`. Residuals come from OLS `dy ~ 1 + d`. This is what `did_had_pretest_workflow` calls under the hood.\n", + "- `null=\"mean_independence\"` (added 2026-04-26 in PR #397, Phase 4 R-parity): tests the stricter `H0: E[dY | D] = E[dY]`, i.e. `dY` is mean-independent of `D`. Residuals come from intercept-only OLS `dy ~ 1`. Mirrors R `YatchewTest::yatchew_test(order=0)`.\n", + "\n", + "The mean-independence mode is typically used on **placebo (pre-treatment) data** to test parallel pre-trends as a non-parametric mean-independence assertion. Below we construct an illustrative input - the within-pre-period first-difference `dy = Y[week=4] - Y[week=3]` paired with each DMA's actual post-period dose - and run both modes side by side. Both should fail to reject on this clean linear DGP; the contrast is in the residual structure.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "7fe4f131", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-10T14:52:28.779708Z", + "iopub.status.busy": "2026-05-10T14:52:28.779631Z", + "iopub.status.idle": "2026-05-10T14:52:28.784665Z", + "shell.execute_reply": "2026-05-10T14:52:28.784449Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "================================================================\n", + " Yatchew-HR linearity test (H_0: linear E[dY|D]) \n", + "================================================================\n", + "T_hr statistic: 0.0207\n", + "p-value: 0.4917\n", + "Critical value (1-sided z): 1.6449\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "sigma^2_lin (OLS): 6.5340\n", + "sigma^2_diff (Yatchew): 6.5170\n", + "sigma^2_W (HR scale): 6.3639\n", + "Observations: 60\n", + "================================================================\n", + "\n", + "================================================================\n", + " Yatchew-HR mean-independence test (H_0: E[dY|D] = E[dY]) \n", + "================================================================\n", + "T_hr statistic: 0.5536\n", + "p-value: 0.2899\n", + "Critical value (1-sided z): 1.6449\n", + "Reject H_0: False\n", + "alpha: 0.0500\n", + "sigma^2_lin (OLS): 7.0076\n", + "sigma^2_diff (Yatchew): 6.5170\n", + "sigma^2_W (HR scale): 6.8638\n", + "Observations: 60\n", + "================================================================\n" + ] + } + ], + "source": [ + "from diff_diff import yatchew_hr_test\n", + "\n", + "panel_sorted = panel.sort_values([\"dma_id\", \"week\"]).reset_index(drop=True)\n", + "pre = panel_sorted[panel_sorted[\"week\"].isin([3, 4])]\n", + "pre_pivot = pre.pivot(index=\"dma_id\", columns=\"week\", values=\"weekly_visits\")\n", + "dy = (pre_pivot[4] - pre_pivot[3]).to_numpy(dtype=np.float64)\n", + "post_dose = (\n", + " panel_sorted[panel_sorted[\"week\"] == 5]\n", + " .set_index(\"dma_id\")\n", + " .sort_index()[\"regional_spend_k\"]\n", + " .to_numpy(dtype=np.float64)\n", + ")\n", + "\n", + "res_lin = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null=\"linearity\")\n", + "res_mi = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null=\"mean_independence\")\n", + "\n", + "print(res_lin.summary())\n", + "print()\n", + "print(res_mi.summary())\n" + ] + }, + { + "cell_type": "markdown", + "id": "5b38c476", + "metadata": {}, + "source": [ + "**Reading the side-panel comparison.**\n", + "\n", + "- The `linearity` mode fits `dy ~ 1 + d` and computes residual variance `sigma2_lin` from those residuals. Under a clean linear DGP the residuals are small (close to noise variance), the gap `sigma2_lin - sigma2_diff` is near zero, and `T_hr` lands close to zero with a p-value far above alpha.\n", + "- The `mean_independence` mode fits intercept-only `dy ~ 1` and computes `sigma2_lin` as the population variance of `dy`. That residual variance is **strictly larger** than under `linearity` (the linear fit can absorb any apparent slope between `dy` and `d` - real or sample noise - shrinking the residual variance, while intercept-only cannot). The gap `sigma2_lin - sigma2_diff` is then larger and `T_hr` is larger - same asymptotic distribution, stricter null, more easily rejected when the alternative is true.\n", + "\n", + "On clean linear placebo data both modes fail to reject - exactly what we want. On data where `dY` actually responds to `D` in pre-period (parallel pre-trends fail), `null=\"mean_independence\"` is more sensitive than `null=\"linearity\"` because linearity is a weaker null (linear pre-trends would fail to reject the linearity null but would reject the mean-independence null).\n", + "\n", + "When to choose which: use `null=\"linearity\"` to defend the joint identification assumption (paper Step 3, Assumption 8). Use `null=\"mean_independence\"` on placebo (pre-treatment) data when you want a non-parametric mean-independence assertion. The `null=\"mean_independence\"` mode is what R `YatchewTest::yatchew_test(order=0)` runs by default for placebo pre-trend tests.\n" + ] + }, + { + "cell_type": "markdown", + "id": "b16b856f", + "metadata": {}, + "source": [ + "## 6. Communicating the Diagnostics to Leadership\n", + "\n", + "Pre-test results travel awkwardly to non-technical audiences. The template below structures the diagnostics around what each test does and does not rule out - mirroring the headline-and-evidence pattern from T20 Section 5.\n", + "\n", + "> **The HAD pre-test diagnostics on the brand-campaign panel do not flag a violation of the testable identifying assumptions.**\n", + ">\n", + "> - **Step 1 (QUG support-infimum, paper Theorem 4):** the test does not reject `H0: d_lower = 0` (p approximately 0.21). The data are statistically consistent with a dose distribution starting at zero. Independently of QUG, HAD's `design=\"auto\"` selector applies a min/median heuristic to the post-period dose vector and lands on the `continuous_at_zero` design (target `WAS`) on this panel; QUG and the design selector are separate rules that point to the same identification path here. Failing to reject the QUG null is not proof that the true support is exactly at zero, and the design selector's choice is operational, not statistical.\n", + "> - **Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test does not reject (joint p approximately 0.07 across the three pre-period horizons). The p-value is close to alpha = 0.05, so the non-rejection here is not by a wide margin - in a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending.\n", + "> - **Step 3 (linearity, Assumption 8):** joint Stute homogeneity does not reject (joint p approximately 0.76 across the four post-launch horizons). The diagnostic does not flag heterogeneity bias on the dose dimension under the test's specification.\n", + ">\n", + "> **Non-testable from data (Design 1' identification, paper Assumption 3 / boundary regularity at zero):** uniform continuity of the dose-response `d -> Y_2(d)` at zero. Argued from domain knowledge - is there reason to believe outcomes are continuous in spend at the lower-dose boundary, with no extensive-margin discontinuity at $0? In our case yes, by DGP construction. (Note: this is the Design 1' caveat. T20's panel was Design 1, where the corresponding non-testable caveats are Assumptions 5/6 - the library actually emits a UserWarning surfacing those on Design 1 fits but stays silent on Design 1' fits like ours.)\n", + ">\n", + "> **Bottom line:** the workflow's three testable diagnostics do not flag a violation, so by paper Step 4 (decision rule) TWFE may be used. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and the non-testable Design 1' caveat (Assumption 3 / boundary regularity at zero). None of these are settled by non-rejection of the pre-tests.\n" + ] + }, + { + "cell_type": "markdown", + "id": "eb1d3712", + "metadata": {}, + "source": [ + "## 7. Extensions\n", + "\n", + "This tutorial covered the composite pre-test workflow on a single panel where QUG fail-to-reject and HAD's `design=\"auto\"` heuristic both pointed independently to the `continuous_at_zero` (Design 1') identification path. A few directions we did not exercise here:\n", + "\n", + "- **Survey-weighted / population-weighted inference** - HAD's pre-test workflow accepts `survey_design=` (or the deprecated `survey=` / `weights=` aliases) for design-based inference. The QUG step is permanently deferred under survey weighting (extreme-value theory under complex sampling is not a settled toolkit); the linearity family runs with PSU-level Mammen multiplier bootstrap (Stute and joint variants) and weighted OLS + weighted variance components (Yatchew). A follow-up tutorial covers this path end-to-end.\n", + "- **`trends_lin=True` (Pierce-Schott Eq 17 / 18 detrending)** - mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)`. Forwards into both joint pre-trends and joint homogeneity wrappers; consumes the placebo at `base_period - 1` and skips Step 2 if no earlier placebo survives the drop. Useful when you suspect linear time trends correlated with dose but want to keep the joint-Stute machinery.\n", + "- **Standalone constituent tests** - all four building blocks are exposed for direct calling: `qug_test`, `stute_test`, `yatchew_hr_test` (used in this tutorial's side panel), and the joint variants `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`.\n", + "\n", + "See the [`HeterogeneousAdoptionDiD` API reference](../api/had.html) and the [`HAD pre-tests` reference](../api/had.html#pre-tests) for the full parameter lists.\n", + "\n", + "**Related tutorials.**\n", + "\n", + "- [Tutorial 14: Continuous DiD](14_continuous_did.ipynb) - the Callaway-Goodman-Bacon-Sant'Anna estimator for continuous-dose settings WHERE you do have a never-treated unit AND want the per-dose ATT(d) curve, not just the average slope.\n", + "- [Tutorial 20: HAD for a National Brand Campaign](20_had_brand_campaign.ipynb) - the headline HAD fit and event-study this tutorial defends.\n", + "- [Tutorial 4: Parallel Trends](04_parallel_trends.ipynb) - parallel-trends tests for the binary-DiD setting.\n" + ] + }, + { + "cell_type": "markdown", + "id": "dd1dce6f", + "metadata": {}, + "source": [ + "## 8. Summary Checklist\n", + "\n", + "- HAD's pre-test workflow `did_had_pretest_workflow` bundles paper Section 4.2 Steps 1 (QUG support infimum), 2 (joint Stute pre-trends - event-study path only), and 3 (Stute / Yatchew-HR linearity, joint variant on event-study path).\n", + "- The two-period (`aggregate=\"overall\"`) path runs Steps 1 + 3 only - it cannot run Step 2 because a single pre-period structurally has nothing to test against. The verdict says so verbatim: \"Assumption 7 pre-trends test NOT run\".\n", + "- Upgrade to the multi-period (`aggregate=\"event_study\"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads \"TWFE admissible under Section 4 assumptions\" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof.\n", + "- Paper Step 4 is the **decision rule** (if Steps 1-3 don't reject, use TWFE), not a non-testable assumption. The non-testable identification caveat is design-path-specific: **Assumption 3** (boundary regularity at zero) for `continuous_at_zero` (Design 1', T21), or **Assumptions 5/6** for the Design 1 paths (`continuous_near_d_lower` / `mass_point`, T20).\n", + "- The Yatchew-HR test exposes two null modes: `null=\"linearity\"` (paper Theorem 7, default; what the workflow calls under the hood) and `null=\"mean_independence\"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data).\n", + "- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The QUG test and HAD's `design=\"auto\"` selector are independent rules: QUG is a statistical test on `H0: d_lower = 0`; `design=\"auto\"` calls `_detect_design()` which uses a min/median heuristic on the dose vector. Both pointed to `continuous_at_zero` on this panel; finite-sample uncertainty in either decision is a remaining caveat.\n", + "- Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python).\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/tutorials/README.md b/docs/tutorials/README.md index 64773d46..a7b495bb 100644 --- a/docs/tutorials/README.md +++ b/docs/tutorials/README.md @@ -103,6 +103,14 @@ Practitioner walkthrough for measuring per-dollar lift when every market is trea - Stakeholder communication template flagging the Assumption 5/6 identification caveat - Companion drift-test file (`tests/test_t20_had_brand_campaign_drift.py`) +### 21. HAD Pre-test Workflow (`21_had_pretest_workflow.ipynb`) +Composite pre-test walkthrough for `HeterogeneousAdoptionDiD`, building on Tutorial 20's brand-campaign framing on a panel where the dose distribution has a strictly positive but very near-zero lower bound (so the QUG step fails-to-reject `H0: d_lower = 0`): +- Paper Section 4.2 step taxonomy (QUG support-infimum, parallel pre-trends, linearity) +- `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse: Step 1 + Step 3 only, verdict explicitly flags Step 2 as deferred +- Upgrade to `did_had_pretest_workflow(aggregate="event_study")` on the multi-week panel: adds the joint pre-trends Stute and joint homogeneity Stute diagnostics (none of the three testable steps reject) +- Side panel comparing `yatchew_hr_test` `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) +- Companion drift-test file (`tests/test_t21_had_pretest_workflow_drift.py`) + ## Running the Notebooks 1. Install diff-diff with dependencies: diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py new file mode 100644 index 00000000..b4142f07 --- /dev/null +++ b/tests/test_t21_had_pretest_workflow_drift.py @@ -0,0 +1,356 @@ +"""Drift detection for Tutorial 21 (`docs/tutorials/21_had_pretest_workflow.ipynb`). + +The tutorial narrative quotes seed-specific numbers (overall verdict +substring, QUG / Stute / Yatchew p-values, joint pre-trends and homogeneity +horizon counts and p-values, Yatchew side-panel statistics under both null +modes). If library numerics drift (estimator changes, RNG path changes, +BLAS path changes), the prose can go stale silently while `pytest --nbmake` +still passes - it only checks that the cells execute without error. + +These asserts re-derive the same numbers using the locked T21 DGP and seeds +the notebook uses, then check them against the values quoted in the +tutorial markdown. If a future change moves any number outside its +tolerance band, this test fails and a maintainer is forced to either +update the prose or investigate the methodology shift before merge. + +T21 DGP differs from T20: dose distribution is `Uniform[$0.01K, $50K]` +(was `[$5K, $50K]` in T20). The true support is strictly positive but very +near zero. Two independent things follow from that small `D_(1)` and are +exercised in this drift file: (a) the QUG step fails-to-reject +`H0: d_lower = 0` in this finite sample, populating the workflow's verdict +with the "Assumption 7 deferred" substring used for the upgrade-arc +narrative; and (b) HAD's `design="auto"` selector - a separate min/median +heuristic that does NOT consume the QUG p-value - independently lands on +`continuous_at_zero` because `d.min() < 0.01 * median(|d|)` (per +`_detect_design()` in `had.py`). Both checks point to the same +identification path on this panel, but the rules are independent. +DGP and seed locked at `_scratch/t21_pretests/10_panel.py`. +Quoted numbers derived from `_scratch/t21_pretests/50_compose_narrative.py`. + +Bootstrap p-value pins use **abs tolerance bands >= 0.15** per +`feedback_bootstrap_drift_tests_need_backend_tolerance` (Rust vs pure-Python +RNG paths can diverge by ~0.05-0.15 and flip rounding boundaries). +Deterministic statistics (QUG, Yatchew sigma2_*) get exact `round(..., 2)` +or `round(..., 4)` pins. +""" + +from __future__ import annotations + +import warnings + +import numpy as np +import pytest + +from diff_diff import HAD, did_had_pretest_workflow, generate_continuous_did_data, yatchew_hr_test + +# Locked T21 DGP parameters (must stay in sync with the notebook). +MAIN_SEED = 87 +N_UNITS = 60 +N_PERIODS = 8 +COHORT_PERIOD = 5 +TRUE_SLOPE = 100.0 +BASELINE_VISITS = 5000.0 +DOSE_LOW = 0.01 # T21 change vs T20 (was 5.0): near-zero lower bound chosen so QUG fails-to-reject H0: d_lower = 0. +DOSE_HIGH = 50.0 +WORKFLOW_SEED = 21 + + +@pytest.fixture(scope="module") +def panel(): + raw = generate_continuous_did_data( + n_units=N_UNITS, + n_periods=N_PERIODS, + cohort_periods=[COHORT_PERIOD], + never_treated_frac=0.0, + dose_distribution="uniform", + dose_params={"low": DOSE_LOW, "high": DOSE_HIGH}, + att_function="linear", + att_intercept=0.0, + att_slope=TRUE_SLOPE, + unit_fe_sd=8.0, + time_trend=0.5, + noise_sd=2.0, + seed=MAIN_SEED, + ) + p = raw.copy() + p.loc[p["period"] < p["first_treat"], "dose"] = 0.0 + p = p.rename( + columns={ + "unit": "dma_id", + "period": "week", + "outcome": "weekly_visits", + "dose": "regional_spend_k", + } + ) + p["weekly_visits"] = p["weekly_visits"] + BASELINE_VISITS + return p + + +@pytest.fixture(scope="module") +def two_period(panel): + p = panel.copy() + p["period"] = (p["week"] >= COHORT_PERIOD).astype(int) + 1 + collapsed = p.groupby(["dma_id", "period"], as_index=False).agg( + weekly_visits=("weekly_visits", "mean"), + regional_spend_k=("regional_spend_k", "mean"), + ) + collapsed.loc[collapsed["period"] == 1, "regional_spend_k"] = 0.0 + collapsed["first_treat"] = 2 + return collapsed + + +@pytest.fixture(scope="module") +def overall_report(two_period): + return did_had_pretest_workflow( + data=two_period, + outcome_col="weekly_visits", + dose_col="regional_spend_k", + time_col="period", + unit_col="dma_id", + first_treat_col="first_treat", + alpha=0.05, + n_bootstrap=999, + seed=WORKFLOW_SEED, + aggregate="overall", + ) + + +@pytest.fixture(scope="module") +def event_study_report(panel): + return did_had_pretest_workflow( + data=panel, + outcome_col="weekly_visits", + dose_col="regional_spend_k", + time_col="week", + unit_col="dma_id", + first_treat_col="first_treat", + alpha=0.05, + n_bootstrap=999, + seed=WORKFLOW_SEED, + aggregate="event_study", + ) + + +@pytest.fixture(scope="module") +def yatchew_side_panel_inputs(panel): + """Section 5's Yatchew side panel: post-period dose paired with the + within-pre-period first-difference dy = Y[w4] - Y[w3]. Shared + construction between the linearity-mode and mean_independence-mode + tests below.""" + panel_sorted = panel.sort_values(["dma_id", "week"]).reset_index(drop=True) + pre = panel_sorted[panel_sorted["week"].isin([3, 4])] + pre_pivot = pre.pivot(index="dma_id", columns="week", values="weekly_visits") + dy = (pre_pivot[4] - pre_pivot[3]).to_numpy(dtype=np.float64) + post_dose = ( + panel_sorted[panel_sorted["week"] == 5] + .set_index("dma_id") + .sort_index()["regional_spend_k"] + .to_numpy(dtype=np.float64) + ) + return post_dose, dy + + +def test_panel_matches_t21_locked_dgp(panel): + """Section 2 narrative claims 60 DMAs over 8 weeks, regional spend + drawn from Uniform[$0.01K, $50K] - true support strictly positive + but very near zero (so QUG can fail-to-reject in this finite + sample). If the DGP drifts, this surfaces.""" + assert panel["dma_id"].nunique() == N_UNITS + assert panel["week"].nunique() == N_PERIODS + post_doses = ( + panel.loc[panel["week"] >= COHORT_PERIOD].groupby("dma_id")["regional_spend_k"].first() + ) + assert post_doses.min() >= DOSE_LOW, post_doses.min() + assert post_doses.max() <= DOSE_HIGH, post_doses.max() + # T21 narrative says "starts from $10" - i.e. the smallest dose is + # below $1K (~$180 from numbers.json: d_order_1 = 0.180569...). + assert post_doses.min() < 1.0, post_doses.min() + + +def test_overall_verdict_flags_assumption_7_deferred(overall_report): + """Load-bearing pivot for the upgrade-arc narrative. Sections 3-4 + of the notebook quote this verdict substring verbatim. If + `_compose_verdict()` is refactored such that the substring changes + or moves, this test surfaces it.""" + pivot = "Assumption 7 pre-trends test NOT run" + assert pivot in overall_report.verdict, overall_report.verdict + # Adjacent pivot the prose also quotes: + assert ( + "paper step 2 deferred to Phase 3 follow-up" in overall_report.verdict + ), overall_report.verdict + + +def test_overall_path_structural_anchors(overall_report): + """Notebook Section 3 prose claims `pretrends_joint` and + `homogeneity_joint` are both None on the overall path (they are + not populated on the two-period dispatch). Sturdier than a + verdict-string anchor against future verdict refactors.""" + assert overall_report.aggregate == "overall" + assert overall_report.pretrends_joint is None + assert overall_report.homogeneity_joint is None + assert overall_report.all_pass is True + + +def test_overall_qug_fails_to_reject(overall_report): + """Section 3 narrative claims QUG fails-to-reject H0: d_lower = 0 + (data are statistically consistent with continuous_at_zero design). + QUG is fully deterministic; pin exact rounded values. The independent + HAD `design="auto"` selector decision is locked separately by + `test_had_design_auto_lands_on_continuous_at_zero`.""" + assert overall_report.qug.reject is False + # T statistic = D_(1) / (D_(2) - D_(1)) is fully deterministic. + assert round(overall_report.qug.t_stat, 2) == 3.86, overall_report.qug.t_stat + assert round(overall_report.qug.critical_value, 1) == 19.0, overall_report.qug.critical_value + + +def test_overall_stute_fails_to_reject(overall_report): + """Section 3 narrative quotes Stute p_value ~0.686. Stute uses + Mammen wild bootstrap so the p-value is RNG-dependent; use a + bounded abs tolerance band per + `feedback_bootstrap_drift_tests_need_backend_tolerance` (>= 0.15 + width). Both bounds tight enough to catch methodology drift in + either direction, loose enough for backend RNG path differences.""" + assert overall_report.stute.reject is False + assert 0.53 <= overall_report.stute.p_value <= 0.84, overall_report.stute.p_value + + +def test_overall_yatchew_fails_to_reject(overall_report): + """Section 3 narrative + cell 9 callout describe the very large + negative Yatchew T_hr (~-35,000) under perfect linearity with + heterogeneous doses. Pin sigma2_* (deterministic) and the + rejection decision.""" + assert overall_report.yatchew.reject is False + assert overall_report.yatchew.p_value > 0.99, overall_report.yatchew.p_value + # sigma2_diff is deterministic given the panel. + assert ( + round(overall_report.yatchew.sigma2_diff, 0) == 6250.0 + ), overall_report.yatchew.sigma2_diff + + +def test_event_study_verdict_says_admissible(event_study_report): + """Sections 4-5 narrative claims the event-study verdict reads + 'TWFE admissible under Section 4 assumptions' (no `deferred` + caveat). Locks the upgrade-arc closure pivot.""" + assert "TWFE admissible" in event_study_report.verdict, event_study_report.verdict + assert "deferred" not in event_study_report.verdict, event_study_report.verdict + + +def test_event_study_path_structural_anchors(event_study_report): + """Section 4 narrative claims `pretrends_joint` and + `homogeneity_joint` are both populated on the event-study path + (the upgrade arc closure). Mirror of the overall path's negative + structural anchor.""" + assert event_study_report.aggregate == "event_study" + assert event_study_report.pretrends_joint is not None + assert event_study_report.homogeneity_joint is not None + assert event_study_report.all_pass is True + + +def test_event_study_qug_matches_overall(event_study_report, overall_report): + """Section 4 narrative claims QUG re-runs deterministically with + the same numbers as the overall path (same dose distribution at + F).""" + assert event_study_report.qug.reject is overall_report.qug.reject + assert round(event_study_report.qug.t_stat, 4) == round(overall_report.qug.t_stat, 4) + + +def test_event_study_pretrends_horizons_correct(event_study_report): + """Section 4 narrative claims `joint_pretrends_test` runs over 3 + horizons (pre-periods 1, 2, 3, with week 4 reserved as the base + period). Locks the earlier-pre-period precondition closure + (PR #402 R7) for T21's specific panel: F=5, t_pre={1,2,3,4}, + base=4, earlier pre-periods={1,2,3}.""" + pj = event_study_report.pretrends_joint + assert pj is not None + assert pj.n_horizons == 3, pj.n_horizons + assert pj.horizon_labels == ["1", "2", "3"], pj.horizon_labels + + +def test_event_study_homogeneity_horizons_correct(event_study_report): + """Section 4 narrative claims `joint_homogeneity_test` runs over 4 + post horizons (weeks 5, 6, 7, 8).""" + hj = event_study_report.homogeneity_joint + assert hj is not None + assert hj.n_horizons == 4, hj.n_horizons + assert hj.horizon_labels == ["5", "6", "7", "8"], hj.horizon_labels + + +def test_event_study_pretrends_fails_to_reject(event_study_report): + """Section 4 narrative quotes the pre-trends p-value as 'close to + alpha = 0.05 ... warrants scrutiny' (~0.07 from numbers.json) - + non-rejection is not a clean pass at this margin. Use binary + fail-to-reject + a wide abs tolerance band - bootstrap p-values + near alpha are the most sensitive to RNG path differences.""" + pj = event_study_report.pretrends_joint + assert pj is not None + assert pj.reject is False + # Tight upper bound to catch a real methodology shift; lower bound + # would catch a regression that pushes pre-trends to look pristine + # (which would belie the "close to alpha" narrative). + assert 0.0 <= pj.p_value <= 0.25, pj.p_value + + +def test_event_study_homogeneity_fails_to_reject(event_study_report): + """Section 4 narrative claims joint homogeneity strongly fails to + reject and quotes p ~0.763 from numbers.json. Use a bounded abs + tolerance band per + `feedback_bootstrap_drift_tests_need_backend_tolerance` so that + drift in either direction (toward rejection or toward an even + cleaner pass) flags the prose as stale rather than silently + passing.""" + hj = event_study_report.homogeneity_joint + assert hj is not None + assert hj.reject is False + assert 0.61 <= hj.p_value <= 0.92, hj.p_value + + +def test_had_design_auto_lands_on_continuous_at_zero(two_period): + """Section 2 narrative claims HAD's `design="auto"` selector + independently lands on `continuous_at_zero` (target = `WAS`) on + this panel because `d.min() < 0.01 * median(|d|)`. This is a + separate decision rule from the QUG test (locked by + `test_overall_qug_fails_to_reject`); the two happen to agree on + this panel but the rules are independent. We fit HAD with + `design="auto"` here just to verify the prose.""" + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=UserWarning) + est = HAD(design="auto") + result = est.fit( + two_period, + outcome_col="weekly_visits", + dose_col="regional_spend_k", + time_col="period", + unit_col="dma_id", + ) + assert result.design == "continuous_at_zero", result.design + assert result.target_parameter == "WAS", result.target_parameter + + +def test_yatchew_side_panel_linearity_passes(yatchew_side_panel_inputs): + """Section 5 (Yatchew side panel) narrative claims `null="linearity"` + does not reject on the within-pre-period first-difference paired + with post-period dose. Pin the T_hr statistic (deterministic); + Yatchew has no bootstrap component.""" + post_dose, dy = yatchew_side_panel_inputs + res = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="linearity") + assert res.reject is False + assert res.null_form == "linearity" + assert round(res.t_stat_hr, 2) == 0.02, res.t_stat_hr + assert round(res.sigma2_lin, 2) == 6.53, res.sigma2_lin + + +def test_yatchew_side_panel_mean_independence_passes(yatchew_side_panel_inputs): + """Section 5 narrative claims `null="mean_independence"` does not + reject on the same input but with larger sigma2_lin (the stricter + null has more residual variance to explain).""" + post_dose, dy = yatchew_side_panel_inputs + res_mi = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="mean_independence") + res_lin = yatchew_hr_test(d=post_dose, dy=dy, alpha=0.05, null="linearity") + assert res_mi.reject is False + assert res_mi.null_form == "mean_independence" + assert round(res_mi.t_stat_hr, 2) == 0.55, res_mi.t_stat_hr + assert round(res_mi.sigma2_lin, 2) == 7.01, res_mi.sigma2_lin + # Pedagogical claim from Section 5: stricter null -> larger sigma2_lin. + assert res_mi.sigma2_lin > res_lin.sigma2_lin + # And the differencing variance (sigma2_diff) is shared across modes. + assert round(res_mi.sigma2_diff, 4) == round(res_lin.sigma2_diff, 4)