diff --git a/CHANGELOG.md b/CHANGELOG.md index 663a6b60..f7eca04f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,11 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [Unreleased] + +### Added +- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), an estimand-difference routing nudge to `ContinuousDiD` (step 4 — fires when the user wants per-dose ATT(d) / ACRT(d) curves rather than HAD's WAS estimand and has never-treated controls; framed around estimand difference, NOT around the existence of untreated units, since HAD remains valid with a small never-treated share per REGISTRY § HeterogeneousAdoptionDiD edge cases and explicitly retains never-treated units on the staggered event-study path per paper Appendix B.2 / `had.py:1325`), `results.bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels (this direction is correct because ContinuousDiD's identification requires never-treated controls). Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference); the documented constructor + `fit()` signatures match the real `HeterogeneousAdoptionDiD.__init__` / `.fit` API exactly (verified by `inspect.signature`-based regression tests). Tightens the existing `Continuous treatment intensity` Choosing row to surface ATT(d) vs WAS as the estimand differentiator. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 26 new tests (16 in `tests/test_practitioner.py::TestHADDispatch` + 9 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs. + ## [3.3.2] - 2026-04-26 ### Added diff --git a/TODO.md b/TODO.md index 73b0251d..17d2659e 100644 --- a/TODO.md +++ b/TODO.md @@ -109,7 +109,7 @@ Deferred items from PR reviews that were not addressed before merge. | `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low | -| `HeterogeneousAdoptionDiD` Phase 5: `practitioner_next_steps()` integration, tutorial notebook, and `llms-full.txt` HeterogeneousAdoptionDiD section (preserving UTF-8 fingerprint). README catalog + bundled `llms.txt` entry + `docs/api/had.rst` + `docs/references.rst` citation landed in PR #372 docs refresh. | `diff_diff/practitioner.py`, `tutorials/`, `diff_diff/guides/llms-full.txt` | Phase 2a | Low | +| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorials (T21 HAD pretest workflow notebook + T22 weighted/survey HAD tutorial). `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1. | `tutorials/`, `tests/test_t21_*_drift.py`, `tests/test_t22_*_drift.py` | Phase 2a | Low | | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low | | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium | | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low | diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index e4f161c3..7e66f9b3 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -590,6 +590,75 @@ results = est.fit(data, outcome='outcome', unit='unit', time='period', results.print_summary() ``` +### HeterogeneousAdoptionDiD + +HeterogeneousAdoption DiD estimator (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). Targets a Weighted Average Slope (WAS) at the dose support boundary on **Heterogeneous Adoption Designs** — designs where treatment varies in dose intensity across units. Comparison comes from dose variation across units. The estimator does NOT require dropping never-treated units: a small share of never-treated units is fully compatible (paper edge case — Garrett et al. 2020 retained 12 untreated counties out of 2,954), and on staggered event-study panels never-treated units are explicitly retained as the untreated-group comparison (paper Appendix B.2). Uses a bias-corrected local-linear estimator at the dose support boundary on continuous-dose designs (Design 1' / Design 1) and a 2SLS Wald-IV estimator on the mass-point design. + +```python +HeterogeneousAdoptionDiD( + design: str = "auto", # "auto" / "continuous_at_zero" / "continuous_near_d_lower" / "mass_point" + d_lower: float | None = None, # Support infimum; auto-detected when None + kernel: str = "epanechnikov", # Local-linear kernel + alpha: float = 0.05, + vcov_type: str | None = None, # Mass-point only: "classical" (default) or "hc1" + robust: bool = False, # Mass-point only: HC1 robust SE shorthand + cluster: str | None = None, # Mass-point only: cluster column for CR1 cluster-robust SE + n_bootstrap: int = 999, # Multiplier-bootstrap iterations for sup-t bands (event-study + weighted) + seed: int | None = None, +) +``` + +**Alias:** `HAD` + +**fit() parameters:** + +```python +had.fit( + data: pd.DataFrame, + outcome_col: str, + dose_col: str, + time_col: str, + unit_col: str, + first_treat_col: str | None = None, # Required on staggered panels (last-cohort auto-filter trigger) + aggregate: str = "overall", # "overall" (single scalar WAS) or "event_study" (per-horizon WAS) + survey: SurveyDesign | None = None, # DEPRECATED alias of survey_design= + weights: np.ndarray | None = None, # DEPRECATED pweight shortcut alias + cband: bool = True, # Simultaneous (sup-t) confidence bands on weighted event-study fits + *, + survey_design: SurveyDesign | None = None, # Canonical survey-design kwarg (weights, strata, PSU, FPC) + trends_lin: bool = False, # Eq 17 linear-trend detrending (event-study; mutually exclusive with survey_design) +) -> HeterogeneousAdoptionDiDResults | HeterogeneousAdoptionDiDEventStudyResults +``` + +**Usage:** + +```python +from diff_diff import HeterogeneousAdoptionDiD, did_had_pretest_workflow + +# Vet the testable identifying assumptions first: +report = did_had_pretest_workflow( + data, outcome_col='y', unit_col='unit', time_col='t', + dose_col='d', first_treat_col='first_treat') +print(report.summary()) + +# Single-period scalar WAS (aggregate="overall" default): +est = HeterogeneousAdoptionDiD() +results = est.fit(data, outcome_col='y', unit_col='unit', + time_col='t', dose_col='d', + first_treat_col='first_treat') +print(results.summary()) + +# Multi-period per-horizon WAS: +es = est.fit(data, outcome_col='y', unit_col='unit', + time_col='t', dose_col='d', + first_treat_col='first_treat', + aggregate='event_study') +``` + +**Staggered panels.** On multi-cohort panels with `aggregate="event_study"`, `fit()` auto-filters to the last treatment cohort plus never-treated units (paper Appendix B.2) and emits a `UserWarning` naming kept/dropped counts. The estimand is then a **last-cohort-only WAS**, not a multi-cohort average. For full multi-cohort staggered support, see `ChaisemartinDHaultfoeuille`. + +**Mass-point + survey constraint.** When fitting `design="mass_point"` with `survey_design=` (or the deprecated `survey=` alias), `vcov_type="hc1"` (or `robust=True`) is required: the survey path composes the standard error via Binder-TSL on the HC1-scale influence function, so the default classical sandwich path raises `NotImplementedError`. Passing `vcov_type="hc1"` is a safe default on weighted survey examples since `vcov_type` is unused on the continuous designs (CCT-2014 robust SE is the only formula there). + ### StackedDiD Stacked DiD estimator (Wing, Freedman & Hollingsworth 2024). Addresses TWFE bias with corrective Q-weights. @@ -1157,6 +1226,76 @@ Each event study effect dict contains: `effect`, `se`, `t_stat`, `p_value`, `con **Methods:** `summary()`, `print_summary()`, `to_dataframe()` +### HeterogeneousAdoptionDiDResults + +Single-period results container for `HeterogeneousAdoptionDiD`. The table below enumerates every public dataclass field; a regression test in `tests/test_guides.py` (`test_llms_full_had_results_class_field_lists_match_real_dataclass`) compares this list against the real `dataclasses.fields()` of the result class. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `att` | `float` | Point estimate of the WAS parameter on the β-scale | +| `se` | `float` | Standard error on the β-scale | +| `t_stat` | `float` | T-statistic | +| `p_value` | `float` | P-value | +| `conf_int` | `tuple[float, float]` | Confidence interval | +| `alpha` | `float` | CI level used at fit time | +| `design` | `str` | Resolved design: `"continuous_at_zero"`, `"continuous_near_d_lower"`, or `"mass_point"` | +| `target_parameter` | `str` | `"WAS"` (Design 1') or `"WAS_d_lower"` (Design 1 / mass-point) | +| `d_lower` | `float` | Support infimum (`0.0` on Design 1', `min(d)` otherwise) | +| `dose_mean` | `float` | `D_bar = (1/G) * sum(D_{g,2})` | +| `n_obs` | `int` | Units contributing to estimation | +| `n_treated` | `int` | Units with `D > d_lower` | +| `n_control` | `int` | Units at or below `d_lower` | +| `n_mass_point` | `int | None` | Mass-point design only: units exactly at `d_lower`; `None` on continuous designs | +| `n_above_d_lower` | `int | None` | Mass-point design only: units strictly above `d_lower`; `None` on continuous designs | +| `inference_method` | `str` | `"analytical_nonparametric"` or `"analytical_2sls"` | +| `vcov_type` | `str | None` | Mass-point only: `"classical"`, `"hc1"`, or `"cr1"` | +| `cluster_name` | `str | None` | Cluster column name when CR1 cluster-robust SE is requested; `None` otherwise | +| `survey_metadata` | `SurveyMetadata | None` | Repo-standard survey metadata when `survey_design=` / `weights=` is supplied | +| `bandwidth_diagnostics` | `BandwidthResult | None` | MSE-DPI selector output (continuous designs); `None` on `mass_point` | +| `bias_corrected_fit` | `BiasCorrectedFit | None` | Phase 1c bias-corrected local-linear fit object (continuous designs); `None` on `mass_point` | +| `variance_formula` | `str | None` | HAD-specific SE label on weighted fits, populated on BOTH continuous and mass-point designs: `"pweight"` (continuous, CCT 2014 weighted-robust on the `weights=` shortcut), `"survey_binder_tsl"` (continuous, Binder 1983 TSL on the `survey_design=` path), `"pweight_2sls"` (mass-point, weighted 2SLS HC1 / CR1 sandwich on the `weights=` shortcut), or `"survey_binder_tsl_2sls"` (mass-point, Binder 1983 TSL on the `survey_design=` path). `None` on unweighted fits | +| `effective_dose_mean` | `float | None` | Weighted denominator used by the β̂-scale rescaling, populated on weighted fits across all designs: weighted `mean(d)` (`continuous_at_zero`), weighted `mean(d − d_lower)` (`continuous_near_d_lower`), or weighted Wald-IV dose gap `mean(d | Z=1, w) − mean(d | Z=0, w)` (`mass_point`). `None` on unweighted fits | + +**Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()` + +### HeterogeneousAdoptionDiDEventStudyResults + +Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `aggregate="event_study"`. The anchor horizon `e = -1` is excluded by construction. The table below enumerates every public dataclass field; a regression test (`test_llms_full_had_results_class_field_lists_match_real_dataclass`) compares this list against the real `dataclasses.fields()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `event_times` | `np.ndarray` | Integer event-time labels `e = t - F`, sorted ascending | +| `att` | `np.ndarray` | Per-horizon WAS point estimates | +| `se` | `np.ndarray` | Per-horizon standard errors | +| `t_stat` | `np.ndarray` | Per-horizon t-statistics | +| `p_value` | `np.ndarray` | Per-horizon p-values | +| `conf_int_low` | `np.ndarray` | Pointwise CI lower bounds | +| `conf_int_high` | `np.ndarray` | Pointwise CI upper bounds | +| `n_obs_per_horizon` | `np.ndarray` | Per-horizon contributing-unit counts | +| `alpha` | `float` | CI level used at fit time | +| `design` | `str` | Shared across horizons (paper Appendix B.2 invariant) | +| `target_parameter` | `str` | Same convention as the single-period result | +| `d_lower` | `float` | Support infimum, shared across horizons | +| `dose_mean` | `float` | `D_bar` on the fit sample | +| `F` | `object` | First-treatment period label | +| `n_units` | `int` | Unique units contributing to the fit (post last-cohort filter) | +| `inference_method` | `str` | `"analytical_nonparametric"` or `"analytical_2sls"` | +| `vcov_type` | `str | None` | Mass-point only: `"classical"`, `"hc1"`, or `"cr1"`; `None` on continuous designs | +| `cluster_name` | `str | None` | Cluster column name when CR1 is requested; `None` otherwise | +| `survey_metadata` | `SurveyMetadata | None` | Populated on weighted fits | +| `bandwidth_diagnostics` | `list[BandwidthResult | None] | None` | Per-horizon MSE-DPI selector output (continuous designs); `None` on `mass_point`; entries can be `None` on degenerate horizons | +| `bias_corrected_fit` | `list[BiasCorrectedFit | None] | None` | Per-horizon Phase 1c bias-corrected local-linear fit objects; `None` on `mass_point`; entries can be `None` on degenerate horizons | +| `filter_info` | `dict | None` | Staggered last-cohort auto-filter metadata (`F_last`, `n_kept`, `n_dropped`, `dropped_cohorts`); `None` when no filter applied | +| `variance_formula` | `str | None` | Per-horizon variance family label | +| `effective_dose_mean` | `float | None` | Weighted denominator | +| `cband_low` | `np.ndarray | None` | Simultaneous (sup-t) band lower bounds; `None` on unweighted fits or when `cband=False` | +| `cband_high` | `np.ndarray | None` | Simultaneous (sup-t) band upper bounds | +| `cband_crit_value` | `float | None` | Sup-t critical value used for the simultaneous band | +| `cband_method` | `str | None` | `"multiplier_bootstrap"` when populated | +| `cband_n_bootstrap` | `int | None` | Bootstrap iterations used for the band | + +**Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()` + ### TROPResults | Attribute | Type | Description | @@ -1265,6 +1404,43 @@ did = DifferenceInDifferences(inference="wild_bootstrap", n_bootstrap=999, results = did.fit(data, outcome='y', treatment='treated', time='post') ``` +## HAD Pretests + +Diagnostic pretests for the `HeterogeneousAdoptionDiD` identifying assumptions (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). The composite workflow `did_had_pretest_workflow` is the recommended entry point — call it before reporting WAS as causal. The workflow follows paper Section 4.2's three-step battery: **step 1** is the QUG support-infimum test (decides whether Design 1' or Design 1 applies); **step 2** is the Assumption 7 pre-trends test (joint Stute on the event-study path; explicitly NOT covered on the overall path because a single-pre-period panel cannot support the joint variant); **step 3** is the Assumption 8 linearity test (`stute_test` or `yatchew_hr_test`). On the default `aggregate="overall"` path the workflow runs steps 1 + 3 only and the returned `verdict` flags the Assumption 7 gap; pass `aggregate="event_study"` on a multi-period panel **with at least one earlier placebo pre-period beyond the base `F-1`** to close that gap. With only the base `F-1` pre-period available (minimal 3-period event-study, or 4-period under `trends_lin=True` where the consumed `F-2` placebo is dropped), the workflow still sets `pretrends_joint=None`, `all_pass=False`, and appends `joint pre-trends skipped (no earlier pre-period)` to the verdict — step 2 stays uncovered. + +```python +from diff_diff import ( + did_had_pretest_workflow, + qug_test, stute_test, yatchew_hr_test, + stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, +) + +# Composite workflow: +# aggregate="overall" -> steps 1 + 3 (QUG + Assumption 8 linearity) +# step 2 (Assumption 7 pre-trends) NOT covered; +# verdict explicitly flags this gap. +# aggregate="event_study" -> steps 1 + 2 + 3 (QUG + joint Stute pre-trends + +# joint homogeneity-linearity Stute) on multi-period panels. +report = did_had_pretest_workflow( + data, outcome_col='y', unit_col='unit', time_col='t', + dose_col='d', first_treat_col='first_treat', + aggregate='overall', + survey_design=None) # SurveyDesign for survey-aware pretests (Phase 4.5 C) +print(report.summary()) +print(report.all_pass, report.verdict) +``` + +Individual tests: + +- `qug_test(d)` — paper Theorem 4 support-infimum test (`H_0: d_lower = 0`; the QUG decides whether Design 1' or Design 1 applies in step 1 of the workflow). Extreme order statistics, Exp(1)/Exp(1) limit law. The QUG itself does NOT test Assumption 5 (which is the Design 1 sign-identification condition and is not testable via pre-trends per registry). **Permanently rejects** non-`None` `survey_design=` / `weights=` (`NotImplementedError`) per Phase 4.5 C0 deferral — extreme-value functionals are not smooth in the empirical CDF, so standard survey machinery does not yield a calibrated test. +- `stute_test(d, dy)` — Assumption 8 linearity of `E[ΔY|D]` (paper Section 4.2 step 3) via Stute Cramér-von Mises functional with Mammen wild bootstrap. Survey-aware via PSU-level Mammen multiplier bootstrap. +- `yatchew_hr_test(d, dy, *, null="linearity")` — Assumption 8 linearity of `E[ΔY|D]` (alternative test for step 3) via Yatchew (1997) heteroskedasticity-robust variance-ratio test. The `null="mean_independence"` mode (R `YatchewTest::yatchew_test(order=0)`) is also exposed for placebo-style mean-independence testing. Survey-aware via closed-form weighted variance components (no bootstrap). +- `stute_joint_pretest(residuals_dict, d)` — joint Cramér-von Mises across K horizons with shared-η Mammen wild bootstrap (Delgado-Manteiga 2001 / Hlávka-Hušková 2020). Residuals-in core; the two data-in wrappers below construct residuals for the two paper-spelled nulls. +- `joint_pretrends_test(...)` — Assumption 7 joint pre-trends on K pre-periods (paper Section 4.2 step 2 closure on the event-study path). +- `joint_homogeneity_test(...)` — joint linearity-and-homogeneity on K post-periods (event-study step 3 alternative). + +The QUG-under-survey deferral is permanent; the linearity-family pretests support `survey_design=` (pweight, PSU, FPC) per Phase 4.5 C. Stratified designs and replicate-weight designs are deferred to follow-up PRs. + ## Honest DiD Sensitivity Analysis Rambachan & Roth (2023) robust inference allowing bounded parallel trends violations. @@ -1734,7 +1910,8 @@ DIFF_DIFF_BACKEND=rust pytest # Force Rust (fail if unavailable) | Staggered treatment timing | `CallawaySantAnna`, `ImputationDiD`, or `SunAbraham` | | Few treated units / synthetic control | `SyntheticDiD` | | Interactive fixed effects / factor confounding | `TROP` | -| Continuous treatment intensity | `ContinuousDiD` | +| Continuous treatment intensity, per-dose ATT(d) / ACRT(d) (requires never-treated controls) | `ContinuousDiD` | +| Continuous treatment intensity, WAS at dose support boundary (compatible with universal rollout or small never-treated share) | `HeterogeneousAdoptionDiD` | | Two-criterion treatment, simultaneous (2x2x2 DDD) | `TripleDifference` | | Two-criterion treatment, staggered timing + eligibility | `StaggeredTripleDifference` | | Nonlinear outcome (binary/count) with staggered timing | `WooldridgeDiD` | diff --git a/diff_diff/guides/llms-practitioner.txt b/diff_diff/guides/llms-practitioner.txt index acb0adaa..c853c2b7 100644 --- a/diff_diff/guides/llms-practitioner.txt +++ b/diff_diff/guides/llms-practitioner.txt @@ -158,7 +158,14 @@ Is this a triple-difference (DDD) design? (Two criteria: e.g., policy + eligibil |-- YES, staggered timing: StaggeredTripleDifference (SDDD) | Is treatment continuous (doses/intensities)? -|-- YES: ContinuousDiD (CDiD) +|-- YES, panel has never-treated units (some units with first_treat == 0, +| i.e. dose == 0 throughout): ContinuousDiD (CDiD) for per-dose +| ATT(d) / ACRT(d) dose-response curves +|-- YES, no never-treated units (universal rollout — every unit treated +| at some positive dose): HeterogeneousAdoptionDiD (HAD) for +| Weighted Average Slope (WAS) at the dose support boundary. +| HAD is also compatible with a small never-treated share if +| the WAS estimand is what you want. | Is treatment adoption staggered (multiple cohorts, different timing)? |-- YES: Do NOT use plain TWFE. Use one of: diff --git a/diff_diff/had.py b/diff_diff/had.py index 2819cd09..6a717bcb 100644 --- a/diff_diff/had.py +++ b/diff_diff/had.py @@ -345,25 +345,31 @@ class HeterogeneousAdoptionDiDResults: # Phase 4.5 weighted-path extras (optional so unweighted fits stay unchanged) variance_formula: Optional[str] = None - """HAD-specific label for the SE formula on the weighted continuous - path: ``"pweight"`` (weighted-robust CCT 2014) under ``weights=``, - ``"survey_binder_tsl"`` (Binder 1983 TSL with PSU/strata/FPC) under - ``survey=SurveyDesign(...)``, ``None`` on unweighted or mass-point - fits. Orthogonal to ``survey_metadata`` which is the repo-standard - :class:`diff_diff.survey.SurveyMetadata` shared with downstream - report/diagnostic consumers (no HAD-specific leakage).""" + """HAD-specific label for the SE formula on weighted fits, populated + on BOTH continuous and mass-point designs (Phase 4.5 A / B): + ``"pweight"`` (continuous, weighted-robust CCT 2014 under the + ``weights=`` shortcut), ``"survey_binder_tsl"`` (continuous, Binder + 1983 TSL with PSU/strata/FPC under ``survey_design=SurveyDesign(...)``), + ``"pweight_2sls"`` (mass-point, weighted 2SLS HC1/CR1 sandwich + under the ``weights=`` shortcut), or ``"survey_binder_tsl_2sls"`` + (mass-point, Binder 1983 TSL under ``survey_design=``). ``None`` on + unweighted fits. Orthogonal to ``survey_metadata`` which is the + repo-standard :class:`diff_diff.survey.SurveyMetadata` shared with + downstream report/diagnostic consumers (no HAD-specific leakage).""" effective_dose_mean: Optional[float] = None - """Weighted denominator used by the beta-scale rescaling on the - continuous path: ``sum(w_g · D_g) / sum(w_g)`` for - ``continuous_at_zero`` or ``sum(w_g · (D_g - d_lower)) / sum(w_g)`` - for ``continuous_near_d_lower``. Reduces bit-exactly to - ``dose_mean`` / ``mean(D - d_lower)`` when weights are uniform or - absent. ``None`` when ``fit()`` was called without - ``survey=`` / ``weights=`` (use ``dose_mean`` there). Exists because - ``dose_mean`` is the raw sample mean of the dose column; under - weighted fits the estimator's actual denominator is the weighted - mean, and users reconstructing the β-scale value by hand need the - weighted one.""" + """Weighted denominator used by the beta-scale rescaling, populated + on weighted fits across all designs: ``sum(w_g · D_g) / sum(w_g)`` + on ``continuous_at_zero``, ``sum(w_g · (D_g - d_lower)) / sum(w_g)`` + on ``continuous_near_d_lower``, and the weighted Wald-IV dose gap + ``mean(D | Z=1, w) - mean(D | Z=0, w)`` on ``mass_point`` (where + ``Z = 1{D > d_lower}``). On the continuous designs reduces + bit-exactly to ``dose_mean`` / ``mean(D - d_lower)`` when weights + are uniform or absent. ``None`` when ``fit()`` was called without + ``survey_design=`` / ``survey=`` / ``weights=`` (use ``dose_mean`` + there). Exists because ``dose_mean`` is the raw sample mean of the + dose column; under weighted fits the estimator's actual denominator + is the weighted form above, and users reconstructing the β-scale + value by hand need the weighted one.""" def __repr__(self) -> str: base = ( @@ -477,9 +483,20 @@ def to_dict(self) -> Dict[str, Any]: ``design_effect`` / ``sum_weights`` / ``weight_range`` + ``n_strata`` / ``n_psu`` / ``df_survey`` (latter three ``None`` on the ``weights=`` shortcut). - - ``variance_formula``: ``"pweight"`` or ``"survey_binder_tsl"``. + - ``variance_formula``: HAD-specific SE label, populated on BOTH + continuous and mass-point designs (Phase 4.5 A / B): + ``"pweight"`` (continuous, weighted-robust CCT 2014 under + ``weights=``), ``"survey_binder_tsl"`` (continuous, Binder + 1983 TSL under ``survey_design=``), ``"pweight_2sls"`` + (mass-point, weighted 2SLS HC1/CR1 sandwich under ``weights=``), + or ``"survey_binder_tsl_2sls"`` (mass-point, Binder 1983 TSL + under ``survey_design=``). See the field docstring above for + the full contract. - ``effective_dose_mean``: weighted denominator used by the - beta-scale rescaling.""" + beta-scale rescaling - weighted ``mean(D)`` on + ``continuous_at_zero``, weighted ``mean(D - d_lower)`` on + ``continuous_near_d_lower``, or the weighted Wald-IV dose gap + ``mean(D | Z=1, w) - mean(D | Z=0, w)`` on ``mass_point``.""" return { "att": self.att, "se": self.se, diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py index cd1d4235..1d3b8e73 100644 --- a/diff_diff/practitioner.py +++ b/diff_diff/practitioner.py @@ -41,8 +41,11 @@ "ContinuousDiDResults": "ContinuousDiD", "TripleDifferenceResults": "TripleDifference (DDD)", "BaconDecompositionResults": "BaconDecomposition", + "HeterogeneousAdoptionDiDResults": "HeterogeneousAdoptionDiD (HAD)", + "HeterogeneousAdoptionDiDEventStudyResults": "HeterogeneousAdoptionDiD (Event Study)", } + # --------------------------------------------------------------------------- # Public API # --------------------------------------------------------------------------- @@ -83,9 +86,7 @@ def practitioner_next_steps( completed = set(completed_steps or []) unknown = completed - STEPS if unknown: - raise ValueError( - f"Unknown step names: {unknown}. Valid names: {sorted(STEPS)}" - ) + raise ValueError(f"Unknown step names: {unknown}. Valid names: {sorted(STEPS)}") # Estimation is always complete if we have a results object completed.add("estimation") @@ -543,10 +544,7 @@ def _handle_synthetic(results: Any): "ATTs — departures signal that something is being picked " "up pre-treatment, weakening the causal interpretation." ), - code=( - "placebo_df = results.in_time_placebo()\n" - "print(placebo_df)" - ), + code=("placebo_df = results.in_time_placebo()\n" "print(placebo_df)"), priority="medium", step_name="sensitivity", ), @@ -589,10 +587,7 @@ def _handle_synthetic(results: Any): "data. Show whether the ATT moves materially across a " "grid of values to gauge robustness to this choice." ), - code=( - "sens_df = results.sensitivity_to_zeta_omega()\n" - "print(sens_df)" - ), + code=("sens_df = results.sensitivity_to_zeta_omega()\n" "print(sens_df)"), priority="low", step_name="sensitivity", ), @@ -731,6 +726,28 @@ def _handle_continuous(results: Any): ), step_name="parallel_trends", ), + _step( + baker_step=4, + label="Switch to HeterogeneousAdoptionDiD if no untreated units", + why=( + "ContinuousDiD's identification assumes a never-treated " + "comparison group exists (units with dose = 0). When every " + "unit is treated at some positive dose level — a universal " + "rollout where treatment varies in intensity, not status — " + "use HeterogeneousAdoptionDiD instead. HAD identifies a " + "Weighted Average Slope (WAS) at the dose support boundary " + "by leveraging dose variation across units." + ), + code=( + "# If your panel has no units with first_treat == 0, switch:\n" + "from diff_diff import HeterogeneousAdoptionDiD\n" + "had = HeterogeneousAdoptionDiD()\n" + "had_results = had.fit(\n" + " data, outcome_col='y', unit_col='unit',\n" + " time_col='t', dose_col='d', first_treat_col='first_treat')" + ), + step_name="estimator_selection", + ), _step( baker_step=7, label="Plot dose-response curve", @@ -739,10 +756,7 @@ def _handle_continuous(results: Any): "level. The dose-response curve reveals the functional form " "of the treatment-dose relationship." ), - code=( - "from diff_diff import plot_dose_response\n" - "plot_dose_response(results)" - ), + code=("from diff_diff import plot_dose_response\n" "plot_dose_response(results)"), step_name="heterogeneity", ), _step( @@ -830,6 +844,318 @@ def _handle_bacon(results: Any): return steps, warnings +def _handle_had(results: Any): + """HeterogeneousAdoptionDiD single-period guidance. + + Five Baker et al. steps (3, 4, 6, 7, 8). HAD's design absence is + "no untreated unit" - comparison comes from dose variation across + units, not from an untreated holdout. Treatment varies in intensity, + not in status. + """ + steps = [ + _step( + baker_step=3, + label="Run the HAD pretest battery", + why=( + "On a two-period unweighted panel did_had_pretest_workflow " + "runs paper Section 4.2 step 1 (QUG support-infimum test - " + "decides Design 1' vs Design 1) and step 3 (Stute / " + "Yatchew-HR Assumption 8 linearity tests). Step 2 " + "(Assumption 7 pre-trends) is NOT covered on the overall " + "path - a single pre-period cannot support the joint " + "Stute variant - and the returned verdict explicitly " + "flags that gap. To close step 2, refit on a multi-period " + "panel with aggregate='event_study' AND verify the panel " + "has at least one earlier placebo pre-period beyond F-1; " + "if only the base pre-period F-1 is available, the " + "workflow still sets pretrends_joint=None, all_pass=False, " + "and a 'joint pre-trends skipped (no earlier pre-period)' " + "verdict suffix - in that case step 2 stays uncovered " + "even on the event-study path. On survey-weighted " + "fits (survey_design= / survey= / weights=) the workflow " + "skips QUG with a UserWarning (permanent Phase 4.5 C0 " + "deferral - extreme order statistics are not smooth " + "functionals of the empirical CDF) and returns a " + "linearity-conditional verdict only - so step 1 coverage " + "is unweighted-only and the reported verdict on weighted " + "fits is conditional on QUG holding by assumption. " + "Assumptions 3 / 5 / 6 (uniform continuity at the " + "boundary, Design 1 sign / WAS_d_lower identification) " + "are NOT testable via pre-trends - the workflow vets only " + "what can be vetted." + ), + code=( + "from diff_diff import did_had_pretest_workflow\n" + "report = did_had_pretest_workflow(\n" + " data, outcome_col='y', unit_col='unit',\n" + " time_col='t', dose_col='d',\n" + " first_treat_col='first_treat')\n" + "print(report.summary())\n" + "# verdict explicitly flags the Assumption 7 gap on the\n" + "# overall path; aggregate='event_study' on a multi-period\n" + "# panel adds joint Stute pre-trends + joint homogeneity-linearity.\n" + "# Passing survey_design= / weights= skips QUG (Phase 4.5 C0)\n" + "# and returns a linearity-conditional verdict only." + ), + step_name="parallel_trends", + ), + _step( + baker_step=4, + label="Confirm WAS is the target estimand (vs ATT(d) for ContinuousDiD)", + why=( + "HAD targets WAS (Weighted Average Slope) at the dose " + "support boundary. If you specifically want per-dose " + "ATT(d) / ACRT(d) dose-response curves AND your panel " + "has never-treated controls (units with first_treat == 0), " + "ContinuousDiD is the alternative — different estimand, " + "and ContinuousDiD's identification requires never-treated " + "controls. HAD itself remains valid even with a small " + "share of never-treated units (paper compatibility; see " + "REGISTRY § HeterogeneousAdoptionDiD edge cases — " + "Garrett et al. 2020 retained 12 untreated counties out " + "of 2,954). The choice is about estimand, not about " + "whether untreated units exist." + ), + code=( + "# HAD reports WAS at the dose support boundary.\n" + "# If you instead want per-dose ATT(d)/ACRT(d) dose-response\n" + "# curves AND the panel has never-treated controls:\n" + "from diff_diff import ContinuousDiD\n" + "cdid = ContinuousDiD()\n" + "cdid_results = cdid.fit(\n" + " data, outcome='y', unit='unit', time='t',\n" + " first_treat='first_treat', dose='d',\n" + " aggregate='dose')" + ), + step_name="estimator_selection", + ), + _step( + baker_step=6, + label="Inspect bandwidth diagnostics (continuous designs)", + why=( + "Continuous-dose designs (continuous_at_zero / " + "continuous_near_d_lower) use an MSE-DPI bandwidth selector " + "for the bias-corrected local-linear estimator. Bandwidth " + "choice affects WAS - verify the selector landed on a " + "viable bandwidth (not boundary-clipped or near-degenerate). " + "results.bandwidth_diagnostics is None on the mass_point " + "design (parametric, no bandwidth)." + ), + code=( + "# Inspect the auto-selected bandwidths:\n" + "results.bandwidth_diagnostics # None on mass_point" + ), + priority="medium", + step_name="sensitivity", + ), + _step( + baker_step=7, + label="Re-fit with aggregate='event_study' for per-horizon WAS", + why=( + "On multi-period panels, the event-study aggregate returns " + "per-event-time WAS estimates instead of a single scalar. " + "Reveals whether dose response grows, decays, or stabilizes " + "across post-treatment horizons. Pre-period placebos serve " + "as a parallel-trends sanity check." + ), + code=( + "from diff_diff import HeterogeneousAdoptionDiD\n" + "est = HeterogeneousAdoptionDiD()\n" + "es = est.fit(\n" + " data, outcome_col='y', unit_col='unit',\n" + " time_col='t', dose_col='d',\n" + " first_treat_col='first_treat',\n" + " aggregate='event_study')" + ), + priority="medium", + step_name="heterogeneity", + ), + _step( + baker_step=8, + label="Verify design auto-detection with explicit design=", + why=( + "design='auto' picks one of {continuous_at_zero, " + "continuous_near_d_lower, mass_point} from the dose " + "support. Re-fit with an explicit design= to verify the " + "auto-detection matched your panel structure - WAS vs " + "WAS_d_lower target parameters, and the bias-corrected " + "local-linear vs 2SLS estimation paths, differ in " + "interpretation." + ), + code=( + "# Refit with each candidate design and compare:\n" + "from diff_diff import HeterogeneousAdoptionDiD\n" + "for d in ['continuous_at_zero', 'continuous_near_d_lower',\n" + " 'mass_point']:\n" + " try:\n" + " alt = HeterogeneousAdoptionDiD(design=d).fit(...)\n" + " print(d, alt.att, alt.target_parameter)\n" + " except Exception as e:\n" + " print(d, 'not applicable:', e)" + ), + priority="medium", + step_name="robustness", + ), + ] + warnings = _check_nan_att(results) + return steps, warnings + + +def _handle_had_event_study(results: Any): + """HeterogeneousAdoptionDiD event-study guidance. + + Five Baker et al. steps (3, 4, 6, 7, 8). Same framing convention as + _handle_had: "no untreated unit", dose variation, treatment varies + in intensity not status. + """ + steps = [ + _step( + baker_step=3, + label="Run the HAD pretest battery (event-study mode)", + why=( + "On multi-period unweighted panels, did_had_pretest_workflow " + "with aggregate='event_study' runs QUG plus joint Stute " + "pre-trends plus joint homogeneity-linearity Stute. The " + "joint Stute pre-trends variant closes the paper Section " + "4.2 step-2 gap ONLY IF the panel carries at least one " + "earlier placebo pre-period beyond the base F-1. With " + "only the base F-1 pre-period present (e.g. a minimal " + "valid 3-period event-study fit, or a 4-period fit under " + "trends_lin=True where the consumed F-2 placebo gets " + "dropped), pretrends_joint=None, all_pass=False, and the " + "verdict carries 'joint pre-trends skipped (no earlier " + "pre-period)' - step 2 stays uncovered. On survey-weighted fits (survey_design= / survey= / " + "weights=) the workflow skips QUG with a UserWarning " + "(permanent Phase 4.5 C0 deferral) and returns a " + "linearity-conditional verdict only - so step 1 coverage " + "is unweighted-only on the event-study path too, and the " + "weighted verdict is conditional on QUG holding by " + "assumption. The joint Stute pre-trends and joint " + "homogeneity-linearity tests themselves remain available " + "under survey weighting via PSU-level Mammen multiplier " + "bootstrap." + ), + code=( + "from diff_diff import did_had_pretest_workflow\n" + "report = did_had_pretest_workflow(\n" + " data, outcome_col='y', unit_col='unit',\n" + " time_col='t', dose_col='d',\n" + " first_treat_col='first_treat',\n" + " aggregate='event_study')\n" + "print(report.summary())" + ), + step_name="parallel_trends", + ), + _step( + baker_step=4, + label="Confirm WAS is the target estimand (vs ATT(d) for ContinuousDiD)", + why=( + "HAD targets per-event-time WAS at the dose support " + "boundary. If you instead want per-dose ATT(d) / ACRT(d) " + "dose-response curves AND your panel has never-treated " + "controls, ContinuousDiD(aggregate='eventstudy') is the " + "alternative — different estimand, requires never-treated. " + "HAD itself remains valid even with a small share of " + "never-treated units (paper compatibility); on staggered " + "panels HAD's last-cohort filter explicitly RETAINS " + "never-treated units as the untreated-group comparison " + "(paper Appendix B.2). The choice is about estimand." + ), + code=( + "# HAD reports per-event-time WAS at the dose boundary.\n" + "# If you instead want per-dose ATT(d)/ACRT(d) event-study\n" + "# curves AND the panel has never-treated controls:\n" + "from diff_diff import ContinuousDiD\n" + "cdid = ContinuousDiD()\n" + "cdid_es = cdid.fit(\n" + " data, outcome='y', unit='unit', time='t',\n" + " first_treat='first_treat', dose='d',\n" + " aggregate='eventstudy')" + ), + step_name="estimator_selection", + ), + _step( + baker_step=6, + label="Use simultaneous (sup-t) confidence bands when reading multiple horizons", + why=( + "Pointwise CIs over-reject when you read multiple horizons " + "as a joint pattern. On weighted fits (survey_design= or " + "weights=), fit(cband=True) constructs simultaneous (sup-t) " + "bands across horizons via multiplier bootstrap. " + "results.cband_low / results.cband_high give the band " + "endpoints; results.cband_crit_value reports the sup-t " + "critical value used." + ), + code=( + "from diff_diff import HeterogeneousAdoptionDiD, SurveyDesign\n" + "# Construct your survey design (adapt to your data):\n" + "sd = SurveyDesign(weights='weight_col')\n" + "# vcov_type='hc1' is REQUIRED on the mass-point design under\n" + "# survey_design= (the default classical sandwich raises\n" + "# NotImplementedError on the survey path because the\n" + "# Binder-TSL composition consumes the HC1-scale IF -\n" + "# see had.py:3495-3507). On the continuous designs the\n" + "# vcov_type kwarg is unused (CCT-2014 robust SE is the\n" + "# only formula), so passing vcov_type='hc1' is a no-op\n" + "# there and a safe default for the survey-aware example.\n" + "est = HeterogeneousAdoptionDiD(\n" + " n_bootstrap=999, seed=42, vcov_type='hc1')\n" + "es = est.fit(\n" + " data, outcome_col='y', unit_col='unit',\n" + " time_col='t', dose_col='d',\n" + " first_treat_col='first_treat',\n" + " aggregate='event_study',\n" + " survey_design=sd, cband=True)\n" + "es.cband_low, es.cband_high # simultaneous band endpoints" + ), + priority="medium", + step_name="sensitivity", + ), + _step( + baker_step=7, + label="Inspect per-horizon WAS arrays + pre-period placebos", + why=( + "Per-horizon WAS reveals adoption-effect dynamics. " + "Pre-period placebo horizons (event_times <= -2) should be " + "near zero - large pre-period coefficients flag a " + "parallel-trends or anticipation problem. The anchor " + "horizon e = -1 is excluded by construction." + ), + code=( + "import numpy as np\n" + "es.event_times, es.att, es.se # per-horizon arrays\n" + "# Pre-period placebos (should be near zero):\n" + "pre_mask = es.event_times <= -2\n" + "es.att[pre_mask], es.se[pre_mask]" + ), + step_name="heterogeneity", + ), + _step( + baker_step=8, + label="Report the last-cohort-only WAS framing on staggered panels", + why=( + "On staggered panels (multiple treatment cohorts), fit() " + "auto-filters to the last treatment cohort plus " + "never-treated units and emits a UserWarning naming " + "kept/dropped counts (paper Appendix B.2). The resulting " + "estimand is a last-cohort-only WAS, NOT a multi-cohort " + "average - report it as such, and consider " + "ChaisemartinDHaultfoeuille for full staggered support." + ), + code=( + "# Inspect the kept/dropped cohort counts in the\n" + "# UserWarning emitted at fit time.\n" + "# For full multi-cohort support, see:\n" + "# from diff_diff import ChaisemartinDHaultfoeuille" + ), + priority="medium", + step_name="robustness", + ), + ] + warnings = _check_nan_att(results) + return steps, warnings + + def _handle_generic(results: Any): """Fallback for unknown result types.""" steps = [ @@ -880,6 +1206,8 @@ def _handle_generic(results: Any): "ContinuousDiDResults": _handle_continuous, "TripleDifferenceResults": _handle_triple, "BaconDecompositionResults": _handle_bacon, + "HeterogeneousAdoptionDiDResults": _handle_had, + "HeterogeneousAdoptionDiDEventStudyResults": _handle_had_event_study, } @@ -887,19 +1215,45 @@ def _handle_generic(results: Any): # Internal helpers # --------------------------------------------------------------------------- def _check_nan_att(results: Any) -> List[str]: - """Return warnings if ATT is NaN.""" + """Return warnings if ATT is NaN. + + Scalar path executes byte-identically to the pre-Phase-5 helper for + backcompat with the existing 12 untouched handlers. The ndarray + branch is reached only when ``float(att)`` raises TypeError on a + numpy array (HAD's event-study ``att`` field) and fires only when + every horizon is NaN - partial-NaN arrays are legitimate event-study + output (single-cluster collapse, degenerate horizon-specific design) + and would over-fire if flagged. Falls through to ``_handle_generic`` + too: any future estimator returning ndarray ``att`` without a + dedicated handler gets the same all-NaN warning shape. + """ # Check .att (DiDResults), .overall_att (staggered), .avg_att (MultiPeriod) att = getattr(results, "att", None) if att is None: att = getattr(results, "overall_att", None) if att is None: att = getattr(results, "avg_att", None) - if att is not None: + if att is None: + return [] + try: + scalar = float(att) + except (TypeError, ValueError): + # Ndarray path (HAD event-study, future ndarray-att estimators). + # Use np.all (not np.any): partial-NaN arrays are legitimate. try: - att = float(att) + import numpy as np + + arr = np.asarray(att, dtype=float) except (TypeError, ValueError): return [] - if att is not None and math.isnan(att): + if arr.size and bool(np.all(np.isnan(arr))): + return [ + "All per-horizon estimates are NaN — check data " + "preparation and model specification before proceeding " + "with diagnostics." + ] + return [] + if math.isnan(scalar): return [ "Estimation produced NaN ATT — check data preparation and " "model specification before proceeding with diagnostics." @@ -907,9 +1261,7 @@ def _check_nan_att(results: Any) -> List[str]: return [] -def _filter_steps( - steps: List[Dict[str, Any]], completed: Set[str] -) -> List[Dict[str, Any]]: +def _filter_steps(steps: List[Dict[str, Any]], completed: Set[str]) -> List[Dict[str, Any]]: """Remove steps whose _step_name is in the completed set.""" filtered = [] for s in steps: @@ -938,8 +1290,9 @@ def _print_output(output: Dict[str, Any]) -> None: for step in output["next_steps"]: priority = step.get("priority", "high") marker = "*" if priority == "high" else "-" - print(f"\n {marker} [{priority.upper()}] Step {step['baker_step']}: " - f"{step['label']}") + print( + f"\n {marker} [{priority.upper()}] Step {step['baker_step']}: " f"{step['label']}" + ) print(f" Why: {step['why']}") if step.get("code"): for line in step["code"].split("\n"): diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml index 7c7aefca..e335c646 100644 --- a/docs/doc-deps.yaml +++ b/docs/doc-deps.yaml @@ -385,9 +385,9 @@ sources: section: "Universal Rollout (No Untreated Markets)" type: user_guide note: "Tip cross-link to T20 in the no-untreated section" - # Note: llms-full.txt does not yet have a HeterogeneousAdoptionDiD section - # (deferred to TODO.md Phase 5 follow-up); the dependency mapping will be - # added when that section lands. + - path: diff_diff/guides/llms-full.txt + section: "HeterogeneousAdoptionDiD" + type: user_guide diff_diff/had_pretests.py: drift_risk: medium @@ -401,6 +401,9 @@ sources: - path: diff_diff/guides/llms.txt section: "Estimators" type: user_guide + - path: diff_diff/guides/llms-full.txt + section: "HAD Pretests" + type: user_guide diff_diff/local_linear.py: drift_risk: low @@ -799,6 +802,10 @@ sources: docs: - path: diff_diff/guides/llms-practitioner.txt type: user_guide + - path: diff_diff/guides/llms-full.txt + section: "Practitioner Workflow" + type: user_guide + note: "HAD handlers (_handle_had / _handle_had_event_study) emit did_had_pretest_workflow + bandwidth_diagnostics references; symmetric Step-4 routing in _handle_continuous" # ── Visualization (visualization group) ──────────────────────────── diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 5e5b6824..a59949e6 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2548,9 +2548,10 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - [x] Phase 3: `did_had_pretest_workflow()` composite helper. Two-period `data`-only entry point (Phase 2a overall-path dispatch); reduces panel via `_aggregate_first_difference` and runs all three IMPLEMENTED tests at a shared `alpha`. `seed` forwards to `stute_test` only (QUG and Yatchew are deterministic). Returns `HADPretestReport` with priority-ordered verdict string. Because Phase 3 ships steps 1 + 3 of the paper's four-step workflow but **not** step 2 (Assumption 7 pre-trends test via Equation 18), the fail-to-reject verdict explicitly flags the Assumption 7 gap rather than claiming unconditional TWFE safety: `"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. Verdict priority follows the paper's one-way rule (TWFE admissible only if NO test rejects): **conclusive rejections are the primary verdict and are NEVER hidden by inconclusive status** — any unresolved-step note is appended via `"; additional steps unresolved: ..."` rather than replacing the rejection. The pure `"inconclusive - QUG NaN"` / `"inconclusive - both Stute and Yatchew linearity tests NaN"` forms only fire when NO conclusive test rejects AND a required step is unresolved. The partial-workflow fail-to-reject verdict may carry a `"(Yatchew NaN - skipped)"` (or Stute) suffix when one linearity test is NaN but the other is conclusive (step 3 resolved via the paper's "Stute OR Yatchew" wording). Bundled rejection-reason strings name each failed assumption in the conclusive-rejection case. `all_pass` is `True` iff QUG is conclusive AND at least one of Stute/Yatchew is conclusive AND no conclusive test rejects. **Non-negative-dose contract**: all three raw linearity helpers (`qug_test`, `stute_test`, `yatchew_hr_test`) raise a front-door `ValueError` on any `d < 0`, mirroring the `_validate_had_panel` guard (paper Section 2 HAD support restriction). Multi-period panels pre-slice to `(F-1, F)` before calling; joint-horizon dispatch deferred to Phase 3 follow-up. - [ ] Phase 4: Pierce-Schott (2016) replication harness reproduces Figure 2 values. - [ ] Phase 4: Full DGP 1/2/3 coverage-rate reproduction from Table 1. -- [ ] Phase 5: `practitioner_next_steps()` integration for HAD results. +- [x] Phase 5 (wave 1, PR #402): `practitioner_next_steps()` integration for HAD results - `_handle_had` and `_handle_had_event_study` route both result classes through HAD-specific Baker et al. (2025) step guidance with bidirectional HAD ↔ ContinuousDiD Step-4 routing closure. The `_check_nan_att` helper extends to ndarray `att` (HAD event-study) via `np.all(np.isnan(arr))` semantics; scalar path bit-exact preserved. +- [x] Phase 5 (wave 1, PR #402): `llms-full.txt` HeterogeneousAdoptionDiD section + result-class blocks + `## HAD Pretests` index + Choosing-an-Estimator row landed; constructor / fit() signatures match the real API (regression-tested via `inspect.signature`); result-class field tables enumerate every public dataclass field (regression-tested via `dataclasses.fields()`); `llms-practitioner.txt` Step 4 decision tree distinguishes ContinuousDiD (per-dose ATT(d), needs never-treated) from HeterogeneousAdoptionDiD (WAS, universal-rollout-compatible). - [x] Phase 5 (partial): README catalog one-liner, bundled `llms.txt` `## Estimators` entry, `docs/api/had.rst` (autoclass for the three classes), and `docs/references.rst` citation landed in PR #372 docs refresh. -- [ ] Phase 5 (remaining): Tutorial notebook + `llms-full.txt` HeterogeneousAdoptionDiD section (preserving the UTF-8 fingerprint). +- [ ] Phase 5 (remaining): T21 HAD pretest workflow tutorial + T22 weighted/survey HAD tutorial - tracked in `TODO.md`. - [ ] Documentation of non-testability of Assumptions 5 and 6. - [ ] Warnings for staggered treatment timing (redirect to `ChaisemartinDHaultfoeuille`). - [ ] `NotImplementedError` phase pointer when `covariates=` is passed (Theorem 6 future work). diff --git a/tests/test_guides.py b/tests/test_guides.py index e51f6920..bb704a6a 100644 --- a/tests/test_guides.py +++ b/tests/test_guides.py @@ -272,3 +272,370 @@ def test_module_docstring_mentions_helper(): import diff_diff assert "get_llm_guide" in diff_diff.__doc__ + + +# --------------------------------------------------------------------------- +# llms-full.txt — HeterogeneousAdoptionDiD coverage (Phase 5) +# --------------------------------------------------------------------------- +class TestLLMsFullHADCoverage: + """Lock the HAD section additions to llms-full.txt against deletion + or framing drift. Phase 5 surfaces the agent-facing API contract for + HeterogeneousAdoptionDiD on the bundled-in-wheel guide.""" + + def test_llms_full_has_had_section(self): + text = get_llm_guide("full") + assert "### HeterogeneousAdoptionDiD" in text + + def test_llms_full_had_results_classes(self): + text = get_llm_guide("full") + assert "### HeterogeneousAdoptionDiDResults" in text + assert "### HeterogeneousAdoptionDiDEventStudyResults" in text + + def test_llms_full_had_pretests_section(self): + text = get_llm_guide("full") + assert "## HAD Pretests" in text + for fn in ( + "did_had_pretest_workflow", + "qug_test", + "stute_test", + "yatchew_hr_test", + "stute_joint_pretest", + "joint_pretrends_test", + "joint_homogeneity_test", + ): + assert fn in text, f"HAD Pretests section missing reference to {fn}" + + def test_llms_full_had_choosing_row(self): + text = get_llm_guide("full") + # The Choosing-an-Estimator table must list HAD with a row that + # accurately reflects the contract: HAD targets WAS at the dose + # support boundary and is compatible with universal-rollout + # panels (and panels with a small never-treated share — paper + # edge case at REGISTRY § HeterogeneousAdoptionDiD edge cases). + idx = text.index("## Choosing an Estimator") + choosing = text[idx:] + assert "HeterogeneousAdoptionDiD" in choosing + # Row must mention WAS as the estimand differentiator (not a + # blanket "if untreated → not HAD" rule which would be wrong + # per registry). + assert "WAS" in choosing + + def test_llms_full_had_section_methodology_compatible_with_untreated(self): + # Per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD edge + # cases (line ~2403): "Authors do NOT require untreated units + # to be dropped" and (line ~2408) the staggered event-study path + # explicitly RETAINS never-treated units. The HAD section must + # NOT carry framing that says HAD is incompatible with + # never-treated / untreated units. + text = get_llm_guide("full") + had_start = text.index("### HeterogeneousAdoptionDiD") + had_end = text.index("### StackedDiD", had_start) + had_text = text[had_start:had_end].lower() + # Negative assertions on framing that contradicts the registry. + assert "no comparison group" not in had_text + assert "missing comparison" not in had_text + forbidden_phrases = ( + "no never-treated units", + "requires no untreated", + "drop untreated", + "must not contain untreated", + "not compatible with untreated", + ) + for phrase in forbidden_phrases: + assert phrase not in had_text, ( + f"HAD section must not carry the phrase {phrase!r}: " + f"per REGISTRY § HeterogeneousAdoptionDiD edge cases, " + f"HAD is compatible with a small share of never-treated " + f"units and explicitly retains them on staggered " + f"event-study panels (Appendix B.2)." + ) + + def test_llms_full_had_constructor_signature_matches_real_api(self): + # Documented constructor parameter list must align with the + # actual HeterogeneousAdoptionDiD.__init__ signature. Catches + # the failure mode where the guide invents kwargs that don't + # exist (h, b, rcond) or omits real ones (d_lower, kernel, + # vcov_type, robust, cluster). + import inspect + + from diff_diff import HeterogeneousAdoptionDiD + + sig_params = set(inspect.signature(HeterogeneousAdoptionDiD.__init__).parameters) + sig_params.discard("self") + text = get_llm_guide("full") + had_start = text.index("### HeterogeneousAdoptionDiD") + had_end = text.index("### StackedDiD", had_start) + had_text = text[had_start:had_end] + block_start = had_text.index("HeterogeneousAdoptionDiD(") + # Multi-line signature ends with "\n)" — close-paren on its own + # line. Searching for ")" alone would hit close-parens inside + # parameter comments (e.g. "(default)"). + block_end = had_text.index("\n)", block_start) + ctor_block = had_text[block_start:block_end] + for param in sig_params: + assert f"{param}:" in ctor_block or f"{param} " in ctor_block, ( + f"Constructor block in the HAD guide section is missing " + f"the real public parameter {param!r}. The guide must " + f"document the actual HeterogeneousAdoptionDiD.__init__ " + f"signature." + ) + + def test_llms_full_had_fit_signature_matches_real_api(self): + # Documented fit() parameter list must align with the actual + # HeterogeneousAdoptionDiD.fit signature. + import inspect + + from diff_diff import HeterogeneousAdoptionDiD + + sig_params = set(inspect.signature(HeterogeneousAdoptionDiD.fit).parameters) + sig_params.discard("self") + text = get_llm_guide("full") + had_start = text.index("### HeterogeneousAdoptionDiD") + had_end = text.index("### StackedDiD", had_start) + had_text = text[had_start:had_end] + block_start = had_text.index("had.fit(") + block_end = had_text.index(") -> ", block_start) + fit_block = had_text[block_start:block_end] + for param in sig_params: + assert f"{param}:" in fit_block or f"{param} " in fit_block, ( + f"fit() block in the HAD guide section is missing the " + f"real public parameter {param!r}. The guide must " + f"document the actual HeterogeneousAdoptionDiD.fit " + f"signature." + ) + + def test_llms_full_paper_citation(self): + # Lead-author "D'Haultfœuille" appears in the HAD section. + # Naturally preserves the UTF-8 'œ' fingerprint asserted by + # test_utf8_encoding_preserved without a synthetic mark. + text = get_llm_guide("full") + had_start = text.index("### HeterogeneousAdoptionDiD") + had_end = text.index("### StackedDiD", had_start) + had_text = text[had_start:had_end] + assert "D'Haultfœuille" in had_text + + def test_llms_full_had_results_class_field_lists_match_real_dataclass(self): + # Every public dataclass field on HeterogeneousAdoptionDiDResults + # and HeterogeneousAdoptionDiDEventStudyResults must appear in the + # documented field table. Catches the failure mode where new + # result fields land but the guide isn't updated, so agents + # treating llms-full.txt as the authoritative surface miss + # available diagnostics / metadata. + import dataclasses + + from diff_diff import ( + HeterogeneousAdoptionDiDEventStudyResults, + HeterogeneousAdoptionDiDResults, + ) + + text = get_llm_guide("full") + + # Single-period result class + sp_start = text.index("### HeterogeneousAdoptionDiDResults") + sp_end = text.index("### HeterogeneousAdoptionDiDEventStudyResults", sp_start) + sp_block = text[sp_start:sp_end] + for field in dataclasses.fields(HeterogeneousAdoptionDiDResults): + assert f"`{field.name}`" in sp_block, ( + f"HeterogeneousAdoptionDiDResults guide block is missing " + f"the public dataclass field {field.name!r}. The table " + f"must enumerate every field so agents see all available " + f"diagnostics / metadata." + ) + + # Event-study result class + es_start = text.index("### HeterogeneousAdoptionDiDEventStudyResults") + es_end = text.index("### TROPResults", es_start) + es_block = text[es_start:es_end] + for field in dataclasses.fields(HeterogeneousAdoptionDiDEventStudyResults): + assert f"`{field.name}`" in es_block, ( + f"HeterogeneousAdoptionDiDEventStudyResults guide block " + f"is missing the public dataclass field {field.name!r}." + ) + + def test_llms_full_had_section_documents_mass_point_survey_vcov_requirement(self): + # Per had.py:3495-3507 the mass-point design rejects the default + # classical vcov family on the survey_design= path + # (NotImplementedError). The HAD section must surface this + # requirement so an agent reading llms-full.txt and writing a + # weighted mass-point fit knows to pass vcov_type='hc1' + # explicitly. Without this caveat the documented fit() example + # can fail at fit time on a mass-point panel. + text = get_llm_guide("full") + had_start = text.index("### HeterogeneousAdoptionDiD") + had_end = text.index("### StackedDiD", had_start) + had_text = text[had_start:had_end] + # Must mention the mass-point + survey vcov requirement. + # Accept either explicit "vcov_type" mention near "mass" wording + # or the explicit "hc1" / "robust=True" pairing with mass-point. + lower = had_text.lower() + assert "vcov_type" in lower and ("mass-point" in lower or "mass_point" in lower), ( + "HAD section must document the mass-point + survey vcov " + "requirement: passing vcov_type='hc1' (or robust=True) is " + "required on design='mass_point' under survey_design= " + "(per had.py:3495-3507). Without this caveat the documented " + "weighted fit example can raise NotImplementedError." + ) + + def test_llms_full_had_variance_formula_describes_all_designs(self): + # Per diff_diff/had.py:3585-3629, weighted mass-point fits populate + # variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} and + # weighted continuous fits in {"pweight", "survey_binder_tsl"}. The + # documented description must cover ALL four labels (not just the + # two continuous ones) so agents reading the guide on a weighted + # mass-point fit do not misread the available inference metadata. + text = get_llm_guide("full") + sp_start = text.index("### HeterogeneousAdoptionDiDResults") + sp_end = text.index("### HeterogeneousAdoptionDiDEventStudyResults", sp_start) + sp_block = text[sp_start:sp_end] + # Find the variance_formula row in the table. + for line in sp_block.splitlines(): + if line.startswith("| `variance_formula`"): + for label in ( + "pweight", + "survey_binder_tsl", + "pweight_2sls", + "survey_binder_tsl_2sls", + ): + assert label in line, ( + f"variance_formula row must enumerate the {label!r} " + f"label - weighted mass-point fits populate " + f"pweight_2sls / survey_binder_tsl_2sls per " + f"had.py:3585-3629. Line: {line!r}" + ) + break + else: + pytest.fail("variance_formula row not found in HAD results table") + # effective_dose_mean: must mention mass-point Wald-IV dose gap. + for line in sp_block.splitlines(): + if line.startswith("| `effective_dose_mean`"): + assert "mass_point" in line or "Wald-IV" in line or "mass-point" in line, ( + f"effective_dose_mean row must mention mass-point " + f"semantics - weighted mass-point fits populate the " + f"weighted Wald-IV dose gap per had.py:3642-3660. " + f"Line: {line!r}" + ) + break + else: + pytest.fail("effective_dose_mean row not found in HAD results table") + + def test_llms_practitioner_step_4_distinguishes_had_from_continuous(self): + # The official practitioner workflow guide (returned by + # get_llm_guide("practitioner")) routes continuous treatments. It + # must distinguish ContinuousDiD (per-dose ATT(d), requires + # never-treated controls) from HeterogeneousAdoptionDiD (WAS at + # dose boundary, compatible with universal rollout). Pre-PR the + # decision tree routed ALL continuous-intensity designs to + # ContinuousDiD - which is wrong for no-untreated panels. + text = get_llm_guide("practitioner") + # Locate the Step 4 decision tree. + s4_start = text.index("## Step 4: Choose Estimation Method") + # Step 5 is the next section header; cap the slice there. + s5_start = text.index("## Step ", s4_start + 1) + s4_block = text[s4_start:s5_start] + # Both HAD and ContinuousDiD must appear in the continuous branch. + assert "HeterogeneousAdoptionDiD" in s4_block, ( + "practitioner guide Step 4 decision tree must mention " + "HeterogeneousAdoptionDiD as the alternative to ContinuousDiD " + "on no-untreated / universal-rollout panels." + ) + assert "ContinuousDiD" in s4_block + # Universal-rollout / no-untreated framing should be present so + # readers know which branch routes where. + assert "never-treated" in s4_block.lower() or "untreated" in s4_block.lower(), ( + "practitioner guide Step 4 must describe the never-treated / " + "universal-rollout distinction that drives the HAD vs " + "ContinuousDiD routing." + ) + + def test_llms_full_had_pretests_documents_earlier_pre_period_precondition(self): + # Same precondition as the practitioner test: per + # docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD + # § "Assumption 7 / step 2 closure" + had_pretests.py:4738-4756 + + # 2769, aggregate="event_study" closes step 2 ONLY IF the + # panel carries at least one earlier placebo pre-period beyond + # the base F-1. The HAD Pretests section in llms-full.txt must + # document this precondition so agents do not assume any + # multi-period event-study fit closes step 2. + text = get_llm_guide("full") + pretests_start = text.index("## HAD Pretests") + pretests_end = text.index("## Honest DiD", pretests_start) + pretests_block = text[pretests_start:pretests_end] + lower = pretests_block.lower() + assert "earlier" in lower and ("pre-period" in lower or "placebo" in lower), ( + "HAD Pretests section must document the 'earlier pre-period' " + "precondition for step-2 closure on the event-study path." + ) + assert "skipped" in lower or "pretrends_joint=none" in lower, ( + "HAD Pretests section must surface the " + "'joint pre-trends skipped' / pretrends_joint=None fallback " + "when no earlier pre-period exists." + ) + + def test_llms_full_had_pretests_assumption_labels_correct(self): + # Per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD + # § "Assumptions / Theorems / Estimators": + # - Assumption 5 = Design 1 sign identification (NOT testable) + # - Assumption 6 = Design 1 WAS_d_lower identification (NOT testable) + # - Assumption 7 = pre-trends (paper Section 4.2 step 2) + # - Assumption 8 = linearity (paper Section 4.2 step 3) + # The HAD Pretests section must NOT mislabel these: + # - qug_test is the support-infimum test (H0: d_lower = 0), + # NOT "Assumption 5" (which is non-testable per registry). + # - stute_test is Assumption 8 (linearity), NOT Assumption 7. + text = get_llm_guide("full") + pretests_start = text.index("## HAD Pretests") + pretests_end = text.index("## Honest DiD", pretests_start) + pretests_block = text[pretests_start:pretests_end] + # qug_test bullet: must positively label QUG as a support-infimum + # test, NOT as a positive "Assumption 5 support condition" claim + # (a negative disclaimer "does NOT test Assumption 5" is fine). + forbidden_qug_positive_claims = ( + "Assumption 5 support condition", + "QUG (Assumption 5", + "qug_test`) — Assumption 5", + "qug_test(d)` — Assumption 5", + ) + # stute_test bullet: must positively label as Assumption 8 + # linearity, NOT as Assumption 7 mean-independence. + forbidden_stute_positive_claims = ( + "stute_test(d, dy)` — Assumption 7", + "Stute (Assumption 7", + "Assumption 7 mean-independence", + ) + for line in pretests_block.splitlines(): + if line.startswith("- `qug_test"): + # Positive claim of what QUG IS: + assert ( + "support-infimum" in line + or "support infimum" in line + or "Theorem 4" in line + or "H_0: d_lower" in line + ), ( + f"qug_test bullet must positively label QUG as the " + f"support-infimum / Theorem-4 test. Line: {line!r}" + ) + for phrase in forbidden_qug_positive_claims: + assert phrase not in line, ( + f"qug_test bullet must not positively claim QUG " + f"is an 'Assumption 5' test ({phrase!r}). QUG " + f"tests H_0: d_lower = 0; Assumption 5 is the " + f"Design 1 sign-identification condition (NOT " + f"testable per registry). A negative disclaimer " + f"that QUG does NOT test Assumption 5 is fine. " + f"Line: {line!r}" + ) + if line.startswith("- `stute_test"): + # Positive claim of what Stute IS: + assert "Assumption 8" in line or "linearity" in line.lower(), ( + f"stute_test bullet must positively label as " + f"Assumption 8 / linearity test. Line: {line!r}" + ) + for phrase in forbidden_stute_positive_claims: + assert phrase not in line, ( + f"stute_test bullet must not positively claim " + f"Stute is an Assumption 7 mean-independence " + f"test ({phrase!r}). stute_test is Assumption 8 " + f"linearity (paper Section 4.2 step 3); " + f"Assumption 7 is pre-trends (step 2, only " + f"covered on the event-study path). Line: {line!r}" + ) diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index 2b2b62c0..0db8d02e 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -1,11 +1,14 @@ """Tests for the practitioner guidance module.""" +import numpy as np import pytest from diff_diff import ( BaconDecomposition, CallawaySantAnna, DifferenceInDifferences, + HeterogeneousAdoptionDiDEventStudyResults, + HeterogeneousAdoptionDiDResults, MultiPeriodDiD, generate_did_data, generate_staggered_data, @@ -32,9 +35,7 @@ def did_data(): @pytest.fixture(scope="session") def staggered_data(): - return generate_staggered_data( - n_units=60, n_periods=8, treatment_effect=2.0, seed=42 - ) + return generate_staggered_data(n_units=60, n_periods=8, treatment_effect=2.0, seed=42) @pytest.fixture(scope="session") @@ -46,9 +47,7 @@ def did_results(did_data): @pytest.fixture(scope="session") def multi_period_results(did_data): es = MultiPeriodDiD() - return es.fit( - did_data, outcome="outcome", unit="unit", time="period", treatment="treated" - ) + return es.fit(did_data, outcome="outcome", unit="unit", time="period", treatment="treated") @pytest.fixture(scope="session") @@ -171,6 +170,43 @@ def mock_stacked_results(): return r +@pytest.fixture +def mock_had_results(): + r = HeterogeneousAdoptionDiDResults.__new__(HeterogeneousAdoptionDiDResults) + r.att = 0.5 + return r + + +@pytest.fixture +def mock_had_event_study_results(): + r = HeterogeneousAdoptionDiDEventStudyResults.__new__(HeterogeneousAdoptionDiDEventStudyResults) + # 5 horizons: e in {-3, -2, 0, 1, 2} + r.att = np.array([0.01, -0.02, 0.30, 0.45, 0.50]) + r.event_times = np.array([-3, -2, 0, 1, 2]) + return r + + +@pytest.fixture +def mock_had_results_nan_att(): + r = HeterogeneousAdoptionDiDResults.__new__(HeterogeneousAdoptionDiDResults) + r.att = float("nan") + return r + + +@pytest.fixture +def mock_had_event_study_results_all_nan(): + r = HeterogeneousAdoptionDiDEventStudyResults.__new__(HeterogeneousAdoptionDiDEventStudyResults) + r.att = np.full(5, np.nan) + return r + + +@pytest.fixture +def mock_had_event_study_results_partial_nan(): + r = HeterogeneousAdoptionDiDEventStudyResults.__new__(HeterogeneousAdoptionDiDEventStudyResults) + r.att = np.array([0.5, np.nan, 0.3]) + return r + + # --------------------------------------------------------------------------- # Tests: return schema # --------------------------------------------------------------------------- @@ -345,16 +381,12 @@ def test_filter_sensitivity(self, cs_results): assert len(filtered["next_steps"]) < len(full["next_steps"]) def test_filter_all_steps(self, cs_results): - output = practitioner_next_steps( - cs_results, completed_steps=list(STEPS), verbose=False - ) + output = practitioner_next_steps(cs_results, completed_steps=list(STEPS), verbose=False) assert len(output["next_steps"]) == 0 def test_invalid_step_name_raises(self, did_results): with pytest.raises(ValueError, match="Unknown step names"): - practitioner_next_steps( - did_results, completed_steps=["invalid_step"], verbose=False - ) + practitioner_next_steps(did_results, completed_steps=["invalid_step"], verbose=False) # --------------------------------------------------------------------------- @@ -439,7 +471,8 @@ def test_hausman_pretest_in_guidance(self, mock_efficient_results): def test_hausman_snippet_uses_classmethod(self, mock_efficient_results): output = practitioner_next_steps(mock_efficient_results, verbose=False) hausman_steps = [ - s for s in output["next_steps"] + s + for s in output["next_steps"] if "hausman" in s["label"].lower() or "Hausman" in s["label"] ] assert len(hausman_steps) > 0 @@ -458,3 +491,448 @@ class FakeResults: output = practitioner_next_steps(FakeResults(), verbose=False) assert len(output["next_steps"]) > 0 assert output["estimator"] == "FakeResults" + + +# --------------------------------------------------------------------------- +# Tests: HeterogeneousAdoptionDiD (HAD) handler dispatch +# --------------------------------------------------------------------------- +class TestHADDispatch: + def test_had_results_dispatch(self, mock_had_results): + output = practitioner_next_steps(mock_had_results, verbose=False) + assert len(output["next_steps"]) > 0 + assert output["estimator"] == "HeterogeneousAdoptionDiD (HAD)" + + def test_had_event_study_dispatch(self, mock_had_event_study_results): + output = practitioner_next_steps(mock_had_event_study_results, verbose=False) + assert len(output["next_steps"]) > 0 + assert output["estimator"] == "HeterogeneousAdoptionDiD (Event Study)" + + def test_had_pretest_workflow_referenced(self, mock_had_results): + output = practitioner_next_steps(mock_had_results, verbose=False) + all_code = " ".join(s.get("code", "") for s in output["next_steps"]) + assert "did_had_pretest_workflow" in all_code + + def test_had_event_study_pretest_workflow_referenced(self, mock_had_event_study_results): + output = practitioner_next_steps(mock_had_event_study_results, verbose=False) + all_code = " ".join(s.get("code", "") for s in output["next_steps"]) + assert "did_had_pretest_workflow" in all_code + assert "aggregate='event_study'" in all_code + + def test_had_bandwidth_diagnostics_referenced(self, mock_had_results): + output = practitioner_next_steps(mock_had_results, verbose=False) + all_text = " ".join( + (s.get("code", "") + " " + s.get("why", "")) for s in output["next_steps"] + ) + assert "bandwidth_diagnostics" in all_text + + def test_had_event_study_simultaneous_bands_referenced(self, mock_had_event_study_results): + output = practitioner_next_steps(mock_had_event_study_results, verbose=False) + all_text = " ".join( + (s.get("code", "") + " " + s.get("why", "")) for s in output["next_steps"] + ) + assert "cband" in all_text + # Either "sup-t" wording or "simultaneous" wording is acceptable. + assert ("sup-t" in all_text) or ("simultaneous" in all_text) + + def test_had_no_comparison_group_framing(self, mock_had_results, mock_had_event_study_results): + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + all_text = " ".join( + (s.get("code", "") + " " + s.get("why", "") + " " + s.get("label", "")) + for s in output["next_steps"] + ) + all_text += " ".join(output["warnings"]) + assert "no comparison group" not in all_text.lower() + assert "missing comparison" not in all_text.lower() + + def test_had_nan_warning_scalar(self, mock_had_results_nan_att): + output = practitioner_next_steps(mock_had_results_nan_att, verbose=False) + warnings = " ".join(output["warnings"]) + assert "NaN" in warnings or "nan" in warnings.lower() + + def test_had_event_study_nan_warning_array(self, mock_had_event_study_results_all_nan): + output = practitioner_next_steps(mock_had_event_study_results_all_nan, verbose=False) + warnings = " ".join(output["warnings"]) + assert "per-horizon" in warnings or "All" in warnings + + def test_had_partial_nan_array_no_warning(self, mock_had_event_study_results_partial_nan): + # Partial-NaN arrays are legitimate event-study output (some + # horizons may collapse on degenerate-design grounds while others + # remain finite). The all-NaN warning must NOT fire here. + output = practitioner_next_steps(mock_had_event_study_results_partial_nan, verbose=False) + # No "per-horizon" or "All ... NaN" warning string should appear. + warnings = " ".join(output["warnings"]) + assert "per-horizon" not in warnings + assert "All " not in warnings + + def test_had_step_4_estimator_selection_present( + self, mock_had_results, mock_had_event_study_results + ): + # Step-4 must surface the WAS-vs-ATT(d) estimand difference (not + # a blanket "if untreated → not HAD" rule which would contradict + # REGISTRY § HeterogeneousAdoptionDiD edge cases lines ~2403/2408). + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + step_4_steps = [s for s in output["next_steps"] if s["baker_step"] == 4] + assert len(step_4_steps) >= 1 + all_text = " ".join( + (s.get("code", "") + " " + s.get("why", "") + " " + s.get("label", "")) + for s in step_4_steps + ) + # Routing nudge must name ContinuousDiD as the estimand + # alternative; framing must center on WAS vs ATT(d) (the + # actual estimand differentiator), NOT on whether untreated + # units exist. + assert "ContinuousDiD" in all_text + assert "WAS" in all_text + assert "ATT(d)" in all_text + + def test_had_step_4_does_not_misframe_untreated_unit_routing( + self, mock_had_results, mock_had_event_study_results + ): + # Per REGISTRY: HAD is compatible with a small share of + # never-treated units (paper edge case), and on staggered + # event-study panels never-treated units are explicitly RETAINED + # (Appendix B.2 / had.py:1325). The Step-4 routing must NOT + # carry the wrong "if untreated → not HAD" framing. + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + step_4_steps = [s for s in output["next_steps"] if s["baker_step"] == 4] + all_text = " ".join( + (s.get("code", "") + " " + s.get("why", "") + " " + s.get("label", "")) + for s in step_4_steps + ).lower() + forbidden_phrases = ( + "switch away from had", + "had's was divisor under-weights", + "drop untreated", + "must drop never-treated", + ) + for phrase in forbidden_phrases: + assert phrase not in all_text, ( + f"HAD Step-4 must not carry the phrase {phrase!r}: " + f"per REGISTRY § HeterogeneousAdoptionDiD edge cases, " + f"HAD is compatible with a small share of never-treated " + f"units and explicitly retains them on staggered " + f"event-study panels." + ) + + def test_handle_continuous_step_4_routes_to_had(self, mock_continuous_results): + # Symmetric pair: ContinuousDiD users with no untreated units + # should be routed to HeterogeneousAdoptionDiD. + output = practitioner_next_steps(mock_continuous_results, verbose=False) + step_4_steps = [s for s in output["next_steps"] if s["baker_step"] == 4] + assert len(step_4_steps) >= 1 + all_text = " ".join((s.get("code", "") + " " + s.get("why", "")) for s in step_4_steps) + assert "HeterogeneousAdoptionDiD" in all_text + + def test_handle_generic_ndarray_att_triggers_warning(self): + # Cross-handler regression: a future estimator that returns + # ndarray att and falls through to _handle_generic must produce + # the same all-NaN warning as the dedicated HAD event-study path. + class FutureNdarrayAttResults: + att: np.ndarray + + r = FutureNdarrayAttResults() + r.att = np.full(3, np.nan) + output = practitioner_next_steps(r, verbose=False) + warnings = " ".join(output["warnings"]) + assert "per-horizon" in warnings or "All" in warnings + + def test_had_handlers_string_only_no_attribute_reads( + self, mock_had_results, mock_had_event_study_results + ): + # Stability invariant #7: handlers are STRING-ONLY at runtime. + # The fixtures construct results with ONLY .att (and event_times + # on the event-study fixture); confirm no AttributeError is + # raised when the handlers run. Protects against a future + # refactor that starts reading result. inside a + # handler and silently breaks the minimal-fixture contract. + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + assert isinstance(output, dict) + assert "next_steps" in output + + def test_had_handler_snippets_are_valid_python_syntax( + self, mock_had_results, mock_had_event_study_results + ): + # Snippet smoke test: every code block emitted by the HAD + # handlers must parse as valid Python. Catches the failure mode + # where snippets reference undefined names with placeholder + # syntax that doesn't compile (e.g. `survey_design=design` with + # no `design` defined in scope, or attribute typos that break + # copy/paste). + import ast + + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + for step in output["next_steps"]: + code = step.get("code", "") + if not code.strip(): + continue + try: + ast.parse(code) + except SyntaxError as e: + pytest.fail( + f"Step {step['baker_step']} ({step['label']!r}) " + f"emits a code snippet that does not parse as " + f"valid Python: {e}\n\nSnippet:\n{code}" + ) + + def test_handle_continuous_step_4_snippet_is_valid_python(self, mock_continuous_results): + # Same syntax check on the symmetric Step-4 in _handle_continuous. + import ast + + output = practitioner_next_steps(mock_continuous_results, verbose=False) + step_4_steps = [s for s in output["next_steps"] if s["baker_step"] == 4] + for step in step_4_steps: + code = step.get("code", "") + if code.strip(): + ast.parse(code) # raises SyntaxError on failure + + def test_had_event_study_sup_t_snippet_uses_hc1_for_mass_point_survey_compatibility( + self, mock_had_event_study_results + ): + # Per had.py:3495-3507 the mass-point design rejects the + # default classical vcov family on the survey_design= path + # (NotImplementedError). The Step-6 sup-t snippet shows a + # generic weighted event-study fit; if it uses the default + # vcov_type a copy/paste on a mass-point panel raises at + # fit time. Snippet must either use vcov_type='hc1' / + # robust=True OR explicitly note the requirement so agents + # can adapt. + output = practitioner_next_steps(mock_had_event_study_results, verbose=False) + step_6_steps = [s for s in output["next_steps"] if s["baker_step"] == 6] + assert len(step_6_steps) >= 1 + # Find the sup-t / cband step (sensitivity step). + sup_t = next( + (s for s in step_6_steps if "cband" in s.get("code", "")), + None, + ) + assert sup_t is not None, "sup-t / cband step not found at baker_step=6" + snippet = sup_t.get("code", "") + # Either the snippet itself uses vcov_type='hc1' / robust=True + # OR it documents the requirement inline (so agents adapting + # the snippet on a mass-point panel know to add it). + ok = ( + "vcov_type='hc1'" in snippet + or 'vcov_type="hc1"' in snippet + or "robust=True" in snippet + or ("mass-point" in snippet and "vcov_type" in snippet) + or ("mass_point" in snippet and "vcov_type" in snippet) + ) + assert ok, ( + "Sup-t / cband snippet must either use vcov_type='hc1' / " + "robust=True or surface the mass-point + survey vcov " + "requirement inline. Per had.py:3495-3507 the default " + "classical sandwich raises NotImplementedError on the " + "mass-point + survey path; the example as written would " + "fail at fit time on a mass-point panel." + ) + + def test_had_results_to_dict_docstring_matches_weighted_mass_point_contract(self): + # Parallel to the dataclass-field-docstring regression below: + # PR #402 R8 P3 caught that HeterogeneousAdoptionDiDResults.to_dict() + # docstring still described variance_formula as continuous-only + # / "pweight" or "survey_binder_tsl", contradicting the field + # docstrings (fixed in R5) and llms-full.txt (fixed in R3). + # Lock the to_dict() docstring against drift back. + from diff_diff.had import HeterogeneousAdoptionDiDResults + + doc = HeterogeneousAdoptionDiDResults.to_dict.__doc__ or "" + for label in ( + "pweight", + "survey_binder_tsl", + "pweight_2sls", + "survey_binder_tsl_2sls", + ): + assert label in doc, ( + f"HeterogeneousAdoptionDiDResults.to_dict() docstring " + f"must enumerate the {label!r} variance_formula label - " + f"weighted mass-point fits populate pweight_2sls / " + f"survey_binder_tsl_2sls per had.py:3585-3629. The " + f"to_dict() docstring is a public source-of-truth " + f"surface and must match the field docstrings + " + f"llms-full.txt HAD section." + ) + # effective_dose_mean: must mention mass-point Wald-IV semantics. + assert "mass_point" in doc or "mass-point" in doc, ( + "HeterogeneousAdoptionDiDResults.to_dict() docstring must " + "describe the mass-point effective_dose_mean semantics; " + "weighted mass-point fits populate it as the weighted " + "Wald-IV dose gap per had.py:3642-3660." + ) + assert "Wald-IV" in doc or "Z=1" in doc, ( + "HeterogeneousAdoptionDiDResults.to_dict() docstring must " + "describe the weighted Wald-IV dose gap semantics for " + "mass-point fits." + ) + + def test_had_results_dataclass_docstrings_match_weighted_mass_point_contract(self): + # PR #402 R3 fixed the llms-full.txt field descriptions to + # acknowledge that weighted mass-point fits populate + # variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} + # and effective_dose_mean as the weighted Wald-IV dose gap (per + # had.py:3585-3660). PR #402 R5 P3 caught that the dataclass + # field docstrings still said those fields were continuous-only + # / None on mass-point - leaving two source-of-truth surfaces + # disagreeing about the same public result object. Lock the + # dataclass docstrings against drift back to the continuous-only + # framing. + import inspect + + from diff_diff.had import HeterogeneousAdoptionDiDResults + + # Field docstrings live as raw __doc__ on the FieldDescriptor / + # in __dataclass_fields__'s metadata; read them via the type's + # source-level docstring attached to the class via the field's + # `__doc__` after assignment in the class body. + # Easier: read the class source via inspect.getsource() and check + # the field-docstring blocks we care about. + src = inspect.getsource(HeterogeneousAdoptionDiDResults) + # variance_formula docstring must enumerate all 4 labels. + assert "pweight_2sls" in src, ( + "HeterogeneousAdoptionDiDResults.variance_formula docstring " + "must mention `pweight_2sls` (weighted mass-point HC1/CR1 " + "label per had.py:3585-3629). Otherwise the dataclass " + "docstring contradicts llms-full.txt and the actual " + "implementation." + ) + assert "survey_binder_tsl_2sls" in src, ( + "HeterogeneousAdoptionDiDResults.variance_formula docstring " + "must mention `survey_binder_tsl_2sls` (weighted mass-point " + "Binder-TSL label)." + ) + # effective_dose_mean docstring must mention mass-point Wald-IV. + assert "mass_point" in src or "mass-point" in src, ( + "HeterogeneousAdoptionDiDResults.effective_dose_mean " + "docstring must mention mass-point semantics; weighted " + "mass-point fits populate it as the weighted Wald-IV dose " + "gap per had.py:3642-3660." + ) + assert "Wald-IV" in src or "Z=1" in src, ( + "HeterogeneousAdoptionDiDResults.effective_dose_mean " + "docstring must describe the weighted Wald-IV dose gap " + "semantics (or the underlying Z=1/Z=0 subgroup-mean form) " + "for mass-point fits." + ) + + def test_had_step_3_documents_earlier_pre_period_precondition_for_step_2( + self, mock_had_results, mock_had_event_study_results + ): + # Per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD + # § "Assumption 7 / step 2 closure" + had_pretests.py:4738-4756 + + # 2769: aggregate="event_study" closes step 2 ONLY IF the panel + # carries at least one earlier placebo pre-period beyond the + # base F-1. With only F-1 available the workflow sets + # pretrends_joint=None, all_pass=False, and the verdict carries + # 'joint pre-trends skipped (no earlier pre-period)'. Both HAD + # handler variants must surface this precondition - otherwise + # agents reading the guidance can think any multi-period + # event-study fit closes step 2 when it does not. + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + step_3_steps = [s for s in output["next_steps"] if s["baker_step"] == 3] + assert len(step_3_steps) == 1 + text = (step_3_steps[0].get("why", "") + " " + step_3_steps[0].get("code", "")).lower() + # Must mention "earlier" pre-period / placebo precondition. + assert "earlier" in text and ("pre-period" in text or "placebo" in text), ( + "Step-3 text must mention the 'earlier pre-period' " + "precondition for closing Assumption 7 / step 2 on the " + "event-study path. With only the base F-1 pre-period " + "the workflow returns pretrends_joint=None and the " + "verdict carries 'joint pre-trends skipped (no earlier " + "pre-period)' - step 2 stays uncovered." + ) + # Must mention the skip-fallback verdict so agents know + # what to expect when the precondition fails. + assert "skipped" in text or "pretrends_joint=none" in text, ( + "Step-3 text must surface the 'joint pre-trends skipped' " + "/ pretrends_joint=None fallback when no earlier " + "pre-period exists - otherwise agents cannot tell " + "whether step 2 was actually covered on a minimal " + "event-study fit." + ) + + def test_had_step_3_flags_qug_under_survey_deferral( + self, mock_had_results, mock_had_event_study_results + ): + # Per diff_diff/had_pretests.py:4488-4495 + REGISTRY § "QUG Null + # Test" Note (Phase 4.5 C0): when survey_design= / survey= / + # weights= is supplied, did_had_pretest_workflow skips the QUG + # step with a UserWarning and returns a linearity-conditional + # verdict only. Both HAD handler variants must surface this + # caveat so agents do not assume step 1 / Design 1' vs Design 1 + # was checked on weighted fits when the library deliberately + # cannot check it there. + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + step_3_steps = [s for s in output["next_steps"] if s["baker_step"] == 3] + assert len(step_3_steps) == 1 + text = (step_3_steps[0].get("why", "") + " " + step_3_steps[0].get("code", "")).lower() + # Must mention that survey-weighted fits skip QUG. + assert "skip" in text and "qug" in text, ( + "Step-3 text must explicitly say survey-weighted fits " + "skip QUG (Phase 4.5 C0 deferral). Without this caveat " + "agents may assume step 1 / Design 1' vs Design 1 was " + "checked on weighted fits when the library deliberately " + "does not check it there." + ) + # Must mention "linearity-conditional" verdict OR equivalent + # framing so agents know the weighted verdict is conditional + # on QUG holding by assumption. + assert ( + "linearity-conditional" in text + or "linearity conditional" in text + or "qug holding by assumption" in text + ), ( + "Step-3 text must describe the weighted verdict as " + "linearity-conditional / conditional on QUG holding by " + "assumption." + ) + + def test_had_step_3_pretest_assumption_labels_correct(self, mock_had_results): + # Per docs/methodology/REGISTRY.md and diff_diff/had_pretests.py + # docstrings: + # - did_had_pretest_workflow(aggregate="overall") covers paper + # Section 4.2 steps 1 + 3 ONLY; step 2 (Assumption 7 + # pre-trends) is explicitly NOT covered on the overall path. + # - qug_test = support-infimum test (H0: d_lower = 0), + # NOT "Assumption 5" (Design 1 sign identification, which is + # not testable per registry). + # - stute_test = Assumption 8 linearity, NOT Assumption 7 + # mean-independence. + # The single-period Step-3 guidance must not mislabel these. + output = practitioner_next_steps(mock_had_results, verbose=False) + step_3_steps = [s for s in output["next_steps"] if s["baker_step"] == 3] + assert len(step_3_steps) == 1 + why = step_3_steps[0].get("why", "") + # Must NOT call QUG an "Assumption 5" test. + assert "QUG (Assumption 5" not in why, ( + "Step-3 why-text must not call QUG an 'Assumption 5' test - " + "QUG tests H_0: d_lower = 0 (paper Theorem 4); Assumption 5 " + "is the Design 1 sign-identification condition and is NOT " + "testable per registry." + ) + # Must NOT claim Stute is Assumption 7 mean-independence. + forbidden = ( + "Stute (Assumption 7", + "Stute / Yatchew-HR Assumption 7", + "Assumption 7 mean-independence", + ) + for phrase in forbidden: + assert phrase not in why, ( + f"Step-3 why-text must not carry the phrase {phrase!r} - " + f"stute_test / yatchew_hr_test are Assumption 8 linearity " + f"tests (paper Section 4.2 step 3); Assumption 7 (pre-trends) " + f"is paper step 2 and is NOT covered on the overall workflow " + f"path - the workflow's verdict explicitly flags that gap." + ) + # Must positively acknowledge the Assumption 7 / step 2 gap on + # the overall path (not silently imply it's covered). + assert "Assumption 7" in why or "step 2" in why, ( + "Step-3 why-text must explicitly mention Assumption 7 / step 2 " + "to acknowledge the gap on the overall workflow path - " + "agents reading the guidance must not assume the workflow " + "covers what it does not cover." + )