Skip to content

Add GNN calorimeter clustering (split-module design)#1823

Draft
zwl0331 wants to merge 2 commits intoMu2e:mainfrom
zwl0331:gnn-clustering
Draft

Add GNN calorimeter clustering (split-module design)#1823
zwl0331 wants to merge 2 commits intoMu2e:mainfrom
zwl0331:gnn-clustering

Conversation

@zwl0331
Copy link
Copy Markdown
Contributor

@zwl0331 zwl0331 commented May 7, 2026

Summary

Adds a Graph Neural Network calorimeter clustering algorithm that runs
alongside the existing seed+BFS CaloClusterMaker. Both clustering
chains run, both emit a CaloClusterCollection; downstream consumers
select via (module_label, instance_name). The GNN output ships under
instance name "GNN" so existing BFS-reading analyses are untouched.

Status: draft. The build links onnxruntime via the u092 muse
manifest (Sophie's hook into Andy's local install), tracking
Mu2e/ArtAnalysis#4
which is also draft pending the central muse onnxruntime package.
Once that lands, the rebase-and-flip is small:

  • drop the u092 qualifier in favour of the standard one,
  • mirror whatever final SConscript dependency name ArtAnalysis#4
    settles on (likely unchanged at 'onnxruntime'),
  • gh pr ready.

The C++ source itself is independent of which qualifier provides
onnxruntime.

Design (split-module)

Following the design meeting with Sophie Middleton + Andrew Edmonds
(2026-04-29), the algorithm is split into two art::EDProducers
joined by a transient data product:

event -> ... -> CaloHitMaker -> CaloHitCollection ---+--> CaloClusterMaker        (existing, BFS, untouched)
                                                     |        -> CaloClusterCollection [instance: ""]
                                                     |
                                                     +--> CaloHitGraphMaker        (NEW, step 1)
                                                              -> CaloHitGraphCollection (transient)
                                                                  |
                                                                  v
                                                          CaloClusterMakerGNN     (NEW, steps 2 + 3)
                                                              -> CaloClusterCollection [instance: "GNN"]

CaloClusterMakerGNN is model-agnostic: a single C++ class loaded
twice in an A/B comparison job runs SimpleEdgeNet and CaloClusterNet
side by side, with per-instance FHiCL tauEdge and expectedModelVersion.
Production declares one instance with the CCN artifact.

Metadata-props deployment contract

The C++ session loader asserts the loaded .onnx's
metadata_props map matches FHiCL expectations at job start, so a
silently-out-of-sync retraining is caught before the first event:

Key Example FHiCL parameter
model_version calo-cluster-net-v2-stage1 expectedModelVersion
node_features log_e,t,x,y,r,e_rel expectedNodeFeatures
edge_features dx,dy,d,dt,dlog_e,asym_e,logsum_e,dr expectedEdgeFeatures

Mismatches abort the job loudly.

What this adds

Two commits, reviewer-friendly split:

Add GNN calorimeter clustering (split-module design)

Production code, FHiCL wiring, and trained model artifacts.

Path Role
RecoDataProducts/inc/CaloHitGraph.hh Transient per-disk data product (flat tensors + art::Ptr<CaloHit> back-references)
CaloCluster/{inc,src}/GnnGraphBuilder.{hh,cc} Step-1 helper: per-disk hit collection, 6 node + 8 edge features, brute-force pairwise radius graph at r_max=210 mm (faithful to the Python scipy.spatial.cKDTree behaviour at N <= ~65 hits/disk), time filter, kNN fallback, degree cap, z-score normalisation from JSON sidecar
CaloCluster/src/CaloHitGraphMaker_module.cc Step-1 EDProducer
CaloCluster/{inc,src}/GnnClusterAssembler.{hh,cc} Step-3 helper (CCN+BFS10 recipe): sigmoid + symmetrise → threshold → BFS-from-highest-energy with bfsExpandCut=10 MeV ExpandCut → minHits / minEnergyMeV cleanup → contiguous relabel
CaloCluster/src/CaloClusterMakerGNN_module.cc Step-2+3 EDProducer. ONNX session in member-init order; metadata_props assertion against FHiCL at job start; zero-copy tensor views over the graph payload at inference time; CaloCluster construction via the existing ClusterUtils::cog3Vector
CaloCluster/data/calo_cluster_net_v2_stage1.{onnx,norm.json} Production CCN artifact (2.6 MB + ~1 KB)
CaloCluster/data/simple_edge_net_v2.onnx SEN artifact (0.84 MB) for A/B studies
CaloCluster/fcl/prolog.fcl New CaloClusterGNN block defining the two producers with frozen recipe defaults; bundled Reco sequence for one-line inclusion
CaloCluster/fcl/from_mcs-gnn-prod.fcl Production-style standalone FCL — runs the GNN chain on MCS art-format input, writes both BFS and GNN CaloClusterCollections
CaloCluster/src/SConscript Adds 'onnxruntime' to plugin deps

Add GNN clustering parity tests

Path Role
CaloCluster/src/testGnnClusterAssembler_main.cc Standalone (make_bin) executable: loads JSON parity payload from the training repo, replays GnnClusterAssembler, asserts byte-identical labels against Python (Stage 2)
CaloCluster/src/CaloHitGraphParityDump_module.cc art::EDAnalyzer: dumps per-event-disk CaloHits + GNN cluster labels to a flat TTree for end-to-end Python comparison (Stage 3)
CaloCluster/fcl/from_mcs-gnn-test.fcl Drives the parity-dump analyzer over MCS art input

Headline result

276,688 events / 481,543 disk-graphs on the MDC2025 mixed-pileup test
set, calo-entrant truth, E_reco >= 50 MeV downstream cut (clusters
that actually enter track finding):

Metric BFS CCN+BFS10 Change
Mean abs(dE) / MeV 0.839 0.616 -27%
95th-pct abs(dE) / MeV 3.520 2.338 -34%
Mean centroid dr / mm 1.589 1.292 -19%
95th-pct dr / mm 3.606 2.294 -36%

Signal region (95-110 MeV, 47,279 clusters): mean abs(dE) drops
from 0.368 to 0.210 MeV (-43%), mean dr from 0.559 to 0.460 mm (-18%).

Parity validation

Both stages already pass on real data:

  • Stage 2 (assembler-only, 100 packed val graphs / 1,147 nodes /
    2,768 edges):
    100/100 byte-exact.
  • Stage 3 (full mu2e art job on real MCS input via
    from_mcs-gnn-test.fcl + Python comparison harness, 50 events / 100
    disk-graphs / 8,502 hits):
    100/100 byte-exact.

Stage 3 implicitly covers Stage 1 (graph-maker parity): any divergence
in graph construction would propagate to mismatched cluster labels.

Reproduce locally:

muse setup -q u092      # central muse onnxruntime once it lands
build/al9-prof-e29-u092/Offline/bin/testGnnClusterAssembler \
    /path/to/calo_cluster_net_v2_stage1.parity.json
# -> [PASS] all 100 graphs match Python cluster_labels byte-exactly

Coordinated PRs

Repo PR Status Relationship
Mu2e/EventNtuple #366 open Adds calomcsim.ancestorSimIds. Used by the training-time truth-labelling step. This PR does not depend on EventNtuple#366 landing — the C++ Offline code only consumes CaloHit collections, not the new EventNtuple branch.
Mu2e/ArtAnalysis #4 draft TrackQuality ONNX integration. Sets the onnxruntime build-dep pattern this PR mirrors. This PR's draft status tracks ArtAnalysis#4 — both are blocked on the central muse onnxruntime package.
Mu2e/MLTrain #7 open CaloClusterGNN/ subdirectory containing the full training pipeline that produces the .onnx artifacts shipped in CaloCluster/data/ here.

Try it

cd <muse-work-dir>
muse setup -q u092       # while u092 is the right qualifier
muse build -j32

# Standalone parity test
build/al9-prof-e29-u092/Offline/bin/testGnnClusterAssembler \
    <path-to-parity.json>

# Production-style smoke test (5 events from an MCS art file)
mu2e -c Offline/CaloCluster/fcl/from_mcs-gnn-prod.fcl \
     -s <path-to-mcs.art> -n 5 -o /tmp/mcs.gnn.art

# Output art file carries both BFS and GNN clusters
build/al9-prof-e29-u092/Offline/bin/artProductSizes /tmp/mcs.gnn.art \
    | grep CaloClusters

Acknowledgement

Implementation, refactoring, and documentation drafting were assisted
by Anthropic's Claude (Claude Code). Scientific decisions, training
campaign, validation results, and the v1->v2 truth-definition design
are my own work; Claude was used as a coding assistant.

Test plan

  • Rebuild against the central muse onnxruntime package once it
    lands (drop u092, mirror final SConscript dep name from
    ArtAnalysis#4).
  • Re-run testGnnClusterAssembler on the parity payload — expect
    [PASS] all 100 graphs match.
  • Re-run from_mcs-gnn-test.fcl end-to-end on a small MCS art
    file + the training-repo Python comparison — expect 100/100.
  • Add the GNN producers to a real production reco sequence in
    Production/JobConfig/... (separate follow-up PR).

zwl0331 added 2 commits May 7, 2026 12:47
Adds a Graph-Neural-Network-based calorimeter clustering algorithm
that runs alongside the existing seed+BFS CaloClusterMaker -- both
producers run, both emit a CaloClusterCollection, downstream consumers
select via (module_label, instance_name). The GNN output ships under
instance name "GNN" so existing BFS-reading analyses are untouched.

Following the design-meeting outcome (with Andy Edmonds + Sophie
Middleton, 2026-04-29), the algorithm is split into two art modules
joined by a transient data product:

  CaloHitMaker -> CaloHitCollection
                 |
                 +-- CaloClusterMaker             (existing, BFS, untouched)
                 |       -> CaloClusterCollection ("")
                 |
                 +-- CaloHitGraphMaker            (NEW, step 1)
                         -> CaloHitGraphCollection (transient)
                            |
                            v
                        CaloClusterMakerGNN       (NEW, steps 2 + 3)
                            -> CaloClusterCollection ("GNN")

New files:

* RecoDataProducts/inc/CaloHitGraph.hh
    Per-disk graph data product. Carries the three normalised tensors
    the ONNX model expects (x, edge_index, edge_attr, all flat) plus
    per-node art::Ptr<CaloHit> back-references. Transient -- not
    registered for ROOT serialisation.

* CaloCluster/{inc,src}/GnnGraphBuilder.{hh,cc}
    Step-1 helper. Per-disk hit collection from the CaloHitCollection;
    six node features (log E, t, x, y, r, e_rel) and eight edge
    features computed from the Calorimeter geometry service; radius
    graph at r_max=210 mm via brute-force pairwise distance loop
    (faithful to scipy.spatial.cKDTree at N <= ~65 hits/disk); time
    filter |dt| <= 25 ns; kNN fallback at k_min=3; per-source-node
    degree cap at k_max=20; z-score normalisation using train-split
    statistics loaded from a JSON sidecar.

* CaloCluster/src/CaloHitGraphMaker_module.cc
    The step-1 EDProducer. Consumes CaloHitCollection, partitions by
    disk, runs GnnGraphBuilder once per disk, emits the
    CaloHitGraphCollection. Norm sidecar resolved by
    ConfigFileLookupPolicy.

* CaloCluster/{inc,src}/GnnClusterAssembler.{hh,cc}
    Step-3 helper (CCN+BFS10 recipe): sigmoid + symmetrise directed
    edge logits; threshold at tau_edge; BFS traversal seeded from
    highest-energy hits, with the bfs_expand_cut=10 MeV ExpandCut
    rule (hits below cut join the cluster but do not recruit further
    neighbours -- mirrors Offline ClusterFinder semantics); cleanup
    by min_hits and min_energy_mev; relabel to contiguous IDs.

* CaloCluster/src/CaloClusterMakerGNN_module.cc
    The step-2 + step-3 EDProducer. Loads the ONNX session in the
    constructor (via Ort::Env / Ort::SessionOptions / Ort::Session
    in member-declaration order, which is RAII-safe). Asserts the
    loaded model's metadata_props (model_version, node_features,
    edge_features) against FHiCL expectations -- silent tensor-layout
    drift after a retraining is caught loudly. produce() runs ONNX
    inference per disk (zero-copy tensor views over the CaloHitGraph
    payload), invokes GnnClusterAssembler, then builds CaloClusters
    via the existing ClusterUtils linear cog3Vector helper.

    Class is model-agnostic: production declares one instance with
    the CCN .onnx; A/B comparison jobs declare a second instance with
    sen.onnx and a different tau_edge.

* CaloCluster/data/calo_cluster_net_v2_stage1.onnx (2.6 MB)
* CaloCluster/data/calo_cluster_net_v2_stage1.norm.json (~1 KB)
* CaloCluster/data/simple_edge_net_v2.onnx (0.84 MB)
    Trained model artifacts. Resolved at runtime by
    ConfigFileLookupPolicy. The .onnx files carry the
    metadata_props deployment contract; the .json carries the
    train-split z-score normalisation statistics.

* CaloCluster/fcl/prolog.fcl
    New CaloClusterGNN block defining caloHitGraphMakerGNN +
    caloClusterMakerGNN with the frozen CCN+BFS10 recipe defaults.
    Production FCLs include the bundled sequence:
        physics.<reco-path> : [ ..., @sequence::CaloClusterGNN.Reco ]

* CaloCluster/fcl/from_mcs-gnn-prod.fcl
    Production-style standalone FCL that runs the GNN chain on MCS
    art-format input and writes both BFS and GNN CaloClusterCollections
    to the output art file.

* CaloCluster/src/SConscript
    Adds 'onnxruntime' to the plugins dependency list. Picks up the
    central muse onnxruntime install via the u092 qualifier.

Build dependency: the central muse `onnxruntime` package; activate
with `muse setup -q u092` (the qualifier providing the central
onnxruntime hook, mirroring the pattern in Mu2e/ArtAnalysis#4).

Training repo: see Mu2e/MLTrain CaloClusterGNN/.
Training-data branch dependency: Mu2e/EventNtuple#366 adds
calomcsim.ancestorSimIds, used by the truth-labelling step in
training only -- this Offline-side code does not depend on the
EventNtuple PR landing.

Headline test-set numbers (276,688 events / 481,543 disk-graphs,
calo-entrant truth, E_reco >= 50 MeV downstream cut):

| Metric                | BFS   | CCN+BFS10 | Change |
|-----------------------|-------|-----------|--------|
| Mean abs(dE) / MeV    | 0.839 | 0.616     | -27%   |
| 95th-pct abs(dE) / MeV| 3.520 | 2.338     | -34%   |
| Mean centroid dr / mm | 1.589 | 1.292     | -19%   |
| 95th-pct dr / mm      | 3.606 | 2.294     | -36%   |
Two test artefacts plus their FHiCL wiring, exercising the C++
implementation against the Python pipeline that trained the model.

* CaloCluster/src/testGnnClusterAssembler_main.cc
    Standalone executable (built via helper.make_bin) that loads a
    JSON parity payload produced by the training repo's
    scripts/dump_parity_payloads.py, replays GnnClusterAssembler on
    each disk-graph, and asserts byte-identical cluster_labels
    against the Python reference. Stage-2 of the parity gate.

    Expected output:
        graphs:           N
        mismatch graphs:  0
        mismatch nodes:   0
        [PASS] all N graphs match Python cluster_labels byte-exactly

* CaloCluster/src/CaloHitGraphParityDump_module.cc
    art::EDAnalyzer that consumes a CaloClusterCollection emitted by
    CaloClusterMakerGNN plus the source CaloHitCollection and writes
    a flat TTree (per event-disk: crystalIDs, time, eDep, GNN
    cluster labels). Used by the training-repo Python script
    scripts/compare_parity_dump.py to replay the same hits through
    the Python pipeline and assert byte-exact agreement on cluster
    labels end-to-end. Stage-3 of the parity gate.

* CaloCluster/fcl/from_mcs-gnn-test.fcl
    Drives the parity-dump analyzer over MCS art-format input. Also
    serves as a minimal smoke-test for the C++ pipeline; outputs a
    parity_dump.root TTree.

Both stages already pass on real data:
* Stage-2 (assembler-only on packed val graphs): 100/100 disk-graphs
  byte-exact (1,147 nodes, 2,768 edges).
* Stage-3 (full mu2e art job on MCS art files, via from_mcs-gnn-test.fcl
  + the training-repo Python comparison): 100/100 disk-graphs
  byte-exact (8,502 hits over 50 events).
@FNALbuild
Copy link
Copy Markdown
Collaborator

Hi @zwl0331,
You have proposed changes to files in these packages:

  • RecoDataProducts
  • CaloCluster

which require these tests: build.

@Mu2e/fnalbuild-users, @Mu2e/write have access to CI actions on main.

⌛ The following tests have been triggered for a58c90e: build (Build queue - API unavailable)

About FNALbuild. Code review on Mu2e/Offline.

@FNALbuild
Copy link
Copy Markdown
Collaborator

☔ The build is failing at a58c90e.

scons: *** [build/al9-prof-e29-p094/Offline/tmp/CaloCluster/src/CaloClusterMakerGNN_module.os] Error 1
scons: *** [build/al9-prof-e29-p094/Offline/lib/libmu2e_CaloCluster_CaloClusterFast_module.so] Error 1
scons: *** [build/al9-prof-e29-p094/Offline/lib/libmu2e_CaloCluster_CaloHitGraphMaker_module.so] Error 1
scons: *** [build/al9-prof-e29-p094/Offline/lib/libmu2e_CaloCluster_CaloClusterMaker_module.so] Error 1
scons: *** [build/al9-prof-e29-p094/Offline/lib/libmu2e_CaloCluster_CaloHitGraphParityDump_module.so] Error 1
scons: *** [build/al9-prof-e29-p094/Offline/lib/libmu2e_CaloCluster_CaloProtoClusterMaker_module.so] Error 1
scons: *** [build/al9-prof-e29-p094/Offline/lib/libmu2e_CaloCluster_CaloTrigger_module.so] Error 1
Test Result Details
test with Command did not list any other PRs to include
merge Merged a58c90e at d2340b7
build (prof) Log file.
ceSimReco 〰️ Log file.
g4test_03MT 〰️ Log file.
transportOnly 〰️ Log file.
POT 〰️ Log file.
g4study 〰️ Log file.
cosmicSimReco 〰️ Log file.
cosmicOffSpill 〰️ Log file.
ceSteps 〰️ Log file.
ceDigi 〰️ Log file.
muDauSteps 〰️ Log file.
ceMix 〰️ Log file.
rootOverlaps 〰️ Log file.
g4surfaceCheck 〰️ Log file.
trigger Log file.
check_cmake 〰️ Log file.
FIXME, TODO TODO (0) FIXME (0) in 9 files
clang-tidy ➡️ 7 errors 19 warnings
whitespace check no whitespace errors found

N.B. These results were obtained from a build of this Pull Request at a58c90e after being merged into the base branch at d2340b7.

For more information, please check the job page here.
Build artifacts are deleted after 5 days. If this is not desired, select Keep this build forever on the job page.

@rlcee
Copy link
Copy Markdown
Collaborator

rlcee commented May 7, 2026

Offline/CaloCluster/src/CaloClusterMakerGNN_module.cc:46:10: fatal error: onnxruntime/core/session/onnxruntime_cxx_api.h: No such file or directory
   46 | #include "onnxruntime/core/session/onnxruntime_cxx_api.h"

ORT is not available yet. Andy is testing the prototype

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants