Add GNN calorimeter clustering (split-module design)#1823
Draft
Add GNN calorimeter clustering (split-module design)#1823
Conversation
Adds a Graph-Neural-Network-based calorimeter clustering algorithm
that runs alongside the existing seed+BFS CaloClusterMaker -- both
producers run, both emit a CaloClusterCollection, downstream consumers
select via (module_label, instance_name). The GNN output ships under
instance name "GNN" so existing BFS-reading analyses are untouched.
Following the design-meeting outcome (with Andy Edmonds + Sophie
Middleton, 2026-04-29), the algorithm is split into two art modules
joined by a transient data product:
CaloHitMaker -> CaloHitCollection
|
+-- CaloClusterMaker (existing, BFS, untouched)
| -> CaloClusterCollection ("")
|
+-- CaloHitGraphMaker (NEW, step 1)
-> CaloHitGraphCollection (transient)
|
v
CaloClusterMakerGNN (NEW, steps 2 + 3)
-> CaloClusterCollection ("GNN")
New files:
* RecoDataProducts/inc/CaloHitGraph.hh
Per-disk graph data product. Carries the three normalised tensors
the ONNX model expects (x, edge_index, edge_attr, all flat) plus
per-node art::Ptr<CaloHit> back-references. Transient -- not
registered for ROOT serialisation.
* CaloCluster/{inc,src}/GnnGraphBuilder.{hh,cc}
Step-1 helper. Per-disk hit collection from the CaloHitCollection;
six node features (log E, t, x, y, r, e_rel) and eight edge
features computed from the Calorimeter geometry service; radius
graph at r_max=210 mm via brute-force pairwise distance loop
(faithful to scipy.spatial.cKDTree at N <= ~65 hits/disk); time
filter |dt| <= 25 ns; kNN fallback at k_min=3; per-source-node
degree cap at k_max=20; z-score normalisation using train-split
statistics loaded from a JSON sidecar.
* CaloCluster/src/CaloHitGraphMaker_module.cc
The step-1 EDProducer. Consumes CaloHitCollection, partitions by
disk, runs GnnGraphBuilder once per disk, emits the
CaloHitGraphCollection. Norm sidecar resolved by
ConfigFileLookupPolicy.
* CaloCluster/{inc,src}/GnnClusterAssembler.{hh,cc}
Step-3 helper (CCN+BFS10 recipe): sigmoid + symmetrise directed
edge logits; threshold at tau_edge; BFS traversal seeded from
highest-energy hits, with the bfs_expand_cut=10 MeV ExpandCut
rule (hits below cut join the cluster but do not recruit further
neighbours -- mirrors Offline ClusterFinder semantics); cleanup
by min_hits and min_energy_mev; relabel to contiguous IDs.
* CaloCluster/src/CaloClusterMakerGNN_module.cc
The step-2 + step-3 EDProducer. Loads the ONNX session in the
constructor (via Ort::Env / Ort::SessionOptions / Ort::Session
in member-declaration order, which is RAII-safe). Asserts the
loaded model's metadata_props (model_version, node_features,
edge_features) against FHiCL expectations -- silent tensor-layout
drift after a retraining is caught loudly. produce() runs ONNX
inference per disk (zero-copy tensor views over the CaloHitGraph
payload), invokes GnnClusterAssembler, then builds CaloClusters
via the existing ClusterUtils linear cog3Vector helper.
Class is model-agnostic: production declares one instance with
the CCN .onnx; A/B comparison jobs declare a second instance with
sen.onnx and a different tau_edge.
* CaloCluster/data/calo_cluster_net_v2_stage1.onnx (2.6 MB)
* CaloCluster/data/calo_cluster_net_v2_stage1.norm.json (~1 KB)
* CaloCluster/data/simple_edge_net_v2.onnx (0.84 MB)
Trained model artifacts. Resolved at runtime by
ConfigFileLookupPolicy. The .onnx files carry the
metadata_props deployment contract; the .json carries the
train-split z-score normalisation statistics.
* CaloCluster/fcl/prolog.fcl
New CaloClusterGNN block defining caloHitGraphMakerGNN +
caloClusterMakerGNN with the frozen CCN+BFS10 recipe defaults.
Production FCLs include the bundled sequence:
physics.<reco-path> : [ ..., @sequence::CaloClusterGNN.Reco ]
* CaloCluster/fcl/from_mcs-gnn-prod.fcl
Production-style standalone FCL that runs the GNN chain on MCS
art-format input and writes both BFS and GNN CaloClusterCollections
to the output art file.
* CaloCluster/src/SConscript
Adds 'onnxruntime' to the plugins dependency list. Picks up the
central muse onnxruntime install via the u092 qualifier.
Build dependency: the central muse `onnxruntime` package; activate
with `muse setup -q u092` (the qualifier providing the central
onnxruntime hook, mirroring the pattern in Mu2e/ArtAnalysis#4).
Training repo: see Mu2e/MLTrain CaloClusterGNN/.
Training-data branch dependency: Mu2e/EventNtuple#366 adds
calomcsim.ancestorSimIds, used by the truth-labelling step in
training only -- this Offline-side code does not depend on the
EventNtuple PR landing.
Headline test-set numbers (276,688 events / 481,543 disk-graphs,
calo-entrant truth, E_reco >= 50 MeV downstream cut):
| Metric | BFS | CCN+BFS10 | Change |
|-----------------------|-------|-----------|--------|
| Mean abs(dE) / MeV | 0.839 | 0.616 | -27% |
| 95th-pct abs(dE) / MeV| 3.520 | 2.338 | -34% |
| Mean centroid dr / mm | 1.589 | 1.292 | -19% |
| 95th-pct dr / mm | 3.606 | 2.294 | -36% |
Two test artefacts plus their FHiCL wiring, exercising the C++
implementation against the Python pipeline that trained the model.
* CaloCluster/src/testGnnClusterAssembler_main.cc
Standalone executable (built via helper.make_bin) that loads a
JSON parity payload produced by the training repo's
scripts/dump_parity_payloads.py, replays GnnClusterAssembler on
each disk-graph, and asserts byte-identical cluster_labels
against the Python reference. Stage-2 of the parity gate.
Expected output:
graphs: N
mismatch graphs: 0
mismatch nodes: 0
[PASS] all N graphs match Python cluster_labels byte-exactly
* CaloCluster/src/CaloHitGraphParityDump_module.cc
art::EDAnalyzer that consumes a CaloClusterCollection emitted by
CaloClusterMakerGNN plus the source CaloHitCollection and writes
a flat TTree (per event-disk: crystalIDs, time, eDep, GNN
cluster labels). Used by the training-repo Python script
scripts/compare_parity_dump.py to replay the same hits through
the Python pipeline and assert byte-exact agreement on cluster
labels end-to-end. Stage-3 of the parity gate.
* CaloCluster/fcl/from_mcs-gnn-test.fcl
Drives the parity-dump analyzer over MCS art-format input. Also
serves as a minimal smoke-test for the C++ pipeline; outputs a
parity_dump.root TTree.
Both stages already pass on real data:
* Stage-2 (assembler-only on packed val graphs): 100/100 disk-graphs
byte-exact (1,147 nodes, 2,768 edges).
* Stage-3 (full mu2e art job on MCS art files, via from_mcs-gnn-test.fcl
+ the training-repo Python comparison): 100/100 disk-graphs
byte-exact (8,502 hits over 50 events).
Collaborator
Collaborator
|
☔ The build is failing at a58c90e.
N.B. These results were obtained from a build of this Pull Request at a58c90e after being merged into the base branch at d2340b7. For more information, please check the job page here. |
Collaborator
ORT is not available yet. Andy is testing the prototype |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a Graph Neural Network calorimeter clustering algorithm that runs
alongside the existing seed+BFS
CaloClusterMaker. Both clusteringchains run, both emit a
CaloClusterCollection; downstream consumersselect via
(module_label, instance_name). The GNN output ships underinstance name
"GNN"so existing BFS-reading analyses are untouched.Status: draft. The build links
onnxruntimevia theu092musemanifest (Sophie's hook into Andy's local install), tracking
Mu2e/ArtAnalysis#4which is also draft pending the central muse
onnxruntimepackage.Once that lands, the rebase-and-flip is small:
u092qualifier in favour of the standard one,SConscriptdependency nameArtAnalysis#4settles on (likely unchanged at
'onnxruntime'),gh pr ready.The C++ source itself is independent of which qualifier provides
onnxruntime.Design (split-module)
Following the design meeting with Sophie Middleton + Andrew Edmonds
(2026-04-29), the algorithm is split into two
art::EDProducersjoined by a transient data product:
CaloClusterMakerGNNis model-agnostic: a single C++ class loadedtwice in an A/B comparison job runs SimpleEdgeNet and CaloClusterNet
side by side, with per-instance FHiCL
tauEdgeandexpectedModelVersion.Production declares one instance with the CCN artifact.
Metadata-props deployment contract
The C++ session loader asserts the loaded
.onnx'smetadata_propsmap matches FHiCL expectations at job start, so asilently-out-of-sync retraining is caught before the first event:
model_versioncalo-cluster-net-v2-stage1expectedModelVersionnode_featureslog_e,t,x,y,r,e_relexpectedNodeFeaturesedge_featuresdx,dy,d,dt,dlog_e,asym_e,logsum_e,drexpectedEdgeFeaturesMismatches abort the job loudly.
What this adds
Two commits, reviewer-friendly split:
Add GNN calorimeter clustering (split-module design)Production code, FHiCL wiring, and trained model artifacts.
RecoDataProducts/inc/CaloHitGraph.hhart::Ptr<CaloHit>back-references)CaloCluster/{inc,src}/GnnGraphBuilder.{hh,cc}r_max=210 mm(faithful to the Pythonscipy.spatial.cKDTreebehaviour atN <= ~65 hits/disk), time filter, kNN fallback, degree cap, z-score normalisation from JSON sidecarCaloCluster/src/CaloHitGraphMaker_module.ccCaloCluster/{inc,src}/GnnClusterAssembler.{hh,cc}bfsExpandCut=10 MeVExpandCut →minHits/minEnergyMeVcleanup → contiguous relabelCaloCluster/src/CaloClusterMakerGNN_module.ccmetadata_propsassertion against FHiCL at job start; zero-copy tensor views over the graph payload at inference time;CaloClusterconstruction via the existingClusterUtils::cog3VectorCaloCluster/data/calo_cluster_net_v2_stage1.{onnx,norm.json}CaloCluster/data/simple_edge_net_v2.onnxCaloCluster/fcl/prolog.fclCaloClusterGNNblock defining the two producers with frozen recipe defaults; bundledRecosequence for one-line inclusionCaloCluster/fcl/from_mcs-gnn-prod.fclCaloClusterCollectionsCaloCluster/src/SConscript'onnxruntime'to plugin depsAdd GNN clustering parity testsCaloCluster/src/testGnnClusterAssembler_main.ccmake_bin) executable: loads JSON parity payload from the training repo, replaysGnnClusterAssembler, asserts byte-identical labels against Python (Stage 2)CaloCluster/src/CaloHitGraphParityDump_module.ccart::EDAnalyzer: dumps per-event-diskCaloHits + GNN cluster labels to a flat TTree for end-to-end Python comparison (Stage 3)CaloCluster/fcl/from_mcs-gnn-test.fclHeadline result
276,688 events / 481,543 disk-graphs on the MDC2025 mixed-pileup test
set, calo-entrant truth,
E_reco >= 50 MeVdownstream cut (clustersthat actually enter track finding):
Signal region (95-110 MeV, 47,279 clusters): mean abs(dE) drops
from 0.368 to 0.210 MeV (-43%), mean dr from 0.559 to 0.460 mm (-18%).
Parity validation
Both stages already pass on real data:
2,768 edges): 100/100 byte-exact.
mu2eart job on real MCS input viafrom_mcs-gnn-test.fcl+ Python comparison harness, 50 events / 100disk-graphs / 8,502 hits): 100/100 byte-exact.
Stage 3 implicitly covers Stage 1 (graph-maker parity): any divergence
in graph construction would propagate to mismatched cluster labels.
Reproduce locally:
Coordinated PRs
Mu2e/EventNtuplecalomcsim.ancestorSimIds. Used by the training-time truth-labelling step. This PR does not depend on EventNtuple#366 landing — the C++ Offline code only consumesCaloHitcollections, not the new EventNtuple branch.Mu2e/ArtAnalysisonnxruntimebuild-dep pattern this PR mirrors. This PR's draft status tracks ArtAnalysis#4 — both are blocked on the central museonnxruntimepackage.Mu2e/MLTrainCaloClusterGNN/subdirectory containing the full training pipeline that produces the.onnxartifacts shipped inCaloCluster/data/here.Try it
Acknowledgement
Implementation, refactoring, and documentation drafting were assisted
by Anthropic's Claude (Claude Code). Scientific decisions, training
campaign, validation results, and the v1->v2 truth-definition design
are my own work; Claude was used as a coding assistant.
Test plan
onnxruntimepackage once itlands (drop
u092, mirror finalSConscriptdep name fromArtAnalysis#4).testGnnClusterAssembleron the parity payload — expect[PASS] all 100 graphs match.from_mcs-gnn-test.fclend-to-end on a small MCS artfile + the training-repo Python comparison — expect 100/100.
Production/JobConfig/...(separate follow-up PR).