[minor] fixes for layerwise calib + MSE by Fridah-nv · Pull Request #1344 · NVIDIA/Model-Optimizer

Fridah-nv · 2026-04-24T23:25:56Z

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Bug Fixes
- Checkpoint directory naming is now deterministic by hashing only the core algorithm configuration, preventing non-deterministic or extraneous fields from affecting artifact paths.
- Already-calibrated runs properly skip redundant resume and bootstrap steps, avoiding unnecessary inputs or reprocessing.
- Replay behavior now reliably clears cached key/value state so replays start fresh and do not carry stale outputs.

coderabbitai · 2026-04-24T23:26:08Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 44a5bd19-348f-488f-8e54-7f1ac455f0c3

📥 Commits

Reviewing files that changed from the base of the PR and between 62409c8 and 28f7109.

📒 Files selected for processing (1)

modelopt/torch/quantization/model_calib.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/quantization/model_calib.py

📝 Walkthrough

Walkthrough

resolve_checkpoint_dir now derives the checkpoint subdirectory name from a deterministic hash of only scalar, JSON-stable values in quant_cfg["algorithm"]. layerwise_calibrate skips resume/bootstrap when start_layer == num_layers and consistently clears/overwrites KV state during replay forwarding.

Changes

Cohort / File(s)	Summary
Checkpoint hash derivation `examples/llm_ptq/example_utils.py`	Compute `config_hash` from only `quant_cfg["algorithm"]` scalar JSON-stable values with deterministic key ordering; excludes `layerwise_checkpoint_dir` and other non-deterministic/non-essential fields from the checkpoint subdir name.
Calibration control flow & KV handling `modelopt/torch/quantization/model_calib.py`	When `start_layer == num_layers`, skip checkpoint resume wiring and skip bootstrapping `layer_inputs` (set to `None`). In replay forwarding closure, clear KV state when `past_key_values` is present: call `reset()` only if cache exists and supports it, and always overwrite `past_key_values` with `None` to avoid stale state across replays.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title '[minor] fixes for layerwise calib + MSE' is too vague and generic. It uses non-descriptive terms and abbreviations that don't clearly convey the specific technical changes made to the codebase.	Expand the title to be more specific about the actual fixes. Consider something like 'Fix deterministic config hashing and KV-state handling in layerwise calibration' to clearly indicate the main changes.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Comprehensive security scan found no unsafe torch.load, numpy.load, eval/exec, hardcoded trust_remote_code, or nosec comments in modified Python files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fridah/layerwise-mse

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-24T23:29:48Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1344/
Built to branch `gh-pages` at 2026-05-01 21:03 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

🧹 Nitpick comments (2)

modelopt/torch/quantization/model_calib.py (2)
1658-1669: cache.reset() may be redundant after forcing past_key_values=None; add justification/defensive handling.

In the replay loop you:

copy kwargs_input

call cache.reset() (if present)

then set kwargs_input["past_key_values"] = None

Since the cache object is not passed into the layer after setting to None, reset() is only useful if you’re relying on it for buffer cleanup/releasing resources before dropping the reference. If that’s the intent, a brief comment would help future readers.

If reset() can throw for some cache implementations, consider guarding it (minor hardening) so calibration doesn’t fail unexpectedly on a cache type that exposes reset but can’t safely run it:
Optional hardening diff
                     cache = kwargs_input["past_key_values"]
-                    if cache is not None and hasattr(cache, "reset"):
-                        cache.reset()
+                    if cache is not None and hasattr(cache, "reset"):
+                        try:
+                            cache.reset()
+                        except Exception:
+                            # Best-effort cleanup; we will force past_key_values=None anyway.
+                            pass
                     kwargs_input["past_key_values"] = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1658 - 1669, The
loop clears past_key_values by setting kwargs_input["past_key_values"]=None but
also calls cache.reset() which is either redundant or risky; either remove the
reset() call and add a short comment noting we drop the cache by setting
past_key_values to None, or harden it by wrapping cache.reset() in a try/except
(logging or ignoring exceptions) and add a comment explaining it's defensive
cleanup before dropping the reference; update the replay loop around _inputs,
kwargs_input, cache, and the call site m(*args, **kwargs_input) accordingly so
behavior is explicit and safe.
1646-1653: OK to skip bootstrapping when no layers remain, but consider an early-exit optimization.

With start_layer >= num_layers, you set layer_inputs = None and the per-layer loop becomes a no-op. That’s logically consistent.

Optional: you could short-circuit earlier (after patching/unpatching strategy review) to avoid unnecessary input_getter._patch_all_layers(...) work when there’s nothing to replay/capture. Not required for correctness, but it can reduce overhead for fully-calibrated restarts.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1646 - 1653, The
per-layer loop is a no-op when start_layer >= num_layers but you still call
input_getter._patch_all_layers(...) and do bootstrapping work; add an early-exit
to skip bootstrapping and patch/unpatch steps when start_layer >= num_layers to
avoid unnecessary overhead—check the start_layer >= num_layers condition before
calling input_getter._patch_all_layers (and before any bootstrapping/patching
logic) and return or bypass the rest of the replay/capture flow so
layer_inputs/resumed_inputs/forward_loop work is skipped for fully-calibrated
restarts.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1658-1669: The loop clears past_key_values by setting
kwargs_input["past_key_values"]=None but also calls cache.reset() which is
either redundant or risky; either remove the reset() call and add a short
comment noting we drop the cache by setting past_key_values to None, or harden
it by wrapping cache.reset() in a try/except (logging or ignoring exceptions)
and add a comment explaining it's defensive cleanup before dropping the
reference; update the replay loop around _inputs, kwargs_input, cache, and the
call site m(*args, **kwargs_input) accordingly so behavior is explicit and safe.
- Around line 1646-1653: The per-layer loop is a no-op when start_layer >=
num_layers but you still call input_getter._patch_all_layers(...) and do
bootstrapping work; add an early-exit to skip bootstrapping and patch/unpatch
steps when start_layer >= num_layers to avoid unnecessary overhead—check the
start_layer >= num_layers condition before calling
input_getter._patch_all_layers (and before any bootstrapping/patching logic) and
return or bypass the rest of the replay/capture flow so
layer_inputs/resumed_inputs/forward_loop work is skipped for fully-calibrated
restarts.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f912c3b9-748a-4790-92cf-c62b7dc38086

📥 Commits

Reviewing files that changed from the base of the PR and between 7c80d85 and b772ab9.

📒 Files selected for processing (2)

examples/llm_ptq/example_utils.py
modelopt/torch/quantization/model_calib.py

codecov · 2026-04-24T23:38:40Z

Codecov Report

❌ Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.59%. Comparing base (50706d1) to head (28f7109).

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_calib.py	57.14%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1344      +/-   ##
==========================================
+ Coverage   66.36%   75.59%   +9.22%     
==========================================
  Files         471      471              
  Lines       50510    50512       +2     
==========================================
+ Hits        33522    38183    +4661     
+ Misses      16988    12329    -4659

Flag	Coverage Δ
unit	`52.84% <57.14%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

copy-pr-bot · 2026-04-27T16:34:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 865-871: The hash currently drops all non-scalar fields from
algorithm, risking collisions; instead, build alg_for_hash from algorithm (still
skipping layerwise_checkpoint_dir) but for non-JSON-scalars include a
deterministic placeholder that captures their identity — e.g., replace each
non-scalar value with a small stable descriptor like its type name plus a short
hash of repr(value) (or a deterministic serialization), then compute config_hash
from json.dumps(alg_for_hash, sort_keys=True). This preserves uniqueness for
nested/complex fields while keeping the hash deterministic and still excluding
layerwise_checkpoint_dir.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 23011c25-fd14-4b57-b58c-ed28d03dbc57

📥 Commits

Reviewing files that changed from the base of the PR and between b772ab9 and e0ab32e.

📒 Files selected for processing (2)

examples/llm_ptq/example_utils.py
modelopt/torch/quantization/model_calib.py

coderabbitai · 2026-04-27T16:39:36Z

+    # Hash only the algorithm dict (scalar fields, fully JSON-serializable and deterministic).
+    alg_for_hash = {
+        k: v
+        for k, v in sorted(algorithm.items())
+        if k != "layerwise_checkpoint_dir" and isinstance(v, (str, int, float, bool, type(None)))
+    }
+    config_hash = hashlib.sha256(json.dumps(alg_for_hash, sort_keys=True).encode()).hexdigest()[:8]


⚠️ Potential issue | 🟠 Major

Avoid silent checkpoint-hash collisions from dropped algorithm fields.

The current hash input excludes all non-scalar values, so different algorithm configs can collapse to the same suffix and accidentally reuse incompatible layerwise checkpoints.

💡 Proposed fix

- # Hash only the algorithm dict (scalar fields, fully JSON-serializable and deterministic). - alg_for_hash = { - k: v - for k, v in sorted(algorithm.items()) - if k != "layerwise_checkpoint_dir" and isinstance(v, (str, int, float, bool, type(None))) - } - config_hash = hashlib.sha256(json.dumps(alg_for_hash, sort_keys=True).encode()).hexdigest()[:8] + def _stable_jsonable(obj): + if isinstance(obj, dict): + return {k: _stable_jsonable(v) for k, v in sorted(obj.items())} + if isinstance(obj, (list, tuple)): + return [_stable_jsonable(v) for v in obj] + if isinstance(obj, (str, int, float, bool, type(None))): + return obj + raise TypeError( + f"Unsupported algorithm field type for checkpoint hash: {type(obj).__name__}" + ) + + alg_for_hash = _stable_jsonable( + {k: v for k, v in algorithm.items() if k != "layerwise_checkpoint_dir"} + ) + config_hash = hashlib.sha256( + json.dumps(alg_for_hash, sort_keys=True, separators=(",", ":")).encode() + ).hexdigest()[:8]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 865 - 871, The hash currently drops all non-scalar fields from algorithm, risking collisions; instead, build alg_for_hash from algorithm (still skipping layerwise_checkpoint_dir) but for non-JSON-scalars include a deterministic placeholder that captures their identity — e.g., replace each non-scalar value with a small stable descriptor like its type name plus a short hash of repr(value) (or a deterministic serialization), then compute config_hash from json.dumps(alg_for_hash, sort_keys=True). This preserves uniqueness for nested/complex fields while keeping the hash deterministic and still excluding layerwise_checkpoint_dir.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1651-1664: The current fast-path relying on start_layer ==
num_layers is unreachable because _CheckpointState.from_folder() resets
start_layer to 0 when detect_resume_point() returns None; fix this by having the
checkpoint helper indicate completion instead of hiding it: update
_CheckpointState.from_folder (and detect_resume_point) to set start_layer =
num_layers (or expose a completed flag like ckpt.completed) when calibration is
already complete, and then use that value/flag here (the start_layer check in
model_calib.py that decides whether to call ckpt.setup_resume and
input_getter.get_first_layer_inputs). Reference symbols to change:
detect_resume_point, _CheckpointState.from_folder, _CheckpointState (add
completed or set start_layer=num_layers), ckpt.setup_resume, start_layer, and
input_getter.get_first_layer_inputs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b57a7734-9fee-4cd2-ac2c-ff625b8bb8ed

📥 Commits

Reviewing files that changed from the base of the PR and between e0ab32e and 62409c8.

📒 Files selected for processing (1)

modelopt/torch/quantization/model_calib.py

coderabbitai · 2026-04-28T17:44:00Z

+    # When all layers are already done (start_layer == num_layers), skip input setup:
+    resumed_inputs = (
+        ckpt.setup_resume(transformer_layers) if ckpt and 0 < start_layer < num_layers else None
+    )

    try:
        # Bootstrap: get first layer's inputs (or use resumed inputs).
-        layer_inputs = input_getter.get_first_layer_inputs(
-            start_layer, resumed_inputs, forward_loop
-        )
+        # Skip entirely when all layers are already calibrated (start_layer == num_layers).
+        if start_layer < num_layers:
+            layer_inputs = input_getter.get_first_layer_inputs(
+                start_layer, resumed_inputs, forward_loop
+            )
+        else:
+            layer_inputs = None


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Inspect detect_resume_point completion behavior:" rg -n -C4 'def detect_resume_point|last \+ 1 >= total|return None' modelopt/torch/quantization/utils/layerwise_calib.py echo echo "Inspect _CheckpointState.from_folder start_layer assignment:" rg -n -C6 'def from_folder|detect_resume_point|start =|return cls\(.*start_layer' modelopt/torch/quantization/utils/layerwise_calib.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5411

🏁 Script executed:

sed -n '1640,1680p' modelopt/torch/quantization/model_calib.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 2211

🏁 Script executed:

# Double-check: is there any other code path that could set start_layer to num_layers? rg -n 'start_layer.*=.*num_layers|start_layer.*=.*total' modelopt/torch/quantization/

Repository: NVIDIA/Model-Optimizer

Length of output: 340

The start_layer == num_layers fast-path is unreachable due to checkpoint helper behavior.

The code at lines 1651 and 1659 is designed to skip bootstrap when all layers are already calibrated (start_layer == num_layers). However, detect_resume_point() returns None when calibration is complete, and _CheckpointState.from_folder() then sets start_layer = 0 (line 570 of layerwise_calib.py). This means resuming from a complete checkpoint will always have start_layer == 0, causing the bootstrap to execute unnecessarily.

The checkpoint helper should either set start_layer = num_layers when complete, or the code should check for completion via a dedicated flag instead of relying on start_layer == num_layers.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 1651 - 1664, The current fast-path relying on start_layer == num_layers is unreachable because _CheckpointState.from_folder() resets start_layer to 0 when detect_resume_point() returns None; fix this by having the checkpoint helper indicate completion instead of hiding it: update _CheckpointState.from_folder (and detect_resume_point) to set start_layer = num_layers (or expose a completed flag like ckpt.completed) when calibration is already complete, and then use that value/flag here (the start_layer check in model_calib.py that decides whether to call ckpt.setup_resume and input_getter.get_first_layer_inputs). Reference symbols to change: detect_resume_point, _CheckpointState.from_folder, _CheckpointState (add completed or set start_layer=num_layers), ckpt.setup_resume, start_layer, and input_getter.get_first_layer_inputs.

Fridah-nv requested review from kinjalpatel27, realAsma and sugunav14 April 24, 2026 23:25

Fridah-nv self-assigned this Apr 24, 2026

Fridah-nv requested review from a team as code owners April 24, 2026 23:25

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

realAsma approved these changes Apr 25, 2026

View reviewed changes

sugunav14 approved these changes Apr 25, 2026

View reviewed changes

small fixes for layerwise calib

e0ab32e

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

Fridah-nv force-pushed the fridah/layerwise-mse branch from b772ab9 to e0ab32e Compare April 27, 2026 16:34

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Fridah-nv enabled auto-merge (squash) April 27, 2026 17:04

Merge branch 'main' into fridah/layerwise-mse

62409c8

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Merge branch 'main' into fridah/layerwise-mse

28f7109

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[minor] fixes for layerwise calib + MSE#1344

[minor] fixes for layerwise calib + MSE#1344
Fridah-nv wants to merge 3 commits intomainfrom
fridah/layerwise-mse

Fridah-nv commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-01 21:03 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Fridah-nv commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-01 21:03 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fridah-nv commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-01 21:03 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Apr 24, 2026 •

edited

Loading