Skip to content

[minor] fixes for layerwise calib + MSE#1344

Open
Fridah-nv wants to merge 3 commits intomainfrom
fridah/layerwise-mse
Open

[minor] fixes for layerwise calib + MSE#1344
Fridah-nv wants to merge 3 commits intomainfrom
fridah/layerwise-mse

Conversation

@Fridah-nv
Copy link
Copy Markdown
Contributor

@Fridah-nv Fridah-nv commented Apr 24, 2026

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • Bug Fixes
    • Checkpoint directory naming is now deterministic by hashing only the core algorithm configuration, preventing non-deterministic or extraneous fields from affecting artifact paths.
    • Already-calibrated runs properly skip redundant resume and bootstrap steps, avoiding unnecessary inputs or reprocessing.
    • Replay behavior now reliably clears cached key/value state so replays start fresh and do not carry stale outputs.

@Fridah-nv Fridah-nv self-assigned this Apr 24, 2026
@Fridah-nv Fridah-nv requested review from a team as code owners April 24, 2026 23:25
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 44a5bd19-348f-488f-8e54-7f1ac455f0c3

📥 Commits

Reviewing files that changed from the base of the PR and between 62409c8 and 28f7109.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/model_calib.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/quantization/model_calib.py

📝 Walkthrough

Walkthrough

resolve_checkpoint_dir now derives the checkpoint subdirectory name from a deterministic hash of only scalar, JSON-stable values in quant_cfg["algorithm"]. layerwise_calibrate skips resume/bootstrap when start_layer == num_layers and consistently clears/overwrites KV state during replay forwarding.

Changes

Cohort / File(s) Summary
Checkpoint hash derivation
examples/llm_ptq/example_utils.py
Compute config_hash from only quant_cfg["algorithm"] scalar JSON-stable values with deterministic key ordering; excludes layerwise_checkpoint_dir and other non-deterministic/non-essential fields from the checkpoint subdir name.
Calibration control flow & KV handling
modelopt/torch/quantization/model_calib.py
When start_layer == num_layers, skip checkpoint resume wiring and skip bootstrapping layer_inputs (set to None). In replay forwarding closure, clear KV state when past_key_values is present: call reset() only if cache exists and supports it, and always overwrite past_key_values with None to avoid stale state across replays.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title '[minor] fixes for layerwise calib + MSE' is too vague and generic. It uses non-descriptive terms and abbreviations that don't clearly convey the specific technical changes made to the codebase. Expand the title to be more specific about the actual fixes. Consider something like 'Fix deterministic config hashing and KV-state handling in layerwise calibration' to clearly indicate the main changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Comprehensive security scan found no unsafe torch.load, numpy.load, eval/exec, hardcoded trust_remote_code, or nosec comments in modified Python files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fridah/layerwise-mse

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1344/

Built to branch gh-pages at 2026-05-01 21:03 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
modelopt/torch/quantization/model_calib.py (2)

1658-1669: cache.reset() may be redundant after forcing past_key_values=None; add justification/defensive handling.

In the replay loop you:

  • copy kwargs_input
  • call cache.reset() (if present)
  • then set kwargs_input["past_key_values"] = None

Since the cache object is not passed into the layer after setting to None, reset() is only useful if you’re relying on it for buffer cleanup/releasing resources before dropping the reference. If that’s the intent, a brief comment would help future readers.

If reset() can throw for some cache implementations, consider guarding it (minor hardening) so calibration doesn’t fail unexpectedly on a cache type that exposes reset but can’t safely run it:

Optional hardening diff
                     cache = kwargs_input["past_key_values"]
-                    if cache is not None and hasattr(cache, "reset"):
-                        cache.reset()
+                    if cache is not None and hasattr(cache, "reset"):
+                        try:
+                            cache.reset()
+                        except Exception:
+                            # Best-effort cleanup; we will force past_key_values=None anyway.
+                            pass
                     kwargs_input["past_key_values"] = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1658 - 1669, The
loop clears past_key_values by setting kwargs_input["past_key_values"]=None but
also calls cache.reset() which is either redundant or risky; either remove the
reset() call and add a short comment noting we drop the cache by setting
past_key_values to None, or harden it by wrapping cache.reset() in a try/except
(logging or ignoring exceptions) and add a comment explaining it's defensive
cleanup before dropping the reference; update the replay loop around _inputs,
kwargs_input, cache, and the call site m(*args, **kwargs_input) accordingly so
behavior is explicit and safe.

1646-1653: OK to skip bootstrapping when no layers remain, but consider an early-exit optimization.

With start_layer >= num_layers, you set layer_inputs = None and the per-layer loop becomes a no-op. That’s logically consistent.

Optional: you could short-circuit earlier (after patching/unpatching strategy review) to avoid unnecessary input_getter._patch_all_layers(...) work when there’s nothing to replay/capture. Not required for correctness, but it can reduce overhead for fully-calibrated restarts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1646 - 1653, The
per-layer loop is a no-op when start_layer >= num_layers but you still call
input_getter._patch_all_layers(...) and do bootstrapping work; add an early-exit
to skip bootstrapping and patch/unpatch steps when start_layer >= num_layers to
avoid unnecessary overhead—check the start_layer >= num_layers condition before
calling input_getter._patch_all_layers (and before any bootstrapping/patching
logic) and return or bypass the rest of the replay/capture flow so
layer_inputs/resumed_inputs/forward_loop work is skipped for fully-calibrated
restarts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1658-1669: The loop clears past_key_values by setting
kwargs_input["past_key_values"]=None but also calls cache.reset() which is
either redundant or risky; either remove the reset() call and add a short
comment noting we drop the cache by setting past_key_values to None, or harden
it by wrapping cache.reset() in a try/except (logging or ignoring exceptions)
and add a comment explaining it's defensive cleanup before dropping the
reference; update the replay loop around _inputs, kwargs_input, cache, and the
call site m(*args, **kwargs_input) accordingly so behavior is explicit and safe.
- Around line 1646-1653: The per-layer loop is a no-op when start_layer >=
num_layers but you still call input_getter._patch_all_layers(...) and do
bootstrapping work; add an early-exit to skip bootstrapping and patch/unpatch
steps when start_layer >= num_layers to avoid unnecessary overhead—check the
start_layer >= num_layers condition before calling
input_getter._patch_all_layers (and before any bootstrapping/patching logic) and
return or bypass the rest of the replay/capture flow so
layer_inputs/resumed_inputs/forward_loop work is skipped for fully-calibrated
restarts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f912c3b9-748a-4790-92cf-c62b7dc38086

📥 Commits

Reviewing files that changed from the base of the PR and between 7c80d85 and b772ab9.

📒 Files selected for processing (2)
  • examples/llm_ptq/example_utils.py
  • modelopt/torch/quantization/model_calib.py

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.59%. Comparing base (50706d1) to head (28f7109).

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 57.14% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1344      +/-   ##
==========================================
+ Coverage   66.36%   75.59%   +9.22%     
==========================================
  Files         471      471              
  Lines       50510    50512       +2     
==========================================
+ Hits        33522    38183    +4661     
+ Misses      16988    12329    -4659     
Flag Coverage Δ
unit 52.84% <57.14%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
@Fridah-nv Fridah-nv force-pushed the fridah/layerwise-mse branch from b772ab9 to e0ab32e Compare April 27, 2026 16:34
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 865-871: The hash currently drops all non-scalar fields from
algorithm, risking collisions; instead, build alg_for_hash from algorithm (still
skipping layerwise_checkpoint_dir) but for non-JSON-scalars include a
deterministic placeholder that captures their identity — e.g., replace each
non-scalar value with a small stable descriptor like its type name plus a short
hash of repr(value) (or a deterministic serialization), then compute config_hash
from json.dumps(alg_for_hash, sort_keys=True). This preserves uniqueness for
nested/complex fields while keeping the hash deterministic and still excluding
layerwise_checkpoint_dir.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 23011c25-fd14-4b57-b58c-ed28d03dbc57

📥 Commits

Reviewing files that changed from the base of the PR and between b772ab9 and e0ab32e.

📒 Files selected for processing (2)
  • examples/llm_ptq/example_utils.py
  • modelopt/torch/quantization/model_calib.py

Comment on lines +865 to +871
# Hash only the algorithm dict (scalar fields, fully JSON-serializable and deterministic).
alg_for_hash = {
k: v
for k, v in sorted(algorithm.items())
if k != "layerwise_checkpoint_dir" and isinstance(v, (str, int, float, bool, type(None)))
}
config_hash = hashlib.sha256(json.dumps(alg_for_hash, sort_keys=True).encode()).hexdigest()[:8]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid silent checkpoint-hash collisions from dropped algorithm fields.

The current hash input excludes all non-scalar values, so different algorithm configs can collapse to the same suffix and accidentally reuse incompatible layerwise checkpoints.

💡 Proposed fix
-    # Hash only the algorithm dict (scalar fields, fully JSON-serializable and deterministic).
-    alg_for_hash = {
-        k: v
-        for k, v in sorted(algorithm.items())
-        if k != "layerwise_checkpoint_dir" and isinstance(v, (str, int, float, bool, type(None)))
-    }
-    config_hash = hashlib.sha256(json.dumps(alg_for_hash, sort_keys=True).encode()).hexdigest()[:8]
+    def _stable_jsonable(obj):
+        if isinstance(obj, dict):
+            return {k: _stable_jsonable(v) for k, v in sorted(obj.items())}
+        if isinstance(obj, (list, tuple)):
+            return [_stable_jsonable(v) for v in obj]
+        if isinstance(obj, (str, int, float, bool, type(None))):
+            return obj
+        raise TypeError(
+            f"Unsupported algorithm field type for checkpoint hash: {type(obj).__name__}"
+        )
+
+    alg_for_hash = _stable_jsonable(
+        {k: v for k, v in algorithm.items() if k != "layerwise_checkpoint_dir"}
+    )
+    config_hash = hashlib.sha256(
+        json.dumps(alg_for_hash, sort_keys=True, separators=(",", ":")).encode()
+    ).hexdigest()[:8]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 865 - 871, The hash currently
drops all non-scalar fields from algorithm, risking collisions; instead, build
alg_for_hash from algorithm (still skipping layerwise_checkpoint_dir) but for
non-JSON-scalars include a deterministic placeholder that captures their
identity — e.g., replace each non-scalar value with a small stable descriptor
like its type name plus a short hash of repr(value) (or a deterministic
serialization), then compute config_hash from json.dumps(alg_for_hash,
sort_keys=True). This preserves uniqueness for nested/complex fields while
keeping the hash deterministic and still excluding layerwise_checkpoint_dir.

@Fridah-nv Fridah-nv enabled auto-merge (squash) April 27, 2026 17:04
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1651-1664: The current fast-path relying on start_layer ==
num_layers is unreachable because _CheckpointState.from_folder() resets
start_layer to 0 when detect_resume_point() returns None; fix this by having the
checkpoint helper indicate completion instead of hiding it: update
_CheckpointState.from_folder (and detect_resume_point) to set start_layer =
num_layers (or expose a completed flag like ckpt.completed) when calibration is
already complete, and then use that value/flag here (the start_layer check in
model_calib.py that decides whether to call ckpt.setup_resume and
input_getter.get_first_layer_inputs). Reference symbols to change:
detect_resume_point, _CheckpointState.from_folder, _CheckpointState (add
completed or set start_layer=num_layers), ckpt.setup_resume, start_layer, and
input_getter.get_first_layer_inputs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b57a7734-9fee-4cd2-ac2c-ff625b8bb8ed

📥 Commits

Reviewing files that changed from the base of the PR and between e0ab32e and 62409c8.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/model_calib.py

Comment on lines +1651 to +1664
# When all layers are already done (start_layer == num_layers), skip input setup:
resumed_inputs = (
ckpt.setup_resume(transformer_layers) if ckpt and 0 < start_layer < num_layers else None
)

try:
# Bootstrap: get first layer's inputs (or use resumed inputs).
layer_inputs = input_getter.get_first_layer_inputs(
start_layer, resumed_inputs, forward_loop
)
# Skip entirely when all layers are already calibrated (start_layer == num_layers).
if start_layer < num_layers:
layer_inputs = input_getter.get_first_layer_inputs(
start_layer, resumed_inputs, forward_loop
)
else:
layer_inputs = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Inspect detect_resume_point completion behavior:"
rg -n -C4 'def detect_resume_point|last \+ 1 >= total|return None' modelopt/torch/quantization/utils/layerwise_calib.py

echo
echo "Inspect _CheckpointState.from_folder start_layer assignment:"
rg -n -C6 'def from_folder|detect_resume_point|start =|return cls\(.*start_layer' modelopt/torch/quantization/utils/layerwise_calib.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5411


🏁 Script executed:

sed -n '1640,1680p' modelopt/torch/quantization/model_calib.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 2211


🏁 Script executed:

# Double-check: is there any other code path that could set start_layer to num_layers?
rg -n 'start_layer.*=.*num_layers|start_layer.*=.*total' modelopt/torch/quantization/

Repository: NVIDIA/Model-Optimizer

Length of output: 340


The start_layer == num_layers fast-path is unreachable due to checkpoint helper behavior.

The code at lines 1651 and 1659 is designed to skip bootstrap when all layers are already calibrated (start_layer == num_layers). However, detect_resume_point() returns None when calibration is complete, and _CheckpointState.from_folder() then sets start_layer = 0 (line 570 of layerwise_calib.py). This means resuming from a complete checkpoint will always have start_layer == 0, causing the bootstrap to execute unnecessarily.

The checkpoint helper should either set start_layer = num_layers when complete, or the code should check for completion via a dedicated flag instead of relying on start_layer == num_layers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1651 - 1664, The
current fast-path relying on start_layer == num_layers is unreachable because
_CheckpointState.from_folder() resets start_layer to 0 when
detect_resume_point() returns None; fix this by having the checkpoint helper
indicate completion instead of hiding it: update _CheckpointState.from_folder
(and detect_resume_point) to set start_layer = num_layers (or expose a completed
flag like ckpt.completed) when calibration is already complete, and then use
that value/flag here (the start_layer check in model_calib.py that decides
whether to call ckpt.setup_resume and input_getter.get_first_layer_inputs).
Reference symbols to change: detect_resume_point, _CheckpointState.from_folder,
_CheckpointState (add completed or set start_layer=num_layers),
ckpt.setup_resume, start_layer, and input_getter.get_first_layer_inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants