Skip to content

[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345

Open
shengliangxu wants to merge 9 commits intomainfrom
shengliangx/local-jsonl-dataset
Open

[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345
shengliangxu wants to merge 9 commits intomainfrom
shengliangx/local-jsonl-dataset

Conversation

@shengliangxu
Copy link
Copy Markdown
Collaborator

@shengliangxu shengliangxu commented Apr 24, 2026

Summary

Local .jsonl paths fed to get_dataset_samples / get_dataset_dataloader previously went through a text-key-only reader, while HF dataset names flowed through an auto-preprocess pipeline that recognizes messages / conversations / prompt / text / input columns. This split meant a calibration dataset behaved differently depending on whether it lived on HF Hub or on disk.

This PR routes local .jsonl through HF's json builder and the same auto-preprocess pipeline so the format is detected from columns, not the source. The legacy text-field reader is preserved as a fallback for files where the HF builder fails (e.g. PyArrow schema unification across heterogeneous rows). Existing callers passing a plain {"text": ...} JSONL file keep working unchanged; chat-shaped JSONL now works without a separate code path.

Changes

  • modelopt/torch/utils/dataset_utils.py
    • Route .jsonl paths through load_dataset(path="json", data_files=...) + _auto_preprocess_sample.
    • Wrap the HF load/preprocess block in try/except; on failure for .jsonl, fall back to get_jsonl_text_samples. If the fallback also fails, re-raise the original HF error so the diagnostic is preserved.
    • Update docstrings; clarify the splits = ["train"] invariant for HF's file-based builders.

Tests

tests/unit/torch/utils/test_dataset_utils.py — three new test classes (18 cases, all passing):

  • TestLocalJsonlLoading — text / messages / conversations / prompt+completion / input+output columns; num_samples honored; tools kwarg forwarded; ValueError on unrecognized columns; legacy text-key fallback on schema-unification failure.
  • TestGetDatasetDataloaderBlending — single JSONL, list of JSONL files concatenated, mixed-format JSONL files blended, length-mismatch assertion.
  • TestHfTinyDataset — uses hf-internal-testing/dataset_with_data_files (10 rows x {train, test}) for end-to-end coverage: single split, multiple splits, default split, HF -> JSONL -> reload round-trip, two-HF-dataset blending, HF + local-JSONL mixing.

Test plan

  • pytest tests/unit/torch/utils/test_dataset_utils.py — 28 passed
  • mypy modelopt/torch/utils/dataset_utils.py — no new errors
  • ruff check — clean

Summary by CodeRabbit

  • New Features

    • Local JSONL files are now loaded via the HuggingFace JSON pipeline so local and HF datasets share the same auto-detection and preprocessing; mixing dataset identifiers and JSONL paths is supported (local JSONL limited to train/unspecified).
  • Bug Fixes

    • Null/None fields no longer count as present text, preventing incorrect concatenation or chat-template application.
    • Better error propagation with a narrow legacy fallback and warnings for schema failures.
  • Documentation

    • Updated to reflect JSONL/mixed-source behavior and concatenation before tokenization.
  • Tests

    • Expanded tests for schemas, sparse/null handling, concatenation, sample limits, tool forwarding, fallback paths, and live HF integration.

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 24, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ead9dfad-29ee-4d35-a4ec-843c4769646e

📥 Commits

Reviewing files that changed from the base of the PR and between 409eb38 and ccbfc64.

📒 Files selected for processing (2)
  • modelopt/torch/utils/dataset_utils.py
  • tests/unit/torch/utils/test_dataset_utils.py

📝 Walkthrough

Walkthrough

Local .jsonl inputs are now loaded via HuggingFace load_dataset(path="json", ...) to use the same column auto-detection and preprocessing as HF datasets; on HF JSON/Arrow schema failures the code warns and falls back to legacy line-by-line text reading. Tests and docstrings updated.

Changes

Cohort / File(s) Summary
Dataset utils implementation
modelopt/torch/utils/dataset_utils.py
Switch .jsonl fast path to HF load_dataset(path="json", data_files=...) with lazy datasets import; use HF auto column detection (text, messages/conversations, prompt/completion, input/output), detect non-None fields correctly, only append optional fields when non-None, validate .jsonl uses train split, warn-and-fallback to legacy line-by-line text reader on JSON/Arrow/schema-unification failures, and update get_dataset_dataloader docstring about mixed HF ids + local JSONL concatenation before tokenization.
Unit tests (extensive)
tests/unit/torch/utils/test_dataset_utils.py
Add comprehensive tests exercising get_dataset_dataloader and .jsonl behavior: schema auto-detection across supported patterns, null/sparse column handling, empty-string handling, per-row tools forwarded into tokenizer.apply_chat_template, num_samples behavior, legacy fallback when HF JSON builder fails, dataloader concatenation across single/multiple/mixed HF+local sources, split semantics, and live HF tiny-dataset integration tests.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Detector as get_dataset_samples
    participant HF as HuggingFace load_dataset
    participant Preprocess as AutoPreprocess
    participant Fallback as LegacyReader
    participant Return

    Client->>Detector: call with dataset ids and/or `.jsonl` paths
    Detector->>Detector: classify inputs (HF id vs local `.jsonl`)
    alt input includes `.jsonl`
        Detector->>HF: load_dataset(path="json", data_files=...)
        alt HF load + preprocessing succeeds
            HF->>Preprocess: stream rows / auto-detect columns
            Preprocess->>Return: emit normalized samples
        else HF load or schema unification fails
            HF--xDetector: raises JSON/Arrow/schema error
            Detector->>Fallback: read file line-by-line, extract `text`
            Fallback->>Return: emit fallback samples (with warning)
        end
    end
    alt input includes HF ids
        Detector->>HF: load_dataset(id, split=...)
        HF->>Preprocess: return preprocessed samples
        Preprocess->>Return: emit normalized samples
    end
    Return->>Client: combined concatenated samples list
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main objective: aligning local JSONL loading with HF datasets while preserving original behavior, which matches the core changes described in the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR passes security review: no torch.load weights_only=False, numpy.load allow_pickle=True, hardcoded trust_remote_code=True, eval/exec on user input, nosec comments, or unsafe new dependencies detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch shengliangx/local-jsonl-dataset

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1345/

Built to branch gh-pages at 2026-04-29 22:05 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@shengliangxu shengliangxu changed the title feat: align local JSONL loading with HF datasets path + keep original behaviour [OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour Apr 24, 2026
@shengliangxu shengliangxu marked this pull request as ready for review April 24, 2026 23:40
@shengliangxu shengliangxu requested a review from a team as a code owner April 24, 2026 23:40
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 76.59574% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.32%. Comparing base (077e29a) to head (ccbfc64).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/utils/dataset_utils.py 76.59% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1345      +/-   ##
==========================================
- Coverage   76.48%   72.32%   -4.17%     
==========================================
  Files         471      471              
  Lines       50487    50511      +24     
==========================================
- Hits        38617    36531    -2086     
- Misses      11870    13980    +2110     
Flag Coverage Δ
examples 38.99% <53.19%> (+8.24%) ⬆️
gpu 51.16% <53.19%> (-9.00%) ⬇️
regression 14.90% <0.00%> (+0.19%) ⬆️
unit 52.85% <76.59%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit/torch/utils/test_dataset_utils.py (1)

445-506: Consider adding a network-skip marker for CI reliability.

These tests download from hf-internal-testing/* datasets, which are stable but still involve network I/O. If CI environments have unreliable network access, these tests may flake.

Consider marking them with a custom pytest marker (e.g., @pytest.mark.network) so they can be selectively skipped in constrained environments, while still running in standard CI.

Example marker usage
`@pytest.mark.network`
class TestHfTinyDataset:
    """End-to-end coverage with a real (tiny) HF dataset."""
    ...

Then configure pytest.ini or pyproject.toml:

markers =
    network: tests that require network access to HuggingFace Hub
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 445 - 506, Add a
network-skippable marker to the TestHfTinyDataset test class so CI can skip HF
Hub network tests when desired: annotate the TestHfTinyDataset class with a
pytest marker such as `@pytest.mark.network` (referencing the TestHfTinyDataset
class and tests like test_load_single_split_directly,
test_dataloader_blending_two_hf_datasets, etc.), and add the corresponding
marker declaration ("network: tests that require network access to HuggingFace
Hub") to pytest.ini or pyproject.toml so pytest recognizes it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 445-506: Add a network-skippable marker to the TestHfTinyDataset
test class so CI can skip HF Hub network tests when desired: annotate the
TestHfTinyDataset class with a pytest marker such as `@pytest.mark.network`
(referencing the TestHfTinyDataset class and tests like
test_load_single_split_directly, test_dataloader_blending_two_hf_datasets,
etc.), and add the corresponding marker declaration ("network: tests that
require network access to HuggingFace Hub") to pytest.ini or pyproject.toml so
pytest recognizes it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e2e1cb6a-ec09-41ed-ab40-d1b508f61467

📥 Commits

Reviewing files that changed from the base of the PR and between 7c80d85 and 8f2f0ae.

📒 Files selected for processing (2)
  • modelopt/torch/utils/dataset_utils.py
  • tests/unit/torch/utils/test_dataset_utils.py

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/unit/torch/utils/test_dataset_utils.py (1)

511-589: Consider gating live Hub tests as integration to reduce unit-test flakiness

These cases are valuable, but remote dataset availability/network instability can make tests/unit non-deterministic. Consider marking them (e.g., @pytest.mark.integration) or guarding behind an env flag in CI.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 511 - 589, The
live HF dataset tests in TestHfTinyDataset (e.g.,
test_load_single_split_directly, test_load_multiple_splits_directly,
test_default_split_is_train, test_download_to_jsonl_then_load,
test_dataloader_blending_two_hf_datasets,
test_dataloader_mixing_hf_and_local_jsonl) are network-dependent and should be
gated as integration tests: either add a pytest marker (e.g., annotate the
TestHfTinyDataset class with `@pytest.mark.integration` or decorate each test) and
register that marker in pytest.ini, or guard execution with an environment flag
check (e.g., skip unless os.environ.get("RUN_INTEGRATION_TESTS") is set) so unit
test runs remain deterministic in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 351-354: The current warn call logs the full local_dataset_path
and raw exception text (e), which can leak sensitive filesystem or error payload
details; update the warning in the dataset load fallback so it does not include
raw paths or exception messages—use a sanitized identifier (e.g.,
Path(local_dataset_path).name or other non-sensitive dataset id) and only
include the exception type (type(e).__name__) or a generic phrase like "loader
error" instead of including {e} or the full path; modify the warn(...)
invocation accordingly (referencing warn, local_dataset_path, and e) so logs
avoid exposing sensitive content.
- Around line 337-356: The current except block around HF JSON parsing (the
except Exception as e handling is_jsonl) falls back to get_jsonl_text_samples
too broadly and can hide caller errors like invalid splits; update the except
block in dataset_utils.py to only attempt the legacy text-field fallback when
the original HF error is clearly a schema/column inference or parsing error, and
re-raise immediately for errors that indicate caller/config issues (e.g., the
exception message or type references a missing/unsupported split, "split", "not
found", or similar); implement this by inspecting e (str(e) and/or its type)
before calling get_jsonl_text_samples(local_dataset_path, num_samples,
key="text") so invalid split/config errors are not silently recovered from.

In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 395-408: The test currently only checks output values so it can
pass without exercising the fallback branch; modify
test_legacy_text_fallback_on_hf_builder_failure to force the HF json builder to
raise and/or assert the fallback path was executed: patch or monkeypatch the HF
loading function used inside get_dataset_samples (e.g., the
datasets.load_dataset / DatasetBuilder call) to raise an exception, then call
get_dataset_samples(path, num_samples=3) and verify both the returned samples
and that the legacy reader function inside get_dataset_samples (the code path
handling text-field fallback) was invoked (for example by spying/mocking that
fallback helper or checking a logged/returned indicator). Ensure references to
get_dataset_samples, test_legacy_text_fallback_on_hf_builder_failure, and
_write_jsonl are used to locate the code to change.

---

Nitpick comments:
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 511-589: The live HF dataset tests in TestHfTinyDataset (e.g.,
test_load_single_split_directly, test_load_multiple_splits_directly,
test_default_split_is_train, test_download_to_jsonl_then_load,
test_dataloader_blending_two_hf_datasets,
test_dataloader_mixing_hf_and_local_jsonl) are network-dependent and should be
gated as integration tests: either add a pytest marker (e.g., annotate the
TestHfTinyDataset class with `@pytest.mark.integration` or decorate each test) and
register that marker in pytest.ini, or guard execution with an environment flag
check (e.g., skip unless os.environ.get("RUN_INTEGRATION_TESTS") is set) so unit
test runs remain deterministic in CI.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 24c5dbc5-fa7a-4e97-9451-2c88a59bb346

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2f0ae and b874912.

📒 Files selected for processing (2)
  • modelopt/torch/utils/dataset_utils.py
  • tests/unit/torch/utils/test_dataset_utils.py

Comment thread modelopt/torch/utils/dataset_utils.py Outdated
Comment thread modelopt/torch/utils/dataset_utils.py Outdated
Comment on lines +351 to +354
warn(
f"Failed to load {local_dataset_path} via the HF 'json' builder "
f"({type(e).__name__}: {e}); fell back to legacy text-field reader."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitize fallback warning content to avoid leaking local path/error details

Line 352-Line 353 logs the full file path plus raw exception text. For untrusted dataset inputs, this can expose sensitive filesystem information or data-derived error payloads in logs.

💡 Proposed fix
-            warn(
-                f"Failed to load {local_dataset_path} via the HF 'json' builder "
-                f"({type(e).__name__}: {e}); fell back to legacy text-field reader."
-            )
+            safe_name = os.path.basename(local_dataset_path)
+            warn(
+                f"Failed to load JSONL file '{safe_name}' via HF 'json' builder "
+                f"({type(e).__name__}); fell back to legacy text-field reader."
+            )

As per coding guidelines, SECURITY.md says: “Be mindful of logging: avoid emitting sensitive paths/tokens/content in warnings/errors.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 351 - 354, The current
warn call logs the full local_dataset_path and raw exception text (e), which can
leak sensitive filesystem or error payload details; update the warning in the
dataset load fallback so it does not include raw paths or exception messages—use
a sanitized identifier (e.g., Path(local_dataset_path).name or other
non-sensitive dataset id) and only include the exception type (type(e).__name__)
or a generic phrase like "loader error" instead of including {e} or the full
path; modify the warn(...) invocation accordingly (referencing warn,
local_dataset_path, and e) so logs avoid exposing sensitive content.

Comment on lines +395 to +408
def test_legacy_text_fallback_on_hf_builder_failure(self, tmp_path):
"""If the HF json builder raises, fall back to the legacy text-field reader."""
pytest.importorskip("datasets")
# Mixed-type ``meta`` field across rows — int vs string — trips PyArrow
# schema unification in the HF json builder. The rows still carry a
# ``text`` field, so the legacy reader can recover the samples.
rows = [
{"text": "row a", "meta": 1},
{"text": "row b", "meta": "two"},
{"text": "row c", "meta": 3},
]
path = _write_jsonl(tmp_path / "mixed.jsonl", rows)
samples = get_dataset_samples(path, num_samples=3)
assert samples == ["row a", "row b", "row c"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This fallback test can pass even if fallback is never executed

Line 407-Line 408 only validates output values; if HF JSON loading succeeds, the assertion still passes. This does not prove the fallback branch ran.

💡 Proposed fix
-    def test_legacy_text_fallback_on_hf_builder_failure(self, tmp_path):
+    def test_legacy_text_fallback_on_hf_builder_failure(self, tmp_path, monkeypatch):
@@
-        samples = get_dataset_samples(path, num_samples=3)
+        import datasets
+
+        monkeypatch.setattr(datasets, "load_dataset", Mock(side_effect=RuntimeError("boom")))
+        with pytest.warns(UserWarning, match="fell back to legacy text-field reader"):
+            samples = get_dataset_samples(path, num_samples=3)
         assert samples == ["row a", "row b", "row c"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 395 - 408, The
test currently only checks output values so it can pass without exercising the
fallback branch; modify test_legacy_text_fallback_on_hf_builder_failure to force
the HF json builder to raise and/or assert the fallback path was executed: patch
or monkeypatch the HF loading function used inside get_dataset_samples (e.g.,
the datasets.load_dataset / DatasetBuilder call) to raise an exception, then
call get_dataset_samples(path, num_samples=3) and verify both the returned
samples and that the legacy reader function inside get_dataset_samples (the code
path handling text-field fallback) was invoked (for example by spying/mocking
that fallback helper or checking a logged/returned indicator). Ensure references
to get_dataset_samples, test_legacy_text_fallback_on_hf_builder_failure, and
_write_jsonl are used to locate the code to change.

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)

391-394: ⚠️ Potential issue | 🟠 Major

Sanitize fallback warning to avoid leaking local path/error payloads.

This warning still includes local_dataset_path and raw exception text e, which can leak sensitive filesystem details or data-derived parser payloads for untrusted JSONL inputs.

Proposed fix
-        warn(
-            f"Failed to load {local_dataset_path} via the HF 'json' builder "
-            f"({type(e).__name__}: {e}); fell back to legacy text-field reader."
-        )
+        safe_name = Path(local_dataset_path).name
+        warn(
+            f"Failed to load JSONL file '{safe_name}' via HF 'json' builder "
+            f"({type(e).__name__}); fell back to legacy text-field reader."
+        )

As per coding guidelines, SECURITY.md requires avoiding sensitive paths/content in warnings/errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 391 - 394, The warning
leaks sensitive filesystem paths and raw exception text; update the warn call in
the dataset loader (the block referencing local_dataset_path and exception e) to
remove local_dataset_path and the full exception payload and instead emit a
sanitized message (e.g., "Failed to load dataset via HF 'json' builder; falling
back to legacy reader") and, if needed, include only the exception class name
(type(e).__name__) or a short safe error code; if you must preserve full details
for debugging, send them to a debug-level logger instead of the user-facing
warn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 391-394: The warning leaks sensitive filesystem paths and raw
exception text; update the warn call in the dataset loader (the block
referencing local_dataset_path and exception e) to remove local_dataset_path and
the full exception payload and instead emit a sanitized message (e.g., "Failed
to load dataset via HF 'json' builder; falling back to legacy reader") and, if
needed, include only the exception class name (type(e).__name__) or a short safe
error code; if you must preserve full details for debugging, send them to a
debug-level logger instead of the user-facing warn.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 130f9e2e-967d-4ec7-93fc-d93a676bc488

📥 Commits

Reviewing files that changed from the base of the PR and between b874912 and 9a94525.

📒 Files selected for processing (1)
  • modelopt/torch/utils/dataset_utils.py

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/utils/dataset_utils.py (1)

181-212: ⚠️ Potential issue | 🟠 Major

Don't conflate None with other falsy column values.

These sample.get(...) checks also skip valid empty-string fields, so rows like {"prompt": "", "completion": "A"} or {"input": "", "output": "A"} now miss auto-detection and fail even though only None needs special handling here.

Proposed fix
-    chat_key = next((k for k in ("messages", "conversations") if sample.get(k)), None)
+    def _has_non_null_value(key: str) -> bool:
+        return key in sample and sample[key] is not None
+
+    chat_key = next((k for k in ("messages", "conversations") if _has_non_null_value(k)), None)
@@
-    if sample.get("prompt"):
+    if _has_non_null_value("prompt"):
         parts = [sample["prompt"]]
-        parts.extend(sample[k] for k in ("completion", "response", "output") if sample.get(k))
+        parts.extend(
+            sample[k]
+            for k in ("completion", "response", "output")
+            if _has_non_null_value(k)
+        )
         return "\n".join(parts)
@@
-    if sample.get("text"):
+    if _has_non_null_value("text"):
         return sample["text"]
@@
-    if sample.get("input"):
+    if _has_non_null_value("input"):
         parts = [sample["input"]]
-        if sample.get("output"):
+        if _has_non_null_value("output"):
             parts.append(sample["output"])
         return "\n".join(parts)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 181 - 212, The checks
using truthy sample.get(...) incorrectly treat valid empty-string fields as
missing; change them to explicit None checks so empty strings are accepted:
replace the chat_key detection with next((k for k in
("messages","conversations") if sample.get(k) is not None), None), change tools
handling to if tools is not None, and update the subsequent conditionals (if
sample.get("prompt"), if sample.get("text"), if sample.get("input"), and any
sample.get("completion"/"response"/"output" checks) to use "is not None" so
tokenizer.apply_chat_template, the prompt/completion join logic, and
input/output handling treat "" as valid while still skipping None.
♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)

397-400: ⚠️ Potential issue | 🟠 Major

Sanitize the JSONL fallback warning.

This still logs the full local path and raw exception text for untrusted dataset inputs. Please keep the warning to a safe identifier plus the exception type.

Proposed fix
-        warn(
-            f"Failed to load {local_dataset_path} via the HF 'json' builder "
-            f"({type(e).__name__}: {e}); fell back to legacy text-field reader."
-        )
+        warn(
+            f"Failed to load JSONL file '{Path(local_dataset_path).name}' via the HF "
+            f"'json' builder ({type(e).__name__}); fell back to legacy text-field reader."
+        )

As per coding guidelines, SECURITY.md says: “Be careful not to log sensitive data (paths, tokens, proprietary content).”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 397 - 400, The warning
currently logs the full local path and raw exception text; change the warn call
in dataset_utils.py to avoid sensitive data by removing the full path and
exception message and instead log a safe identifier plus the exception type. Use
a safe identifier such as os.path.basename(local_dataset_path) or an existing
dataset identifier variable (e.g., dataset_name) in place of local_dataset_path,
and keep only type(e).__name__ (remove {e}) in the message for the warn(...)
call to produce: "Failed to load dataset '<safe_identifier>' via the HF 'json'
builder (<ExceptionType>); fell back to legacy text-field reader."
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 343-373: After normalizing splits via _normalize_splits (or
defaulting to ["train"]), validate that the resulting splits list is not empty
and raise a ValueError if it is; place this check before calling load_dataset
and before computing num_per_split to avoid a ZeroDivisionError. Update the
logic around the splits variable (used when calling load_dataset(..., split=s)
and when computing num_per_split) to fail fast with a clear message like "empty
splits list after normalization" rather than allowing a division by zero.

In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 531-609: The TestHfTinyDataset tests use real HF data (_HF_TINY,
_hf_dump_to_jsonl) and call get_dataset_samples and get_dataset_dataloader which
invoke datasets.load_dataset; move these tests out of unit tests or mark them as
integration tests (e.g., add `@pytest.mark.integration` to TestHfTinyDataset or
each test) OR replace live calls by monkeypatching datasets.load_dataset (or
patch get_dataset_samples/get_dataset_dataloader to use local fixtures) so the
unit suite is hermetic and no network download occurs.

---

Outside diff comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 181-212: The checks using truthy sample.get(...) incorrectly treat
valid empty-string fields as missing; change them to explicit None checks so
empty strings are accepted: replace the chat_key detection with next((k for k in
("messages","conversations") if sample.get(k) is not None), None), change tools
handling to if tools is not None, and update the subsequent conditionals (if
sample.get("prompt"), if sample.get("text"), if sample.get("input"), and any
sample.get("completion"/"response"/"output" checks) to use "is not None" so
tokenizer.apply_chat_template, the prompt/completion join logic, and
input/output handling treat "" as valid while still skipping None.

---

Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 397-400: The warning currently logs the full local path and raw
exception text; change the warn call in dataset_utils.py to avoid sensitive data
by removing the full path and exception message and instead log a safe
identifier plus the exception type. Use a safe identifier such as
os.path.basename(local_dataset_path) or an existing dataset identifier variable
(e.g., dataset_name) in place of local_dataset_path, and keep only
type(e).__name__ (remove {e}) in the message for the warn(...) call to produce:
"Failed to load dataset '<safe_identifier>' via the HF 'json' builder
(<ExceptionType>); fell back to legacy text-field reader."
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2976e442-d11c-40e3-ac14-90f57ac4afab

📥 Commits

Reviewing files that changed from the base of the PR and between 9a94525 and 409eb38.

📒 Files selected for processing (2)
  • modelopt/torch/utils/dataset_utils.py
  • tests/unit/torch/utils/test_dataset_utils.py

Comment on lines 343 to +373
splits = _normalize_splits(split) if split is not None else ["train"]

def _preprocess(sample: dict) -> str:
return _auto_preprocess_sample(sample, dataset_name, tokenizer)

# Narrow the legacy fallback to JSON-parsing / Arrow schema failures. Any
# other error (split-not-found, IO, OOM, ...) should surface to the caller
# rather than be hidden by the text-field reader. Imported lazily because
# the exact module paths vary across versions; an empty tuple is a valid
# ``except`` target that catches nothing if neither is importable.
fallback_types: tuple[type[BaseException], ...] = ()
try:
from datasets.exceptions import DatasetGenerationError

fallback_types += (DatasetGenerationError,)
except ImportError:
pass
try:
from pyarrow.lib import ArrowInvalid

fallback_types += (ArrowInvalid,)
except ImportError:
pass

# load_dataset does not support a list of splits while streaming, so load each separately.
print(f"Loading dataset with {config=} and {splits=}")
dataset_splits = [load_dataset(streaming=True, **config, split=s) for s in splits]

num_per_split = [num_samples // len(dataset_splits)] * len(dataset_splits)
num_per_split[-1] += num_samples - sum(num_per_split)
try:
dataset_splits = [load_dataset(streaming=True, **config, split=s) for s in splits]

samples: list[str] = []
for dataset, n in zip(dataset_splits, num_per_split):
for i, sample in enumerate(dataset):
if i >= n:
break
text = _preprocess(sample)
if text:
samples.append(text)
num_per_split = [num_samples // len(dataset_splits)] * len(dataset_splits)
num_per_split[-1] += num_samples - sum(num_per_split)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject an empty split list before computing per-split quotas.

split=[] currently reaches num_samples // len(dataset_splits) and crashes with ZeroDivisionError. Please fail fast with a ValueError once splits has been normalized.

Proposed fix
         splits = _normalize_splits(split) if split is not None else ["train"]
+        if not splits:
+            raise ValueError("``split`` must contain at least one split name.")
@@
-        num_per_split = [num_samples // len(dataset_splits)] * len(dataset_splits)
+        num_per_split = [num_samples // len(dataset_splits)] * len(dataset_splits)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 343 - 373, After
normalizing splits via _normalize_splits (or defaulting to ["train"]), validate
that the resulting splits list is not empty and raise a ValueError if it is;
place this check before calling load_dataset and before computing num_per_split
to avoid a ZeroDivisionError. Update the logic around the splits variable (used
when calling load_dataset(..., split=s) and when computing num_per_split) to
fail fast with a clear message like "empty splits list after normalization"
rather than allowing a division by zero.

Comment on lines +531 to +609
# ---------------------------------------------------------------------------
# Live HF dataset round-trips. ``hf-internal-testing/dataset_with_data_files``
# is a 10-row x {train,test} fixture maintained by HF for their own CI — tiny
# enough to download in a unit test and stable across releases.
# ---------------------------------------------------------------------------

_HF_TINY = "hf-internal-testing/dataset_with_data_files" # train, test splits, ``text`` col


def _hf_dump_to_jsonl(name: str, split: str, path) -> str:
from datasets import load_dataset

ds = load_dataset(name, split=split)
ds.to_json(str(path), lines=True)
return str(path)


class TestHfTinyDataset:
"""End-to-end coverage with a real (tiny) HF dataset."""

def test_load_single_split_directly(self):
pytest.importorskip("datasets")
samples = get_dataset_samples(_HF_TINY, num_samples=4, split="train")
assert len(samples) == 4
assert all(isinstance(s, str) and s for s in samples)

def test_load_multiple_splits_directly(self):
"""``split=["train", "test"]`` divides ``num_samples`` across both splits."""
pytest.importorskip("datasets")
samples = get_dataset_samples(_HF_TINY, num_samples=6, split=["train", "test"])
assert len(samples) == 6
# Default per-split is num_samples // n + remainder; for 6/2 → 3 from each.
# We can't assert exact origin without re-reading, but both splits should
# contribute, which we'll confirm by comparing against direct loads below.
train_only = set(get_dataset_samples(_HF_TINY, num_samples=10, split="train"))
test_only = set(get_dataset_samples(_HF_TINY, num_samples=10, split="test"))
assert any(s in train_only for s in samples)
assert any(s in test_only for s in samples)

def test_default_split_is_train(self):
pytest.importorskip("datasets")
default_samples = get_dataset_samples(_HF_TINY, num_samples=4)
train_samples = get_dataset_samples(_HF_TINY, num_samples=4, split="train")
assert default_samples == train_samples

def test_download_to_jsonl_then_load(self, tmp_path):
"""Dump the HF dataset to JSONL, then reload it via the local-jsonl path."""
pytest.importorskip("datasets")
jsonl_path = _hf_dump_to_jsonl(_HF_TINY, "train", tmp_path / "train.jsonl")
from_jsonl = get_dataset_samples(jsonl_path, num_samples=10)
from_hf = get_dataset_samples(_HF_TINY, num_samples=10, split="train")
assert from_jsonl == from_hf

def test_dataloader_blending_two_hf_datasets(self, pad_tokenizer):
"""Two HF datasets concatenated via ``get_dataset_dataloader``."""
pytest.importorskip("datasets")
loader = get_dataset_dataloader(
dataset_name=[_HF_TINY, "hf-internal-testing/multi_dir_dataset"],
tokenizer=pad_tokenizer,
batch_size=4,
num_samples=[3, 1],
max_sample_length=16,
)
batches = list(loader)
assert sum(b["input_ids"].shape[0] for b in batches) == 4

def test_dataloader_mixing_hf_and_local_jsonl(self, tmp_path, pad_tokenizer):
"""Live HF dataset blended with a local synthetic JSONL file."""
pytest.importorskip("datasets")
local = _write_jsonl(tmp_path / "local.jsonl", [{"text": f"local {i}"} for i in range(2)])
loader = get_dataset_dataloader(
dataset_name=[_HF_TINY, local],
tokenizer=pad_tokenizer,
batch_size=5,
num_samples=[3, 2],
max_sample_length=16,
)
batches = list(loader)
assert sum(b["input_ids"].shape[0] for b in batches) == 5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the implementation of get_dataset_samples and get_dataset_dataloader
find . -type f -name "*.py" -path "*/modelopt/*" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 814


🏁 Script executed:

# Search for the get_dataset_samples function definition
rg "def get_dataset_samples" --type py -B 2 -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 954


🏁 Script executed:

# Search for get_dataset_dataloader definition
rg "def get_dataset_dataloader" --type py -B 2 -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 1009


🏁 Script executed:

# Check if there's a conftest.py in the tests/unit directory
find tests/unit -name "conftest.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 176


🏁 Script executed:

# Check the test file itself for any pytest markers or fixtures
rg "pytest.mark|@pytest|conftest|mock|patch" tests/unit/torch/utils/test_dataset_utils.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1039


🏁 Script executed:

# Get the line count first to see file size
wc -l modelopt/torch/utils/dataset_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 108


🏁 Script executed:

# Check the implementation of get_dataset_samples to see how it calls load_dataset
sed -n '/def get_dataset_samples/,/^def /p' modelopt/torch/utils/dataset_utils.py | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 4405


🏁 Script executed:

# Check conftest.py for any fixtures that might mock datasets
cat tests/unit/torch/conftest.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2146


🏁 Script executed:

# Check the test file for any imports and setup at the top
head -100 tests/unit/torch/utils/test_dataset_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3254


🏁 Script executed:

# Look for pytest markers on the TestHfTinyDataset class
rg "TestHfTinyDataset" tests/unit/torch/utils/test_dataset_utils.py -B 5 -A 30

Repository: NVIDIA/Model-Optimizer

Length of output: 1810


🏁 Script executed:

# Check for pytest markers around TestHfTinyDataset more carefully
rg "class TestHfTinyDataset" tests/unit/torch/utils/test_dataset_utils.py -B 10 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 355


🏁 Script executed:

# Look for pytest configuration
find . -name "pytest.ini" -o -name "pyproject.toml" -o -name "setup.cfg" | grep -E "(pytest|pyproject|setup)" | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 149


🏁 Script executed:

# Check pyproject.toml for test configuration
rg "pytest|testpaths|markers" pyproject.toml -i -A 3 -B 1 | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 996


🏁 Script executed:

# Check if there are any skip markers in the entire test_dataset_utils.py file
rg "pytest.mark|skip|xfail" tests/unit/torch/utils/test_dataset_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1136


TestHfTinyDataset downloads from live Hugging Face infrastructure; move to integration tests or mock the dataset loading.

The tests in lines 531-609 call get_dataset_samples() and get_dataset_dataloader() with real HF dataset identifiers (_HF_TINY = "hf-internal-testing/dataset_with_data_files"), which triggers live downloads via load_dataset() with no mocking in place. These tests lack integration markers and are placed in tests/unit/, creating flakiness risk when CI is offline, rate-limited, or upstream datasets change. Either relocate these to an integration suite (with @pytest.mark.integration) or replace load_dataset() calls with locally-cached fixtures or monkeypatched mock behavior to keep the unit suite hermetic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 531 - 609, The
TestHfTinyDataset tests use real HF data (_HF_TINY, _hf_dump_to_jsonl) and call
get_dataset_samples and get_dataset_dataloader which invoke
datasets.load_dataset; move these tests out of unit tests or mark them as
integration tests (e.g., add `@pytest.mark.integration` to TestHfTinyDataset or
each test) OR replace live calls by monkeypatching datasets.load_dataset (or
patch get_dataset_samples/get_dataset_dataloader to use local fixtures) so the
unit suite is hermetic and no network download occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant