[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345
[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345shengliangxu wants to merge 9 commits intomainfrom
Conversation
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughLocal Changes
Sequence DiagramsequenceDiagram
participant Client
participant Detector as get_dataset_samples
participant HF as HuggingFace load_dataset
participant Preprocess as AutoPreprocess
participant Fallback as LegacyReader
participant Return
Client->>Detector: call with dataset ids and/or `.jsonl` paths
Detector->>Detector: classify inputs (HF id vs local `.jsonl`)
alt input includes `.jsonl`
Detector->>HF: load_dataset(path="json", data_files=...)
alt HF load + preprocessing succeeds
HF->>Preprocess: stream rows / auto-detect columns
Preprocess->>Return: emit normalized samples
else HF load or schema unification fails
HF--xDetector: raises JSON/Arrow/schema error
Detector->>Fallback: read file line-by-line, extract `text`
Fallback->>Return: emit fallback samples (with warning)
end
end
alt input includes HF ids
Detector->>HF: load_dataset(id, split=...)
HF->>Preprocess: return preprocessed samples
Preprocess->>Return: emit normalized samples
end
Return->>Client: combined concatenated samples list
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1345 +/- ##
==========================================
- Coverage 76.48% 72.32% -4.17%
==========================================
Files 471 471
Lines 50487 50511 +24
==========================================
- Hits 38617 36531 -2086
- Misses 11870 13980 +2110
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/torch/utils/test_dataset_utils.py (1)
445-506: Consider adding a network-skip marker for CI reliability.These tests download from
hf-internal-testing/*datasets, which are stable but still involve network I/O. If CI environments have unreliable network access, these tests may flake.Consider marking them with a custom pytest marker (e.g.,
@pytest.mark.network) so they can be selectively skipped in constrained environments, while still running in standard CI.Example marker usage
`@pytest.mark.network` class TestHfTinyDataset: """End-to-end coverage with a real (tiny) HF dataset.""" ...Then configure
pytest.iniorpyproject.toml:markers = network: tests that require network access to HuggingFace Hub🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 445 - 506, Add a network-skippable marker to the TestHfTinyDataset test class so CI can skip HF Hub network tests when desired: annotate the TestHfTinyDataset class with a pytest marker such as `@pytest.mark.network` (referencing the TestHfTinyDataset class and tests like test_load_single_split_directly, test_dataloader_blending_two_hf_datasets, etc.), and add the corresponding marker declaration ("network: tests that require network access to HuggingFace Hub") to pytest.ini or pyproject.toml so pytest recognizes it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 445-506: Add a network-skippable marker to the TestHfTinyDataset
test class so CI can skip HF Hub network tests when desired: annotate the
TestHfTinyDataset class with a pytest marker such as `@pytest.mark.network`
(referencing the TestHfTinyDataset class and tests like
test_load_single_split_directly, test_dataloader_blending_two_hf_datasets,
etc.), and add the corresponding marker declaration ("network: tests that
require network access to HuggingFace Hub") to pytest.ini or pyproject.toml so
pytest recognizes it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e2e1cb6a-ec09-41ed-ab40-d1b508f61467
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/unit/torch/utils/test_dataset_utils.py (1)
511-589: Consider gating live Hub tests as integration to reduce unit-test flakinessThese cases are valuable, but remote dataset availability/network instability can make
tests/unitnon-deterministic. Consider marking them (e.g.,@pytest.mark.integration) or guarding behind an env flag in CI.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 511 - 589, The live HF dataset tests in TestHfTinyDataset (e.g., test_load_single_split_directly, test_load_multiple_splits_directly, test_default_split_is_train, test_download_to_jsonl_then_load, test_dataloader_blending_two_hf_datasets, test_dataloader_mixing_hf_and_local_jsonl) are network-dependent and should be gated as integration tests: either add a pytest marker (e.g., annotate the TestHfTinyDataset class with `@pytest.mark.integration` or decorate each test) and register that marker in pytest.ini, or guard execution with an environment flag check (e.g., skip unless os.environ.get("RUN_INTEGRATION_TESTS") is set) so unit test runs remain deterministic in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 351-354: The current warn call logs the full local_dataset_path
and raw exception text (e), which can leak sensitive filesystem or error payload
details; update the warning in the dataset load fallback so it does not include
raw paths or exception messages—use a sanitized identifier (e.g.,
Path(local_dataset_path).name or other non-sensitive dataset id) and only
include the exception type (type(e).__name__) or a generic phrase like "loader
error" instead of including {e} or the full path; modify the warn(...)
invocation accordingly (referencing warn, local_dataset_path, and e) so logs
avoid exposing sensitive content.
- Around line 337-356: The current except block around HF JSON parsing (the
except Exception as e handling is_jsonl) falls back to get_jsonl_text_samples
too broadly and can hide caller errors like invalid splits; update the except
block in dataset_utils.py to only attempt the legacy text-field fallback when
the original HF error is clearly a schema/column inference or parsing error, and
re-raise immediately for errors that indicate caller/config issues (e.g., the
exception message or type references a missing/unsupported split, "split", "not
found", or similar); implement this by inspecting e (str(e) and/or its type)
before calling get_jsonl_text_samples(local_dataset_path, num_samples,
key="text") so invalid split/config errors are not silently recovered from.
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 395-408: The test currently only checks output values so it can
pass without exercising the fallback branch; modify
test_legacy_text_fallback_on_hf_builder_failure to force the HF json builder to
raise and/or assert the fallback path was executed: patch or monkeypatch the HF
loading function used inside get_dataset_samples (e.g., the
datasets.load_dataset / DatasetBuilder call) to raise an exception, then call
get_dataset_samples(path, num_samples=3) and verify both the returned samples
and that the legacy reader function inside get_dataset_samples (the code path
handling text-field fallback) was invoked (for example by spying/mocking that
fallback helper or checking a logged/returned indicator). Ensure references to
get_dataset_samples, test_legacy_text_fallback_on_hf_builder_failure, and
_write_jsonl are used to locate the code to change.
---
Nitpick comments:
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 511-589: The live HF dataset tests in TestHfTinyDataset (e.g.,
test_load_single_split_directly, test_load_multiple_splits_directly,
test_default_split_is_train, test_download_to_jsonl_then_load,
test_dataloader_blending_two_hf_datasets,
test_dataloader_mixing_hf_and_local_jsonl) are network-dependent and should be
gated as integration tests: either add a pytest marker (e.g., annotate the
TestHfTinyDataset class with `@pytest.mark.integration` or decorate each test) and
register that marker in pytest.ini, or guard execution with an environment flag
check (e.g., skip unless os.environ.get("RUN_INTEGRATION_TESTS") is set) so unit
test runs remain deterministic in CI.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 24c5dbc5-fa7a-4e97-9451-2c88a59bb346
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
| warn( | ||
| f"Failed to load {local_dataset_path} via the HF 'json' builder " | ||
| f"({type(e).__name__}: {e}); fell back to legacy text-field reader." | ||
| ) |
There was a problem hiding this comment.
Sanitize fallback warning content to avoid leaking local path/error details
Line 352-Line 353 logs the full file path plus raw exception text. For untrusted dataset inputs, this can expose sensitive filesystem information or data-derived error payloads in logs.
💡 Proposed fix
- warn(
- f"Failed to load {local_dataset_path} via the HF 'json' builder "
- f"({type(e).__name__}: {e}); fell back to legacy text-field reader."
- )
+ safe_name = os.path.basename(local_dataset_path)
+ warn(
+ f"Failed to load JSONL file '{safe_name}' via HF 'json' builder "
+ f"({type(e).__name__}); fell back to legacy text-field reader."
+ )As per coding guidelines, SECURITY.md says: “Be mindful of logging: avoid emitting sensitive paths/tokens/content in warnings/errors.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/utils/dataset_utils.py` around lines 351 - 354, The current
warn call logs the full local_dataset_path and raw exception text (e), which can
leak sensitive filesystem or error payload details; update the warning in the
dataset load fallback so it does not include raw paths or exception messages—use
a sanitized identifier (e.g., Path(local_dataset_path).name or other
non-sensitive dataset id) and only include the exception type (type(e).__name__)
or a generic phrase like "loader error" instead of including {e} or the full
path; modify the warn(...) invocation accordingly (referencing warn,
local_dataset_path, and e) so logs avoid exposing sensitive content.
| def test_legacy_text_fallback_on_hf_builder_failure(self, tmp_path): | ||
| """If the HF json builder raises, fall back to the legacy text-field reader.""" | ||
| pytest.importorskip("datasets") | ||
| # Mixed-type ``meta`` field across rows — int vs string — trips PyArrow | ||
| # schema unification in the HF json builder. The rows still carry a | ||
| # ``text`` field, so the legacy reader can recover the samples. | ||
| rows = [ | ||
| {"text": "row a", "meta": 1}, | ||
| {"text": "row b", "meta": "two"}, | ||
| {"text": "row c", "meta": 3}, | ||
| ] | ||
| path = _write_jsonl(tmp_path / "mixed.jsonl", rows) | ||
| samples = get_dataset_samples(path, num_samples=3) | ||
| assert samples == ["row a", "row b", "row c"] |
There was a problem hiding this comment.
This fallback test can pass even if fallback is never executed
Line 407-Line 408 only validates output values; if HF JSON loading succeeds, the assertion still passes. This does not prove the fallback branch ran.
💡 Proposed fix
- def test_legacy_text_fallback_on_hf_builder_failure(self, tmp_path):
+ def test_legacy_text_fallback_on_hf_builder_failure(self, tmp_path, monkeypatch):
@@
- samples = get_dataset_samples(path, num_samples=3)
+ import datasets
+
+ monkeypatch.setattr(datasets, "load_dataset", Mock(side_effect=RuntimeError("boom")))
+ with pytest.warns(UserWarning, match="fell back to legacy text-field reader"):
+ samples = get_dataset_samples(path, num_samples=3)
assert samples == ["row a", "row b", "row c"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 395 - 408, The
test currently only checks output values so it can pass without exercising the
fallback branch; modify test_legacy_text_fallback_on_hf_builder_failure to force
the HF json builder to raise and/or assert the fallback path was executed: patch
or monkeypatch the HF loading function used inside get_dataset_samples (e.g.,
the datasets.load_dataset / DatasetBuilder call) to raise an exception, then
call get_dataset_samples(path, num_samples=3) and verify both the returned
samples and that the legacy reader function inside get_dataset_samples (the code
path handling text-field fallback) was invoked (for example by spying/mocking
that fallback helper or checking a logged/returned indicator). Ensure references
to get_dataset_samples, test_legacy_text_fallback_on_hf_builder_failure, and
_write_jsonl are used to locate the code to change.
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)
391-394:⚠️ Potential issue | 🟠 MajorSanitize fallback warning to avoid leaking local path/error payloads.
This warning still includes
local_dataset_pathand raw exception texte, which can leak sensitive filesystem details or data-derived parser payloads for untrusted JSONL inputs.Proposed fix
- warn( - f"Failed to load {local_dataset_path} via the HF 'json' builder " - f"({type(e).__name__}: {e}); fell back to legacy text-field reader." - ) + safe_name = Path(local_dataset_path).name + warn( + f"Failed to load JSONL file '{safe_name}' via HF 'json' builder " + f"({type(e).__name__}); fell back to legacy text-field reader." + )As per coding guidelines, SECURITY.md requires avoiding sensitive paths/content in warnings/errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 391 - 394, The warning leaks sensitive filesystem paths and raw exception text; update the warn call in the dataset loader (the block referencing local_dataset_path and exception e) to remove local_dataset_path and the full exception payload and instead emit a sanitized message (e.g., "Failed to load dataset via HF 'json' builder; falling back to legacy reader") and, if needed, include only the exception class name (type(e).__name__) or a short safe error code; if you must preserve full details for debugging, send them to a debug-level logger instead of the user-facing warn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 391-394: The warning leaks sensitive filesystem paths and raw
exception text; update the warn call in the dataset loader (the block
referencing local_dataset_path and exception e) to remove local_dataset_path and
the full exception payload and instead emit a sanitized message (e.g., "Failed
to load dataset via HF 'json' builder; falling back to legacy reader") and, if
needed, include only the exception class name (type(e).__name__) or a short safe
error code; if you must preserve full details for debugging, send them to a
debug-level logger instead of the user-facing warn.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 130f9e2e-967d-4ec7-93fc-d93a676bc488
📒 Files selected for processing (1)
modelopt/torch/utils/dataset_utils.py
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/utils/dataset_utils.py (1)
181-212:⚠️ Potential issue | 🟠 MajorDon't conflate
Nonewith other falsy column values.These
sample.get(...)checks also skip valid empty-string fields, so rows like{"prompt": "", "completion": "A"}or{"input": "", "output": "A"}now miss auto-detection and fail even though onlyNoneneeds special handling here.Proposed fix
- chat_key = next((k for k in ("messages", "conversations") if sample.get(k)), None) + def _has_non_null_value(key: str) -> bool: + return key in sample and sample[key] is not None + + chat_key = next((k for k in ("messages", "conversations") if _has_non_null_value(k)), None) @@ - if sample.get("prompt"): + if _has_non_null_value("prompt"): parts = [sample["prompt"]] - parts.extend(sample[k] for k in ("completion", "response", "output") if sample.get(k)) + parts.extend( + sample[k] + for k in ("completion", "response", "output") + if _has_non_null_value(k) + ) return "\n".join(parts) @@ - if sample.get("text"): + if _has_non_null_value("text"): return sample["text"] @@ - if sample.get("input"): + if _has_non_null_value("input"): parts = [sample["input"]] - if sample.get("output"): + if _has_non_null_value("output"): parts.append(sample["output"]) return "\n".join(parts)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 181 - 212, The checks using truthy sample.get(...) incorrectly treat valid empty-string fields as missing; change them to explicit None checks so empty strings are accepted: replace the chat_key detection with next((k for k in ("messages","conversations") if sample.get(k) is not None), None), change tools handling to if tools is not None, and update the subsequent conditionals (if sample.get("prompt"), if sample.get("text"), if sample.get("input"), and any sample.get("completion"/"response"/"output" checks) to use "is not None" so tokenizer.apply_chat_template, the prompt/completion join logic, and input/output handling treat "" as valid while still skipping None.
♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)
397-400:⚠️ Potential issue | 🟠 MajorSanitize the JSONL fallback warning.
This still logs the full local path and raw exception text for untrusted dataset inputs. Please keep the warning to a safe identifier plus the exception type.
Proposed fix
- warn( - f"Failed to load {local_dataset_path} via the HF 'json' builder " - f"({type(e).__name__}: {e}); fell back to legacy text-field reader." - ) + warn( + f"Failed to load JSONL file '{Path(local_dataset_path).name}' via the HF " + f"'json' builder ({type(e).__name__}); fell back to legacy text-field reader." + )As per coding guidelines, SECURITY.md says: “Be careful not to log sensitive data (paths, tokens, proprietary content).”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 397 - 400, The warning currently logs the full local path and raw exception text; change the warn call in dataset_utils.py to avoid sensitive data by removing the full path and exception message and instead log a safe identifier plus the exception type. Use a safe identifier such as os.path.basename(local_dataset_path) or an existing dataset identifier variable (e.g., dataset_name) in place of local_dataset_path, and keep only type(e).__name__ (remove {e}) in the message for the warn(...) call to produce: "Failed to load dataset '<safe_identifier>' via the HF 'json' builder (<ExceptionType>); fell back to legacy text-field reader."
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 343-373: After normalizing splits via _normalize_splits (or
defaulting to ["train"]), validate that the resulting splits list is not empty
and raise a ValueError if it is; place this check before calling load_dataset
and before computing num_per_split to avoid a ZeroDivisionError. Update the
logic around the splits variable (used when calling load_dataset(..., split=s)
and when computing num_per_split) to fail fast with a clear message like "empty
splits list after normalization" rather than allowing a division by zero.
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 531-609: The TestHfTinyDataset tests use real HF data (_HF_TINY,
_hf_dump_to_jsonl) and call get_dataset_samples and get_dataset_dataloader which
invoke datasets.load_dataset; move these tests out of unit tests or mark them as
integration tests (e.g., add `@pytest.mark.integration` to TestHfTinyDataset or
each test) OR replace live calls by monkeypatching datasets.load_dataset (or
patch get_dataset_samples/get_dataset_dataloader to use local fixtures) so the
unit suite is hermetic and no network download occurs.
---
Outside diff comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 181-212: The checks using truthy sample.get(...) incorrectly treat
valid empty-string fields as missing; change them to explicit None checks so
empty strings are accepted: replace the chat_key detection with next((k for k in
("messages","conversations") if sample.get(k) is not None), None), change tools
handling to if tools is not None, and update the subsequent conditionals (if
sample.get("prompt"), if sample.get("text"), if sample.get("input"), and any
sample.get("completion"/"response"/"output" checks) to use "is not None" so
tokenizer.apply_chat_template, the prompt/completion join logic, and
input/output handling treat "" as valid while still skipping None.
---
Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 397-400: The warning currently logs the full local path and raw
exception text; change the warn call in dataset_utils.py to avoid sensitive data
by removing the full path and exception message and instead log a safe
identifier plus the exception type. Use a safe identifier such as
os.path.basename(local_dataset_path) or an existing dataset identifier variable
(e.g., dataset_name) in place of local_dataset_path, and keep only
type(e).__name__ (remove {e}) in the message for the warn(...) call to produce:
"Failed to load dataset '<safe_identifier>' via the HF 'json' builder
(<ExceptionType>); fell back to legacy text-field reader."
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2976e442-d11c-40e3-ac14-90f57ac4afab
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
| splits = _normalize_splits(split) if split is not None else ["train"] | ||
|
|
||
| def _preprocess(sample: dict) -> str: | ||
| return _auto_preprocess_sample(sample, dataset_name, tokenizer) | ||
|
|
||
| # Narrow the legacy fallback to JSON-parsing / Arrow schema failures. Any | ||
| # other error (split-not-found, IO, OOM, ...) should surface to the caller | ||
| # rather than be hidden by the text-field reader. Imported lazily because | ||
| # the exact module paths vary across versions; an empty tuple is a valid | ||
| # ``except`` target that catches nothing if neither is importable. | ||
| fallback_types: tuple[type[BaseException], ...] = () | ||
| try: | ||
| from datasets.exceptions import DatasetGenerationError | ||
|
|
||
| fallback_types += (DatasetGenerationError,) | ||
| except ImportError: | ||
| pass | ||
| try: | ||
| from pyarrow.lib import ArrowInvalid | ||
|
|
||
| fallback_types += (ArrowInvalid,) | ||
| except ImportError: | ||
| pass | ||
|
|
||
| # load_dataset does not support a list of splits while streaming, so load each separately. | ||
| print(f"Loading dataset with {config=} and {splits=}") | ||
| dataset_splits = [load_dataset(streaming=True, **config, split=s) for s in splits] | ||
|
|
||
| num_per_split = [num_samples // len(dataset_splits)] * len(dataset_splits) | ||
| num_per_split[-1] += num_samples - sum(num_per_split) | ||
| try: | ||
| dataset_splits = [load_dataset(streaming=True, **config, split=s) for s in splits] | ||
|
|
||
| samples: list[str] = [] | ||
| for dataset, n in zip(dataset_splits, num_per_split): | ||
| for i, sample in enumerate(dataset): | ||
| if i >= n: | ||
| break | ||
| text = _preprocess(sample) | ||
| if text: | ||
| samples.append(text) | ||
| num_per_split = [num_samples // len(dataset_splits)] * len(dataset_splits) | ||
| num_per_split[-1] += num_samples - sum(num_per_split) |
There was a problem hiding this comment.
Reject an empty split list before computing per-split quotas.
split=[] currently reaches num_samples // len(dataset_splits) and crashes with ZeroDivisionError. Please fail fast with a ValueError once splits has been normalized.
Proposed fix
splits = _normalize_splits(split) if split is not None else ["train"]
+ if not splits:
+ raise ValueError("``split`` must contain at least one split name.")
@@
- num_per_split = [num_samples // len(dataset_splits)] * len(dataset_splits)
+ num_per_split = [num_samples // len(dataset_splits)] * len(dataset_splits)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/utils/dataset_utils.py` around lines 343 - 373, After
normalizing splits via _normalize_splits (or defaulting to ["train"]), validate
that the resulting splits list is not empty and raise a ValueError if it is;
place this check before calling load_dataset and before computing num_per_split
to avoid a ZeroDivisionError. Update the logic around the splits variable (used
when calling load_dataset(..., split=s) and when computing num_per_split) to
fail fast with a clear message like "empty splits list after normalization"
rather than allowing a division by zero.
| # --------------------------------------------------------------------------- | ||
| # Live HF dataset round-trips. ``hf-internal-testing/dataset_with_data_files`` | ||
| # is a 10-row x {train,test} fixture maintained by HF for their own CI — tiny | ||
| # enough to download in a unit test and stable across releases. | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| _HF_TINY = "hf-internal-testing/dataset_with_data_files" # train, test splits, ``text`` col | ||
|
|
||
|
|
||
| def _hf_dump_to_jsonl(name: str, split: str, path) -> str: | ||
| from datasets import load_dataset | ||
|
|
||
| ds = load_dataset(name, split=split) | ||
| ds.to_json(str(path), lines=True) | ||
| return str(path) | ||
|
|
||
|
|
||
| class TestHfTinyDataset: | ||
| """End-to-end coverage with a real (tiny) HF dataset.""" | ||
|
|
||
| def test_load_single_split_directly(self): | ||
| pytest.importorskip("datasets") | ||
| samples = get_dataset_samples(_HF_TINY, num_samples=4, split="train") | ||
| assert len(samples) == 4 | ||
| assert all(isinstance(s, str) and s for s in samples) | ||
|
|
||
| def test_load_multiple_splits_directly(self): | ||
| """``split=["train", "test"]`` divides ``num_samples`` across both splits.""" | ||
| pytest.importorskip("datasets") | ||
| samples = get_dataset_samples(_HF_TINY, num_samples=6, split=["train", "test"]) | ||
| assert len(samples) == 6 | ||
| # Default per-split is num_samples // n + remainder; for 6/2 → 3 from each. | ||
| # We can't assert exact origin without re-reading, but both splits should | ||
| # contribute, which we'll confirm by comparing against direct loads below. | ||
| train_only = set(get_dataset_samples(_HF_TINY, num_samples=10, split="train")) | ||
| test_only = set(get_dataset_samples(_HF_TINY, num_samples=10, split="test")) | ||
| assert any(s in train_only for s in samples) | ||
| assert any(s in test_only for s in samples) | ||
|
|
||
| def test_default_split_is_train(self): | ||
| pytest.importorskip("datasets") | ||
| default_samples = get_dataset_samples(_HF_TINY, num_samples=4) | ||
| train_samples = get_dataset_samples(_HF_TINY, num_samples=4, split="train") | ||
| assert default_samples == train_samples | ||
|
|
||
| def test_download_to_jsonl_then_load(self, tmp_path): | ||
| """Dump the HF dataset to JSONL, then reload it via the local-jsonl path.""" | ||
| pytest.importorskip("datasets") | ||
| jsonl_path = _hf_dump_to_jsonl(_HF_TINY, "train", tmp_path / "train.jsonl") | ||
| from_jsonl = get_dataset_samples(jsonl_path, num_samples=10) | ||
| from_hf = get_dataset_samples(_HF_TINY, num_samples=10, split="train") | ||
| assert from_jsonl == from_hf | ||
|
|
||
| def test_dataloader_blending_two_hf_datasets(self, pad_tokenizer): | ||
| """Two HF datasets concatenated via ``get_dataset_dataloader``.""" | ||
| pytest.importorskip("datasets") | ||
| loader = get_dataset_dataloader( | ||
| dataset_name=[_HF_TINY, "hf-internal-testing/multi_dir_dataset"], | ||
| tokenizer=pad_tokenizer, | ||
| batch_size=4, | ||
| num_samples=[3, 1], | ||
| max_sample_length=16, | ||
| ) | ||
| batches = list(loader) | ||
| assert sum(b["input_ids"].shape[0] for b in batches) == 4 | ||
|
|
||
| def test_dataloader_mixing_hf_and_local_jsonl(self, tmp_path, pad_tokenizer): | ||
| """Live HF dataset blended with a local synthetic JSONL file.""" | ||
| pytest.importorskip("datasets") | ||
| local = _write_jsonl(tmp_path / "local.jsonl", [{"text": f"local {i}"} for i in range(2)]) | ||
| loader = get_dataset_dataloader( | ||
| dataset_name=[_HF_TINY, local], | ||
| tokenizer=pad_tokenizer, | ||
| batch_size=5, | ||
| num_samples=[3, 2], | ||
| max_sample_length=16, | ||
| ) | ||
| batches = list(loader) | ||
| assert sum(b["input_ids"].shape[0] for b in batches) == 5 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the implementation of get_dataset_samples and get_dataset_dataloader
find . -type f -name "*.py" -path "*/modelopt/*" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 814
🏁 Script executed:
# Search for the get_dataset_samples function definition
rg "def get_dataset_samples" --type py -B 2 -A 10Repository: NVIDIA/Model-Optimizer
Length of output: 954
🏁 Script executed:
# Search for get_dataset_dataloader definition
rg "def get_dataset_dataloader" --type py -B 2 -A 10Repository: NVIDIA/Model-Optimizer
Length of output: 1009
🏁 Script executed:
# Check if there's a conftest.py in the tests/unit directory
find tests/unit -name "conftest.py" -type fRepository: NVIDIA/Model-Optimizer
Length of output: 176
🏁 Script executed:
# Check the test file itself for any pytest markers or fixtures
rg "pytest.mark|@pytest|conftest|mock|patch" tests/unit/torch/utils/test_dataset_utils.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1039
🏁 Script executed:
# Get the line count first to see file size
wc -l modelopt/torch/utils/dataset_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 108
🏁 Script executed:
# Check the implementation of get_dataset_samples to see how it calls load_dataset
sed -n '/def get_dataset_samples/,/^def /p' modelopt/torch/utils/dataset_utils.py | head -80Repository: NVIDIA/Model-Optimizer
Length of output: 4405
🏁 Script executed:
# Check conftest.py for any fixtures that might mock datasets
cat tests/unit/torch/conftest.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2146
🏁 Script executed:
# Check the test file for any imports and setup at the top
head -100 tests/unit/torch/utils/test_dataset_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3254
🏁 Script executed:
# Look for pytest markers on the TestHfTinyDataset class
rg "TestHfTinyDataset" tests/unit/torch/utils/test_dataset_utils.py -B 5 -A 30Repository: NVIDIA/Model-Optimizer
Length of output: 1810
🏁 Script executed:
# Check for pytest markers around TestHfTinyDataset more carefully
rg "class TestHfTinyDataset" tests/unit/torch/utils/test_dataset_utils.py -B 10 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 355
🏁 Script executed:
# Look for pytest configuration
find . -name "pytest.ini" -o -name "pyproject.toml" -o -name "setup.cfg" | grep -E "(pytest|pyproject|setup)" | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 149
🏁 Script executed:
# Check pyproject.toml for test configuration
rg "pytest|testpaths|markers" pyproject.toml -i -A 3 -B 1 | head -60Repository: NVIDIA/Model-Optimizer
Length of output: 996
🏁 Script executed:
# Check if there are any skip markers in the entire test_dataset_utils.py file
rg "pytest.mark|skip|xfail" tests/unit/torch/utils/test_dataset_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1136
TestHfTinyDataset downloads from live Hugging Face infrastructure; move to integration tests or mock the dataset loading.
The tests in lines 531-609 call get_dataset_samples() and get_dataset_dataloader() with real HF dataset identifiers (_HF_TINY = "hf-internal-testing/dataset_with_data_files"), which triggers live downloads via load_dataset() with no mocking in place. These tests lack integration markers and are placed in tests/unit/, creating flakiness risk when CI is offline, rate-limited, or upstream datasets change. Either relocate these to an integration suite (with @pytest.mark.integration) or replace load_dataset() calls with locally-cached fixtures or monkeypatched mock behavior to keep the unit suite hermetic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 531 - 609, The
TestHfTinyDataset tests use real HF data (_HF_TINY, _hf_dump_to_jsonl) and call
get_dataset_samples and get_dataset_dataloader which invoke
datasets.load_dataset; move these tests out of unit tests or mark them as
integration tests (e.g., add `@pytest.mark.integration` to TestHfTinyDataset or
each test) OR replace live calls by monkeypatching datasets.load_dataset (or
patch get_dataset_samples/get_dataset_dataloader to use local fixtures) so the
unit suite is hermetic and no network download occurs.
Summary
Local
.jsonlpaths fed toget_dataset_samples/get_dataset_dataloaderpreviously went through atext-key-only reader, while HF dataset names flowed through an auto-preprocess pipeline that recognizesmessages/conversations/prompt/text/inputcolumns. This split meant a calibration dataset behaved differently depending on whether it lived on HF Hub or on disk.This PR routes local
.jsonlthrough HF'sjsonbuilder and the same auto-preprocess pipeline so the format is detected from columns, not the source. The legacytext-field reader is preserved as a fallback for files where the HF builder fails (e.g. PyArrow schema unification across heterogeneous rows). Existing callers passing a plain{"text": ...}JSONL file keep working unchanged; chat-shaped JSONL now works without a separate code path.Changes
modelopt/torch/utils/dataset_utils.py.jsonlpaths throughload_dataset(path="json", data_files=...)+_auto_preprocess_sample.try/except; on failure for.jsonl, fall back toget_jsonl_text_samples. If the fallback also fails, re-raise the original HF error so the diagnostic is preserved.splits = ["train"]invariant for HF's file-based builders.Tests
tests/unit/torch/utils/test_dataset_utils.py— three new test classes (18 cases, all passing):TestLocalJsonlLoading— text / messages / conversations / prompt+completion / input+output columns;num_sampleshonored;toolskwarg forwarded; ValueError on unrecognized columns; legacytext-key fallback on schema-unification failure.TestGetDatasetDataloaderBlending— single JSONL, list of JSONL files concatenated, mixed-format JSONL files blended, length-mismatch assertion.TestHfTinyDataset— useshf-internal-testing/dataset_with_data_files(10 rows x {train, test}) for end-to-end coverage: single split, multiple splits, default split, HF -> JSONL -> reload round-trip, two-HF-dataset blending, HF + local-JSONL mixing.Test plan
pytest tests/unit/torch/utils/test_dataset_utils.py— 28 passedmypy modelopt/torch/utils/dataset_utils.py— no new errorsruff check— cleanSummary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests