Removed deep-copy data.table ops from the dataProcess pipeline by tonywu1999 · Pull Request #208 · Vitek-Lab/MSstats

tonywu1999 · 2026-05-14T15:24:06Z

User description

Replaced input[, cols, with = FALSE] deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL).
Replaced row-shuffle input = input[order(...), ] in .prepareForDataProcess with data.table::setorder() (in place).
Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table.
Replaced the synthesised tmp string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call.
Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes.
Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy.
Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame().
Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept.
Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

I have read the MSstats contributing guidelines
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules
I have run the devtools::document() command after my changes and committed the added files

PR Type

Enhancement, Bug fix, Tests

Description

Replace full-table copies with in-place updates
Use keyed lookups for joins
Split survival outputs before finalization
Add memory-regression pipeline tests

Diagram Walkthrough

flowchart LR
  a["Copy-heavy data.table operations"]
  b["In-place set and setorder updates"]
  c["Keyed survival and fraction lookups"]
  d["Lower-memory summarization pipeline"]
  a -- "replaced by" --> b
  b -- "combined with" --> c
  c -- "reduces copies in" --> d

File Walkthrough

Relevant files

Enhancement

dataProcess.R `Limit censored-value updates to matching rows` R/dataProcess.R Replace `ifelse()` rewrites with targeted `:=` updates Only keep `predicted` on applicable censored rows Only overwrite `newABUNDANCE` where imputation applies	+12/-10
utils_checks.R `Avoid copies during input trimming and sorting` R/utils_checks.R Drop unwanted columns via `data.table::set(..., NULL)` Avoid retained-column deep copies during preparation Sort rows in place with `data.table::setorder()`	+13/-5
utils_feature_selection.R `Collapse feature preprocessing into one pass` R/utils_feature_selection.R Build the feature-selection working table once Handle optional `censored` values inline Compute `is_obs` without intermediate tables	+20/-22
utils_normalize.R `Use in-place cleanup and keyed fraction merges` R/utils_normalize.R Remove normalization temp columns in place Return the modified input after cleanup Replace `merge()` with keyed `newRun` assignment Filter fractions by direct `(FEATURE, FRACTION)` lookup	+23/-13

Bug fix

utils_output.R `Streamline summary output and imputation joins` R/utils_output.R Extract combined `predicted_survival` before finalization Free nested survival entries before binding summaries Replace merge-based imputation with keyed writes Guard join columns and update flags in place	+51/-27

Tests

test_memory_optimization_copies.R `Add memory regression tests for copy avoidance` inst/tinytest/test_memory_optimization_copies.R Add address-based assertions for in-place behavior Test `.normalizeMedian` temp-column cleanup Test `.finalizeTMP` keyed matches and unmatched `NA`s Verify `MSstatsSummarizationOutput` list splitting behavior	+469/-0

* Replaced `input[, cols, with = FALSE]` deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL). * Replaced row-shuffle `input = input[order(...), ]` in .prepareForDataProcess with data.table::setorder() (in place). * Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table. * Replaced the synthesised `tmp` string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call. * Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes. * Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy. * Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame(). * Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept. * Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2026-05-14T15:24:13Z

Warning

Rate limit exceeded

@Rudhik1904 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 1 second before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2ba1749b-700f-4199-8355-b48e8959b078

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa0bd1 and e2bd771.

📒 Files selected for processing (6)

R/dataProcess.R
R/utils_checks.R
R/utils_feature_selection.R
R/utils_normalize.R
R/utils_output.R
inst/tinytest/test_memory_optimization_copies.R

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch MSstats/work/20260514_avoid-deep-copy-ops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-14T15:28:41Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Input mutation `MSstatsPrepareForDataProcess` now removes columns from `input` in place and then adds derived columns on the same object. If the caller passes a `data.table` they intend to reuse, this function will now silently delete all non-selected columns from that original object, which did not happen before when the code first created a subset copy. drop_cols = setdiff(colnames(input), cols) for (col in drop_cols) data.table::set(input, j = col, value = NULL) input$PEPTIDE = paste(input$PEPTIDESEQUENCE, input$PRECURSORCHARGE, sep = "_") input$TRANSITION = paste(input$FRAGMENTION, Output mutation `MSstatsSummarizationOutput` now mutates its `input` argument by reference by dropping columns and converting it with `setDF()`. Because `input` is a `data.table`, the caller's object will be stripped down to output columns and lose its `data.table` class after this call, which can break any downstream code that reuses the same table. drop_cols = setdiff(colnames(input), output_cols) for (col in drop_cols) data.table::set(input, j = col, value = NULL) if (is.element("remove", colnames(processed))) { processed = processed[(remove), intersect(output_cols, colnames(processed)), with = FALSE] input = rbind(input, processed, fill = TRUE) } data.table::setDF(input) data.table::setDF(rqall) list(FeatureLevelData = input,

github-actions · 2026-05-14T15:33:19Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Avoid mutating caller objects `data.table::setDF()` mutates by reference, so this function now converts the caller's `input` object into a plain `data.frame`. That is a behavioral regression from `as.data.frame(...)` and can break later code that still expects `input` to be a `data.table`. R/utils_output.R [101-105] -data.table::setDF(input) -data.table::setDF(rqall) -list(FeatureLevelData = input, - ProteinLevelData = rqall, +feature_output = data.table::copy(input) +protein_output = data.table::copy(rqall) +data.table::setDF(feature_output) +data.table::setDF(protein_output) +list(FeatureLevelData = feature_output, + ProteinLevelData = protein_output, SummaryMethod = method) Suggestion importance[1-10]: 7 __ Why: This is correct and important for API behavior: `data.table::setDF()` mutates `input` and `rqall` by reference, unlike the previous `as.data.frame(...)` conversion. Copying before `setDF()` preserves the optimization while avoiding surprising side effects on caller-visible `data.table` objects.	Medium
Possible issue	Guard nested summary extraction Validate `summarized` before indexing `x[[2]]`. The new code dereferences every element immediately, so any `try-error` or short result now crashes here and bypasses the function's existing failure handling. R/utils_output.R [41-44] -predicted_survival = data.table::rbindlist(lapply(summarized, function(x) x[[2]]), - fill = TRUE) +invalid_summary = vapply( + summarized, + function(x) inherits(x, "try-error") \|\| length(x) < 2L, + logical(1) +) +if (any(invalid_summary)) { + stop("`summarized` contains failed or incomplete protein summaries.") +} +predicted_survival = data.table::rbindlist(lapply(summarized, `[[`, 2L), + fill = TRUE) for (i in seq_along(summarized)) summarized[[i]][[2]] = NULL input = .finalizeInput(input, predicted_survival, method, impute, censored_symbol) Suggestion importance[1-10]: 6 __ Why: This is a valid robustness concern: `MSstatsSummarizationOutput()` now dereferences `x[[2]]` before the later `inherits(summarized, "try-error")` check, so malformed entries in `summarized` can fail earlier with an uncontrolled error. The fix is relevant, but it mainly improves error handling rather than changing core results.	Low

mstaniak · 2026-05-15T10:14:47Z

Great update, thanks. Did you get a chance to evaluate the memory gain with lineprof or lobstr?

mstaniak

Hi,
thanks again for this update. I have a few minor comments

mstaniak · 2026-05-16T10:27:55Z

+# If they were similar sizes, we couldn't distinguish "allocated 2 temp
+# columns" from "copied the whole table" and the test would be meaningless.
+# For our 11-column, 100K-row table: 2 cols ≈ 1.6 MB, full table ≈ 6 MB.
+expect_true(one_col_size * 2 < table_size * 0.5,


I wonder, is this test not too sensitive to implementation details out of our control? It's reasonable, but how about we ensure that this being wrong doesn't break output of pkg test, and instead only gives us a message?

mstaniak · 2026-05-16T10:30:40Z

-    features_quality = data.table::rbindlist(lapply(split(input, input$label),
+
+    has_censored = is.element("censored", colnames(input))
+    work = input[, list(protein = as.character(PROTEIN),


"work" was a name used in MSstats 3, I tried to change it to input and the likes, may have missed some uses. Can't we use "input" as the variable name here?

mstaniak · 2026-05-16T10:31:27Z

+                                           on = c("FEATURE", "FRACTION"),
+                                           which = TRUE, mult = "first"]
+                input = input[!is.na(keep_idx)]
                input$originalRUN = input$newRun


We should use := here, right? Otherwise I think a copy is made

github-actions Bot added the Review effort 4/5 label May 14, 2026

Rudhik1904 added 2 commits May 15, 2026 00:11

Updating the comments.

b64b004

Updating some more comments.

e2bd771

mstaniak requested changes May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed deep-copy data.table ops from the dataProcess pipeline#208

Removed deep-copy data.table ops from the dataProcess pipeline#208
tonywu1999 wants to merge 3 commits into
develfrom
MSstats/work/20260514_avoid-deep-copy-ops

tonywu1999 commented May 14, 2026 •

edited by github-actions Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

mstaniak commented May 15, 2026

Uh oh!

mstaniak left a comment

Uh oh!

mstaniak May 16, 2026

Uh oh!

mstaniak May 16, 2026

Uh oh!

mstaniak May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tonywu1999 commented May 14, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Motivation and Context

Changes

Testing

Checklist Before Requesting a Review

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

github-actions Bot commented May 14, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 14, 2026

PR Code Suggestions ✨

Uh oh!

mstaniak commented May 15, 2026

Uh oh!

mstaniak left a comment

Choose a reason for hiding this comment

Uh oh!

mstaniak May 16, 2026

Choose a reason for hiding this comment

Uh oh!

mstaniak May 16, 2026

Choose a reason for hiding this comment

Uh oh!

mstaniak May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tonywu1999 commented May 14, 2026 •

edited by github-actions Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading