Skip to content

Removed deep-copy data.table ops from the dataProcess pipeline#208

Open
tonywu1999 wants to merge 3 commits into
develfrom
MSstats/work/20260514_avoid-deep-copy-ops
Open

Removed deep-copy data.table ops from the dataProcess pipeline#208
tonywu1999 wants to merge 3 commits into
develfrom
MSstats/work/20260514_avoid-deep-copy-ops

Conversation

@tonywu1999
Copy link
Copy Markdown
Contributor

@tonywu1999 tonywu1999 commented May 14, 2026

User description

  • Replaced input[, cols, with = FALSE] deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL).
  • Replaced row-shuffle input = input[order(...), ] in .prepareForDataProcess with data.table::setorder() (in place).
  • Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table.
  • Replaced the synthesised tmp string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call.
  • Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes.
  • Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy.
  • Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame().
  • Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept.
  • Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • I have run the devtools::document() command after my changes and committed the added files

PR Type

Enhancement, Bug fix, Tests


Description

  • Replace full-table copies with in-place updates

  • Use keyed lookups for joins

  • Split survival outputs before finalization

  • Add memory-regression pipeline tests


Diagram Walkthrough

flowchart LR
  a["Copy-heavy data.table operations"]
  b["In-place set and setorder updates"]
  c["Keyed survival and fraction lookups"]
  d["Lower-memory summarization pipeline"]
  a -- "replaced by" --> b
  b -- "combined with" --> c
  c -- "reduces copies in" --> d
Loading

File Walkthrough

Relevant files
Enhancement
dataProcess.R
Limit censored-value updates to matching rows                       

R/dataProcess.R

  • Replace ifelse() rewrites with targeted := updates
  • Only keep predicted on applicable censored rows
  • Only overwrite newABUNDANCE where imputation applies
+12/-10 
utils_checks.R
Avoid copies during input trimming and sorting                     

R/utils_checks.R

  • Drop unwanted columns via data.table::set(..., NULL)
  • Avoid retained-column deep copies during preparation
  • Sort rows in place with data.table::setorder()
+13/-5   
utils_feature_selection.R
Collapse feature preprocessing into one pass                         

R/utils_feature_selection.R

  • Build the feature-selection working table once
  • Handle optional censored values inline
  • Compute is_obs without intermediate tables
+20/-22 
utils_normalize.R
Use in-place cleanup and keyed fraction merges                     

R/utils_normalize.R

  • Remove normalization temp columns in place
  • Return the modified input after cleanup
  • Replace merge() with keyed newRun assignment
  • Filter fractions by direct (FEATURE, FRACTION) lookup
+23/-13 
Bug fix
utils_output.R
Streamline summary output and imputation joins                     

R/utils_output.R

  • Extract combined predicted_survival before finalization
  • Free nested survival entries before binding summaries
  • Replace merge-based imputation with keyed writes
  • Guard join columns and update flags in place
+51/-27 
Tests
test_memory_optimization_copies.R
Add memory regression tests for copy avoidance                     

inst/tinytest/test_memory_optimization_copies.R

  • Add address-based assertions for in-place behavior
  • Test .normalizeMedian temp-column cleanup
  • Test .finalizeTMP keyed matches and unmatched NAs
  • Verify MSstatsSummarizationOutput list splitting behavior
+469/-0 

* Replaced `input[, cols, with = FALSE]` deep-copy in
  MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with
  drop-cols loops via data.table::set(j = ..., value = NULL).
* Replaced row-shuffle `input = input[order(...), ]` in
  .prepareForDataProcess with data.table::setorder() (in place).
* Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and
  .finalizeTMP with keyed-which lookups + data.table::set() writes —
  avoids deep-copying the whole table.
* Replaced the synthesised `tmp` string-join filter in
  MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed
  lookup; drops two large character vectors and a paste() call.
* Replaced ifelse() full-vector writes for predicted/newABUNDANCE
  and nonmissing_orig with targeted [i, j := v] in-place writes.
* Collapsed the two-step subset+transform in .selectHighQualityFeatures
  into a single pass to eliminate one intermediate data.table copy.
* Reworked MSstatsSummarizationOutput to extract predicted_survival
  upfront and null per-protein second slots so the nested-list
  duplication is freed before .finalizeTMP runs; switched the
  final return to data.table::setDF() in place of as.data.frame().
* Fixed two regressions in the original commit: (1) .finalizeTMP's
  join_cols must intersect with predicted_survival's columns so the
  keyed lookup doesn't error on missing LABEL; (2) reverted the
  survival-column-selection tightening that dropped LABEL — a
  downstream test in test_dataProcess.R relies on LABEL being kept.
* Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4
  — 28 assertions, all green. Full suite 224/224 OK.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Warning

Rate limit exceeded

@Rudhik1904 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 1 second before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2ba1749b-700f-4199-8355-b48e8959b078

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa0bd1 and e2bd771.

📒 Files selected for processing (6)
  • R/dataProcess.R
  • R/utils_checks.R
  • R/utils_feature_selection.R
  • R/utils_normalize.R
  • R/utils_output.R
  • inst/tinytest/test_memory_optimization_copies.R
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch MSstats/work/20260514_avoid-deep-copy-ops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Input mutation

MSstatsPrepareForDataProcess now removes columns from input in place and then adds derived columns on the same object. If the caller passes a data.table they intend to reuse, this function will now silently delete all non-selected columns from that original object, which did not happen before when the code first created a subset copy.

drop_cols = setdiff(colnames(input), cols)
for (col in drop_cols) data.table::set(input, j = col, value = NULL)

input$PEPTIDE = paste(input$PEPTIDESEQUENCE,
                      input$PRECURSORCHARGE, sep = "_")
input$TRANSITION = paste(input$FRAGMENTION, 
Output mutation

MSstatsSummarizationOutput now mutates its input argument by reference by dropping columns and converting it with setDF(). Because input is a data.table, the caller's object will be stripped down to output columns and lose its data.table class after this call, which can break any downstream code that reuses the same table.

drop_cols = setdiff(colnames(input), output_cols)
for (col in drop_cols) data.table::set(input, j = col, value = NULL)

if (is.element("remove", colnames(processed))) {
    processed = processed[(remove),
                          intersect(output_cols,
                                    colnames(processed)), with = FALSE]
    input = rbind(input, processed, fill = TRUE)
}
data.table::setDF(input)
data.table::setDF(rqall)
list(FeatureLevelData = input,

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid mutating caller objects

data.table::setDF() mutates by reference, so this function now converts the caller's
input object into a plain data.frame. That is a behavioral regression from
as.data.frame(...) and can break later code that still expects input to be a
data.table.

R/utils_output.R [101-105]

-data.table::setDF(input)
-data.table::setDF(rqall)
-list(FeatureLevelData = input,
-     ProteinLevelData = rqall,
+feature_output = data.table::copy(input)
+protein_output = data.table::copy(rqall)
+data.table::setDF(feature_output)
+data.table::setDF(protein_output)
+list(FeatureLevelData = feature_output,
+     ProteinLevelData = protein_output,
      SummaryMethod = method)
Suggestion importance[1-10]: 7

__

Why: This is correct and important for API behavior: data.table::setDF() mutates input and rqall by reference, unlike the previous as.data.frame(...) conversion. Copying before setDF() preserves the optimization while avoiding surprising side effects on caller-visible data.table objects.

Medium
Guard nested summary extraction

Validate summarized before indexing x[[2]]. The new code dereferences every element
immediately, so any try-error or short result now crashes here and bypasses the
function's existing failure handling.

R/utils_output.R [41-44]

-predicted_survival = data.table::rbindlist(lapply(summarized, function(x) x[[2]]),
-                                            fill = TRUE)
+invalid_summary = vapply(
+    summarized,
+    function(x) inherits(x, "try-error") || length(x) < 2L,
+    logical(1)
+)
+if (any(invalid_summary)) {
+    stop("`summarized` contains failed or incomplete protein summaries.")
+}
+predicted_survival = data.table::rbindlist(lapply(summarized, `[[`, 2L),
+                                           fill = TRUE)
 for (i in seq_along(summarized)) summarized[[i]][[2]] = NULL
 input = .finalizeInput(input, predicted_survival, method, impute, censored_symbol)
Suggestion importance[1-10]: 6

__

Why: This is a valid robustness concern: MSstatsSummarizationOutput() now dereferences x[[2]] before the later inherits(summarized, "try-error") check, so malformed entries in summarized can fail earlier with an uncontrolled error. The fix is relevant, but it mainly improves error handling rather than changing core results.

Low

@mstaniak
Copy link
Copy Markdown
Contributor

Great update, thanks. Did you get a chance to evaluate the memory gain with lineprof or lobstr?

Copy link
Copy Markdown
Contributor

@mstaniak mstaniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
thanks again for this update. I have a few minor comments

# If they were similar sizes, we couldn't distinguish "allocated 2 temp
# columns" from "copied the whole table" and the test would be meaningless.
# For our 11-column, 100K-row table: 2 cols ≈ 1.6 MB, full table ≈ 6 MB.
expect_true(one_col_size * 2 < table_size * 0.5,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, is this test not too sensitive to implementation details out of our control? It's reasonable, but how about we ensure that this being wrong doesn't break output of pkg test, and instead only gives us a message?

features_quality = data.table::rbindlist(lapply(split(input, input$label),

has_censored = is.element("censored", colnames(input))
work = input[, list(protein = as.character(PROTEIN),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"work" was a name used in MSstats 3, I tried to change it to input and the likes, may have missed some uses. Can't we use "input" as the variable name here?

Comment thread R/utils_normalize.R
on = c("FEATURE", "FRACTION"),
which = TRUE, mult = "first"]
input = input[!is.na(keep_idx)]
input$originalRUN = input$newRun
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use := here, right? Otherwise I think a copy is made

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants