Removed deep-copy data.table ops from the dataProcess pipeline#208
Removed deep-copy data.table ops from the dataProcess pipeline#208tonywu1999 wants to merge 3 commits into
Conversation
* Replaced `input[, cols, with = FALSE]` deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL). * Replaced row-shuffle `input = input[order(...), ]` in .prepareForDataProcess with data.table::setorder() (in place). * Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table. * Replaced the synthesised `tmp` string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call. * Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes. * Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy. * Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame(). * Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept. * Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
|
Great update, thanks. Did you get a chance to evaluate the memory gain with lineprof or lobstr? |
mstaniak
left a comment
There was a problem hiding this comment.
Hi,
thanks again for this update. I have a few minor comments
| # If they were similar sizes, we couldn't distinguish "allocated 2 temp | ||
| # columns" from "copied the whole table" and the test would be meaningless. | ||
| # For our 11-column, 100K-row table: 2 cols ≈ 1.6 MB, full table ≈ 6 MB. | ||
| expect_true(one_col_size * 2 < table_size * 0.5, |
There was a problem hiding this comment.
I wonder, is this test not too sensitive to implementation details out of our control? It's reasonable, but how about we ensure that this being wrong doesn't break output of pkg test, and instead only gives us a message?
| features_quality = data.table::rbindlist(lapply(split(input, input$label), | ||
|
|
||
| has_censored = is.element("censored", colnames(input)) | ||
| work = input[, list(protein = as.character(PROTEIN), |
There was a problem hiding this comment.
"work" was a name used in MSstats 3, I tried to change it to input and the likes, may have missed some uses. Can't we use "input" as the variable name here?
| on = c("FEATURE", "FRACTION"), | ||
| which = TRUE, mult = "first"] | ||
| input = input[!is.na(keep_idx)] | ||
| input$originalRUN = input$newRun |
There was a problem hiding this comment.
We should use := here, right? Otherwise I think a copy is made
User description
input[, cols, with = FALSE]deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL).input = input[order(...), ]in .prepareForDataProcess with data.table::setorder() (in place).tmpstring-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call.See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md
Motivation and Context
Please include relevant motivation and context of the problem along with a short summary of the solution.
Changes
Please provide a detailed bullet point list of your changes.
Testing
Please describe any unit tests you added or modified to verify your changes.
Checklist Before Requesting a Review
PR Type
Enhancement, Bug fix, Tests
Description
Replace full-table copies with in-place updates
Use keyed lookups for joins
Split survival outputs before finalization
Add memory-regression pipeline tests
Diagram Walkthrough
File Walkthrough
dataProcess.R
Limit censored-value updates to matching rowsR/dataProcess.R
ifelse()rewrites with targeted:=updatespredictedon applicable censored rowsnewABUNDANCEwhere imputation appliesutils_checks.R
Avoid copies during input trimming and sortingR/utils_checks.R
data.table::set(..., NULL)data.table::setorder()utils_feature_selection.R
Collapse feature preprocessing into one passR/utils_feature_selection.R
censoredvalues inlineis_obswithout intermediate tablesutils_normalize.R
Use in-place cleanup and keyed fraction mergesR/utils_normalize.R
merge()with keyednewRunassignment(FEATURE, FRACTION)lookuputils_output.R
Streamline summary output and imputation joinsR/utils_output.R
predicted_survivalbefore finalizationtest_memory_optimization_copies.R
Add memory regression tests for copy avoidanceinst/tinytest/test_memory_optimization_copies.R
.normalizeMediantemp-column cleanup.finalizeTMPkeyed matches and unmatchedNAsMSstatsSummarizationOutputlist splitting behavior