feat: Add decoupled weight decay (AdamW) to Adam optimizer#1037
Open
AliAlimohammadi wants to merge 1 commit intoTheAlgorithms:masterfrom
Open
feat: Add decoupled weight decay (AdamW) to Adam optimizer#1037AliAlimohammadi wants to merge 1 commit intoTheAlgorithms:masterfrom
AliAlimohammadi wants to merge 1 commit intoTheAlgorithms:masterfrom
Conversation
Contributor
Author
|
@siriak, this is ready to be merged. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1037 +/- ##
==========================================
- Coverage 96.01% 95.94% -0.07%
==========================================
Files 392 392
Lines 29722 29806 +84
==========================================
+ Hits 28537 28597 +60
- Misses 1185 1209 +24 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Extends the existing
Adamoptimizer insrc/machine_learning/optimization/adam.rsto support decoupled weight decay (AdamW), as introduced in Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2019).Rather than adding a separate$\lambda \cdot \theta_{t-1}$ is subtracted directly from the parameters after the adaptive gradient step, keeping it independent of the second moment scaling — the key correction AdamW makes over naive L2 regularisation inside Adam.
AdamWstruct, a singleweight_decay: f64field (defaulting to0.0) is added to the existingAdamstruct. Whenweight_decayis0.0the update is identical to standard Adam. When positive, the decay termThe
stepsignature changes fromstep(&mut self, gradients: &[f64])tostep(&mut self, gradients: &[f64], params: &[f64]), since the decoupled decay term requires the current parameter values. All existing tests are updated accordingly (zero-initialised params preserve the original expected values).Algorithm
Both variants share the same moment update rules:
They differ only in the parameter update step:
Adam — weight decay is absent (or equivalently, folded into$g_t$ as L2 regularisation, where it gets scaled down by $1/\sqrt{\hat{v}_t}$ ):
AdamW — weight decay is applied directly to$\theta_{t-1}$ , fully decoupled from the adaptive scaling so its effect is constant regardless of gradient history:
Setting$\lambda = 0$ in the AdamW update recovers Adam exactly, both mathematically and in the implementation (verified by
test_adamw_step_weight_decay_zero_matches_adam).Type of change
Checklist
cargo clippy --all -- -D warningsjust before my last commit and fixed any issue that was found.cargo fmtjust before my last commit.cargo testjust before my last commit and all tests passed.mod.rsfile within its own folder, and in any parent folder(s).DIRECTORY.mdwith the correct link.CONTRIBUTING.mdand my code follows its guidelines.