[SVLS-8349] Add CPU Enhanced Metrics in Windows Azure Functions#133
[SVLS-8349] Add CPU Enhanced Metrics in Windows Azure Functions#133kathiehuang wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds Windows support for the Azure Functions enhanced CPU usage metric (azure.functions.enhanced.cpu.usage) by reading cumulative CPU time from the Windows Job Object API and enabling enhanced metrics in Windows Azure Functions.
Changes:
- Enable
DD_ENHANCED_METRICS_ENABLED-controlled enhanced metrics for Azure Functions on Windows (no longer hard-disabled). - Implement Windows CPU total collection via
QueryInformationJobObjectand convert Job Object accounting time to nanoseconds for downstream usage-rate computation. - Add the
windowscrate dependency behind the existingwindows-enhanced-metricsfeature.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| crates/datadog-serverless-compat/src/main.rs | Enables enhanced metrics in Azure Functions based on DD_ENHANCED_METRICS_ENABLED for all platforms (including Windows). |
| crates/datadog-metrics-collector/src/azure_windows.rs | Implements Job Object CPU time reads for Windows to support enhanced CPU usage metric computation. |
| crates/datadog-metrics-collector/Cargo.toml | Adds optional windows dependency and wires it to windows-enhanced-metrics. |
| Cargo.lock | Locks new windows crate transitive dependencies for Windows builds. |
Comments suppressed due to low confidence (1)
crates/datadog-metrics-collector/src/azure_windows.rs:41
- This adds new Windows-specific CPU collection logic, but there are no unit tests covering the conversion from Job Object 100-ns units to the
CpuStats.totalnanoseconds value (including overflow/negative handling). Consider extracting the conversion into a pure function and adding a small#[cfg(windows)]test to prevent regressions.
// TotalUserTime and TotalKernelTime are in 100-nanosecond units - multiply by 100 to get nanoseconds
let total_ns = (info.TotalUserTime + info.TotalKernelTime) as u64 * 100;
Some(CpuStats { total: total_ns })
}
| //! This module provides functionality to read CPU usage from Windows Job Objects. | ||
| //! | ||
| //! All CPU metrics will be reported in nanocores (1 core = 1,000,000,000 nanocores). | ||
| //! All CPU metrics are reported in nanocores (1 core = 1,000,000,000 nanocores). | ||
|
|
||
| use crate::azure_cpu::{CpuStats, CpuStatsReader}; |
There was a problem hiding this comment.
Specified that CPU time is stored in nanoseconds and reported in nanocores in 0e97d78
| fn read_cpu_usage_from_job_object() -> Option<CpuStats> { | ||
| let mut info = JOBOBJECT_BASIC_ACCOUNTING_INFORMATION::default(); | ||
| let result = unsafe { | ||
| QueryInformationJobObject( | ||
| None, // If the handle is None, the current process's job object is used | ||
| JobObjectBasicAccountingInformation, // The type of info to retrieve | ||
| &mut info as *mut _ as *mut _, // Pointer to the struct that will store the info | ||
| std::mem::size_of::<JOBOBJECT_BASIC_ACCOUNTING_INFORMATION>() as u32, | ||
| None, | ||
| ) | ||
| }; | ||
| match result { | ||
| Ok(()) => { |
There was a problem hiding this comment.
Added a safety comment in 0e97d78 following https://std-dev-guide.rust-lang.org/policy/safety-comments.html
| match result { | ||
| Ok(()) => { | ||
| // TotalUserTime and TotalKernelTime are in 100-nanosecond units - multiply by 100 to get nanoseconds | ||
| let total_ns = (info.TotalUserTime + info.TotalKernelTime) as u64 * 100; |
There was a problem hiding this comment.
Guarded against overflow by using try_from and checked add/mul in 0e97d78
| Err(_) => { | ||
| debug!("Failed to read CPU usage from Job Object"); |
|
@codex review |
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e5cf36972a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| libdd-common = { git = "https://github.com/DataDog/libdatadog", rev = "4ae8ebe252451374c292efd159ce254c3f5a72e0", default-features = false } | ||
|
|
||
| [target.'cfg(windows)'.dependencies] | ||
| windows = { version = "0.62", features = ["Win32_System_JobObjects"], optional = true } |
There was a problem hiding this comment.
Add Win32_Foundation to the Windows bindings
When building the Windows target with windows-enhanced-metrics, this dependency only enables Win32_System_JobObjects, but the windows bindings gate QueryInformationJobObject/HANDLE behind the Win32_Foundation feature as well. As a result, azure_windows.rs cannot resolve the imported API in the Windows enhanced-metrics build; include Win32_Foundation in this feature list.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Win32_Foundation is already transitively included, defined in the windows crate's Cargo.toml:
Win32_System_JobObjects = ["Win32_System"]
Win32_System = ["Win32"]
Win32 = ["Win32_Foundation"]
The CI build passing confirms this
| libdd-common = { git = "https://github.com/DataDog/libdatadog", rev = "4ae8ebe252451374c292efd159ce254c3f5a72e0", default-features = false } | ||
|
|
||
| [target.'cfg(windows)'.dependencies] | ||
| windows = { version = "0.62", features = ["Win32_System_JobObjects"], optional = true } |
There was a problem hiding this comment.
It looks like windows pulls in additional dependencies. Are all of these dependencies needed? If you set default-features = false are some or all of the dependencies omitted while still being able to collect cpu stats?
Let's try to minimize the increase of the build artifact as much as possible.
serverless-components/Cargo.lock
Lines 3773 to 3778 in e5cf369
There was a problem hiding this comment.
I tried setting default-features = false but it didn't change the Cargo.lock - it looks like all four of these are hard dependencies of the windows crate and not features: https://github.com/microsoft/windows-rs/blob/d3a079bfe1bde71bceebfc6628f4c55a044cd501/crates/libs/windows/Cargo.toml#L15
I think if we want to avoid these subcrates we could switch to windows-sys. The code would be slightly more verbose but the compile footprint would be smaller: https://kennykerr.ca/rust-getting-started/windows-or-windows-sys.html
There was a problem hiding this comment.
I tested windows-sys - output was the same, metrics look right, and binary size went down 11,264 bytes. Updated in c404f84
| // SAFETY: `info` is a stack-allocated `JOBOBJECT_BASIC_ACCOUNTING_INFORMATION` initialized via `default()`, so the compiler guarantees its alignment. | ||
| // The buffer size argument is `size_of::<JOBOBJECT_BASIC_ACCOUNTING_INFORMATION>()`, which exactly matches `info`, so the API cannot write out of bounds. | ||
| // Passing `None` for the job handle is documented to use the current process's job object. | ||
| let result = unsafe { |
There was a problem hiding this comment.
can we add unit tests for this? do we already have windows test runners in ci? if not, maybe now with "unsafe" in the mix is a good time to add them?
There was a problem hiding this comment.
Yes we have Windows test runners in CI! I added unit tests for the CPU time conversion logic to check negative values and overflow a75f3b5 but I think the only way I can test the unsafe block would be to just see if the API call works and that read_cpu_usage_from_job_object doesn't return none?
What does this PR do?
Adds CPU usage enhanced metric
azure.functions.enhanced.cpu.usage(nanocores) for Windows Azure Functions. This feature is already available on Linux Azure Functions (see #77).Motivation
https://datadoghq.atlassian.net/browse/SVLS-8349
Describe how to test/QA your changes
Build with serverless-compat-self-monitoring.
This was deployed with the serverless-compat-self-monitoring pipeline across all runtimes and hosting plans. All hosting plans in Windows were tested to make sure metrics submit correctly.
Testing to compare against Azure Monitor is documented in an internal doc in Enhanced Metrics in the Serverless Compatibility Layer.