OneDNN BRGeMM Micro-Kernel Integration for BF16 MatMul#903
OneDNN BRGeMM Micro-Kernel Integration for BF16 MatMul#903bbhattar wants to merge 5 commits intogoogle:devfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
629b569 to
e072d70
Compare
jan-wassenberg
left a comment
There was a problem hiding this comment.
Very nice work :) Just some fairly minor suggestions:
|
|
||
| struct BRGeMMConfig { | ||
| int64_t M_blk; | ||
| int64_t N_blk; |
There was a problem hiding this comment.
We could set these to 32 directly, as member initializers? Possibly also make them const to make clear that they do not change.
Also, prefer size_t for all size-like things to prevent sign-conversion warnings.
| // Tunable: M_blk in {32,64}, batch_size in {16,32,64,128,256}. | ||
| inline std::vector<BRGeMMConfig> BRGeMMCandidates(size_t M, size_t K, | ||
| size_t N) { | ||
| std::vector<BRGeMMConfig> out; |
There was a problem hiding this comment.
Let's .reserve with some estimate, also to document how many there will be?
| static constexpr int64_t kMBlkValues[] = {32, 64}; | ||
| static constexpr int64_t kBatchValues[] = {16, 32, 64, 128, 256}; | ||
|
|
||
| const int64_t k_chunks = static_cast<int64_t>(K) / kKBlk; |
There was a problem hiding this comment.
Should this round up? We have hwy::DivCeil.
| } | ||
| madvise(ptr_, size_, MADV_HUGEPAGE); | ||
| for (size_t off = 0; off < size_; off += kHugePageSize) { | ||
| static_cast<volatile uint8_t*>(ptr_)[off] = 0; |
There was a problem hiding this comment.
Possibly safer/more portable: consider ptr_[off] = 0; hwy::PreventElision(ptr_[off]).
| // Kernel cache key: identifies a JIT-compiled kernel set. | ||
| struct BRGeMMKernelKey { | ||
| size_t M, K, N; | ||
| int64_t M_blk, N_blk, K_blk, batch_size; |
There was a problem hiding this comment.
Can these also be size_t? And below.
| ke.M_blk = | ||
| static_cast<int64_t>(std::min(static_cast<size_t>(cfg.M_blk), M)); | ||
|
|
||
| ke.M_tail = M % ke.M_blk; |
There was a problem hiding this comment.
Do we want precomputed hwy::Divisor here to avoid actual division?
| const int64_t ldb_for[2] = {ke.N_blk, ke.N_tail ? ke.N_tail : ke.N_blk}; | ||
| const int64_t ldc_for[2] = {ke.N_blk, ke.N_tail ? ke.N_tail : ke.N_blk}; | ||
|
|
||
| // Create brgemm kernels for each (M-tile, N-tile) variant. |
There was a problem hiding this comment.
I think these are "do we have an M and N tail" variants, could the comment be rephrased to make that more clear?
| auto& kern_cache = GetBRGeMMKernelCache(); | ||
| auto kern_it = kern_cache.find(kern_key); | ||
|
|
||
| if (kern_it == kern_cache.end()) { |
There was a problem hiding this comment.
This block is quite big. Might help readability and codegen to put it into a HWY_NOINLINE helper function?
| if (!MakeBrgemm(ke.brg_first_all[mi][ni], ms, ns, ke.K_blk, | ||
| ke.K_super_size, ke.lda, ldb_for[ni], ldc_for[ni], | ||
| a_dt, b_dt, c_dt, false)) { | ||
| return; |
There was a problem hiding this comment.
Should we HWY_WARN on failure? Or even HWY_ABORT? If failure can happen, should we fall back to the prior matmul?
| const auto va = hn::Load(df, add_row + n); | ||
| const auto result = hn::MulAdd(v, vscale, va); | ||
| if constexpr (hwy::IsSame<TC, float>()) { | ||
| hn::Store(result, df, reinterpret_cast<float*>(C_row) + n); |
There was a problem hiding this comment.
Better to use HWY_RCAST_ALIGNED to tell the compiler is this element-aligned. (also below)
This PR integrates OneDNN BRGeMM (Batch-Reduced General Matrix Multiply) micro-kernels as an alternative compute path for BF16 MatMul on Intel Xeon platforms with AMX or AVX-512 BF16 support.
What
When enabled via the
GEMMA_ONEDNN_BRGEMMcompile-time flag, BF16×BF16 MatMul operations are dispatched to JIT-compiled BRGeMM kernels instead of the Highway SIMD path. This targets Gemma model workloads (FFW projections, attention) on Intel Xeon Scalable (SPR/EMR) processors. At this point support has been added to both CMake and Bazel build systems.How to Enable
Runtime Fallback
When
GEMMA_ONEDNN_BRGEMMis enabled at compile time, the BRGeMM path activates for BF16×BF16 operations whose dimensions meet AMX tile constraints (M, N, K ≥ 32 and K % 32 == 0). All other cases — non-BF16 types, smaller or non-aligned dimensions, mixed precision — fall through to the standard Highway SIMD MatMul path automatically.Changes
ops/brgemm.hUseOneDnnBrgemm(), autotuning candidatesops/brgemm-inl.hDoMatMul_BRGeMM(): kernel JIT/caching, B-packing with hugepages, tiled parallel executionops/matmul-inl.hMatMul()guarded by#if GEMMA_ONEDNN_BRGEMMops/matmul.h#include "ops/brgemm.h",brgemm_autotunefield inMMPerKeyops/bench_matmul.ccbrgemm_autotune.Best()to avoid infinite loop when BRGeMM handles dispatchCMakeLists.txtGEMMA_ONEDNN_BRGEMMoption, FetchContent for OneDNN v3.11, conditional target linkingBUILD.bazelconfig_settingforgemma_onednn_brgemm, conditional OneDNN dep and defines for x86_64MODULE.bazelhttp_archivedependencybazel/onednn.BUILDutil/zones.hkBRGeMMcaller enum for thread pool dispatchutil/zones.ccCallerNamemapping forkBRGeMMTesting
matmul_testpasses with and withoutGEMMA_ONEDNN_BRGEMM(all original test shapes, types, and correctness checks preserved)bench_matmulruns successfully with BRGeMM enabled