Skip to content

Fix memory issues in CUDA EXX screening#182

Open
vmitq wants to merge 3 commits intowavefunction91:masterfrom
vmitq:fixup/cuda_exx_memory_issues
Open

Fix memory issues in CUDA EXX screening#182
vmitq wants to merge 3 commits intowavefunction91:masterfrom
vmitq:fixup/cuda_exx_memory_issues

Conversation

@vmitq
Copy link
Copy Markdown

@vmitq vmitq commented Mar 13, 2026

This PR addresses memory-related issues in exact exchange calculation on CUDA GPU devices.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes correctness issues in CUDA exact-exchange (EXX) EK screening for large workloads by adjusting how task patches are generated/processed on device, and by widening several CUDA screening counters to 64-bit to avoid overflow.

Changes:

  • Reworks the CUDA exx_ek_screening task loop to process each generate_buffers() patch independently (avoiding buffer overwrite across inner-loop iterations).
  • Promotes CUDA EXX screening collision counts / position lists from 32-bit to 64-bit.
  • Adds a new device-memory requirement flag for reserving EXX collision scratch space, and increases Scheme1 static padding.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/xc_integrator/xc_data/device/xc_device_data.hpp Adds task_exx_collision and a sizing helper to reserve scratch space for EXX collision work.
src/xc_integrator/xc_data/device/xc_device_aos_data.cxx Accounts for the new EXX collision scratch reservation in AoS device memory requirements.
src/xc_integrator/local_work_driver/device/scheme1_data_base.cxx Increases static allocation padding used for alignment.
src/xc_integrator/local_work_driver/device/cuda/kernels/exx_ek_screening_bfn_stats.cu Widens collision counts/position lists to 64-bit and updates associated kernel signatures and allocations.
src/xc_integrator/integrator_util/exx_screening.cxx Removes the problematic double-loop batching and calls collision per generated task patch.
Comments suppressed due to low confidence (1)

src/xc_integrator/local_work_driver/device/cuda/kernels/exx_ek_screening_bfn_stats.cu:299

  • print_counts now takes uint64_t* counts, but the printf format string still uses %d for counts[i_task] (and i_task is also not an int). This is undefined behavior and can corrupt output or crash when debugging. Use the correct 64-bit/size_t format specifiers (or cast to an explicitly formatted type).
__global__ void print_counts(size_t ntasks, uint64_t* counts) {


  for(auto i_task = 0 ; i_task < ntasks; ++i_task) {

    printf("[GPU] ITASK %d: %d\n", i_task,counts[i_task]);

  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 305 to 313
template <int32_t buffer_size, typename buffer_type = uint32_t>
__global__ void bitvector_to_position_list_shellpair(
size_t ntasks,
size_t nsp,
size_t LD_bit,
const uint32_t* collisions,
const uint32_t* counts,
uint32_t* position_list
const uint64_t* counts,
uint64_t* position_list
) {
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In bitvector_to_position_list_shellpair, the signature was widened to uint64_t counts/position_list, but the kernel body still uses 32-bit int for indices/counters (e.g., current, j_block). For large nsp/counts (the scenario that motivated 64-bit counts), these int variables can overflow and generate incorrect position lists. Promote these loop counters/offsets to size_t/uint64_t consistently.

Copilot uses AI. Check for mistakes.
Comment on lines 367 to 376
template <int32_t buffer_size, typename buffer_type = uint32_t>
__global__ void bitvector_to_position_list_shells(
size_t ntasks,
size_t nshells,
size_t LD_bit,
const uint32_t* collisions,
const uint32_t* counts,
const uint64_t* counts,
const int32_t* shell_size,
uint32_t* position_list,
uint64_t* position_list,
size_t* nbe_list
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In bitvector_to_position_list_shells, counts/position_list were widened to 64-bit, but the kernel still uses 32-bit int counters/loop variables (e.g., current, j_block). This can overflow for large shell counts and defeat the purpose of moving to 64-bit counts. Use size_t/uint64_t for these indices and for offset arithmetic into position_list.

Copilot uses AI. Check for mistakes.
Comment on lines +510 to +513
inline size_t task_exx_collision_size(size_t nshells) {
const size_t nslt = (nshells * (nshells+1)) / 2
+ nshells
;
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task_exx_collision_size computes the scratch size using the dense lower-triangular shell-pair count (nshells*(nshells+1))/2, but ShellPairCollection is stored sparsely (only nonzero primitive-screened pairs). Using the dense upper bound can massively overestimate per-task memory, forcing very small GPU task patches or even preventing batching on large systems. Consider basing this size on the actual allocated nshell_pairs (e.g., global_dims.nshell_pairs) instead of nshells.

Suggested change
inline size_t task_exx_collision_size(size_t nshells) {
const size_t nslt = (nshells * (nshells+1)) / 2
+ nshells
;
inline size_t task_exx_collision_size(size_t nshell_pairs) {
// Size the EXX collision workspace based on the actual number of
// allocated shell pairs rather than the dense lower-triangular
// shell-pair count (which can massively overestimate the required
// memory for sparse ShellPairCollection objects).
const size_t nslt = nshell_pairs;

Copilot uses AI. Check for mistakes.
reqt.task_bfn_shell_indirection_size( nbe_bfn ) * sizeof(int32_t) +

// Scratch memory to store shell pairs
reqt.task_exx_collision_size( nshells ) * sizeof(int64_t) +
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EXX collision scratch reservation is currently computed from nshells (dense (nshells*(nshells+1))/2 upper bound) rather than the actually allocated sparse shell-pair count (global_dims.nshell_pairs). On large/sparse systems this can drastically inflate get_mem_req, causing generate_buffers to choose much smaller task patches than necessary. Prefer using global_dims.nshell_pairs (or passing nshell_pairs into the sizing helper) for a tighter bound.

Suggested change
reqt.task_exx_collision_size( nshells ) * sizeof(int64_t) +
reqt.task_exx_collision_size( task.global_dims.nshell_pairs ) * sizeof(int64_t) +

Copilot uses AI. Check for mistakes.
@awvwgk awvwgk linked an issue Apr 1, 2026 that may be closed by this pull request
@wavefunction91
Copy link
Copy Markdown
Owner

Thanks for the patch @vmitq!

You seem to have disabled the batching in the EXX screening. I'm not opposed to that offhand, but can you quantify the performance penalty (if any) on systems for which the memory problem wasn't a problem?

@vmitq
Copy link
Copy Markdown
Author

vmitq commented Apr 7, 2026

Thanks for the patch @vmitq!

You seem to have disabled the batching in the EXX screening. I'm not opposed to that offhand, but can you quantify the performance penalty (if any) on systems for which the memory problem wasn't a problem?

It doesn’t remove batching completely; it just removes the second level of batching, which seems redundant here. I haven’t observed a notable performance difference, but if you’d like concrete numbers, I can run a few benchmarks.

@vmitq vmitq closed this Apr 7, 2026
@vmitq vmitq reopened this Apr 7, 2026
vmitq added 2 commits April 24, 2026 10:50
Compute collision scratch requirements as
max(mem_coll + mem_bfn_stats, mem_collion) per task.
Replace task_exx_collision_size with task_exx_coll_bitvec_size,
task_exx_coll_fmax_size, and task_exx_coll_position_size helpers.
Only use uint64_t for prefix sum counters
@vmitq
Copy link
Copy Markdown
Author

vmitq commented Apr 24, 2026

investigated the performance and found it was about ~10% slower than the master. The main reason was excessive use of uint64_t for the positions list, which increased memory traffic. This has been fixed in a recent commit. Previously, I had only benchmarked with large batch sizes (for performance, see table), where this issue was less noticeable.

The scheduling changes do not have a measurable impact on performance.

The memory size for the screening step is now computed more accurately. This allows for larger batch sizes, which leads to a slight overall performance improvement compared to master.

Below are the final benchmark results for several molecules with the cc-pVDZ basis set and different screening tolerances:

Molecule Atoms NBF Batch Tol Master (s) current (s) Diff
chloroquine 48 464 1024 1e-10 1.892 ± 0.012 1.871 ± 0.014 -1.1%
chloroquine 48 464 2048 1e-10 1.039 ± 0.001 1.027 ± 0.009 -1.1%
chloroquine 48 464 10240 1e-10 0.315 ± 0.001 0.311 ± 0.001 -1.0%
chloroquine 48 464 1024 1e-50 2.807 ± 0.006 2.776 ± 0.013 -1.1%
chloroquine 48 464 2048 1e-50 1.753 ± 0.004 1.725 ± 0.007 -1.6%
chloroquine 48 464 10240 1e-50 0.850 ± 0.002 0.837 ± 0.001 -1.6%
c60 60 900 1024 1e-10 5.753 ± 0.021 5.703 ± 0.034 -0.9%
c60 60 900 2048 1e-10 3.320 ± 0.015 3.302 ± 0.016 -0.5%
c60 60 900 10240 1e-10 1.140 ± 0.003 1.138 ± 0.002 -0.1%
c60 60 900 1024 1e-50 9.395 ± 0.051 9.333 ± 0.036 -0.7%
c60 60 900 2048 1e-50 5.642 ± 0.030 5.567 ± 0.013 -1.3%
c60 60 900 10240 1e-50 2.930 ± 0.009 2.944 ± 0.009 +0.5%
valinomycin 168 1620 1024 1e-10 22.449 ± 0.272 22.657 ± 0.086 +0.9%
valinomycin 168 1620 2048 1e-10 12.270 ± 0.038 12.173 ± 0.031 -0.8%
valinomycin 168 1620 10240 1e-10 3.172 ± 0.006 3.171 ± 0.014 -0.0%
valinomycin 168 1620 1024 1e-50 48.831 ± 0.040 47.718 ± 0.133 -2.3%
valinomycin 168 1620 2048 1e-50 32.718 ± 0.073 33.217 ± 0.113 +1.5%
valinomycin 168 1620 10240 1e-50 17.337 ± 0.026 17.135 ± 0.077 -1.2%
crambin 642 6504 1024 1e-10 311.094 ± 3.624 259.484 ± 0.488 -16.6%
crambin 642 6504 2048 1e-10 146.633 ± 0.197 150.920 ± 0.164 +2.9%
crambin 642 6504 10240 1e-10 47.117 ± 1.645 45.911 ± 0.063 -2.6%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in CUDA exx ek screening

3 participants