Skip to content

fix error utils to work on architectures where std::is_normal is not …#637

Open
Yurlungur wants to merge 6 commits intomainfrom
jmm/fix_error_utils
Open

fix error utils to work on architectures where std::is_normal is not …#637
Yurlungur wants to merge 6 commits intomainfrom
jmm/fix_error_utils

Conversation

@Yurlungur
Copy link
Copy Markdown
Collaborator

@Yurlungur Yurlungur commented May 6, 2026

…supported

PR Summary

On a V100 architecture, @swjones discovered that bad_value was triggering on perfectly normal numbers, like $10^6$. This is apparently because std::is_normal is not guaranteed to be available or functional on all GPU architectures. This MR (hopefully!) remedies the situation by (a) implementing the constexpr version of is_normal ourselves and (b) testing it to make sure that sticks.

PR Checklist

  • Adds a test for any bugs fixed. Adds tests for new features.
  • Format your changes by using the make format command after configuring with cmake.
  • Document any new features, update documentation for changes made.
  • Make sure the copyright notice on any files you modified is up to date.
  • After creating a pull request, note it in the CHANGELOG.md file.
  • LANL employees: make sure tests pass both on the github CI and on the Darwin CI
  • If ML was used, make sure to add a disclaimer at the top of a file indicating ML was used to assist in generating the file.
  • If Agentic AI was used, have the AI generate a "proposed changes" markdown file and store it in the plan_histories folder, with a filename the same as the MR number.

If preparing for a new release, in addition please check the following:

  • Update the version in cmake.
  • Move the changes in the CHANGELOG.md file under a new header for the new release, and reset the categories.
  • Maintainers: ensure spackages are up to date:
    • LANL-internal team, update XCAP spackages
    • Current maintainer of upstream spackages, submit MR to spack

@Yurlungur Yurlungur self-assigned this May 6, 2026
@Yurlungur Yurlungur added the bug Something isn't working label May 6, 2026
Comment on lines -81 to +90
return violates_condition(std::forward<valT>(value), is_strictly_positive{},
return violates_condition(std::forward<valT>(value), is_non_negative{},
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a typo. this condition was identical to the one above.

@Yurlungur
Copy link
Copy Markdown
Collaborator Author

One thing to note, the code

// Checks whether a value obeys some sort of provided condition. If not, returns true and
// prints the provided error message, variable name, and value (but does not abort!)
template <typename valT, typename condT, typename nameT>
PORTABLE_FORCEINLINE_FUNCTION bool violates_condition(valT &&value, condT &&condition,
                                                      nameT &&var_name) {
  const bool good = condition(std::forward<valT>(value));
  if (!good) {
    printf("### ERROR: Bad singularity-eos value\n  Var:   %s\n  Value: %.15e\n",
           var_name, value);
  }
  return !good;
}

is what actually calls the is_normal condition. I am torn about the print statement not being behind a NDEBUG preprocessor flag. On the one hand, the code continues if the check fails, and so this could be a very annoying error message. And it could impact performance. On the other hand, I caught this problem because of this print statement. So I'm not sure. Curious what people think.

@Yurlungur Yurlungur requested a review from buechlerm May 6, 2026 20:20
Copy link
Copy Markdown
Collaborator

@adamdempsey90 adamdempsey90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be something worth putting in ports of call since it's a portable std:: function?

Comment thread singularity-eos/base/error_utils.hpp Outdated
Comment on lines +34 to +44
constexpr double NORMAL_FACTOR = 1.0e10;

struct is_normal_or_zero {
template <typename valT>
constexpr bool PORTABLE_FORCEINLINE_FUNCTION operator()(valT value) const {
static_assert(std::is_floating_point<valT>::value);
using limits = std::numeric_limits<valT>;
const valT abs_value = (value < valT{0}) ? -value : value;
return (value == valT{0}) ||
(std::isnormal(_NORMAL_FACTOR * value) && std::isnormal(value));
((abs_value >= limits::min()) &&
(abs_value * static_cast<valT>(NORMAL_FACTOR) <= limits::max()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
constexpr double NORMAL_FACTOR = 1.0e10;
struct is_normal_or_zero {
template <typename valT>
constexpr bool PORTABLE_FORCEINLINE_FUNCTION operator()(valT value) const {
static_assert(std::is_floating_point<valT>::value);
using limits = std::numeric_limits<valT>;
const valT abs_value = (value < valT{0}) ? -value : value;
return (value == valT{0}) ||
(std::isnormal(_NORMAL_FACTOR * value) && std::isnormal(value));
((abs_value >= limits::min()) &&
(abs_value * static_cast<valT>(NORMAL_FACTOR) <= limits::max()));
constexpr double PORTABLE_FORCEINLINE_FUNCTION NORMAL_FACTOR() { return 1.0e10;}
struct is_normal_or_zero {
template <typename valT>
constexpr bool PORTABLE_FORCEINLINE_FUNCTION operator()(valT value) const {
static_assert(std::is_floating_point<valT>::value);
using limits = std::numeric_limits<valT>;
const valT abs_value = (value < valT{0}) ? -value : value;
return (value == valT{0}) ||
((abs_value >= limits::min()) &&
(abs_value * static_cast<valT>(NORMAL_FACTOR()) <= limits::max()));

Can't use globals on device with nvcc

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constexpr scalars (and static arrays, insanely) are an exception:
https://godbolt.org/z/dWMP9or8q

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I've definitely tried to do this across multiple code bases and it never would compile

@Yurlungur
Copy link
Copy Markdown
Collaborator Author

One thing to note, the code

// Checks whether a value obeys some sort of provided condition. If not, returns true and
// prints the provided error message, variable name, and value (but does not abort!)
template <typename valT, typename condT, typename nameT>
PORTABLE_FORCEINLINE_FUNCTION bool violates_condition(valT &&value, condT &&condition,
                                                      nameT &&var_name) {
  const bool good = condition(std::forward<valT>(value));
  if (!good) {
    printf("### ERROR: Bad singularity-eos value\n  Var:   %s\n  Value: %.15e\n",
           var_name, value);
  }
  return !good;
}

is what actually calls the is_normal condition. I am torn about the print statement not being behind a NDEBUG preprocessor flag. On the one hand, the code continues if the check fails, and so this could be a very annoying error message. And it could impact performance. On the other hand, I caught this problem because of this print statement. So I'm not sure. Curious what people think.

Needs to be behind debug because it's a performance hit

@adamdempsey90
Copy link
Copy Markdown
Collaborator

One thing to note, the code

// Checks whether a value obeys some sort of provided condition. If not, returns true and
// prints the provided error message, variable name, and value (but does not abort!)
template <typename valT, typename condT, typename nameT>
PORTABLE_FORCEINLINE_FUNCTION bool violates_condition(valT &&value, condT &&condition,
                                                      nameT &&var_name) {
  const bool good = condition(std::forward<valT>(value));
  if (!good) {
    printf("### ERROR: Bad singularity-eos value\n  Var:   %s\n  Value: %.15e\n",
           var_name, value);
  }
  return !good;
}

is what actually calls the is_normal condition. I am torn about the print statement not being behind a NDEBUG preprocessor flag. On the one hand, the code continues if the check fails, and so this could be a very annoying error message. And it could impact performance. On the other hand, I caught this problem because of this print statement. So I'm not sure. Curious what people think.

Needs to be behind debug because it's a performance hit

The alternative is to remove the printing of the name and value so that there are no arguments passed to it.

@Yurlungur
Copy link
Copy Markdown
Collaborator Author

Now uses the ports-of-call method
lanl/ports-of-call#112

To support that, I also bumped us to C++20.

-DSINGULARITY_PLUGINS=$(pwd)/../example/plugin \
-DCMAKE_LINKER=ld.gold \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_STANDARD=20 \
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed as we're now just on C++20

PORTABLE_FORCEINLINE_FUNCTION bool violates_condition(valT &&value, condT &&condition,
nameT &&var_name) {
const bool good = condition(std::forward<valT>(value));
#ifndef NDEBUG
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now added for performance and convenience

@Yurlungur
Copy link
Copy Markdown
Collaborator Author

In the bump to C++20, I also bump the kokkos and kokkos-kernels submodules to 5.1.0. Note that this does not (yet) impact the spack builds. When we do our next release, we need to note the updated kokkos dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants