Backend time/failure reporting fixes for timeouts of local jobs#188
Merged
AlexJones0 merged 2 commits intolowRISC:masterfrom May 1, 2026
Merged
Backend time/failure reporting fixes for timeouts of local jobs#188AlexJones0 merged 2 commits intolowRISC:masterfrom
AlexJones0 merged 2 commits intolowRISC:masterfrom
Conversation
hcallahan-lowrisc
approved these changes
Apr 30, 2026
Contributor
hcallahan-lowrisc
left a comment
There was a problem hiding this comment.
Makes sense to me. Thanks @AlexJones0
This fixes an issue inherited from the old base launcher in DVSim, but only for the new runtime backend interface (i.e. not backported to the launchers, due to their fragility). Jobs that are killed after they time out directly emit a completion event instead of going through the regular `_finish_job` call, which mean that we do not get the job runtime / simulated time reported in the final completed spec, which is not correct. To alleviate this issue, change the LocalRuntimeBackend logic to always go through this `RuntimeBackend._finish_job` interface, where an exit code of `None` is now used to denote the fact that the process did not finish naturally and instead timed out. This case is the first thing checked before any log statuses, and so will always be reported as a timeout failure without the different backends needing to implement it manually (which should reduce some duplication in the future, and potential for inconsistencies in the failure messages). Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This was a small bug introduced in the new local runtime backend. Originally the `_kill_job` method was just used by `_kill_many` (for when a SIGINT/SIGTERM is sent to the scheduler), so I used this `kill_requested` flag on the handle to determine when to override the kill reason. However, I then reused this function for the timeout case to kill the ongoing job, without realising that this flag was still being set in that function. In reality, to fix this issue we should just set it directly in `kill_many` to denote that we are manually trying to kill a job. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
5e27dff to
e9a0d72
Compare
rswarbrick
approved these changes
May 1, 2026
Contributor
rswarbrick
left a comment
There was a problem hiding this comment.
Thanks for sorting this out.
The changes look right to me. For the "in reality..." comment on the second commit, should we add an issue tracking that this is worth doing?
Contributor
Author
I think my wording here might have been a bit misleading - that comment is what the commit is doing! No more issue to be tracked 😃 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes two bugs in the runtime backends:
LocalRuntimeBackend, where the error message that was being reported was not correct ("Job killed!" instead of "Job timed out after X minutes").See the commit messages for more info.
Example change: