-
Notifications
You must be signed in to change notification settings - Fork 479
Queue cache - add reverse workload mapping to stop relying on previous workload state during update/delete #8001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Queue cache - add reverse workload mapping to stop relying on previous workload state during update/delete #8001
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
Hi @Singularity23x0. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
olekzabl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Though some more minor comments.
|
/lgtm |
|
LGTM label has been added. Git tree hash: 2ef83dc41dd58962f464055f949c596915d04c5a
|
| continue | ||
| } | ||
| wl := cq.Pop() | ||
| m.reportPendingWorkloads(cqName, cq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this moved? It looks unrelated to the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize the discussion on this issue we had on the previous iteration of the pr:
One of the tests (test/integration/singlecluster/controller/jobs/appwrapper/appwrapper_controller_test.go -> Should schedule AppWrappers as they fit in their ClusterQueue) had an issue with reporting pending workloads, where the existence of the "inflight" workload would make it think there is one pending.
With the reporting being set after the nil check, this remained the case, whereas now, if the inflight is set back to nil, the metric will be corrected accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, but is this test failing only after the PR? I'm wondering if we should consider this as a small bug we would cherrypick, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My investigations were inconclusive as to what exactly in the changes cause this issue to become visible, my best bet so far is that changes to the logic made some async operations be delayed, happening later than planned and as such not caching this particular update to the inflight field in time.
I can split it as a bug fix and put it in a separate pr to be merged beforehand, working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we don't change the location of the line on the new version, is the test failing consistently or flaking?
Also, do you have an example run which failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: un-drafted and ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thank you for extracting. It would still be great to spend some time trying to understand what is the issue here, and its impact. It would be greatly appreciated if you can try to push the investigation a bit more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more info on test flakieness: #8037 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one nit / question
|
@Singularity23x0 please rebase |
|
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kshalot, olekzabl, Singularity23x0 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Part 2/3 of updating the queue cache.
The PR introduces changes allowing for O(1) time retrieval of the current state of the workload assumed into the queue cache. This allows making the update/deletion methods independent from an object representing the state of the workload before the update.
Which issue(s) this PR fixes:
Part 1.2 of addressing: #5310
Special notes for your reviewer:
Created from #7915
Does this PR introduce a user-facing change?