-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux job info: improve error messages #6331
Conversation
Problem: flux job info does not display detailed error messages when it cannot connect to the broker. Call flux_open_ex() and display the textual error message on error.
Problem: some code in job-lookup does not break long parameter lists per modern project conventions. Break long parameter lists to one per line.
Problem: the destructor for a lookup context is not errno-safe, which could result in the wrong errno being set in lookup(), and subsequently returned to a user lookup request. Make the function errno safe and adjust callers.
Problem: job-info log errors when it receives a malformed request, but it also sends EPROTO to the requestor so the log messages are redundant. Drop the logging.
Problem: when a job-info.lookup request fails, often the detailed information is logged and the user receives only an errno. Replace logging with a textual error response. Consolidate handling of some unlikely errors. Change one function that cannot fail to return void instead of int. Ensure errno is set when zlist_append() fails. Adjust expected output in tests.
Problem: the lookup response manually encodes json to a string before placing it in the response payload. Use flux_respond_pack() instead, which simplifies the code.
Problem: when flux job info fails, it does not print a different error message for a bad job ID vs missing key. Ensure that when the job owner is not found in the LRU cache, the eventlog is looked up, even for the instance owner. Then if the lookup of the eventlog fails with ENOENT, we know it's a bad job ID. Fixes flux-framework#6325
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Only spotted a couple very minor commit message typos
@@ -79,147 +79,148 @@ test_expect_success 'flux job eventlog works (owner)' ' | |||
test_expect_success 'flux job eventlog works (user)' ' | |||
jobid=$(submit_job 9000) && | |||
set_userid 9000 && | |||
flux job eventlog $jobid && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message typo: whcih
@@ -6,51 +6,17 @@ test_description='Test flux job info service security' | |||
|
|||
test_under_flux 1 job | |||
|
|||
# We have to fake a job submission by a guest into the KVS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message typo: "but the job-info LRU will now more aggressively caches eventlog" -- caches -> cache
Problem: if a test in t2232-job-info-security.t fails, sometimes the fake userid is left set, which can cause the next test to fail and so on. Clear the fake userid at the beginning of the next test, not in the && chain of the test that sets it.
Problem: t2232-job-info-security.t tests might succeed if the flux job info command segfaults. Use test_must_fail instead of the shell "!" test.
Problem: t2232-job-info-security.t starts 4 brokers but only uses 1. Reduce the size of the test instance to 1.
Problem: these tests mimic guest jobs by modifying eventlogs in the KVS, but the job-info LRU will now more aggressively cache eventlog results, causing many tests to fail. Use the submit_as_alternate_user() shell function from other tests and make its use contingent on flux being compiled with flux-security. To make the test faster, reuse jobs in multiple tests.
Problem: the entire eventlog is parsed when only the first event is needed for authorization. Simplify this function so it only parses the first event.
Problem: the job owner LRU is only populated when job info requests are received from a guest, but the instance owner can now benefit from it since it needs to look up eventlogs to validate the job ID. Don't exclude the instance owner from LRU validation. Now that lookups result in an eventlog parse, the t4413 issue test is now expecting the wrong message (again). Adjust for the new message.
Problem: there is no coverage for flux job info error messages when the job ID or key are unknown. Add tests.
Thanks! Fixed those two and forced a push. Will set MWP. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6331 +/- ##
==========================================
+ Coverage 83.32% 83.33% +0.01%
==========================================
Files 523 523
Lines 86167 86143 -24
==========================================
- Hits 71795 71788 -7
+ Misses 14372 14355 -17
|
Problem: as noted in #6325,
flux job info
prints the same error message for a missing key as for an unknown job id.Fixing this turned out to be tricky because the
job-info.lookup
service is just acting as a proxy for a KVS lookup and doesn't have a direct way to check for an invalid ID - it just knows that a KVS lookup forjob.<id>.R
(or whatever) got ENOENT. However it does also lookup the job eventlog for access control purposes for guests only. The instance owner is immune from such checks.The solution is to always look up the eventlog regardless of request credentials if that fails with ENOENT then we know the job ID does not exist.
Because this code is pretty old, a fair bit of modernization was also required. And there is more to do but I tried to keep this focused on the code involved in the reported problem.