Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux job info: improve error messages #6331

Merged
merged 14 commits into from
Oct 1, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Oct 1, 2024

Problem: as noted in #6325, flux job info prints the same error message for a missing key as for an unknown job id.

Fixing this turned out to be tricky because the job-info.lookup service is just acting as a proxy for a KVS lookup and doesn't have a direct way to check for an invalid ID - it just knows that a KVS lookup for job.<id>.R (or whatever) got ENOENT. However it does also lookup the job eventlog for access control purposes for guests only. The instance owner is immune from such checks.

The solution is to always look up the eventlog regardless of request credentials if that fails with ENOENT then we know the job ID does not exist.

Because this code is pretty old, a fair bit of modernization was also required. And there is more to do but I tried to keep this focused on the code involved in the reported problem.

Problem: flux job info does not display detailed error messages when
it cannot connect to the broker.

Call flux_open_ex() and display the textual error message on error.
Problem: some code in job-lookup does not break long parameter lists
per modern project conventions.

Break long parameter lists to one per line.
Problem: the destructor for a lookup context is not errno-safe,
which could result in the wrong errno being set in lookup(),
and subsequently returned to a user lookup request.

Make the function errno safe and adjust callers.
Problem: job-info log errors when it receives a malformed request,
but it also sends EPROTO to the requestor so the log messages
are redundant.

Drop the logging.
Problem: when a job-info.lookup request fails, often the detailed
information is logged and the user receives only an errno.

Replace logging with a textual error response.

Consolidate handling of some unlikely errors.
Change one function that cannot fail to return void instead of int.
Ensure errno is set when zlist_append() fails.
Adjust expected output in tests.
Problem: the lookup response manually encodes json to a string
before placing it in the response payload.

Use flux_respond_pack() instead, which simplifies the code.
Problem: when flux job info fails, it does not print a different
error message for a bad job ID vs missing key.

Ensure that when the job owner is not found in the LRU cache,
the eventlog is looked up, even for the instance owner.
Then if the lookup of the eventlog fails with ENOENT, we know
it's a bad job ID.

Fixes flux-framework#6325
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Only spotted a couple very minor commit message typos

@@ -79,147 +79,148 @@ test_expect_success 'flux job eventlog works (owner)' '
test_expect_success 'flux job eventlog works (user)' '
jobid=$(submit_job 9000) &&
set_userid 9000 &&
flux job eventlog $jobid &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message typo: whcih

@@ -6,51 +6,17 @@ test_description='Test flux job info service security'

test_under_flux 1 job

# We have to fake a job submission by a guest into the KVS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message typo: "but the job-info LRU will now more aggressively caches eventlog" -- caches -> cache

Problem: if a test in t2232-job-info-security.t fails, sometimes
the fake userid is left set, which can cause the next test to fail
and so on.

Clear the fake userid at the beginning of the next test, not in the
&& chain of the test that sets it.
Problem: t2232-job-info-security.t tests might succeed if the
flux job info command segfaults.

Use test_must_fail instead of the shell "!" test.
Problem: t2232-job-info-security.t starts 4 brokers but only uses 1.

Reduce the size of the test instance to 1.
Problem: these tests mimic guest jobs by modifying eventlogs in the
KVS, but the job-info LRU will now more aggressively cache eventlog
results, causing many tests to fail.

Use the submit_as_alternate_user() shell function from other tests
and make its use contingent on flux being compiled with flux-security.

To make the test faster, reuse jobs in multiple tests.
Problem: the entire eventlog is parsed when only the first event is
needed for authorization.

Simplify this function so it only parses the first event.
Problem: the job owner LRU is only populated when job info
requests are received from a guest, but the instance owner can
now benefit from it since it needs to look up eventlogs to validate
the job ID.

Don't exclude the instance owner from LRU validation.

Now that lookups result in an eventlog parse, the t4413 issue test
is now expecting the wrong message (again).  Adjust for the new message.
Problem: there is no coverage for flux job info error messages
when the job ID or key are unknown.

Add tests.
@garlick
Copy link
Member Author

garlick commented Oct 1, 2024

Thanks! Fixed those two and forced a push. Will set MWP.

@mergify mergify bot merged commit 12017cc into flux-framework:master Oct 1, 2024
32 of 33 checks passed
Copy link

codecov bot commented Oct 1, 2024

Codecov Report

Attention: Patch coverage is 72.64957% with 32 lines in your changes missing coverage. Please review.

Project coverage is 83.33%. Comparing base (5736952) to head (00ef18d).
Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
src/modules/job-info/lookup.c 70.66% 22 Missing ⚠️
src/modules/job-info/watch.c 53.84% 6 Missing ⚠️
src/modules/job-info/util.c 0.00% 2 Missing ⚠️
src/cmd/job/info.c 66.66% 1 Missing ⚠️
src/modules/job-info/update.c 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6331      +/-   ##
==========================================
+ Coverage   83.32%   83.33%   +0.01%     
==========================================
  Files         523      523              
  Lines       86167    86143      -24     
==========================================
- Hits        71795    71788       -7     
+ Misses      14372    14355      -17     
Files with missing lines Coverage Δ
src/modules/job-info/allow.c 92.68% <100.00%> (+7.68%) ⬆️
src/modules/job-info/job-info.c 74.64% <100.00%> (ø)
src/cmd/job/info.c 92.68% <66.66%> (+0.18%) ⬆️
src/modules/job-info/update.c 76.77% <87.50%> (-0.08%) ⬇️
src/modules/job-info/util.c 70.49% <0.00%> (ø)
src/modules/job-info/watch.c 65.82% <53.84%> (-2.96%) ⬇️
src/modules/job-info/lookup.c 74.24% <70.66%> (+1.40%) ⬆️

... and 14 files with indirect coverage changes

@garlick garlick deleted the issue#6325 branch October 1, 2024 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants