flux job info: improve error messages #6331

garlick · 2024-10-01T00:55:03Z

Problem: as noted in #6325, flux job info prints the same error message for a missing key as for an unknown job id.

Fixing this turned out to be tricky because the job-info.lookup service is just acting as a proxy for a KVS lookup and doesn't have a direct way to check for an invalid ID - it just knows that a KVS lookup for job.<id>.R (or whatever) got ENOENT. However it does also lookup the job eventlog for access control purposes for guests only. The instance owner is immune from such checks.

The solution is to always look up the eventlog regardless of request credentials if that fails with ENOENT then we know the job ID does not exist.

Because this code is pretty old, a fair bit of modernization was also required. And there is more to do but I tried to keep this focused on the code involved in the reported problem.

Problem: flux job info does not display detailed error messages when it cannot connect to the broker. Call flux_open_ex() and display the textual error message on error.

Problem: some code in job-lookup does not break long parameter lists per modern project conventions. Break long parameter lists to one per line.

Problem: the destructor for a lookup context is not errno-safe, which could result in the wrong errno being set in lookup(), and subsequently returned to a user lookup request. Make the function errno safe and adjust callers.

Problem: job-info log errors when it receives a malformed request, but it also sends EPROTO to the requestor so the log messages are redundant. Drop the logging.

Problem: when a job-info.lookup request fails, often the detailed information is logged and the user receives only an errno. Replace logging with a textual error response. Consolidate handling of some unlikely errors. Change one function that cannot fail to return void instead of int. Ensure errno is set when zlist_append() fails. Adjust expected output in tests.

Problem: the lookup response manually encodes json to a string before placing it in the response payload. Use flux_respond_pack() instead, which simplifies the code.

Problem: when flux job info fails, it does not print a different error message for a bad job ID vs missing key. Ensure that when the job owner is not found in the LRU cache, the eventlog is looked up, even for the instance owner. Then if the lookup of the eventlog fails with ENOENT, we know it's a bad job ID. Fixes flux-framework#6325

grondo

LGTM! Only spotted a couple very minor commit message typos

grondo · 2024-10-01T01:25:29Z

t/t2232-job-info-security.t

@@ -79,147 +79,148 @@ test_expect_success 'flux job eventlog works (owner)' '
 test_expect_success 'flux job eventlog works (user)' '
 	jobid=$(submit_job 9000) &&
 	set_userid 9000 &&
-	flux job eventlog $jobid &&


commit message typo: whcih

grondo · 2024-10-01T01:27:13Z

t/t2232-job-info-security.t

@@ -6,51 +6,17 @@ test_description='Test flux job info service security'

 test_under_flux 1 job

-# We have to fake a job submission by a guest into the KVS.


commit message typo: "but the job-info LRU will now more aggressively caches eventlog" -- caches -> cache

Problem: if a test in t2232-job-info-security.t fails, sometimes the fake userid is left set, which can cause the next test to fail and so on. Clear the fake userid at the beginning of the next test, not in the && chain of the test that sets it.

Problem: t2232-job-info-security.t tests might succeed if the flux job info command segfaults. Use test_must_fail instead of the shell "!" test.

Problem: t2232-job-info-security.t starts 4 brokers but only uses 1. Reduce the size of the test instance to 1.

Problem: these tests mimic guest jobs by modifying eventlogs in the KVS, but the job-info LRU will now more aggressively cache eventlog results, causing many tests to fail. Use the submit_as_alternate_user() shell function from other tests and make its use contingent on flux being compiled with flux-security. To make the test faster, reuse jobs in multiple tests.

Problem: the entire eventlog is parsed when only the first event is needed for authorization. Simplify this function so it only parses the first event.

Problem: the job owner LRU is only populated when job info requests are received from a guest, but the instance owner can now benefit from it since it needs to look up eventlogs to validate the job ID. Don't exclude the instance owner from LRU validation. Now that lookups result in an eventlog parse, the t4413 issue test is now expecting the wrong message (again). Adjust for the new message.

Problem: there is no coverage for flux job info error messages when the job ID or key are unknown. Add tests.

garlick · 2024-10-01T02:04:17Z

Thanks! Fixed those two and forced a push. Will set MWP.

codecov · 2024-10-01T13:41:04Z

Codecov Report

Attention: Patch coverage is 72.64957% with 32 lines in your changes missing coverage. Please review.

Project coverage is 83.33%. Comparing base (5736952) to head (00ef18d).
Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
src/modules/job-info/lookup.c	70.66%	22 Missing ⚠️
src/modules/job-info/watch.c	53.84%	6 Missing ⚠️
src/modules/job-info/util.c	0.00%	2 Missing ⚠️
src/cmd/job/info.c	66.66%	1 Missing ⚠️
src/modules/job-info/update.c	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6331      +/-   ##
==========================================
+ Coverage   83.32%   83.33%   +0.01%     
==========================================
  Files         523      523              
  Lines       86167    86143      -24     
==========================================
- Hits        71795    71788       -7     
+ Misses      14372    14355      -17

Files with missing lines	Coverage Δ
src/modules/job-info/allow.c	`92.68% <100.00%> (+7.68%)`	⬆️
src/modules/job-info/job-info.c	`74.64% <100.00%> (ø)`
src/cmd/job/info.c	`92.68% <66.66%> (+0.18%)`	⬆️
src/modules/job-info/update.c	`76.77% <87.50%> (-0.08%)`	⬇️
src/modules/job-info/util.c	`70.49% <0.00%> (ø)`
src/modules/job-info/watch.c	`65.82% <53.84%> (-2.96%)`	⬇️
src/modules/job-info/lookup.c	`74.24% <70.66%> (+1.40%)`	⬆️

... and 14 files with indirect coverage changes

garlick added 7 commits September 30, 2024 17:15

flux job info: provide detailed flux_open() errors

f9e6793

Problem: flux job info does not display detailed error messages when it cannot connect to the broker. Call flux_open_ex() and display the textual error message on error.

job-info: fix whitespace issues

8adf709

Problem: some code in job-lookup does not break long parameter lists per modern project conventions. Break long parameter lists to one per line.

job-info: lookup_ctx_destroy() clobbers errno

d2f6f5e

Problem: the destructor for a lookup context is not errno-safe, which could result in the wrong errno being set in lookup(), and subsequently returned to a user lookup request. Make the function errno safe and adjust callers.

job-info: don't log request unpack errors

11ae48d

Problem: job-info log errors when it receives a malformed request, but it also sends EPROTO to the requestor so the log messages are redundant. Drop the logging.

job-info: use flux_respond_pack()

c8de40b

Problem: the lookup response manually encodes json to a string before placing it in the response payload. Use flux_respond_pack() instead, which simplifies the code.

grondo approved these changes Oct 1, 2024

View reviewed changes

garlick added 7 commits September 30, 2024 19:00

testsuite: ensure tests clean up

abc6503

Problem: if a test in t2232-job-info-security.t fails, sometimes the fake userid is left set, which can cause the next test to fail and so on. Clear the fake userid at the beginning of the next test, not in the && chain of the test that sets it.

testsuite: use test_must_fail

b2c2945

Problem: t2232-job-info-security.t tests might succeed if the flux job info command segfaults. Use test_must_fail instead of the shell "!" test.

testsuite: reduce test size instance

23c3a29

Problem: t2232-job-info-security.t starts 4 brokers but only uses 1. Reduce the size of the test instance to 1.

job-info: reduce eventlog_get_userid() complexity

418394d

Problem: the entire eventlog is parsed when only the first event is needed for authorization. Simplify this function so it only parses the first event.

testsuite: cover flux job info key/jobid errors

00ef18d

Problem: there is no coverage for flux job info error messages when the job ID or key are unknown. Add tests.

garlick force-pushed the issue#6325 branch from 75ef9cc to 00ef18d Compare October 1, 2024 02:02

garlick added the merge-when-passing label Oct 1, 2024

mergify bot merged commit 12017cc into flux-framework:master Oct 1, 2024
32 of 33 checks passed

garlick deleted the issue#6325 branch October 1, 2024 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux job info: improve error messages #6331

flux job info: improve error messages #6331

garlick commented Oct 1, 2024

grondo left a comment

grondo Oct 1, 2024

grondo Oct 1, 2024

garlick commented Oct 1, 2024

codecov bot commented Oct 1, 2024

		@@ -6,51 +6,17 @@ test_description='Test flux job info service security'

		test_under_flux 1 job

		# We have to fake a job submission by a guest into the KVS.

flux job info: improve error messages #6331

flux job info: improve error messages #6331

Conversation

garlick commented Oct 1, 2024

grondo left a comment

Choose a reason for hiding this comment

grondo Oct 1, 2024

Choose a reason for hiding this comment

grondo Oct 1, 2024

Choose a reason for hiding this comment

garlick commented Oct 1, 2024

codecov bot commented Oct 1, 2024

Codecov Report