Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cc_test coverage fails when fetching from remote cache #20556

Closed
anhlinh123 opened this issue Dec 15, 2023 · 31 comments
Closed

cc_test coverage fails when fetching from remote cache #20556

anhlinh123 opened this issue Dec 15, 2023 · 31 comments
Assignees
Labels
coverage P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@anhlinh123
Copy link

anhlinh123 commented Dec 15, 2023

Description of the bug:

Running coverage on a cc_test randomly fails with the following error:
I/O exception during sandboxed execution: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path_to_the_test_dir>/_coverage
Afaik, the condition under which the bug happens might be:

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  • Use disk_cache instead of remote_cache to easily manipulate the cache. (add --disk_cache=<path_to_local_cache> to the command line).
  • Run a cc_test: bazel coverage <test> --disk_cache=<cache> --remote_upload_local_results=true --execution_log_json_file=<file_name>.
  • Open the json file, find the action that has "commandArgs": ["external/bazel_tools/tools/test/collect_coverage.sh"],
  • Find the hash of the action output (near the end of the action object).
  • Delete the file corresponding to the hash value in the cache directory.
  • Rerun the test.
  • The test fails with the error: I/O exception during sandboxed execution: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path_to_the_test_dir>/_coverage.

Which operating system are you running Bazel on?

Ubuntu 20.04.6 LTS

What is the output of bazel info release?

release 7.0.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Probably it happens after this commit 1267631, when it tries to fetch metadata of a TreeArtifact (the _coverage directory).

Have you found anything relevant by searching the web?

No.

Any other information, logs, or outputs that you want to share?

No response

@sputt
Copy link
Contributor

sputt commented Dec 17, 2023

Good writeup thank you. This is blocking our upgrade to Bazel 7.0.0. It also occurs with these flags set:

coverage --experimental_fetch_all_coverage_outputs
coverage --experimental_split_coverage_postprocessing

which we set due to requiring nobuild_runfile_links.

@comius comius added coverage team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-Rules-CPP Issues for C++ rules labels Dec 27, 2023
@tjgq tjgq self-assigned this Jan 2, 2024
@tjgq
Copy link
Contributor

tjgq commented Jan 2, 2024

@anhlinh123 @sputt If you run with --verbose_failures, are you able to get the stack trace leading up to the I/O exception during sandboxed execution exception? That would help narrow down the issue.

@nlou9
Copy link

nlou9 commented Jan 5, 2024

we ran into this issue with py_test too. During migration to bazel 7.0.0, when we try to run coverage with py_test, we have this error
I/O exception during sandboxed execution: Input is a directory: bazel-out/aarch64-fastbuild/testlogs/tests/ci/release_stable/release_stable_lint_flake8_test/_coverage

our coverage options:

test --experimental_fetch_all_coverage_outputs
test --remote_download_minimal
test --experimental_split_coverage_postprocessing
coverage --combined_report=lcov

@fmeum
Copy link
Collaborator

fmeum commented Jan 5, 2024

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Jan 5, 2024
@fmeum
Copy link
Collaborator

fmeum commented Jan 5, 2024

@iancha1992 This looks like another 7.0.0 regression, just want to make sure it's tracked.

@iancha1992
Copy link
Member

iancha1992 commented Jan 5, 2024

@fmeum @Wyverald @meteorcloudy Should this be included in the 7.0.1 patch? or is it okay to go straight to 7.1.0?

cc: @bazelbuild/triage

@Wyverald
Copy link
Member

Wyverald commented Jan 5, 2024

Since this is a regression in 7.0.0, ideally we should include a fix in 7.0.1. But that depends on the timeline. @tjgq do you have an estimate for how long the fix might take?

@iancha1992 iancha1992 changed the title [7.0.0] cc_test coverage fails when fetching from remote cache cc_test coverage fails when fetching from remote cache Jan 8, 2024
@iancha1992
Copy link
Member

@bazel-io fork 7.0.1

@iancha1992
Copy link
Member

@bazel-io fork 7.1.0

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Jan 8, 2024
@oquenchil oquenchil added P1 I'll work on this now. (Assignee required) and removed untriaged labels Jan 9, 2024
@tjgq
Copy link
Contributor

tjgq commented Jan 9, 2024

@nlou9 Same request as above: can you please run with --verbose_failures and post the full stack trace here?

@lberki
Copy link
Contributor

lberki commented Jan 9, 2024

Are you setting --experimental_split_coverage_postprocessing by any chance?

If so, I suspect that the root cause is the same as the coverage failure in #20753 and if so, have a fix for it that is in the process of being submitted (although CI is not really a happy camper now so it will take a bit of time to materialize at HEAD)

@nlou9
Copy link

nlou9 commented Jan 9, 2024

@lberki we are using --experimental_split_coverage_postprocessing as I mentioned above

@nlou9
Copy link

nlou9 commented Jan 9, 2024

@tjgq, here is the log

bazel coverage //...Failed in 
02:17
2024/01/09 02:29:07 Downloading https://releases.bazel.build/7.0.0/release/bazel-7.0.0-linux-arm64...00:01
2024/01/09 02:29:07 Skipping basic authentication for releases.bazel.build because no credentials found in /home/semaphore/.netrc00:01
Extracting Bazel installation...00:01
Starting local Bazel server and connecting to it...00:07
(02:29:13) INFO: Invocation ID: d0d3983c-0fcd-4822-974a-8cda4a8476d500:08
(02:29:13) INFO: Options provided by the client:00:08
 00:08
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/.bazelrc:00:08
  Inherited 'common' options:
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/ci-tools/.bazelrc:00:08
  Inherited 'build' options: --remote_retries=2 --remote_timeout=7200 --java_runtime_version=remotejdk_11 --enable_bzlmod=false00:08
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/.bazelrc:00:08
  Inherited 'build' options: --announce_rc --show_timestamps --show_progress_rate_limit=60 --curses=no --remote_download_minimal00:08
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/ci-tools/.bazelrc:00:08
  Inherited 'test' options: --test_output=errors --test_timeout=-1,-1,-1,720000:08
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/.bazelrc:00:08
  Inherited 'test' options: --test_output=errors --experimental_fetch_all_coverage_outputs --experimental_split_coverage_postprocessing00:08
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/ci-tools/.bazelrc:00:08
  'coverage' options: --combined_report=lcov --verbose_failures --instrumentation_filter=^// --instrumentation_filter=^//confluent[/:],-.*(test|tests|lint)00:08
(02:29:13) INFO: Current date is 2024-01-0900:08
(02:29:13) Computing main repo mapping: 00:08
(02:29:18) DEBUG: /home/semaphore/.cache/bazel/_bazel_semaphore/c2e391e2c21d1440479c3b26017f505f/external/rules_python/python/pip.bzl:47:10: pip_install is deprecated. Please switch to pip_parse. pip_install will be removed in a future release.00:13
(02:29:19) Loading: 00:13
(02:29:19) Loading: 0 packages loaded00:13
(02:29:20) Analyzing: 324 targets (26 packages loaded, 0 targets configured)00:14
(02:29:20) Analyzing: 324 targets (26 packages loaded, 0 targets configured)00:14
[0 / 1] checking cached actions00:14
(02:30:21) Analyzing: 324 targets (199 packages loaded, 10070 targets configured)01:16
[1 / 2] checking cached actions01:16
(02:30:32) INFO: Analyzed 324 targets (219 packages loaded, 11036 targets configured).01:26
(02:31:21) [622 / 741] Creating runfiles tree bazel-out/aarch64-fastbuild/bin/tests/ci/release_stable/release/test_builder_first_rc_from_repo.runfiles; 45s local ... (112 actions, 93 running)02:16
(02:31:23) ERROR: /home/semaphore/ci-tools/tests/ci/scripts/BUILD.bazel:44:8: Testing //tests/ci/scripts:test_update_version_integration_lint_flake8_test failed: I/O exception during sandboxed execution: Input is a directory: bazel-out/aarch64-fastbuild/testlogs/tests/ci/scripts/test_update_version_integration_lint_flake8_test/_coverage02:17
(02:31:23) INFO: Elapsed time: 135.863s, Critical Path: 47.72s

@lberki
Copy link
Contributor

lberki commented Jan 9, 2024

Then I believe iancha1992@9463a4f is the fix (it'll be both in 7.0.1 and 7.1.0)

@nlou9
Copy link

nlou9 commented Jan 9, 2024

@lberki wonder the ETA of 7.0.1 or 7.1.0?

@lberki
Copy link
Contributor

lberki commented Jan 9, 2024

I think the idea is that 7.0.1 should be out sometime next week the latest (don't take this as a promise until @meteorcloudy confirms)

@lberki
Copy link
Contributor

lberki commented Jan 9, 2024

Also, @tjgq suspects that there is another bug lurking in the deep that could cause your build to fail like this and he was working on confirming or denying that. My guess is that the above change is enough, but that's just a guess and an uninformed one.

@tjgq
Copy link
Contributor

tjgq commented Jan 10, 2024

If we have confirmation that the repro requires --experimental_split_coverage_postprocessing, I believe b0db044 will fix it. (Otherwise, if it's supposed to repro without --experimental_split_coverage_postprocessing, I couldn't do so following the instructions above.)

@lberki
Copy link
Contributor

lberki commented Jan 10, 2024

Thanks, your mention of AbstractActionInputPrefetcher scared me, but if that's not involved in this breakage, I'm relieved .

@UebelAndre
Copy link
Contributor

I see this on python tests as well. This issue blocks my ability to upgrade Bazel.

@anhlinh123
Copy link
Author

anhlinh123 commented Jan 11, 2024

@tjgq Sorry for the late reply. This is the error message when using --verbose_failures:

ERROR: /<path_to_the_test>/BUILD:50:8: Testing //<path>:UnitTest failed: I/O exception during sandboxed execution: 11 errors during bulk transfer:
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage

I don't think there will be a stacktrace as there is a function consuming all Exceptions. I don't remember what that is since it's been a while. Let me try to recall.

@anhlinh123
Copy link
Author

Ok, so it all started with this function:
https://github.com/bazelbuild/bazel/blob/7.0.0/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnCache.java#L82
which after all consumes all IOException:
https://github.com/bazelbuild/bazel/blob/7.0.0/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnCache.java#L142

But the critical point seems to be here:
https://github.com/bazelbuild/bazel/blob/7.0.0/src/main/java/com/google/devtools/build/lib/remote/AbstractActionInputPrefetcher.java#L418
where it tries to fetch metadata of a tree artifact.
I'm not sure why it does that. But at that point, the tree artifact is a directory (I guess that is because the content of the directory was already evicted from the remote cache), and it breaks this code
https://github.com/bazelbuild/bazel/blob/7.0.0/src/main/java/com/google/devtools/build/lib/exec/SingleBuildFileCache.java#L78

@anhlinh123
Copy link
Author

My repo also uses --experimental_split_coverage_postprocessing.
But I believe it doesn't need that flag to reproduce the error (if the flag is disabled by default).
It can be easily reproduce locally by this

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Use disk_cache instead of remote_cache to easily manipulate the cache. (add --disk_cache=<path_to_local_cache> to the command line).
Run a cc_test: bazel coverage <test> --disk_cache=<cache> --remote_upload_local_results=true --execution_log_json_file=<file_name>.
Open the json file, find the action that has "commandArgs": ["external/bazel_tools/tools/test/collect_coverage.sh"],
Find the hash of the action output (near the end of the action object).
Delete the file corresponding to the hash value in the cache directory.
Rerun the test.
The test fails with the error: I/O exception during sandboxed execution: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path_to_the_test_dir>/_coverage.

@tjgq
Copy link
Contributor

tjgq commented Jan 11, 2024

@anhlinh123 Do you mind giving b0db044 a try? (The simplest way is to use Bazelisk with USE_BAZEL_VERSION=last_green.) That would tell us whether there's indeed a separate issue that that commit didn't fix.

Otherwise, I think there might be something missing from your repro steps. You are building once, deleting a file from the disk cache, then rebuilding incrementally. I'd thus expect the incremental rebuild to be a no-op, since the output tree hasn't been touched and Bazel can tell it's up-to-date (and that's also what I'm seeing experimentally.) Is the second build a clean build? Are there any other flags in a .bazelrc? (Use --announce_rc to print all of the flags Bazel is using.)

@tjgq
Copy link
Contributor

tjgq commented Jan 11, 2024

(It's USE_BAZEL_VERSION=last_green, not USE_BAZEL_VERSION=latest. I've amended the previous comment.)

tjgq added a commit to tjgq/bazel that referenced this issue Jan 11, 2024
@anhlinh123
Copy link
Author

@tjgq This is what I've found so far.

  • I made a small project to reproduce the bug. You're right, I missed the step to clean up Bazel cache after tampering the remote cache.
  • The flag --experimental_split_coverage_postprocessing actually matters. Turning it off did fix the error.
  • Bazel last green fixes the error!!!

@tjgq
Copy link
Contributor

tjgq commented Jan 11, 2024

That's great to hear, thanks. I will close this issue, since the fix has also been cherry-picked into the 7.0.1 branch (in #20819).

@tjgq tjgq closed this as completed Jan 11, 2024
@anhlinh123
Copy link
Author

@tjgq thank you for your support!

@anhlinh123
Copy link
Author

@tjgq The version 7.0.1rc1 has a weird bug related to this.
Turning --experimental_split_coverage_postprocessing on always breaks coverage.

ERROR: <path>/test/BUILD:1:8: Testing //:test failed: I/O exception during sandboxed execution: Input is a directory: bazel-out/k8-fastbuild/testlogs/test/_coverage

The last_green doesn't have it.

@tjgq
Copy link
Contributor

tjgq commented Jan 15, 2024

7.0.1rc1 was cut before the fix made it into the 7.0.1 branch. Can you try USE_BAZEL_VERSION=f4da34dcfe7b83388e3d963f35581a4fe710fc14 (current tip of the branch)?

@anhlinh123
Copy link
Author

@tjgq It works!!! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
coverage P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests