-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote worker execution sometimes fails with exit -1 #3356
Comments
If this becomes more pressing, the best we can do is retry the subprocess execution, with the downside that that could hide other errors. |
Wait, this is an IOException, not a CommandException or AbnormalTerminationException. That means after my fix is in, it will be converted to INTERNAL status and retried by the client. |
This was happening less often because I accidentally made all remote execution single-threaded. :-( Now that we've fixed that, I can see this happening fairly commonly, so we should have a fix for this before we release. |
Ola says that it still happens even after my change, so reopening this. |
So even always calling |
I'm not sure which version Ola was looking at. I've never personally seen it happen with the single-threaded impl. |
I haven't had time to test the current workaround, so I'm not how likely it is. |
This might be lower priority if it's actually rare. |
@ulfjack |
@ulfjack We've seen this in mainline bazel a couple times in our verification builds. |
I just tried to run: |
I could also reproduce it on third run using it Ola's command. |
So I have a fix that seems to work reliably. I had it running for 4 hours with 128 concurrent actions on a cloud vm with 64 cores, as well as with 40 jobs for an hour on my workstation and I haven't experienced the error once. The fix is to have a single thread dedicated to launching the process and then waiting for its completion asynchronously - very straightforward. I ll provide a CL soonish. |
I am marking this a release blocker, since we already have a fix for it and thus should also include it in 0.5.4. |
Spawning processes concurrently from multiple threads doesn't work reliably on Linux (see bug for details). This change introduces simply puts a synchronized block around fork(). I have tested this change by repeatedly running a build of bazel for 2 hours on a 64 core machine with up to 200 concurrent actions. There wasn't a single failure. PiperOrigin-RevId: 164450559
Spawning processes concurrently from multiple threads doesn't work reliably on Linux (see bug for details). This change introduces simply puts a synchronized block around fork(). I have tested this change by repeatedly running a build of bazel for 2 hours on a 64 core machine with up to 200 concurrent actions. There wasn't a single failure. PiperOrigin-RevId: 164450559
Spawning processes concurrently from multiple threads doesn't work reliably on Linux (see bug for details). This change introduces simply puts a synchronized block around fork(). I have tested this change by repeatedly running a build of bazel for 2 hours on a 64 core machine with up to 200 concurrent actions. There wasn't a single failure. PiperOrigin-RevId: 164450559
Spawning processes concurrently from multiple threads doesn't work reliably on Linux (see bug for details). This change introduces simply puts a synchronized block around fork(). I have tested this change by repeatedly running a build of bazel for 2 hours on a 64 core machine with up to 200 concurrent actions. There wasn't a single failure. PiperOrigin-RevId: 164450559
Baseline: 6563b2d Cherry picks: + d4fa181: Use getExecPathString when getting bash_main_file + 837e1b3: Windows, sh_bin. launcher: export runfiles envvars + fe9ba89: grpc: Consolidate gRPC code from BES and Remote Execution. Fixes #3460, #3486 + e8d4366: Automated rollback of commit 496d3de. + 242a434: bes,remote: update default auth scope. + 793b409: Windows, sh_bin. launcher: fix manifest path + 7e4fbbe: Add --windows_exe_launcher option + 91fb38e: remote_worker: Serialize fork() calls. Fixes #3356 + b79a9fc: Quote python_path and launcher in python_stub_template_windows.txt + 4a2e17f: Add build_windows_jni.sh back + ce61d63: don't use methods and classes removed in upstream dx RELNOTES: update dexing tools to Android SDK 26.0.1 + 5393a49: Make Bazel enforce requirement on build-tools 26.0.1 or later. + 5fac03570f80856c063c6019f5beb3bdc1672dee: Fix --verbose_failures w/ sandboxing to print the full command line + f7bd1acf1f96bb7e3e19edb9483d9e07eb5af070: Only patch in C++ compile features when they are not already defined in crosstool + d7f5c12: Bump python-gflags to 3.1.0, take two + 3cb136d: Add python to bazel's dockerfiles New features: - Do not disable fully dynamic linking with ThinLTO when invoked via LIPO options. Important changes: - Ignore --glibc in the Android transition. - Remove --experimental_android_use_singlejar_for_multidex. - nocopts now also filter copts - The Build Event Service (BES) client now properly supports Google Applicaton Default Credentials. - update dexing tools to Android SDK 26.0.1 - Bazel Android support now requires build-tools 26.0.1 or later. - Fix a bug in the remote_worker that would at times make it crash on Linux. See #3356 - The java_proto_library rule now supports generated sources. See #2265
I split this out from #3251. I get this exception:
I am highly confident that it's this bug:
https://bugs.openjdk.java.net/browse/JDK-8068370
I.e., the problem is that there's a race when multiple threads try to fork and exec binaries that were recently written to disk. The underlying problem is that Linux has no API to do this reliably (as far as we can tell). Workarounds are either to retry when it happens, to use a helper process to do the forking, or otherwise serialize the forking.
The text was updated successfully, but these errors were encountered: