Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote worker execution sometimes fails with exit -1 #3356

Closed
ulfjack opened this issue Jul 11, 2017 · 14 comments
Closed

Remote worker execution sometimes fails with exit -1 #3356

ulfjack opened this issue Jul 11, 2017 · 14 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) type: bug

Comments

@ulfjack
Copy link
Contributor

ulfjack commented Jul 11, 2017

I split this out from #3251. I get this exception:

java.io.IOException: Cannot run program "/tmp/remote.pTK0EkckA/build-0121c395-7267-4941-abb9-405621684cdf/external/bazel_tools/tools/jdk/ijar/ijar" (in directory "/tmp/remote.pTK0EkckA/build-0121c395-7267-4941-abb9-405621684cdf"): error=26, Text file busy
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at com.google.devtools.build.lib.shell.JavaSubprocessFactory.create(JavaSubprocessFactory.java:116)
	at com.google.devtools.build.lib.shell.SubprocessBuilder.start(SubprocessBuilder.java:194)
	at com.google.devtools.build.lib.shell.Command.startProcess(Command.java:738)
	at com.google.devtools.build.lib.shell.Command.doExecute(Command.java:705)
	at com.google.devtools.build.lib.shell.Command.execute(Command.java:418)
	at com.google.devtools.build.remote.WatcherServer.execute(WatcherServer.java:250)
	at com.google.devtools.build.remote.WatcherServer.watch(WatcherServer.java:318)
	at com.google.watcher.v1.WatcherGrpc$MethodHandlers.invoke(WatcherGrpc.java:214)
	at io.grpc.stub.ServerCalls$1$1.onHalfClose(ServerCalls.java:148)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:262)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$2.runInContext(ServerImpl.java:572)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:117)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=26, Text file busy
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 16 more

I am highly confident that it's this bug:
https://bugs.openjdk.java.net/browse/JDK-8068370

I.e., the problem is that there's a race when multiple threads try to fork and exec binaries that were recently written to disk. The underlying problem is that Linux has no API to do this reliably (as far as we can tell). Workarounds are either to retry when it happens, to use a helper process to do the forking, or otherwise serialize the forking.

@ulfjack ulfjack added category: service APIs P2 We'll consider working on this in future. (Assignee optional) type: bug labels Jul 11, 2017
@ulfjack
Copy link
Contributor Author

ulfjack commented Jul 11, 2017

If this becomes more pressing, the best we can do is retry the subprocess execution, with the downside that that could hide other errors.

@ola-rozenfeld
Copy link
Contributor

Wait, this is an IOException, not a CommandException or AbnormalTerminationException. That means after my fix is in, it will be converted to INTERNAL status and retried by the client.
Still bad, but probably will eliminate it in practice.

bazel-io pushed a commit that referenced this issue Jul 13, 2017
TESTED=remote worker, triggered some errors
RELNOTES: fixes #3305, helps #3356
PiperOrigin-RevId: 161722997
@ulfjack ulfjack added P1 I'll work on this now. (Assignee required) Release blocker and removed P2 We'll consider working on this in future. (Assignee optional) labels Jul 14, 2017
@ulfjack
Copy link
Contributor Author

ulfjack commented Jul 14, 2017

This was happening less often because I accidentally made all remote execution single-threaded. :-( Now that we've fixed that, I can see this happening fairly commonly, so we should have a fix for this before we release.

@ulfjack ulfjack self-assigned this Jul 14, 2017
@ulfjack
Copy link
Contributor Author

ulfjack commented Jul 17, 2017

Ola says that it still happens even after my change, so reopening this.

@ulfjack ulfjack reopened this Jul 17, 2017
@ulfjack ulfjack assigned buchgr and unassigned ulfjack Jul 28, 2017
@buchgr
Copy link
Contributor

buchgr commented Aug 1, 2017

This was happening less often because I accidentally made all remote execution single-threaded. :-(

So even always calling fork in the same thread does not solve the problem?

@ulfjack
Copy link
Contributor Author

ulfjack commented Aug 1, 2017

I'm not sure which version Ola was looking at. I've never personally seen it happen with the single-threaded impl.

@ulfjack
Copy link
Contributor Author

ulfjack commented Aug 1, 2017

I haven't had time to test the current workaround, so I'm not how likely it is.

@ulfjack
Copy link
Contributor Author

ulfjack commented Aug 1, 2017

This might be lower priority if it's actually rare.

@buchgr
Copy link
Contributor

buchgr commented Aug 1, 2017

@ulfjack
Ok, I am having a look. I have also not seen it in the single threaded version. I am working on an implementation that has a single thread dedicated to spawning processes and then to asynchronously collect the results. Will let you know once I have results.

@AustinSchuh
Copy link
Contributor

@ulfjack We've seen this in mainline bazel a couple times in our verification builds.

@ola-rozenfeld
Copy link
Contributor

I just tried to run:
while bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 clean && bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 build --remote_accept_cached=false --remote_local_fallback=false --remote_timeout=3600 --spawn_strategy=remote --verbose_failures=true --tls_enabled=false --auth_enabled=false --remote_cache=localhost:8080 --remote_executor=localhost:8080 --jobs=20 //src:bazel; do echo; done
With a worker also having --jobs=20.
Took ~25 iterations to reproduce the error.

@buchgr
Copy link
Contributor

buchgr commented Aug 2, 2017

I could also reproduce it on third run using it Ola's command.

@buchgr
Copy link
Contributor

buchgr commented Aug 2, 2017

So I have a fix that seems to work reliably. I had it running for 4 hours with 128 concurrent actions on a cloud vm with 64 cores, as well as with 40 jobs for an hour on my workstation and I haven't experienced the error once.

The fix is to have a single thread dedicated to launching the process and then waiting for its completion asynchronously - very straightforward.

I ll provide a CL soonish.

@buchgr
Copy link
Contributor

buchgr commented Aug 3, 2017

I am marking this a release blocker, since we already have a fix for it and thus should also include it in 0.5.4.

buchgr added a commit that referenced this issue Aug 7, 2017
Spawning processes concurrently from multiple threads doesn't work
reliably on Linux (see bug for details).

This change introduces simply puts a synchronized block around fork().

I have tested this change by repeatedly running a build of bazel for 2
hours on a 64 core machine with up to 200 concurrent actions. There
wasn't a single failure.

PiperOrigin-RevId: 164450559
buchgr added a commit that referenced this issue Aug 11, 2017
Spawning processes concurrently from multiple threads doesn't work
reliably on Linux (see bug for details).

This change introduces simply puts a synchronized block around fork().

I have tested this change by repeatedly running a build of bazel for 2
hours on a 64 core machine with up to 200 concurrent actions. There
wasn't a single failure.

PiperOrigin-RevId: 164450559
buchgr added a commit that referenced this issue Aug 21, 2017
Spawning processes concurrently from multiple threads doesn't work
reliably on Linux (see bug for details).

This change introduces simply puts a synchronized block around fork().

I have tested this change by repeatedly running a build of bazel for 2
hours on a 64 core machine with up to 200 concurrent actions. There
wasn't a single failure.

PiperOrigin-RevId: 164450559
buchgr added a commit that referenced this issue Aug 22, 2017
Spawning processes concurrently from multiple threads doesn't work
reliably on Linux (see bug for details).

This change introduces simply puts a synchronized block around fork().

I have tested this change by repeatedly running a build of bazel for 2
hours on a 64 core machine with up to 200 concurrent actions. There
wasn't a single failure.

PiperOrigin-RevId: 164450559
bazel-io pushed a commit that referenced this issue Aug 25, 2017
Baseline: 6563b2d

Cherry picks:
   + d4fa181:
     Use getExecPathString when getting bash_main_file
   + 837e1b3:
     Windows, sh_bin. launcher: export runfiles envvars
   + fe9ba89:
     grpc: Consolidate gRPC code from BES and Remote Execution. Fixes
     #3460, #3486
   + e8d4366:
     Automated rollback of commit
     496d3de.
   + 242a434:
     bes,remote: update default auth scope.
   + 793b409:
     Windows, sh_bin. launcher: fix manifest path
   + 7e4fbbe:
     Add --windows_exe_launcher option
   + 91fb38e:
     remote_worker: Serialize fork() calls. Fixes #3356
   + b79a9fc:
     Quote python_path and launcher in
     python_stub_template_windows.txt
   + 4a2e17f:
     Add build_windows_jni.sh back
   + ce61d63:
     don't use methods and classes removed in upstream dx RELNOTES:
     update dexing tools to Android SDK 26.0.1
   + 5393a49:
     Make Bazel enforce requirement on build-tools 26.0.1 or later.
   + 5fac03570f80856c063c6019f5beb3bdc1672dee:
     Fix --verbose_failures w/ sandboxing to print the full command
     line
   + f7bd1acf1f96bb7e3e19edb9483d9e07eb5af070:
     Only patch in C++ compile features when they are not already
     defined in crosstool
   + d7f5c12:
     Bump python-gflags to 3.1.0, take two
   + 3cb136d:
     Add python to bazel's dockerfiles

New features:

  - Do not disable fully dynamic linking with ThinLTO when invoked
    via LIPO options.

Important changes:

  - Ignore --glibc in the Android transition.
  - Remove --experimental_android_use_singlejar_for_multidex.
  - nocopts now also filter copts
  - The Build Event Service (BES) client now properly supports
    Google Applicaton Default Credentials.
  - update dexing tools to Android SDK 26.0.1
  - Bazel Android support now requires build-tools 26.0.1 or later.
  - Fix a bug in the remote_worker that would at times make it crash on Linux. See #3356
  - The java_proto_library rule now supports generated sources. See #2265
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) type: bug
Projects
None yet
Development

No branches or pull requests

4 participants