Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add use_default_shell_env = True to all ctx.actions.run_shell rules #44549

Merged
merged 2 commits into from
Aug 31, 2021

Conversation

Flamefire
Copy link
Contributor

To start subprograms, even simple bash snippets, Bazel uses an
executable process-wrapper which is potentially built using a custom
toolchain and hence requires a set up LD_LIBRARY_PATH.
Ommitting the use_default_shell_env (defaulting to false) clears the
whole environment and the binary may try to use older system libs such
as /lib64/libstdc++.so causing it to fail in case it is (much) older
than the used libstdc++ from the custom toolchain which is very common
in HPC environments.
Hence I added use_default_shell_env = True as already done in e.g.
_local_genrule_impl.

@google-ml-butler google-ml-butler bot added the size:XS CL Change Size: Extra Small label Nov 3, 2020
@google-cla google-cla bot added the cla: yes label Nov 3, 2020
@gbaned gbaned self-assigned this Nov 3, 2020
@gbaned gbaned added the comp:core issues related to core part of tensorflow label Nov 3, 2020
@gbaned gbaned requested a review from chsigg November 3, 2020 13:23
@gbaned gbaned added the awaiting review Pull request awaiting review label Nov 5, 2020
@Flamefire
Copy link
Contributor Author

For some reason I was also seeing the environment dropped from a flatc --python invocation which happens through flatbuffer_py_library. However prior to 2.4 that was working. Checking the compile log for 2.3.1 I don't see any invocation of flatc --python so it seems it was simply not used until 2.4 and hence not compiled which explains why the faulty Bazel rule didn't cause problems before

@chsigg
Copy link
Contributor

chsigg commented Nov 18, 2020

@buchgr, would you mind to take a look at this change? I fear it might not play nice with RBE.

@chsigg
Copy link
Contributor

chsigg commented Nov 19, 2020

@buchgr, would you mind to take a look at this change? I fear it might not play nice with RBE.

@buchgr confirmed that this prevent RBE caching between different users, and would cause problems when the host's LD_LIBRARY_PATH doesn't fit the remote environment.

Is there any chance you could ldconfig to work around this in your environment?

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Nov 21, 2020
@gbaned
Copy link
Contributor

gbaned commented Nov 25, 2020

@Flamefire Can you please check @chsigg's comments and keep us posted ? Thanks!

@gbaned gbaned added awaiting review Pull request awaiting review stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Nov 25, 2020
@Flamefire
Copy link
Contributor Author

@gbaned If you are referring to

Is there any chance you could ldconfig to work around this in your environment?

No, this is not possible. Usual HPC environments have multiple versions of a software installed and decide via environment variables which is active (in particular LD_LIBRARY_PATH and PATH for binaries, CPATH & LIBRARY_PATH for dependencies)

I'm unsure why this is now a problem to add this as e.g. _local_genrule_impl already uses this.

Unfortunately I don't understand enough about Bazel and especially RBE but for us sysadmins it is a real pain that the environment gets cleared on multiple levels

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Nov 29, 2020
@gbaned
Copy link
Contributor

gbaned commented Dec 4, 2020

@chsigg Can you please take a look on above comments from @Flamefire. Thanks!

@gbaned gbaned added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 4, 2020
@chsigg
Copy link
Contributor

chsigg commented Dec 17, 2020

Another idea, would --action_env or --host_action_env maybe do what you need?

@Flamefire
Copy link
Contributor Author

@chsigg No this does not work. This came up in a recent Bazel issue (bazelbuild/bazel#12579 (comment)) where it was observed that action_env and such are seemingly not passed when use_default_shell_env=True is not set.
I haven't tried --host_action_env yet (it is very new) but I assume --action_env already sets those, especially as we use --distinct_host_configuration=false already. It also seems like there is an as-of-yet unreleased bugfix to the host_action_env flag: bazelbuild/bazel@e667082

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 20, 2020
@gbaned
Copy link
Contributor

gbaned commented Jan 5, 2021

@Flamefire Can you please resolve conflicts? Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Jan 5, 2021
@chsigg
Copy link
Contributor

chsigg commented Jan 6, 2021

To be honest, I don't know how we can resolve this. @buchgr, do you have any suggestions?

There is a need to properly pass environment variables to build actions, but bazel sometimes drops them (due to bugs, IIUC). The change here would work around those bugs, but it doesn't play well with RBE.

@Flamefire
Copy link
Contributor Author

There is a need to properly pass environment variables to build actions, but bazel sometimes drops them (due to bugs, IIUC). The change here would work around those bugs, but it doesn't play well with RBE.

IIUC the use_default_shell_env=True makes Bazel pass the --action_env variables and PATH/LD_LIBRARY_PATH while without (=False) it will pass a cleaned up environment with nothing in it.
So this "sometimes" is the case for use_default_shell_env=False, which seems to me like the design, not a bug.
Having said that: There certainly are bugs, some of which I reported and some seem to be "unfixable" (Bazel contains a binary which requires LD_LIBRARY_PATH itself, "fixed" by building Bazel statically)

In the case of flatc (one of the fixed things here) a binary is called which has dependencies, so LD_LIBRARY_PATH must be passed. And as Bazel doesn't seem to have anything else but use_default_shell_env=True to do so, it should be used.
It could be checked what effect --distinct_host_configuration=true (default) has here in respect to RBE, maybe it already means that the hosts LD_LIBRARY_PATH isn't passed to the RBE even with use_default_shell_env=True? I'd expect so but am not using RBE.

Anyway, 3 strong arguments in favor of this change:

  • It is only the flatbuffers compilation i.e. a code generation which is likely fast, so even a potential cache miss isn't expensive
  • Short of patching the source there is no other way (AFAIK) by now to run a flatc build with a different (LD_)LIBRARY_PATH (which most likely happens when it is modified at all)
  • The same is already used for protoc, where it is seemingly ok. So why the discussion here?

@gbaned
Copy link
Contributor

gbaned commented Aug 24, 2021

@Flamefire Can you please resolve conflicts? Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Aug 27, 2021
To compile flatbuffers definition Bazel starts the  flatbuffers compiler
flatc which is potentially built using a custom
toolchain and hence requires a set up LD_LIBRARY_PATH.
Ommitting the `use_default_shell_env` (defaulting to false) clears the
whole environment and the binary may try to use older system libs such
as /lib64/libstdc++.so causing it to fail in case it is (much) older
than the used libstdc++ from the custom toolchain which is very common
in HPC environments.
Hence I added `use_default_shell_env = True` as already done in e.g.
`_local_genrule_impl`.
@Flamefire
Copy link
Contributor Author

@gbaned Rebased

@Flamefire Flamefire closed this Aug 30, 2021
@Flamefire Flamefire reopened this Aug 30, 2021
@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Aug 30, 2021
@Flamefire
Copy link
Contributor Author

Had to close+reopen due to Github not detecting the change...

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 30, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 30, 2021
@copybara-service copybara-service bot merged commit 52aa0e9 into tensorflow:master Aug 31, 2021
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 31, 2021
@Flamefire Flamefire deleted the add_default_shell_env branch September 1, 2021 07:34
Flamefire added a commit to Flamefire/llvm-project that referenced this pull request Sep 16, 2021
When building a tool in a non-standard environment (e.g. custom compiler path -> LD_LIBRARY_PATH set) then `use_default_shell_env = True` is required to run that tool in the same environment or otherwise the build will fail due to missing symbols.
See jax-ml/jax#7842 for this issue and tensorflow/tensorflow#44549 for related fix in TF
GMNGeoffrey pushed a commit to llvm/llvm-project that referenced this pull request Sep 20, 2021
When building a tool in a non-standard environment (e.g. custom
compiler path -> LD_LIBRARY_PATH set) then
`use_default_shell_env = True` is required to run that tool in the same
environment or otherwise the build will fail due to missing symbols.
See jax-ml/jax#7842 for this issue and
tensorflow/tensorflow#44549 for related fix in
TF.

Reviewed By: GMNGeoffrey

Differential Revision: https://reviews.llvm.org/D109873
rsanthanam-amd pushed a commit to ROCm/tensorflow-upstream that referenced this pull request Aug 29, 2022
PiperOrigin-RevId: 394106593
Change-Id: I2226f80fd3bef5e04a42174a8a519f612ae39985
rsanthanam-amd added a commit to ROCm/tensorflow-upstream that referenced this pull request Aug 29, 2022
…_without_soft_link

Merge pull request tensorflow#44549 from Flamefire:add_default_shell_env
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:core issues related to core part of tensorflow size:XS CL Change Size: Extra Small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants