Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to a UTF-8 locale in Java stub template #15159

Closed
wants to merge 1 commit into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Apr 1, 2022

On non-macOS Unix, without any locale variable set, the OpenJDK defaults
to using ASCII rather than UTF-8 as the encoding for file system paths
(i.e., the value of the sun.jnu.encoding property).

Fixes #15106

@fmeum
Copy link
Collaborator Author

fmeum commented Apr 1, 2022

@comius I opted to go with the more hermetic variant of always setting LC_ALL=C.UTF-8 rather than inheriting it. I am a bit worried about compatibility though: Can we assume that every host has a UTF-8 locale?

@fmeum
Copy link
Collaborator Author

fmeum commented Apr 1, 2022

The test failures indicate that there are all sorts of compatibility issues with this approach:

  1. Not all systems seem have a UTF-8 locale installed by default in a way picked up by all tools: https://storage.googleapis.com/bazel-untrusted-buildkite-artifacts/4c13a09c-ed0c-4c06-857a-282bd909b126/src/test/shell/bazel/bazel_test_test/shard_3_of_3/test.log.
  2. Setting a UTF-8 locale messes with tools output as reported by Bazel (see https://storage.googleapis.com/bazel-untrusted-buildkite-artifacts/48d00f3b-d64f-4940-b45d-bfc8d8df4928/src/test/shell/integration/run_test/shard_1_of_3/test.log).

@fmeum fmeum changed the title Set LC_ALL=C.UTF-8 for all actions Default to a UTF-8 locale in Java stub template Apr 2, 2022
@fmeum
Copy link
Collaborator Author

fmeum commented Apr 2, 2022

Given the large number of test failures and the fact that the test encyclopedia explicitly documents that all locale variables are unset, I gave up on the original approach of setting LC_ALL for all actions and am now only setting it in java_stub_template.txt and only if none of the locale variables is set explicitly.

@fmeum fmeum force-pushed the non-ascii-file-names branch 3 times, most recently from d84452d to d1d1725 Compare April 2, 2022 07:33
@sgowroji sgowroji added the team-Rules-Java Issues for Java rules label Apr 5, 2022
@comius
Copy link
Contributor

comius commented Apr 8, 2022

cc @cushon
There is some prior art in JavaCompileAction and JavaHeaderCompileAction: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/rules/java/JavaCompileActionBuilder.java;l=208;drc=cb01bde9369f8197cb54c34b14aa41f0dcc903cf

If something along this way was possible for java_test rule, I would prefer this instead of making the stub template more complex. Testing such change internally would probably be more involved, but I don't mind doing it if it's the right thing.

It seems to me the right thing to do would be to have it on all test rules, because I don't see Java anything special in the bug report (test in other languages could use UTF-8 filenames - and this is also another reason why not to put it into Java stub), however this would be more involved and would probably require migration by parts.

  1. Not all systems seem have a UTF-8 locale installed by default in a way picked up by all tools

Does setting LANG instead of LC_TYPE help here?

  1. Setting a UTF-8 locale messes with tools output as reported by Bazel

Is there a way to fix this? How does the output look like? Should just the test be fixed?

@comius comius self-requested a review April 8, 2022 05:39
@fmeum
Copy link
Collaborator Author

fmeum commented Apr 8, 2022

It seems that C.UTF-8 is not universally supported, see https://unix.stackexchange.com/questions/597962/how-widespread-is-the-c-utf-8-locale. I will try again with en-US.UTF-8, which should be.

@comius Assuming this would require a migration anyway, would you prefer to set this on all actions or just on all test actions?

@comius
Copy link
Contributor

comius commented Apr 8, 2022

@comius Assuming this would require a migration anyway, would you prefer to set this on all actions or just on all test actions?

All test actions is where such filenames may occur.

All actions is too wide. This might conflict some other requirements of specific actions, so let's avoid it.

@fmeum
Copy link
Collaborator Author

fmeum commented Apr 9, 2022

@comius This seems pretty subtle based on the CI results: The Ubuntu 18.04 runner doesn't have the en_US.UTF-8 locale, the CentOS runner doesn't have the C.UTF-8 locale. That makes it difficult to determine a working locale statically as setting a non-existent locale can cause warnings to be printed. I'm not sure I see options other than determining a suitable locale at runtime in the Java stub, but of course that doesn't handle non-Java tests.

@sgowroji
Copy link
Member

Hello @fmeum, Thank you for sharing your contributions in PR. Can you please fix the above buildkite presubmit build failures ?

@sgowroji sgowroji added the awaiting-user-response Awaiting a response from the author label Apr 21, 2022
@fmeum
Copy link
Collaborator Author

fmeum commented Apr 21, 2022

@comius Given the problem that neither C.UTF-8 nor en_US.UTF-8 can be assumed to exist, what kind of approach do you suggest? Would including conditional logic in the Java stub be fine knowing that a static choice is difficult?

@cushon
Copy link
Contributor

cushon commented Apr 21, 2022

The Ubuntu 18.04 runner doesn't have the en_US.UTF-8 locale, the CentOS runner doesn't have the C.UTF-8 locale. That makes it difficult to determine a working locale statically

Could we use platforms for this, to differentiate between those target environments and then statically pick the locale to use here?

Also, for my understanding, is this going to make the default behaviour the JDK will have after JEP 400: UTF-8 by Default?

@fmeum
Copy link
Collaborator Author

fmeum commented Apr 21, 2022

Could we use platforms for this, to differentiate between those target environments and then statically pick the locale to use here?

That may be possible. It isn't going to be easy though if the available locales already differ between major distributions - should those be reflected in platform constraints?

What I'm thinking: Maybe a prominent hint in the docs instructing users to set either --action_env=LC_CTYPE=C.UTF-8 or --action_env=LC_CTYPE=C.en_US.UTF-8 in their .bazelrc would serve the same purpose with less additional complexity if we can't find something that "just works".

Also, for my understanding, is this going to make the default behaviour the JDK will have after JEP 400: UTF-8 by Default?

I wasn't aware of this effort, but based on a quick glance I would say yes. It even mentions sun.jnu.encoding.

@cushon
Copy link
Contributor

cushon commented Apr 22, 2022

should those be reflected in platform constraints?

My sense is that it would be nice if platforms could easily express this, but that it probably isn't a straightforward approach. I don't have as crisp an answer to 'what belongs in platform constraints' as I'd like. Maybe someone else reading this does?

Would including conditional logic in the Java stub be fine knowing that a static choice is difficult?

What does the change to do this in the stub template look like? All else being equal it'd be nice to keep more logic out of the stub template, but there's precedent for putting work-arounds in the stub template (e.g. create_and_run_classpath_jar / #6354), and it doesn't seem like any of the options here are perfect, so that seems work considering to me.

@fmeum
Copy link
Collaborator Author

fmeum commented Apr 22, 2022

Would including conditional logic in the Java stub be fine knowing that a static choice is difficult?

What does the change to do this in the stub template look like? All else being equal it'd be nice to keep more logic out of the stub template, but there's precedent for putting work-arounds in the stub template (e.g. create_and_run_classpath_jar / #6354), and it doesn't seem like any of the options here are perfect, so that seems work considering to me.

It would essentially amount to querying locale for C.UTF-8, en_US.UTF-8, and POSIX in order, setting the first one that is available as LC_CTYPE (setting POSIX explicitly is probably not necessary though, as it seems to be the default).

@cushon
Copy link
Contributor

cushon commented Apr 22, 2022

It would essentially amount to querying locale for C.UTF-8, en_US.UTF-8, and POSIX in order, setting the first one that is available as LC_CTYPE (setting POSIX explicitly is probably not necessary though, as it seems to be the default).

Thanks, that sounds reasonable to me.

On non-macOS Unix, without any locale variable set, the JVM defaults to
using ASCII rather than UTF-8 as the encoding for file system paths
(i.e., the value of the `sun.jnu.encoding` property).
@fmeum
Copy link
Collaborator Author

fmeum commented Apr 22, 2022

@cushon I implemented this approach.

@cushon
Copy link
Contributor

cushon commented Apr 22, 2022

This makes sense to me, although I also want @comius's input.

What do you think about also putting this change behind an --incompatible_ flag? Internally we'd need to do more testing and deal with tests relying on the current behaviour before making this change, and I imagine other users will have the same issue, and a flag is the usual way to manage that.

@fmeum
Copy link
Collaborator Author

fmeum commented Apr 22, 2022

I will wait for @comius input on the general approach, but agree that this should be moved behind a flag.

@sgowroji sgowroji added awaiting-review PR is awaiting review from an assigned reviewer and removed awaiting-user-response Awaiting a response from the author labels Apr 25, 2022
@comius
Copy link
Contributor

comius commented Apr 27, 2022

This implementation looks like the most reasonable default behaviour.

What do you think about also putting this change behind an --incompatible_ flag? Internally we'd need to do more testing and deal with tests relying on the current behaviour before making this change, and I imagine other users will have the same issue, and a flag is the usual way to manage that.

I'd avoid additional flag if possible, because we already have many and I wouldn't expect a significant breakages to happen. In this case Bazel LTS policy doesn't require us to do the flag dance - introducing it for one release, flipping it and removing it.

I'd also assume the stub template is something, that is probably already overriden/replaced by large users (like Google does)

@comius comius removed the awaiting-review PR is awaiting review from an assigned reviewer label Apr 28, 2022
@bazel-io bazel-io closed this in 17cfa01 Apr 29, 2022
@fmeum fmeum deleted the non-ascii-file-names branch July 1, 2022 07:21
@fmeum
Copy link
Collaborator Author

fmeum commented Jul 1, 2022

@bazel-io flag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Rules-Java Issues for Java rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-ascii filenames cause InvalidPathException in java_test
4 participants