Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode popen output using the system locale if UTF-8 decoding fails. #380

Merged
merged 11 commits into from
Feb 11, 2021
Merged

Decode popen output using the system locale if UTF-8 decoding fails. #380

merged 11 commits into from
Feb 11, 2021

Conversation

Cielquan
Copy link
Contributor

@Cielquan Cielquan commented Feb 2, 2021

Based on my comments in the issue #309 I created a decoding wrapper function. As utf8 is the python default I kept it hard coded as first choice of the encodings to try. If utf8 fails sys.stdout.encoding, sys.getdefaultencoding() and locale.getpreferredencoding() (in this order) deliver possible other encodings to try. If all decoding attempts fail the UnicodeDecodeError exception is risen.

I'm open for other orders of the encoding sources.

If this fix is well received, I will add the missing tests and fix the broken ones if there are any.

fixes #309

fixes #309 by trying to decode popen output with utf8 first
and on error tries other encodings provided by the systems
preferences.
Copy link
Collaborator

@cjolowicz cjolowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this maybe be simplified to try UTF-8 first, and fall back to locale.getpreferredencoding in case of a UnicodeDecodeError?

I would skip sys.getdefaultencoding because it appears to be hardcoded to UTF-8. See CPython source (1, 2) and Victor Stinner's notes.

sys.stdout.encoding does not seem relevant here, as we are decoding the output of another process. IIUC, on a legacy Windows system the process output will typically be CP-1252, while our own stdout might be a console device, which is always UTF-8. On non-Windows systems, it's just the locale encoding.

(The code assumes that a wrong encoding will result in a UnicodeDecodeError, and not simply in garbled output. As far as I'm aware, this is rare when decoding to UTF-8, at least from ASCII-based encodings [citation needed 😉], so I guess this is fine.)

@stsewd
Copy link
Collaborator

stsewd commented Feb 2, 2021

@Cielquan you could try installing nox from this branch to test if this solves your problem (if you haven't already!)

@Cielquan
Copy link
Contributor Author

Cielquan commented Feb 3, 2021

Thanks for the fast replies you two!

@cjolowicz
Thanks for the additional information. I should have looked more thoroughly through the docs. 😅
I think its a good idea to discard sys.stdout.encoding and sys.getdefaultencoding. 👍🏾
Yeah garbled output is not detected. That was another reason to try utf-8 as first encoding. E.g. latin-1 will work on our german umlaut ü but the output is not what one wants (the same goes for cp1252):

>>> "ü".encode("utf8").decode("latin-1")
'ü'

But I think this is better than an UnicodeDecodeError.

@stsewd
I monkeypatched nox.command.popen inside my noxfile.py to overwrite the hard coded utf8 with sys.stdout.encoding, which worked for me and the windows GHA CI succeeded.
After that I developed the added function inside my noxfile.py.
I tested it on my dev machine (linux), a windows vm and all three OS in CI. It ran everywhere.

@Cielquan
Copy link
Contributor Author

Cielquan commented Feb 3, 2021

Sooo ...

I made adjustments proposed by @cjolowicz and added some tests.

I use the b"\x95" byte which caused my problems (see my comment) as a test example. It does not work with utf-8 and asccii but does work with cp1252 so it was a good candidate to test all possible outcomes.

@Cielquan
Copy link
Contributor Author

Cielquan commented Feb 3, 2021

Some remarks on the CI test:

  • python3.9 does not run in appveyor somehow (see here):
...
nox > Running session tests-3.9
nox > Session tests-3.9 skipped: Python interpreter 3.9 not found.
Build success
  • coverage is 100% in Travis for linux, but on my local linux machine I don't reach 100% when running $ nox -e tests:
...
Name                  Stmts   Miss Branch BrPart  Cover   Missing
-----------------------------------------------------------------
nox/__init__.py           6      0      0      0   100%
nox/__main__.py          17      0      4      0   100%
nox/_decorators.py       38      0      2      0   100%
nox/_option_set.py       99      0     34      0   100%
nox/_options.py          57      0     24      0   100%
nox/_parametrize.py      74      0     36      0   100%
nox/command.py           57      0     26      0   100%
nox/logger.py            46      0     12      0   100%
nox/manifest.py         131      0     72      0   100%
nox/popen.py             33      0      6      0   100%
nox/registry.py          20      0      6      0   100%
nox/sessions.py         245      0     76      0   100%
nox/tasks.py             91      0     32      0   100%
nox/tox_to_nox.py        18      0      2      0   100%
nox/virtualenv.py       177     22     54      0    87%   194-206, 219-248
nox/workflow.py          15      0      6      0   100%
-----------------------------------------------------------------
TOTAL                  1124     22    392      0    98%
  • coverage in appveyor is not 100% because of missing parts in virtualenv.py and workflow.py. I can recommend coverage-conditional-plugin to add conditional pragma: no covers to the code base. You then can test for 100% coverage in any case.

EDIT1: spelling

@cjolowicz
Copy link
Collaborator

on my local linux machine I don't reach 100%

The test suite skips conda tests if you don't have conda installed, and those missing lines are from CondaEnv. Could that be what caused it?

@Cielquan
Copy link
Contributor Author

Cielquan commented Feb 3, 2021

Well yeah, I totally forgot about that. Thanks.

@Cielquan
Copy link
Contributor Author

Cielquan commented Feb 8, 2021

Updated the branch

@Cielquan
Copy link
Contributor Author

Updated the branch.

Anything else needed from me? @theacodes

Copy link
Collaborator

@cjolowicz cjolowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Cielquan thanks for moving this forward!

Today I found some time to test this on a legacy system running Windows Server
2008 (64-bit). I could reproduce the original issue, and I could also confirm
that your fix works 🚀 See details below.

I have one suggestion, though. As you can see in the last command output below,
when both encodings fail, Python will chain the tracebacks for both
UnicodeDecodeError exceptions, including these two error messages:

  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
  • UnicodeDecodeError: '[utf-8, BIG5]' codec can't decode byte 0x95 in position 0: incomplete multibyte sequence

You can see that modifying the second exception during flight is both
unnecessary and slightly confusing. I would therefore suggest that you eliminate
the second try...catch block, keeping only the body of the try clause.
I'll leave a code suggestion to clarify what I mean.

Stylistically, I would also suggest to eliminate the decoded_output variable,
and simply return instead of assigning to it. But this is just a nitpick, feel
free to ignore.

Here are the details, FTR:

On the Windows system I used, locale.getpreferredencoding returns CP-1252.
sys.stdout.encoding returns UTF-8 when writing to the command prompt, and
CP-1252 when redirecting output to a file.

Like you, I used the Unicode BULLET codepoint for testing, which is 0x95
when encoded in CP-1252.

>>> import unicodedata
>>> unicodedata.name(bytes([0x95]).decode("cp1252"))
'BULLET'

This byte sequence causes an error when decoded to UTF-8:

>>> bytes([0x95]).decode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte

I used the following noxfile.py for testing:

import nox

@nox.session
def tests(session):
    session.run("python", "-c", 'import sys; print(f"{sys.stdout.encoding = }")')
    session.run("python", "-c", 'import sys; sys.stdout.buffer.write(bytes([0x95]))', silent=True)

We pass silent to get Nox to capture stdout. This is done by default when you get here via session.install.

Below is the output when running nox from the latest release, and from your branch. Expand the output using the arrows.

Reproducing the error:
(venv) C:\Users\cjolowicz>nox
nox > Running session tests
nox > Creating virtual environment (virtualenv) using python.exe in .nox\tests
nox > python -c import sys; print(f"{sys.stdout.encoding = }")
sys.stdout.encoding = 'utf-8'
nox > python -c import sys; sys.stdout.buffer.write(bytes([0x95]))
nox > Session tests raised exception UnicodeDecodeError('utf-8', b'\x95', 0, 1, 'invalid start byte')
Traceback (most recent call last):
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\sessions.py", line 549, in execute
    self.func(session)
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\_decorators.py", line 53, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\cjolowicz\noxfile.py", line 6, in tests
    session.run("python", "-c", 'import sys; sys.stdout.buffer.write(bytes([0x95]))', silent=True)
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\sessions.py", line 262, in run
    return self._run(*args, env=env, **kwargs)
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\sessions.py", line 318, in _run
    return nox.command.run(args, env=env, paths=self.bin_paths, **kwargs)
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\command.py", line 111, in run
    return_code, output = popen(
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\popen.py", line 48, in popen
    return return_code, out.decode("utf-8") if out else ""
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
nox > Session tests failed.
Reproducing the error with output redirected:
(venv) C:\Users\cjolowicz>nox > nox.log 2>&1
(venv) C:\Users\cjolowicz>type nox.log
nox > Running session tests
nox > Creating virtual environment (virtualenv) using python.exe in .nox\tests
nox > python -c import sys; print(f"{sys.stdout.encoding = }")
sys.stdout.encoding = 'cp1252'
nox > python -c import sys; sys.stdout.buffer.write(bytes([0x95]))
nox > Session tests raised exception UnicodeDecodeError('utf-8', b'\x95', 0, 1, 'invalid start byte')
Traceback (most recent call last):
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\sessions.py", line 549, in execute
    self.func(session)
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\_decorators.py", line 53, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\cjolowicz\noxfile.py", line 6, in tests
    session.run("python", "-c", 'import sys; sys.stdout.buffer.write(bytes([0x95]))', silent=True)
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\sessions.py", line 262, in run
    return self._run(*args, env=env, **kwargs)
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\sessions.py", line 318, in _run
    return nox.command.run(args, env=env, paths=self.bin_paths, **kwargs)
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\command.py", line 111, in run
    return_code, output = popen(
  File "c:\users\cjolowicz\venv\lib\site-packages\nox\popen.py", line 48, in popen
    return return_code, out.decode("utf-8") if out else ""
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
nox > Session tests failed.
Confirming that the fix works (happy case):
(venv-fix) C:\Users\build\cjolowicz>nox
nox > Running session tests
nox > Creating virtual environment (virtualenv) using python.exe in .nox\tests
nox > python -c import sys; print(f"{sys.stdout.encoding = }")
sys.stdout.encoding = 'utf-8'
nox > python -c import sys; sys.stdout.buffer.write(bytes([0x95]))
nox > Session tests was successful.

(venv-fix) C:\Users\build\cjolowicz>nox > nox.log 2>&1

(venv-fix) C:\Users\build\cjolowicz>type nox.log
nox > Running session tests
nox > Creating virtual environment (virtualenv) using python.exe in .nox\tests
nox > python -c import sys; print(f"{sys.stdout.encoding = }")
sys.stdout.encoding = 'cp1252'
nox > python -c import sys; sys.stdout.buffer.write(bytes([0x95]))
nox > Session tests was successful.
Confirming that the fix works (error case: output cannot be decoded from the preferred encoding either):
❯ env LC_ALL=zh_TW.big5 nox
nox > Running session tests
nox > Creating virtual environment (virtualenv) using python in .nox/tests
nox > python -c import sys; print(f"{sys.stdout.encoding = }")
sys.stdout.encoding = 'big5'
nox > python -c import sys; sys.stdout.buffer.write(bytes([0x95]))
nox > Session tests raised exception UnicodeDecodeError('big5', b'\x95', 0, 1, 'incomplete multibyte sequence')
Traceback (most recent call last):
  File "/tmp/nox-unicode/venv/lib/python3.9/site-packages/nox/popen.py", line 31, in decode_output
    decoded_output = output.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/nox-unicode/venv/lib/python3.9/site-packages/nox/sessions.py", line 551, in execute
    self.func(session)
  File "/tmp/nox-unicode/venv/lib/python3.9/site-packages/nox/_decorators.py", line 53, in __call__
    return self.func(*args, **kwargs)
  File "/tmp/nox-unicode/noxfile.py", line 6, in tests
    session.run("python", "-c", 'import sys; sys.stdout.buffer.write(bytes([0x95]))', silent=True)
  File "/tmp/nox-unicode/venv/lib/python3.9/site-packages/nox/sessions.py", line 264, in run
    return self._run(*args, env=env, **kwargs)
  File "/tmp/nox-unicode/venv/lib/python3.9/site-packages/nox/sessions.py", line 320, in _run
    return nox.command.run(args, env=env, paths=self.bin_paths, **kwargs)
  File "/tmp/nox-unicode/venv/lib/python3.9/site-packages/nox/command.py", line 111, in run
    return_code, output = popen(
  File "/tmp/nox-unicode/venv/lib/python3.9/site-packages/nox/popen.py", line 74, in popen
    return return_code, decode_output(out) if out else ""
  File "/tmp/nox-unicode/venv/lib/python3.9/site-packages/nox/popen.py", line 38, in decode_output
    decoded_output = output.decode(second_encoding)
UnicodeDecodeError: '[utf-8, BIG5]' codec can't decode byte 0x95 in position 0: incomplete multibyte sequence
nox > Session tests failed.

nox/popen.py Outdated Show resolved Hide resolved
@Cielquan
Copy link
Contributor Author

@cjolowicz Thanks for your thorough answer. I also learned some new things 😀

I agree in all your points and will gladly take your suggestions and fix the tests accordingly later (hopefully today elsewise tomorrow).


I think the decoded_output variable must be a relic from earlier versions for the code or something even before the PR.
The double exceptions originate from an older version also I think. Earlier I raised the manipulated UnicodeDecodeError with a from UnicodeDecodeError or so.

@Cielquan
Copy link
Contributor Author

Ok fixed the tests. Was only one though.

With this the PR should be ready for merging?

Copy link
Collaborator

@cjolowicz cjolowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@theacodes theacodes changed the title fix decoding issues with popen output Decode popen output using the system locale if UTF-8 decoding fails. Feb 11, 2021
@theacodes theacodes merged commit 211702d into wntrblm:master Feb 11, 2021
@theacodes
Copy link
Collaborator

Looks great! Thank you both!

@theacodes
Copy link
Collaborator

Just a heads up @Cielquan and @cjolowicz:

This pull request is eligible for payment via our Open Collective. You can each file an expense for $20 USD for this contribution. Thanks!

@cjolowicz
Copy link
Collaborator

This pull request is eligible for payment via our Open Collective.

thank you for setting this up @theacodes ❤️

@Cielquan
Copy link
Contributor Author

Thanks for merging @theacodes and thanks again for reviewing @cjolowicz.
This OpenCollective thing is nice 👍🏾

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

UnicodeDecodeError bug
4 participants