Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line length incorrect for non-ascii text #3825

Closed
snstanton opened this issue Mar 30, 2023 · 13 comments
Closed

Line length incorrect for non-ascii text #3825

snstanton opened this issue Mar 30, 2023 · 13 comments
Labels
needs-decision Awaiting a decision from a maintainer

Comments

@snstanton
Copy link

version: ruff 0.0.260

The line length calculation for non-ascii characters is incorrect:

x = '12345678901234567890123456789012345678901234567890123456789012345678901234'
y = 'これは長い列です90これは長い列です90これは長い列です90これは長い列です90これは長い列です90これは長い列です90これは長い列です90これは長'

Both lines are under the character limit, but ruff flags the second line as too long, reporting the number of bytes rather than the number of characters. This changed recently. 0.254 accepted this file, but 0.260 rejects it:

ruff bug.py
bug.py:2:81: E501 Line too long (140 > 80 characters)

pyproject.toml:

[tool.ruff]
line-length = 80

select = ["E"]

[tool.ruff.flake8-quotes]
inline-quotes = "single"
@snstanton snstanton changed the title Line length calculation does not work on Line length incorrect for non-ascii text Mar 30, 2023
@charliermarsh
Copy link
Member

It's not quite the number of bytes (or, at least, it shouldn't be). Instead, it's the unicode width (or, at least, it should be). The relevant PR is here: #3714.

@Yasu-umi
Copy link
Contributor

Yasu-umi commented Apr 5, 2023

Hi! #3714 is finished, so this issue is already solved ?
I added the following case to the crates/ruff/resources/test/fixtures/pycodestyle/E501.py in current main branch, and running cargo run -p ruff_cli -- check crates/ruff/resources/test/fixtures/pycodestyle/E501.py --no-cache --select E501, but I didn't get expected result.

# Ok length=84
class JapaneseOk:
    """
    このドキュメントはテスト用のダミーです。十分に長い文章を用意しても大丈夫なことを確認してください。このくらいの長さではまだエラーになりません、
    """


# Error length=91
class JapaneseError:
    """
    このドキュメントはテスト用のダミーです。十分に長い文章を用意しても大丈夫なことを確認してください。このくらいの長さではまだエラーになりません、もーっともっと伸ばすとエラーになります。
    """

@charliermarsh
Copy link
Member

It should be, yeah. I'll \cc @MichaReiser who may know better!

@MichaReiser
Copy link
Member

MichaReiser commented Apr 5, 2023

We don't report either of the two examples because both strings contain no whitespace, meaning there's no reasonable position where the user can break the text over multiple lines.

https://github.com/charliermarsh/ruff/blob/46bcb1f725999540a315bc6fd44f455ec3cc38a9/crates/ruff/src/rules/pycodestyle/helpers.rs#L32-L36

Ruff reports both lines when adding whitespace somewhere inside the text

# Ok length=84
class JapaneseOk:
    """
    このドキュメントは テスト用のダミーです。十分に長い文章を用意しても大丈夫なことを確認してください。このくらいの長さではまだエラーになりません、
    """


# Error length=91
class JapaneseError:
    """
    このドキュメントは テスト用のダミーです。十分に長い文章を用意しても大丈夫なことを確認してください。このくらいの長さではまだエラーになりません、もーっともっと伸ばすとエラーになります。
    """

Ruff reports both lines because both exceed the unicode width of 88:

  • The first line has a width of 120 characters
  • The second line has a width of 152 characters

@Yasu-umi
Copy link
Contributor

Yasu-umi commented Apr 6, 2023

thanks for detailed explanation !

because both strings contain no whitespace

I misunderstood about this.

unicode width

I also misunderstood about this. (I assumed ruff uses &str.chars().count().)

Do you have any plan for move to grapheme cluster count ? (ex: use unicode_segmentation)
Of course, this counting is more expensive than &str.len(). but, is important (& easy for understanding err) for people use non-ascii chars language. please consider this!

@MichaReiser
Copy link
Member

I also misunderstood about this. (I assumed ruff uses &str.chars().count().)

Ruff did use chars().count() but we changed it in #3714 to align with black. Pycodestyle does use chars().count() because their reasoning is that they don't want to add the complexity to their tool and PEP8 recommends using ASCII only.

This commit in the black repository has a good explanation why they (and Ruff) choose unicode width over character or grapheme count.

@eviljeff
Copy link

eviljeff commented May 5, 2023

We have a case where ruff is reporting an over count line (112 vs our standard 88) but if we attempt to split it into two lines black auto-formats it back to one line because it'll fit on one line.

Is there still some misalignment between ruff and black, or is there a correct "fix" to this (other than adding #noqa: E501 to the end)?

         query = '남포역립카페추천 ˇjjtat닷컴ˇ ≡제이제이♠♣ 남포역스파 남포역op남포역유흥≡남포역안마남포역오피 ♠♣'

@MichaReiser
Copy link
Member

Oh, I only now realized that Black only changed the behavior for the preview style but not for the "stable" style. You can see this in the changes.md. Black also only changed the behavior for string literals only, not for all characters.

@eviljeff could you try using Black's preview style to see if that fixes the issue? I wonder if we should add a configuration setting to allow users to pick their preferred "width" measurement unit.

@eviljeff
Copy link

eviljeff commented May 5, 2023

Oh, I only now realized that Black only changed the behavior for the preview style but not for the "stable" style. You can see this in the changes.md. Black also only changed the behavior for string literals only, not for all characters.

@eviljeff could you try using Black's preview style to see if that fixes the issue?

It does fix that problem - black --preview splits the the string onto two lines which ruff accepts.

(Although if we wanted to use that as a workaround it wouldn't be a solution today as black --preview - for some reason - is rewriting "['categories']." to '[\'categories\'].' which ruff then raises as Q003 🤷‍♂️. I hope that's fixed in black before the change moves to stable)

I wonder if we should add a configuration setting to allow users to pick their preferred "width" measurement unit.

Personally, I prefer measuring for the line length seen in a (modern) code editor that supports unicode, because a line limit is mostly about readability, but practically, aligning with black is the most efficient solution (unless black adds a similar setting - which I would guess is unlikely given the project ethos of minimal configuration).

@eviljeff
Copy link

eviljeff commented May 5, 2023

(Although if we wanted to use that as a workaround it wouldn't be a solution today as black --preview - for some reason - is rewriting "['categories']." to '[\'categories\'].' which ruff then raises as Q003 🤷‍♂️. I hope that's fixed in black before the change moves to stable)

fwiw, looks like that bug(?) has been reported already as psf/black#3643 (the example I included was also a multi-line string)

@sgryjp
Copy link
Contributor

sgryjp commented May 16, 2023

@MichaReiser I think that the new behavior (using unicode-width instead of byte-count) should be an opt-in feature because Black's preview style is also an opt-in feature until Jan 2024... it's too far.

Note that, for the Black's preview style, "there are no guarantees around the stability of the output" according to their stability policy. I don't think we should use it for production code.

@charliermarsh charliermarsh added the question Asking for support or clarification label Jun 16, 2023
@charliermarsh charliermarsh added needs-decision Awaiting a decision from a maintainer and removed question Asking for support or clarification labels Jul 10, 2023
@MichaReiser
Copy link
Member

Black's preview style is about to be released and it includes the change to measure strings using the unicode width rather the character length. Ruff's formatter already uses unicode width.

We also updated our documentation to improve the explanation that it isn't the character limit when using emoji or east asian characters.

@MichaReiser MichaReiser closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2023
@jackjyq
Copy link

jackjyq commented Jul 18, 2024

Even ruff uses Unicode width to measure line length, the editor font might not respect that, which appears as an in-consistency line length in ruff's auto-formatting or E501 reporting.

To resolve this, your favorite editor shall employ a carefully designed font.

For example Sarasa-Gothic

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-decision Awaiting a decision from a maintainer
Projects
None yet
Development

No branches or pull requests

7 participants