Line length incorrect for non-ascii text #3825

snstanton · 2023-03-30T18:40:49Z

version: ruff 0.0.260

The line length calculation for non-ascii characters is incorrect:

x = '12345678901234567890123456789012345678901234567890123456789012345678901234'
y = 'これは長い列です90これは長い列です90これは長い列です90これは長い列です90これは長い列です90これは長い列です90これは長い列です90これは長'

Both lines are under the character limit, but ruff flags the second line as too long, reporting the number of bytes rather than the number of characters. This changed recently. 0.254 accepted this file, but 0.260 rejects it:

ruff bug.py
bug.py:2:81: E501 Line too long (140 > 80 characters)

pyproject.toml:

[tool.ruff]
line-length = 80

select = ["E"]

[tool.ruff.flake8-quotes]
inline-quotes = "single"

The text was updated successfully, but these errors were encountered:

charliermarsh · 2023-03-30T18:47:44Z

It's not quite the number of bytes (or, at least, it shouldn't be). Instead, it's the unicode width (or, at least, it should be). The relevant PR is here: #3714.

Yasu-umi · 2023-04-05T01:48:29Z

Hi! #3714 is finished, so this issue is already solved ?
I added the following case to the crates/ruff/resources/test/fixtures/pycodestyle/E501.py in current main branch, and running cargo run -p ruff_cli -- check crates/ruff/resources/test/fixtures/pycodestyle/E501.py --no-cache --select E501, but I didn't get expected result.

# Ok length=84
class JapaneseOk:
    """
    このドキュメントはテスト用のダミーです。十分に長い文章を用意しても大丈夫なことを確認してください。このくらいの長さではまだエラーになりません、
    """


# Error length=91
class JapaneseError:
    """
    このドキュメントはテスト用のダミーです。十分に長い文章を用意しても大丈夫なことを確認してください。このくらいの長さではまだエラーになりません、もーっともっと伸ばすとエラーになります。
    """

charliermarsh · 2023-04-05T01:53:56Z

It should be, yeah. I'll \cc @MichaReiser who may know better!

MichaReiser · 2023-04-05T12:32:59Z

We don't report either of the two examples because both strings contain no whitespace, meaning there's no reasonable position where the user can break the text over multiple lines.

https://github.com/charliermarsh/ruff/blob/46bcb1f725999540a315bc6fd44f455ec3cc38a9/crates/ruff/src/rules/pycodestyle/helpers.rs#L32-L36

Ruff reports both lines when adding whitespace somewhere inside the text

# Ok length=84
class JapaneseOk:
    """
    このドキュメントは テスト用のダミーです。十分に長い文章を用意しても大丈夫なことを確認してください。このくらいの長さではまだエラーになりません、
    """


# Error length=91
class JapaneseError:
    """
    このドキュメントは テスト用のダミーです。十分に長い文章を用意しても大丈夫なことを確認してください。このくらいの長さではまだエラーになりません、もーっともっと伸ばすとエラーになります。
    """

Ruff reports both lines because both exceed the unicode width of 88:

The first line has a width of 120 characters
The second line has a width of 152 characters

Yasu-umi · 2023-04-06T03:22:04Z

thanks for detailed explanation !

because both strings contain no whitespace

I misunderstood about this.

unicode width

I also misunderstood about this. (I assumed ruff uses &str.chars().count().)

Do you have any plan for move to grapheme cluster count ? (ex: use unicode_segmentation)
Of course, this counting is more expensive than &str.len(). but, is important (& easy for understanding err) for people use non-ascii chars language. please consider this!

MichaReiser · 2023-04-06T07:07:26Z

I also misunderstood about this. (I assumed ruff uses &str.chars().count().)

Ruff did use chars().count() but we changed it in #3714 to align with black. Pycodestyle does use chars().count() because their reasoning is that they don't want to add the complexity to their tool and PEP8 recommends using ASCII only.

This commit in the black repository has a good explanation why they (and Ruff) choose unicode width over character or grapheme count.

eviljeff · 2023-05-05T10:46:13Z

We have a case where ruff is reporting an over count line (112 vs our standard 88) but if we attempt to split it into two lines black auto-formats it back to one line because it'll fit on one line.

Is there still some misalignment between ruff and black, or is there a correct "fix" to this (other than adding #noqa: E501 to the end)?

         query = '남포역립카페추천 ˇjjtat닷컴ˇ ≡제이제이♠♣ 남포역스파 남포역op남포역유흥≡남포역안마남포역오피 ♠♣'

MichaReiser · 2023-05-05T15:22:17Z

Oh, I only now realized that Black only changed the behavior for the preview style but not for the "stable" style. You can see this in the changes.md. Black also only changed the behavior for string literals only, not for all characters.

@eviljeff could you try using Black's preview style to see if that fixes the issue? I wonder if we should add a configuration setting to allow users to pick their preferred "width" measurement unit.

eviljeff · 2023-05-05T16:53:46Z

Oh, I only now realized that Black only changed the behavior for the preview style but not for the "stable" style. You can see this in the changes.md. Black also only changed the behavior for string literals only, not for all characters.

@eviljeff could you try using Black's preview style to see if that fixes the issue?

It does fix that problem - black --preview splits the the string onto two lines which ruff accepts.

(Although if we wanted to use that as a workaround it wouldn't be a solution today as black --preview - for some reason - is rewriting "['categories']." to '[\'categories\'].' which ruff then raises as Q003 🤷‍♂️. I hope that's fixed in black before the change moves to stable)

I wonder if we should add a configuration setting to allow users to pick their preferred "width" measurement unit.

Personally, I prefer measuring for the line length seen in a (modern) code editor that supports unicode, because a line limit is mostly about readability, but practically, aligning with black is the most efficient solution (unless black adds a similar setting - which I would guess is unlikely given the project ethos of minimal configuration).

eviljeff · 2023-05-05T17:00:44Z

(Although if we wanted to use that as a workaround it wouldn't be a solution today as black --preview - for some reason - is rewriting "['categories']." to '[\'categories\'].' which ruff then raises as Q003 🤷‍♂️. I hope that's fixed in black before the change moves to stable)

fwiw, looks like that bug(?) has been reported already as psf/black#3643 (the example I included was also a multi-line string)

sgryjp · 2023-05-16T15:54:31Z

@MichaReiser I think that the new behavior (using unicode-width instead of byte-count) should be an opt-in feature because Black's preview style is also an opt-in feature until Jan 2024... it's too far.

Note that, for the Black's preview style, "there are no guarantees around the stability of the output" according to their stability policy. I don't think we should use it for production code.

MichaReiser · 2023-12-07T23:43:00Z

Black's preview style is about to be released and it includes the change to measure strings using the unicode width rather the character length. Ruff's formatter already uses unicode width.

We also updated our documentation to improve the explanation that it isn't the character limit when using emoji or east asian characters.

jackjyq · 2024-07-18T06:08:35Z

Even ruff uses Unicode width to measure line length, the editor font might not respect that, which appears as an in-consistency line length in ruff's auto-formatting or E501 reporting.

To resolve this, your favorite editor shall employ a carefully designed font.

For example Sarasa-Gothic

snstanton changed the title ~~Line length calculation does not work on~~ Line length incorrect for non-ascii text Mar 30, 2023

eviljeff mentioned this issue May 5, 2023

switch from flake8 to ruff mozilla/addons-server#20652

Merged

This was referenced Jun 16, 2023

Ruff treats every Chinese character as two and often falsely reports the "Line too long (E501)" warning #5133

Closed

E501 false positive when using east_asian_unicode string #3902

Closed

charliermarsh added the question Asking for support or clarification label Jun 16, 2023

charliermarsh added needs-decision Awaiting a decision from a maintainer and removed question Asking for support or clarification labels Jul 10, 2023

MichaReiser closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line length incorrect for non-ascii text #3825

Line length incorrect for non-ascii text #3825

snstanton commented Mar 30, 2023

charliermarsh commented Mar 30, 2023

Yasu-umi commented Apr 5, 2023 •

edited

Loading

charliermarsh commented Apr 5, 2023

MichaReiser commented Apr 5, 2023 •

edited

Loading

Yasu-umi commented Apr 6, 2023

MichaReiser commented Apr 6, 2023

eviljeff commented May 5, 2023

MichaReiser commented May 5, 2023

eviljeff commented May 5, 2023

eviljeff commented May 5, 2023

sgryjp commented May 16, 2023

MichaReiser commented Dec 7, 2023

jackjyq commented Jul 18, 2024 •

edited

Loading

Line length incorrect for non-ascii text #3825

Line length incorrect for non-ascii text #3825

Comments

snstanton commented Mar 30, 2023

charliermarsh commented Mar 30, 2023

Yasu-umi commented Apr 5, 2023 • edited Loading

charliermarsh commented Apr 5, 2023

MichaReiser commented Apr 5, 2023 • edited Loading

Yasu-umi commented Apr 6, 2023

MichaReiser commented Apr 6, 2023

eviljeff commented May 5, 2023

MichaReiser commented May 5, 2023

eviljeff commented May 5, 2023

eviljeff commented May 5, 2023

sgryjp commented May 16, 2023

MichaReiser commented Dec 7, 2023

jackjyq commented Jul 18, 2024 • edited Loading

Yasu-umi commented Apr 5, 2023 •

edited

Loading

MichaReiser commented Apr 5, 2023 •

edited

Loading

jackjyq commented Jul 18, 2024 •

edited

Loading