Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider 2-character EOL before line continuation #12035

Merged
merged 1 commit into from
Jun 26, 2024

Conversation

dhruvmanila
Copy link
Member

@dhruvmanila dhruvmanila commented Jun 26, 2024

Summary

This PR fixes a bug introduced in #12008 which didn't consider the two character newline after the line continuation character.

For example, consider the following code highlighted with whitespaces:

call(foo # comment \\r\n
\r\n
def bar():\r\n
....pass\r\n

The lexer is at def when it's running the re-lexing logic and trying to move back to a newline character. It encounters \n and it's being escaped (incorrect) but \r is being escaped, so it moves the lexer to \n character. This creates an overlap in token ranges which causes the panic.

Name 0..4
Lpar 4..5
Name 5..8
Comment 9..20
NonLogicalNewline 20..22 <-- overlap between
Newline 21..22           <-- these two tokens
NonLogicalNewline 22..23
Def 23..26
...

fixes: #12028

Test Plan

Add a test case with line continuation and windows style newline character.

@dhruvmanila dhruvmanila added bug Something isn't working parser Related to the parser labels Jun 26, 2024
Copy link
Contributor

github-actions bot commented Jun 26, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Comment on lines +1399 to +1408
match ch {
'\n' => {
current_position -= ch.text_len();
if let Some(carriage_return) = reverse_chars.next_if_eq(&'\r') {
current_position -= carriage_return.text_len();
}
current_position -= TextSize::new('\\'.text_len().to_u32() * backslash_count);
}
'\r' => {
current_position -= ch.text_len();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change is to split the match with \r and \n into it's own branch and then a common check for line continuation character.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. But I agree with you that this is all backwards lexing again. Could we use the previous token kind (e.g. by storing it on the lexer) and test if it is a NonLogicalNewline to avoid all the complexity of guessing if something's a newline or not?

Or is this not sufficient if there are multiple newlines (in which case storing just one previous token isn't enough?)

I guess there's also:

[a, 
	# comment
	# more comment

def foo

where there's more in between def and ,

Comment on lines +1399 to +1408
match ch {
'\n' => {
current_position -= ch.text_len();
if let Some(carriage_return) = reverse_chars.next_if_eq(&'\r') {
current_position -= carriage_return.text_len();
}
current_position -= TextSize::new('\\'.text_len().to_u32() * backslash_count);
}
'\r' => {
current_position -= ch.text_len();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this logic a bit. It doesn't seem to matter how many new line characters there are, we just want to skip all of them

if matches!(c, '\n' | '\r') {
	while Some(newline) = reverse_chars.next_if(|c| matches!(c, '\n' | '\r')) {
		current_position -= newline.text_len();
	}

 // after any new line, handle line continuation
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work because we want the position of the last unescaped newline character. We would still need to keep track of that position, so we might as well be explicit about this.

@dhruvmanila
Copy link
Member Author

dhruvmanila commented Jun 26, 2024

Looks good to me. But I agree with you that this is all backwards lexing again. Could we use the previous token kind (e.g. by storing it on the lexer) and test if it is a NonLogicalNewline to avoid all the complexity of guessing if something's a newline or not?

Or is this not sufficient if there are multiple newlines (in which case storing just one previous token isn't enough?)

I guess there's also:

[a, 
	# comment
	# more comment

def foo

where there's more in between def and ,

Yeah, that's correct. There can be multiple trivia tokens in between which makes it hard to just look at the last token.

Although we could split the logic and move "finding the non-logical newline" part to the TokenSource who has access to all the tokens and then it'll ask the lexer to move back to the provided location. I'm not sure if this is a good way to model this behavior.

@dhruvmanila dhruvmanila merged commit 47c9ed0 into main Jun 26, 2024
20 checks passed
@dhruvmanila dhruvmanila deleted the dhruv/line-continuation-eol branch June 26, 2024 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parser Related to the parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rule E731 cause panic (from_tokens panic - never ending story)
2 participants