Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased Support for Printing Escape Sequences #1502

Merged
merged 9 commits into from
Mar 6, 2023

Conversation

gptsarthak
Copy link
Contributor

Closes #1473

LPython now replicates python behavior with all types of combinations of escape sequences.

def test_new_line():
    print('I said \'Wow\' ')
    print("I said \"hi\" ")
    print("I said \\\"hi\" ")
    print("abc \\ ") 
    print("abc \\\ wow")

test_new_line()

(lp) sarthak@pop-os:~/lpython/examples$ python test.py
I said 'Wow' 
I said "hi" 
I said \"hi" 
abc \ 
abc \\ wow

(lp) sarthak@pop-os:~/lpython/examples$ lpython test.py
I said 'Wow' 
I said "hi" 
I said \"hi" 
abc \ 
abc \\ wow

@@ -803,6 +803,15 @@ char* unescape(Allocator &al, LCompilers::Str &s) {
} else if (s.p[idx] == '\\' && s.p[idx+1] == '\'') {
x += "'";
idx++;
} else if (s.p[idx] == '\\' && s.p[idx+1] == '\"') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you recheck for this case -

A tricky one I came across while randomly trying out a bunch of strings. Although a very improbable input, I would like this pr to perfectly handle all cases.

print('I said \'\\\'hi\\''hey\' ')

(lp) C:\Users\kunni\lpython>python try.py
I said '\'hi\hey'

(lp) C:\Users\kunni\lpython>src\bin\lpython try.py
I said '\'hi\hey\'

Thanks !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have observed that LPython generates incorrect output for the following cases:

print('  ''hey\' ')
print(" " " hey\" ")

Meaning, when there are 2 strings in a print function, which have no comma or any operator between them. This is accepted behavior in python.
Like you said this is an impossible input, since a user should put an operator like a comma between two strings, in which case lpython executes correctly.
I do not yet understand why lpython outputs that escape sequence after 'hey', but will try after my classes end. But right now I think that this might be a more parser related issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But right now I think that this might be a more parser related issue.

Try checking the AST using --show-ast command line argument in lpython.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was kinda right - lpython considers these both of the print statements I wrote as single strings and ignores the quotes in the middle, which somehow leads to the problem of printing the escape sequence.
I believe that in ("a" " b") or ('a' 'b'), a and b should be considered two different strings by lpython, which it does not.
Some AST results for further investigations:

print('a  b\' ')
[(ConstantStr "a  b' " ())]                           // Correct

print('a'' b\' ')
[(ConstantStr "a b\' " ())]                           // Incorrect (Same AST but '\' also printed)

print('a'              ' b\' ')
[(ConstantStr "a b\' " ())]                           // Incorrect (Whitespace ignored) 

print('a' , ' b\' ')
[(ConstantStr "a" ()) (ConstantStr " b' " ())]        // Correct (Two strings because of comma) 

print("a " " b\' ")
[(ConstantStr "a  b\' " ())]                          // Incorrect (Same issue with double quotes)

print("a' ' b\' ")
[(ConstantStr "a' ' b' " ())]                         // Correct (Nested quotes work)

I think we should open a new issue for this, if it is a problem that needs to be solved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, thanks for investigating. You can create a new issue or directly open a new PR and work towards fixing it.
If you get any doubts regarding the parser, do ping me I can help you.

@czgdp1807 czgdp1807 marked this pull request as draft February 6, 2023 04:42
Copy link
Collaborator

@czgdp1807 czgdp1807 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a runnable test for this. See https://github.com/lcompilers/lpython/tree/main/integration_tests for the pattern.

Comment on lines +8 to +14
def test_escape_sequences():
print('I said \'Wow\' ')
print("I said \"hi\" ")
print("I said \\\"hi\" ")
print('I said \'\\\'hi\\''hey\' ')
print('a''b')
print("a" "b")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this test to tests/parser/string2.py as well? (so that we test the AST changes)

Comment on lines 1 to 2
(Module [(Expr (ConstantStr ""\" ())) (Expr (ConstantStr "
\" ())) (Expr (ConstantStr "\" ())) (Expr (ConstantStr " \" ())) (Expr (ConstantStr "\ " ())) (Expr (ConstantStr "\" ())) (Expr (ConstantStr "\|" ())) (Expr (ConstantStr "Text\\" ())) (Expr (ConstantStr "\" ())) (Expr (Call (Name x Load) [(ConstantStr "\class a:
Copy link
Collaborator

@Thirumalai-Shaktivel Thirumalai-Shaktivel Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, ""\", "\", ... seems to be wrong.

Can you cross check with Python's output and make the changes accordingly in semantics.h? you can use python -m ast tests/parser/string2.py to verify the Python's AST.

@@ -774,9 +799,12 @@ static inline ast_t* concat_string(Allocator &al, Location &l,
throw LCompilers::LPython::parser_local::ParserError(
"The byte and non-byte literals can not be combined", l);
}
} else {
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {
} else {

Comment on lines 735 to 762
} else if (s.p[idx] == '\\' && s.p[idx+1] == ' ') {
x += "\\ ";
idx++;
} else if (s.p[idx] == '\\') {
x += "\\";
idx++;
Copy link
Collaborator

@Thirumalai-Shaktivel Thirumalai-Shaktivel Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment this, check the output, uncomment or add new lines, check the output...; play around with this. Please make sure that you get the same output as CPython ast

@gptsarthak
Copy link
Contributor Author

gptsarthak commented Feb 15, 2023

In the latest commit I have :

  • Added code to escape the strings in the AST. This required me to edit the asdl_cpp.py file to emit the correct code in the visit_ConstantStr function in the Pickle Visitor class.

  • Made the strings in LPython AST be same as you would get in python -m AST {file-name} by escaping newline

  • Fixed printing of backslash when used as a continuation character at the end of a line. We already had tests for this in string2.py and many other files, but they produced incorrect AST and output before.

  • Refactored code according to suggested changes.

  • Added test in parser/string2.py

There is one issue left to solve. After making the changes mentioned in first point, I observed that visit_ConstantBytes is also getting edited along the visit_ConstantStr function. I could not find a way to differentiate between the two inside make_visitor. If that is a problem, we can discuss and fix it asap. After that I believe we can merge this PR.

Comment on lines 1448 to 1460
elif field.name == "value":
self.emit( 's.append(\"\\\"\");',2)
self.emit( "std::string str = std::string(x.m_value);",2)
self.emit( "for (size_t idx=0; idx < str.size(); idx++) {",2)
self.emit( "if (str[idx] == \'\\n\') {",3)
self.emit( "s += \"\\\\n\";",4)
self.emit( "} else if (str[idx] == '\\\\') {",3)
self.emit( "s += \"\\\\\\\\\";",4)
self.emit( "} else {",3)
self.emit( "s += str[idx];",4)
self.emit( "}",3)
self.emit( "}",2)
self.emit( "s.append(\"\\\"\");",2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to handle this in semantics.h itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried handling this in the semantics alone, but could not do it because that changes the output of the code. It changes the value in the ConstantStr node. If we change that, we'll have to handle this again in the backend (python_ast_to_asr.cpp i guess) to get the same output as python. If that is what we need I'll try to do that myself first, otherwise you should take over this PR. Thank you

" ())) (Expr (ConstantStr "
'\'
" ()))] [])
(Module [(Expr (ConstantStr ""\\" ())) (Expr (ConstantStr "\n\\" ())) (Expr (ConstantStr "\\" ())) (Expr (ConstantStr " \\" ())) (Expr (ConstantStr "\\ " ())) (Expr (ConstantStr "" ())) (Expr (ConstantStr "\\|" ())) (Expr (ConstantStr "I said 'Wow' " ())) (Expr (ConstantStr "I said "hi" " ())) (Expr (ConstantStr "I said \\"hi" " ())) (Expr (ConstantStr "I said '\\'hi\\hey' " ())) (Expr (ConstantStr "ab" ())) (Expr (ConstantStr "ab" ())) (Expr (ConstantStr "Text\\\\" ())) (Expr (ConstantStr "" ())) (Expr (Call (Name x Load) [(ConstantStr "class a:\n def test():\n return r"\\LaTeX"\n" ())] [])) (Expr (ConstantStr "\n "\\\\"\n" ())) (Expr (ConstantStr "\n '\\'\n" ()))] [])
Copy link
Collaborator

@Thirumalai-Shaktivel Thirumalai-Shaktivel Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems to be correct. Except ""\\", which should be '"\\' or "\"\\" (and "\n "\\\\"\n" too). I think we don't have to unescape " in this PR. Instead, Let's create a new issue or work on another PR.

Copy link
Contributor Author

@gptsarthak gptsarthak Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please create a new issue to work on this.
Right now what i think is, in Lpython's ConstantStr node we use string value while python uses constant value. I think this is why the strings are wrapped in double quotes instead of single quotes, which is the only difference I see here. If fixing this is simply not unescaping ", that will be easy and I'll do it in my next commit in this PR only.
In python,
Update : Python AST has value exactly what we get using the repr() method,

  • mostly wrapped in single quotes (example 'I said \\"hi" ')
  • unless there is a single quote inside string in which case it wraps with double quotes. (example - "I said 'Wow' ")
  • If both single and double quotes are present inside string, it wraps with single quotes and escapes the single quotes inside. (example - 'I said "\'Wow\'" ')

I can try implementing this but I need help with how we would wrap our string value around with single quote

@Thirumalai-Shaktivel
Copy link
Collaborator

Thank you, @gptsarthak for working on this. It was very much required.
I have left some comments, you can work on it if you know what to do. Otherwise, I will try to work on it coming weekend.

@Thirumalai-Shaktivel
Copy link
Collaborator

@gptsarthak I made the required changes.
Now, you can review and test it. If you have any doubts or find any mistakes, please ping me.

@Thirumalai-Shaktivel Thirumalai-Shaktivel marked this pull request as ready for review February 20, 2023 13:54
Copy link
Collaborator

@czgdp1807 czgdp1807 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me.

@anutosh491
Copy link
Collaborator

LGTM ! I did some testing and it looks fine .

@Thirumalai-Shaktivel
Copy link
Collaborator

Thanks for the approval and testing.

@Thirumalai-Shaktivel
Copy link
Collaborator

@gptsarthak is it okay if I merge this?
Do you wanna test or make some other changes??

@czgdp1807
Copy link
Collaborator

Conflicts should be resolved here.

@czgdp1807 czgdp1807 marked this pull request as draft March 2, 2023 14:31
@gptsarthak
Copy link
Contributor Author

I believe I resolved the conflict. Its my first time dealing with them so hopefully it went right.
@Thirumalai-Shaktivel thanks for working on this! I did some tests and they ran fine. Your commits taught me a lot and gave an insight on how the issue was supposed to be solved. It can be merged now.
I believe I can help out with #1529 (comment) after this is merged.

@Thirumalai-Shaktivel Thirumalai-Shaktivel marked this pull request as ready for review March 6, 2023 05:48
@Thirumalai-Shaktivel Thirumalai-Shaktivel merged commit 3c53ffb into lcompilers:main Mar 6, 2023
@gptsarthak gptsarthak deleted the fix branch March 21, 2023 07:47
@certik
Copy link
Contributor

certik commented May 4, 2023

I think this PR will have to be fixed. It seems it keeps the strings escaped in ASR. I think that's not a good design, because then we need to be unescaping the strings in almost all the backends. Also, the escaping and unescaping will be different in every surface language, in particular it is different in Fortran and Python.

Rather, a proper design is what we had in master before this PR, that is:

  • Strings should be unescaped, represented as utf-8 strings, directly. I think they can be null terminated, that's probably fine.
  • The frontend's job is to unescape the strings, based on the surface level language rules.
  • The backend's job is to properly escape the string as needed, based on the language, it will in general be different for Julia, Python, Fortran, C, etc. Most lowering backends like LLVM and WASM will not be escaping the string at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Escape sequence recognized but not ignored in printing.
6 participants