Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQL: Replace ?"..." with """...""" for unescaped strings #62539

Merged
merged 14 commits into from
Oct 2, 2020

Conversation

matriv
Copy link
Contributor

@matriv matriv commented Sep 17, 2020

Use triple double quotes enclosing a string literal to interpret it
as unescaped, in order to use ? for marking query params and avoid
user confusion.

Relates to #61659

Use triple doulbe quotes enclosing a string literal to interpret it
as unescaped, in order to use `?` for marking query params and avoid
user confusion.

Relates to elastic#61659
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (:Query Languages/EQL)

@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Sep 17, 2020
@matriv
Copy link
Contributor Author

matriv commented Sep 17, 2020

/cc @rw-access
/cc @jrodewig

@matriv
Copy link
Contributor Author

matriv commented Sep 17, 2020

Currently the usage of ?'.......' is handled with the error message regarding the single quotes.
Personally I prefer this to handling it as a specific case alongside ?"......", but let me know if you have a different opinion.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some parser tests to validate the error being thrown with the old and new format.

@@ -122,10 +122,18 @@ public static String unquoteString(Source source) {
return null;
}

// unescaped strings can be interpreted directly
// catch old method of ?" and ?' to define unescaped strings
if (text.startsWith("?")) {
checkForSingleQuotedString(source, text, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect - in case of ?' the error message is invalid - it tells that double quotes should be used which is not the case.
In fact I would argue that since the string is always quoted, if it starts with ? an exception should be thrown regardless of what follows it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, It's a 2-step error for the user: ?' -> ?" -> """. I just thought that explicitly handling the single quotes (no matter of the preceding ?) was a "better" approach. I'll gladly change it.

@matriv
Copy link
Contributor Author

matriv commented Sep 17, 2020

@costin other that this: https://github.com/elastic/elasticsearch/pull/62539/files#diff-8b3f6645d4cf54ff461c3e337b4bbe84R108 ?

Copy link
Contributor

@rw-access rw-access left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this has been decided as a new syntax so I would like to hold off on merging and run this by the EQL rules team.

@@ -195,6 +195,7 @@ STRING
| '"' ('\\' [btnfr"'\\] | ~[\r\n"\\])* '"'
| '?"' ('\\"' |~["\r\n])* '"'
| '?\'' ('\\\'' |~['\r\n])* '\''
| '"""' ('\\\'' |~['\r\n])* '"""'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like ' needs to be escaped under this format.
I also have some questions about the desired behavior of triple double quotes:

How would each of these be interpreted? I think we should know our edge cases up front. I'm wondering what the implication is for " at the beginning or at the end of a string. And how many of them are supported

  • """"""
  • """""x"""
  • """x"""""
  • """"""""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like ' needs to be escaped under this format.

Correct, I rushed and overlooked that.

I also have some questions about the desired behavior of triple double quotes:

How would each of these be interpreted? I think we should know our edge cases up front. I'm wondering what the implication is for " at the beginning or at the end of a string. And how many of them are supported

  • """"""
  • """""x"""
  • """x"""""
  • """"""""

The extra " in the beginning or end of the string would be part of the final string:

  • `` -> empty string
  • ""x
  • x""
  • ""

@rw-access
Copy link
Contributor

in order to use ? for marking query params and avoid
user confusion.

there is no grammatical ambiguity for a lone ?. it was passing tests here: https://github.com/elastic/elasticsearch/pull/52301/files#diff-75fcba04994edcd265cc994992f6124bR113

sure there may be an ambiguity from the user's expectation of what ? means, but since that's deprecated that should no longer be a problem. you should be safe to pursue ? as a templated param without needing to make any other grammatical changes

@matriv
Copy link
Contributor Author

matriv commented Sep 18, 2020

there is no grammatical ambiguity for a lone ?. it was passing tests here: https://github.com/elastic/elasticsearch/pull/52301/files#diff-75fcba04994edcd265cc994992f6124bR113

sure there may be an ambiguity from the user's expectation of what ? means, but since that's deprecated that should no longer be a problem. you should be safe to pursue ? as a templated param without needing to make any other grammatical changes

That's true, It's only to have a clear user of ? only for the params, there's no problem with the grammar, that's why I mentioned "user confusion", not implementation wise.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -199,6 +199,7 @@ STRING
| '"' ('\\' [btnfr"'\\] | ~[\r\n"\\])* '"'
| '?"' ('\\"' |~["\r\n])* '"'
| '?\'' ('\\\'' |~['\r\n])* '\''
| '"""' ('\\"' |~['\r\n])* '"""'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sttill doesn't seem right and needs more tests.
Under this regex, you start and end with """, but you can't have any ' characters in between. In addition, you can have an infinite number of " characters. This would mean """"""""""""""" is valid syntax. But I'm not sure that's good behavior.

Copy link
Contributor

@rw-access rw-access left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this particular implementation needs some more tests and an updated regex.

Regardless of the implementation though, I think we still need to hold off on merging this PR. I think the turnaround was impressive, but that we skipped a few crucial steps. Using """ for quoting literal strings went from proposal straight to implementation, but I think we still need the opportunity for a discussion, ironing out the behavior for edge cases, and ultimately agreement.

The specification going forward

I'm thinking that any changes to the EQL specification, like this example, should be collectively deliberated and agreed upon. I'm not completely sure if this syntax is desirable, and would like some more time to explore options. And when things merge, it's harder to back them out. Especially for 7.10+ changes or whenever we land as beta.

I think that all of the interested bodies should have weight, and I'm wondering if our existing syncs are sufficient to discuss changes to the specification. An idea was pitched on this PR elastic/beats#20994 (comment) to separate out the specification into a separate repo. I think this could help draw clearer lines between implementation and specification.

Thoughts @costin @philkra @paulewing @tsg?

@matriv
Copy link
Contributor Author

matriv commented Sep 21, 2020

Please see my comment here

e.getMessage());
}

public void testTripleDoubleQuotedUnescapedString() {
Copy link
Contributor

@rw-access rw-access Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is """hello""" == """world""" interpreted?

it should be equivalent to this expression Equals("hello", "world):
"hello" == "world"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a test for that

@@ -200,6 +200,7 @@ STRING
| '"' ('\\' [btnfr"'\\] | ~[\r\n"\\])* '"'
| '?"' ('\\"' |~["\r\n])* '"'
| '?\'' ('\\\'' |~['\r\n])* '\''
| '"""' (~['\r\n])* '"""'
Copy link
Contributor

@rw-access rw-access Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't need to forbid ' inside the string.

Suggested change
| '"""' (~['\r\n])* '"""'
| '"""' (~[\r\n])* '"""'

@matriv
Copy link
Contributor Author

matriv commented Oct 1, 2020

@rw-access Was right about the greediness, so in my previous implementation an expression like:

"""foo""" == """bar"""

would be parsed as a literal:

foo""" == """bar

So instead I've pushed the python approach that within the """....""" you can have anything except for 3 double quotes in a raw, one of them at least must be escaped with \. That means that at the end of string you always need to escape the last " that is adjacent to the closing """:
Wrong: """foo"""", correct: """foo\""""

Please advice for any more tests than the current ones.

/cc @jrodewig

@@ -200,6 +200,7 @@ STRING
| '"' ('\\' [btnfr"'\\] | ~[\r\n"\\])* '"'
| '?"' ('\\"' |~["\r\n])* '"'
| '?\'' ('\\\'' |~['\r\n])* '\''
| '"""' ('\\"' |~[\r\n])*? '"""'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could really come to bite us. Mostly because it means that a raw string is not a raw string. With this, there's no way to express a literal \" anywhere in the string.

This branch allows a literal \":
matriv/elasticsearch@replace-unescaped-triple-doublequotes...rw-access:raw-string-changes

Copy link
Contributor Author

@matriv matriv Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This grammar has other issues:

"""""hello\"""world!\"""" == """"foo"\""bar""\""""

=>

org.elasticsearch.xpack.eql.parser.ParsingException: line 1:21: token recognition error at: '!\'

and with the proposed solution you can have a \" literal if you double escape it \\"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could replace \" with " only if part of an escaped triple double quote sequence, but I think that produces more confusion for a user.

Copy link
Contributor

@rw-access rw-access Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it should give that syntax error because it's invalid syntax. the string was completed after seeing """.

that sounds like the correct behavior. If we introduce any escapes at all we defeat the purpose of this being a raw string, and require some potentially complex parsing logic and strange edge cases. although I don't think we should stop immediately when seeing """ because then a string can't end with """. we could process a few more quotes after without issue. then a string will be able to end with quotes, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""""hello\"""world!\"""" it's a valid syntax, it has a closing """.
The previous ?" couldn't accept any " and every " should be escaped.
With this approach we only need to escape a """ which is very rare to occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of issues at hand:

  1. " is a common character. To avoid escaping it too often, the raw string relies on 3 triple quotes """. This however can become a problem at the end of the raw string where a trailing " can mess up quoting
  2. " are allowed inside and can be escaped to avoid closing a string """ through \. This again is it occurs at the end of the string.

To wit:

  1. """ dir "c:\dir"""" <-- 4 ending quotes
  2. """ dir c:\dir\""" <-- \ escapes the """

Due to file paths on windows 2, aka \" is likely a common occurrence.
I see several options to solve this:

A. Require raw strings to not end with " and advice folks to add a space for example:
""" dir "c:\dir" """
B. Use a different escape character from \, say |. It solves 2 but not 1.
C. Do not allow escaping inside raw strings. This solves 2 but not 1.
D. Use a different set for quoting characters. Say "?""and""?`. However this does not solve the problem, rather just minimizes the occurrences in which they appear. Which might be a win in itself.

Considering \ is traditionally used for escaping and in Window paths, I have a preference for C over B. That's because introducing a new character for escaping is surprising and might backfire again depending on the raw string .
B however mean that one cannot escape """ inside a raw string but then for that there's the regular string definition. Unfortunately this doesn't solve 1 either...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A. I thing introducing a mandatory whitespace can lead to other issues, and is it only one space? what happens with more spaces, we keep all but one?, or with tabs?
B. I also don't like a new escape char, doesn't address the main issue no1.
D. I definitely don't like D since it still contains the ? char which usually refers to regex or query param.

C. I'm ok with C so if a user needs to escape chars or needs a """, he/she has to resort to the normal "....." syntax and properly escape what's needed.

Copy link
Contributor

@rw-access rw-access Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option E:

Allow for two optional additional trailing quotes with no escape sequences. then you can solve the trailing quotes issue without escapes.

The string can't contain """ anywhere inside. If you ever need to write a string with that sequence, a raw string is insufficient, so users will need to switch to a traditional escaped " string.

"""[^\r\n]*?""""?"?

Screenshot_20201002-055915~2.png

and it doesn't suffer from these contextual issues with \"
Screenshot_20201002-055234~2.png

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed that """ isn't allowed anywhere in the string but "" is;

Screenshot_20201002-060749~2.png

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rw-access I'm also happy with that, pushed the change and fixed all relevant tests.

@matriv
Copy link
Contributor Author

matriv commented Oct 2, 2020

@elasticmachine run elasticsearch-ci/packaging-sample-windows

@matriv matriv requested a review from rw-access October 2, 2020 11:00
Copy link
Contributor

@rw-access rw-access left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
thanks for your endurance!

@matriv
Copy link
Contributor Author

matriv commented Oct 2, 2020

thanks for your endurance!

Thank you too from my side!

@matriv
Copy link
Contributor Author

matriv commented Oct 2, 2020

@elasticmachine run elasticsearch-ci/packaging-sample-windows

@matriv
Copy link
Contributor Author

matriv commented Oct 2, 2020

@jrodewig So finally the change is that we replace ?"......." with """.....""".
Everything inside """....""" is allowed as is (raw) and the only thing that is not allowed is triple double quotes as those are considered to be marking the end of the raw string. So if one has """ in their string they need to use the usual string literal syntax: "....." and properly escape it, e.g.: "foo\"\"\"bar" or "foobar\"\"\""

@matriv matriv merged commit d87c2ca into elastic:master Oct 2, 2020
@matriv matriv deleted the replace-unescaped-triple-doublequotes branch October 2, 2020 12:00
matriv added a commit that referenced this pull request Oct 2, 2020
…3174)

Use triple double quotes enclosing a string literal to interpret it
as unescaped, in order to use `?` for marking query params and avoid
user confusion. `?` also usually implies regex expressions.

Any character inside the `"""` beginning-closing markings is considered
raw and the only thing that is not permitted is the `"""` sequence itself.
If a user wants to use that, needs to resort to the normal `"` string literal
and use proper escaping.

Relates to #61659

(cherry picked from commit d87c2ca)
@costin
Copy link
Member

costin commented Oct 2, 2020

Woohoo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants