Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protoc: doesn't accept capital X in hex escapes in string literals #10756

Closed
jhump opened this issue Oct 10, 2022 · 1 comment · Fixed by #10757
Closed

protoc: doesn't accept capital X in hex escapes in string literals #10756

jhump opened this issue Oct 10, 2022 · 1 comment · Fixed by #10757
Labels

Comments

@jhump
Copy link
Contributor

jhump commented Oct 10, 2022

In the specs on the developer site, string literals can use \X (capital X) to initiate a hexadecimal byte literal:

hexEscape = '\' ( "x" | "X" ) hexDigit hexDigit

However, attempting to do so results in a syntax error:

// test.proto
syntax = "proto2";

message Foo {
  optional bytes bar = 1 [default = "\XBC"];
}
$ protoc test.proto -o /dev/null
test.proto:5:39: Invalid escape sequence in string literal.

Changing the string literal to "\xBC" (lower-case X) fixes the problem.

This appears to be a bug in the tokenizer, not in the documentation. I say this for a couple of reasons:

  1. Both lower-case and capital X are accepted for hexadecimal integer literals.
  2. Decoding the contents of default_value for a bytes field accepts hex escapes that use capital X.

To test the latter, I modified a file descriptor set so it included "\\X74" in a default value:

protoc test.proto -o /dev/stdout | xxd > test.protoset.txt
# edit the output of xxd (see third line, third word) to use capital X escape
00000000: 0a2b 0a0a 7465 7374 2e70 726f 746f 221d  .+..test.proto".
00000010: 0a03 466f 6f12 160a 0362 6172 1801 2001  ..Foo....bar.. .
00000020: 280c 3a04 5c58 3734 5203 6261 72         (.:.\X74R.bar

It turns out that protoc happily accepts this, even though it would complain about the escape if it appeared in a proto source file.

$ xxd -r < test.protoset.txt | protoc --descriptor_set_in /dev/stdin -o /dev/stdout test.proto | xxd
00000000: 0a28 0a0a 7465 7374 2e70 726f 746f 221a  .(..test.proto".
00000010: 0a03 466f 6f12 130a 0362 6172 1801 2001  ..Foo....bar.. .
00000020: 280c 3a01 7452 0362 6172                 (.:.tR.bar

As can be seen above, the \X74 escape was accepted and translated to t.

@jhump jhump added the untriaged auto added to all issues by default when created. label Oct 10, 2022
@fowles fowles added protoc release notes: yes and removed untriaged auto added to all issues by default when created. labels Oct 10, 2022
@fowles
Copy link
Contributor

fowles commented Oct 10, 2022

Agreed that this is a tokenizer bug. Feel free to send a fix if you have one ready!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants