Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal Unicode Escaping #8

Open
brcolow opened this issue May 10, 2018 · 1 comment
Open

Illegal Unicode Escaping #8

brcolow opened this issue May 10, 2018 · 1 comment

Comments

@brcolow
Copy link

brcolow commented May 10, 2018

This is kind of a "gotcha" type issue but I think my reasoning is sound.
nanojson parses Java character literals correctly (and goes to great
lengths to make sure the UTF-8 is valid) in the consumeTokenStringUtf8Char
method. But when the characters are escaped (not Java character literals)
it uses the following simplified logic:

int escaped = 0;

for (int i = 0; i < 4; i++) {
    escaped <<= 4;
    int digit = buffer[index++];
    if (digit >= '0' && digit <= '9') {
        escaped |= (digit - '0');
    } else if (digit >= 'A' && digit <= 'F') {
        escaped |= (digit - 'A') + 10;
    } else if (digit >= 'a' && digit <= 'f') {
        escaped |= (digit - 'a') + 10;
    } else {
        throw createParseException(null, "Expected unicode hex escape character: " +
                (char) digit + " (" + digit + ")", false);
    }
}

reusableBuffer.append((char) escaped);

which is not sufficient because there are illegal combinations of encoded
Unicode characters (as well as illegal single characters) which I will now attempt
to show/describe.

The JSON spec states that any character may be escaped. The method
to escape characters outside of the Basic Multilingual Plane (U+0000
through U+FFFF) is as follows:

`To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

In other words, characters above U+FFFF are encoded as surrogate pairs.
Thus, the following JSON: "\uD800\uDC00" should decode to the Unicode
character: U+10000.

The following JSON Strings:

1.) "\uD800"
2.) "\uD800\uCC00"

Should fail parsing because a high surrogate character (U+D800) is:

in 1.) not followed by another Unicode character
in 2.) not followed by a low surrogate character

In other words, the following tests should not fail:

@Test
public void testFailBustedString8() {
    try {
        // High-surrogate character not followed by another Unicode character
        JsonParser.any().from("\"\\uD800\"");
        fail();
    } catch (JsonParserException e) {
        testException(e, 1, 7); // 7 may be the wrong char, but is irrelevant for the issue
    }
}

@Test
public void testFailBustedString9() {
    try {
        // High-surrogate character not followed by a low-surrogate character
        JsonParser.any().from("\"\\uD800\\uCC00\"");
        fail();
    } catch (JsonParserException e) {
        testException(e, 1, 7); // 7 may be the wrong char, but is irrelevant for the issue
    }
}

Thanks for your time, let me know if anything is not clear.

@mmastrac
Copy link
Owner

Great explanation. You are correct - the parser probably not accept these. However, as Javascript's JSON methods do support stringify and parse illegal surrogates (at least Firefox did in my quick test), this should probably be gated behind a parser flag that isn't set by default.

Ideally the parser and writer would have a "strict utf16 surrogate" mode, ie:

JsonWriter.strictSurrogateSupport().indent("  ") ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants