Illegal Unicode Escaping #8

brcolow · 2018-05-10T23:58:38Z

This is kind of a "gotcha" type issue but I think my reasoning is sound.
nanojson parses Java character literals correctly (and goes to great
lengths to make sure the UTF-8 is valid) in the consumeTokenStringUtf8Char
method. But when the characters are escaped (not Java character literals)
it uses the following simplified logic:

int escaped = 0;

for (int i = 0; i < 4; i++) {
    escaped <<= 4;
    int digit = buffer[index++];
    if (digit >= '0' && digit <= '9') {
        escaped |= (digit - '0');
    } else if (digit >= 'A' && digit <= 'F') {
        escaped |= (digit - 'A') + 10;
    } else if (digit >= 'a' && digit <= 'f') {
        escaped |= (digit - 'a') + 10;
    } else {
        throw createParseException(null, "Expected unicode hex escape character: " +
                (char) digit + " (" + digit + ")", false);
    }
}

reusableBuffer.append((char) escaped);

which is not sufficient because there are illegal combinations of encoded
Unicode characters (as well as illegal single characters) which I will now attempt
to show/describe.

The JSON spec states that any character may be escaped. The method
to escape characters outside of the Basic Multilingual Plane (U+0000
through U+FFFF) is as follows:

`To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

In other words, characters above U+FFFF are encoded as surrogate pairs.
Thus, the following JSON: "\uD800\uDC00" should decode to the Unicode
character: U+10000.

The following JSON Strings:

1.) "\uD800"
2.) "\uD800\uCC00"

Should fail parsing because a high surrogate character (U+D800) is:

in 1.) not followed by another Unicode character
in 2.) not followed by a low surrogate character

In other words, the following tests should not fail:

@Test
public void testFailBustedString8() {
    try {
        // High-surrogate character not followed by another Unicode character
        JsonParser.any().from("\"\\uD800\"");
        fail();
    } catch (JsonParserException e) {
        testException(e, 1, 7); // 7 may be the wrong char, but is irrelevant for the issue
    }
}

@Test
public void testFailBustedString9() {
    try {
        // High-surrogate character not followed by a low-surrogate character
        JsonParser.any().from("\"\\uD800\\uCC00\"");
        fail();
    } catch (JsonParserException e) {
        testException(e, 1, 7); // 7 may be the wrong char, but is irrelevant for the issue
    }
}

Thanks for your time, let me know if anything is not clear.

The text was updated successfully, but these errors were encountered:

mmastrac · 2018-05-11T03:01:34Z

Great explanation. You are correct - the parser probably not accept these. However, as Javascript's JSON methods do support stringify and parse illegal surrogates (at least Firefox did in my quick test), this should probably be gated behind a parser flag that isn't set by default.

Ideally the parser and writer would have a "strict utf16 surrogate" mode, ie:

JsonWriter.strictSurrogateSupport().indent("  ") ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illegal Unicode Escaping #8

Illegal Unicode Escaping #8

brcolow commented May 10, 2018 •

edited

Loading

mmastrac commented May 11, 2018

Illegal Unicode Escaping #8

Illegal Unicode Escaping #8

Comments

brcolow commented May 10, 2018 • edited Loading

mmastrac commented May 11, 2018

brcolow commented May 10, 2018 •

edited

Loading