Fix decoding of some UTF-16 strings that use surrogate pairs #529
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When extracting PDF text to UTF-8, some fonts use a ToUnicode mapping that's defined using a CMap table. CMap tables define unicode using UTF-16 and for reasons, we unwisely do the decoding of UTF16 to codepoints ourselves instead of deferring to a library.
Turns out we had a boundary bug, where some codepoints that get encoded with the surrogate pair 0xD800 or 0xDBFF weren't detected as surrogate pairs and were decoded incorrectly.
This would usually manifest as an incompatible encoding error while extracting text:
I believe Unicode codepoints in the range 0x10000 (decimal 65536) to 0x103FF (decimal 66559) were impacted, a total of 1023 codepoints. Technically higher codepoints were also impacted, but in an unallocated range). They're mostly ancient languages and numbers, like Aegean Numbers, Ancient Greek, Phaistos Disc, and Old Persian.