Skip to content

Commit

Permalink
Fix decoding of some UTF-16 strings that use surrogate pairs
Browse files Browse the repository at this point in the history
When extracting PDF text to UTF-8, some fonts use a ToUnicode mapping
that's defined using a CMap table. CMap tables define unicode using
UTF-16 and for reasons, we unwisely do the decoding of UTF16 to
codepoints ourselves instead of deferring to a library.

Turns out we had a boundary bug, where some codepoints that get encoded
with the surrogate pair 0xD800 or 0xDBFF weren't detected as surrogate
pairs and were decoded incorrectly.

This would usually manifest as an incompatible encoding error while
extracting text:

    /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `strip': invalid byte sequence in UTF-8 (Encoding::CompatibilityError)
        from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `block in interesting_rows'
        from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `map'
        from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `interesting_rows'
        from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:46:in `to_s'
        from /home/jh/git/pdf-reader/lib/pdf/reader/page.rb:121:in `text'
        from bin/pdf_text:12:in `block in <main>'
        from bin/pdf_text:11:in `each'
        from bin/pdf_text:11:in `<main>'

I believe Unicode codepoints in the range 0x10000 (decimal 65536) to
0x103FF (decimal 66559) were impacted, a total of 1023 codepoints.
Technically higher codepoints were also impacted, but in an unallocated
range). They're mostly ancient languages and numbers, like [Aegean
Numbers](https://en.wikipedia.org/wiki/Aegean_Numbers_(Unicode_block)),
[Ancient
Greek](https://en.wikipedia.org/wiki/Ancient_Greek_Numbers_(Unicode_block)),
[Phaistos
Disc](https://en.wikipedia.org/wiki/Phaistos_Disc_(Unicode_block)), and
[Old
Persian](https://en.wikipedia.org/wiki/Old_Persian_(Unicode_block)).

    (65536...66559).to_a.map { |c| [c].pack("U*") }.each_slice(20)  { |s| puts s.join(" " )}
    𐀀 𐀁 𐀂 𐀃 𐀄 𐀅 𐀆 𐀇 𐀈 𐀉 𐀊 𐀋  𐀍 𐀎 𐀏 𐀐 𐀑 𐀒 𐀓
    𐀔 𐀕 𐀖 𐀗 𐀘 𐀙 𐀚 𐀛 𐀜 𐀝 𐀞 𐀟 𐀠 𐀡 𐀢 𐀣 𐀤 𐀥 𐀦
    𐀨 𐀩 𐀪 𐀫 𐀬 𐀭 𐀮 𐀯 𐀰 𐀱 𐀲 𐀳 𐀴 𐀵 𐀶 𐀷 𐀸 𐀹 𐀺
    𐀼 𐀽  𐀿 𐁀 𐁁 𐁂 𐁃 𐁄 𐁅 𐁆 𐁇 𐁈 𐁉 𐁊 𐁋 𐁌 𐁍
    𐁐 𐁑 𐁒 𐁓 𐁔 𐁕 𐁖 𐁗 𐁘 𐁙 𐁚 𐁛 𐁜 𐁝

            𐂀 𐂁 𐂂 𐂃 𐂄 𐂅 𐂆 𐂇 𐂈 𐂉 𐂊 𐂋
    𐂌 𐂍 𐂎 𐂏 𐂐 𐂑 𐂒 𐂓 𐂔 𐂕 𐂖 𐂗 𐂘 𐂙 𐂚 𐂛 𐂜 𐂝 𐂞 𐂟
    𐂠 𐂡 𐂢 𐂣 𐂤 𐂥 𐂦 𐂧 𐂨 𐂩 𐂪 𐂫 𐂬 𐂭 𐂮 𐂯 𐂰 𐂱 𐂲 𐂳
    𐂴 𐂵 𐂶 𐂷 𐂸 𐂹 𐂺 𐂻 𐂼 𐂽 𐂾 𐂿 𐃀 𐃁 𐃂 𐃃 𐃄 𐃅 𐃆 𐃇
    𐃈 𐃉 𐃊 𐃋 𐃌 𐃍 𐃎 𐃏 𐃐 𐃑 𐃒 𐃓 𐃔 𐃕 𐃖 𐃗 𐃘 𐃙 𐃚 𐃛
    𐃜 𐃝 𐃞 𐃟 𐃠 𐃡 𐃢 𐃣 𐃤 𐃥 𐃦 𐃧 𐃨 𐃩 𐃪 𐃫 𐃬 𐃭 𐃮 𐃯
    𐃰 𐃱 𐃲 𐃳 𐃴 𐃵 𐃶 𐃷 𐃸 𐃹 𐃺      𐄀 𐄁 𐄂
       𐄇 𐄈 𐄉 𐄊 𐄋 𐄌 𐄍 𐄎 𐄏 𐄐 𐄑 𐄒 𐄓 𐄔 𐄕 𐄖 𐄗
    𐄘 𐄙 𐄚 𐄛 𐄜 𐄝 𐄞 𐄟 𐄠 𐄡 𐄢 𐄣 𐄤 𐄥 𐄦 𐄧 𐄨 𐄩 𐄪 𐄫
    𐄬 𐄭 𐄮 𐄯 𐄰 𐄱 𐄲 𐄳    𐄷 𐄸 𐄹 𐄺 𐄻 𐄼 𐄽 𐄾 𐄿
    𐅀 𐅁 𐅂 𐅃 𐅄 𐅅 𐅆 𐅇 𐅈 𐅉 𐅊 𐅋 𐅌 𐅍 𐅎 𐅏 𐅐 𐅑 𐅒 𐅓
    𐅔 𐅕 𐅖 𐅗 𐅘 𐅙 𐅚 𐅛 𐅜 𐅝 𐅞 𐅟 𐅠 𐅡 𐅢 𐅣 𐅤 𐅥 𐅦 𐅧
    𐅨 𐅩 𐅪 𐅫 𐅬 𐅭 𐅮 𐅯 𐅰 𐅱 𐅲 𐅳 𐅴 𐅵 𐅶 𐅷 𐅸 𐅹 𐅺 𐅻
    𐅼 𐅽 𐅾 𐅿 𐆀 𐆁 𐆂 𐆃 𐆄 𐆅 𐆆 𐆇 𐆈 𐆉 𐆊 𐆋 𐆌 𐆍 𐆎
    𐆐 𐆑 𐆒 𐆓 𐆔 𐆕 𐆖 𐆗 𐆘 𐆙 𐆚 𐆛 𐆜    𐆠

        𐇐 𐇑 𐇒 𐇓 𐇔 𐇕 𐇖 𐇗 𐇘 𐇙 𐇚 𐇛 𐇜 𐇝 𐇞 𐇟
    𐇠 𐇡 𐇢 𐇣 𐇤 𐇥 𐇦 𐇧 𐇨 𐇩 𐇪 𐇫 𐇬 𐇭 𐇮 𐇯 𐇰 𐇱 𐇲 𐇳
    𐇴 𐇵 𐇶 𐇷 𐇸 𐇹 𐇺 𐇻 𐇼

    𐊀 𐊁 𐊂 𐊃 𐊄 𐊅 𐊆 𐊇 𐊈 𐊉 𐊊 𐊋 𐊌 𐊍 𐊎 𐊏 𐊐 𐊑 𐊒 𐊓
    𐊔 𐊕 𐊖 𐊗 𐊘 𐊙 𐊚 𐊛 𐊜    𐊠 𐊡 𐊢 𐊣 𐊤 𐊥 𐊦 𐊧
    𐊨 𐊩 𐊪 𐊫 𐊬 𐊭 𐊮 𐊯 𐊰 𐊱 𐊲 𐊳 𐊴 𐊵 𐊶 𐊷 𐊸 𐊹 𐊺 𐊻
    𐊼 𐊽 𐊾 𐊿 𐋀 𐋁 𐋂 𐋃 𐋄 𐋅 𐋆 𐋇 𐋈 𐋉 𐋊 𐋋 𐋌 𐋍 𐋎 𐋏
    𐋐                𐋠 𐋡 𐋢 𐋣
    𐋤 𐋥 𐋦 𐋧 𐋨 𐋩 𐋪 𐋫 𐋬 𐋭 𐋮 𐋯 𐋰 𐋱 𐋲 𐋳 𐋴 𐋵 𐋶 𐋷
    𐋸 𐋹 𐋺 𐋻     𐌀 𐌁 𐌂 𐌃 𐌄 𐌅 𐌆 𐌇 𐌈 𐌉 𐌊 𐌋
    𐌌 𐌍 𐌎 𐌏 𐌐 𐌑 𐌒 𐌓 𐌔 𐌕 𐌖 𐌗 𐌘 𐌙 𐌚 𐌛 𐌜 𐌝 𐌞 𐌟
    𐌠 𐌡 𐌢 𐌣          𐌭 𐌮 𐌯 𐌰 𐌱 𐌲 𐌳
    𐌴 𐌵 𐌶 𐌷 𐌸 𐌹 𐌺 𐌻 𐌼 𐌽 𐌾 𐌿 𐍀 𐍁 𐍂 𐍃 𐍄 𐍅 𐍆 𐍇
    𐍈 𐍉 𐍊      𐍐 𐍑 𐍒 𐍓 𐍔 𐍕 𐍖 𐍗 𐍘 𐍙 𐍚 𐍛
    𐍜 𐍝 𐍞 𐍟 𐍠 𐍡 𐍢 𐍣 𐍤 𐍥 𐍦 𐍧 𐍨 𐍩 𐍪 𐍫 𐍬 𐍭 𐍮 𐍯
    𐍰 𐍱 𐍲 𐍳 𐍴 𐍵 𐍶 𐍷 𐍸 𐍹 𐍺      𐎀 𐎁 𐎂 𐎃
    𐎄 𐎅 𐎆 𐎇 𐎈 𐎉 𐎊 𐎋 𐎌 𐎍 𐎎 𐎏 𐎐 𐎑 𐎒 𐎓 𐎔 𐎕 𐎖 𐎗
    𐎘 𐎙 𐎚 𐎛 𐎜 𐎝  𐎟 𐎠 𐎡 𐎢 𐎣 𐎤 𐎥 𐎦 𐎧 𐎨 𐎩 𐎪 𐎫
    𐎬 𐎭 𐎮 𐎯 𐎰 𐎱 𐎲 𐎳 𐎴 𐎵 𐎶 𐎷 𐎸 𐎹 𐎺 𐎻 𐎼 𐎽 𐎾 𐎿
    𐏀 𐏁 𐏂 𐏃     𐏈 𐏉 𐏊 𐏋 𐏌 𐏍 𐏎 𐏏 𐏐 𐏑 𐏒 𐏓
    𐏔 𐏕
  • Loading branch information
yob committed Dec 26, 2023
1 parent 7762166 commit 65623e7
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 2 deletions.
4 changes: 2 additions & 2 deletions lib/pdf/reader/cmap.rb
Original file line number Diff line number Diff line change
Expand Up @@ -118,8 +118,8 @@ def str_to_int(str)
result = []
while unpacked_string.any? do
if unpacked_string.size >= 2 &&
unpacked_string.first.to_i > 0xD800 &&
unpacked_string.first.to_i < 0xDBFF
unpacked_string.first.to_i >= 0xD800 &&
unpacked_string.first.to_i <= 0xDBFF
# this is a Unicode UTF-16 "Surrogate Pair" see Unicode Spec. Chapter 3.7
# lets convert to a UTF-32. (the high bit is between 0xD800-0xDBFF, the
# low bit is between 0xDC00-0xDFFF) for example: U+1D44E (U+D835 U+DC4E)
Expand Down
13 changes: 13 additions & 0 deletions spec/cmap_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -116,5 +116,18 @@
expect(map.decode(0x00C1)).to eql([0x00C1])
end
end

context "cmap with bfchar and surrogate pairs, where the surrogate pair starts with D800" do
it "correctly loads character mapping" do
filename = File.dirname(__FILE__) + "/data/cmap_with_surrogate_pairs_on_boundary.txt"
map = PDF::Reader::CMap.new(binread(filename))
expect(map.map).to be_a_kind_of(Hash)
expect(map.size).to eq(27)
expect(map.map[0x0]).to eq([0x10102])
expect(map.map[0xB]).to eq([0x28])
expect(map.map[0x1E]).to eq([0x3B])
expect(map.map[0x0194]).to eq([0x25CF])
end
end
end
end
38 changes: 38 additions & 0 deletions spec/data/cmap_with_surrogate_pairs_on_boundary.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
/CIDInit /ProcSet findresource begin
14 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
2 beginbfchar
<0000> <D800DD02>
<0003> <0020>
endbfchar
1 beginbfrange
<000B> <000C> <0028>
endbfrange
1 beginbfchar
<001E> <003B>
endbfchar
3 beginbfrange
<0044> <004C> <0061>
<004F> <0053> <006C>
<0055> <0058> <0072>
endbfrange
4 beginbfchar
<005C> <0079>
<00B2> <2014>
<00B6> <2019>
<0194> <25CF>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

0 comments on commit 65623e7

Please sign in to comment.