Fix decoding of some UTF-16 strings that use surrogate pairs

When extracting PDF text to UTF-8, some fonts use a ToUnicode mapping that's defined using a CMap table. CMap tables define unicode using UTF-16 and for reasons, we unwisely do the decoding of UTF16 to codepoints ourselves instead of deferring to a library. Turns out we had a boundary bug, where some codepoints that get encoded with the surrogate pair 0xD800 or 0xDBFF weren't detected as surrogate pairs and were decoded incorrectly. This would usually manifest as an incompatible encoding error while extracting text: /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `strip': invalid byte sequence in UTF-8 (Encoding::CompatibilityError) from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `block in interesting_rows' from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `map' from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `interesting_rows' from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:46:in `to_s' from /home/jh/git/pdf-reader/lib/pdf/reader/page.rb:121:in `text' from bin/pdf_text:12:in `block in <main>' from bin/pdf_text:11:in `each' from bin/pdf_text:11:in `<main>' I believe Unicode codepoints in the range 0x10000 (decimal 65536) to 0x103FF (decimal 66559) were impacted, a total of 1023 codepoints. Technically higher codepoints were also impacted, but in an unallocated range). They're mostly ancient languages and numbers, like [Aegean Numbers](https://en.wikipedia.org/wiki/Aegean_Numbers_(Unicode_block)), [Ancient Greek](https://en.wikipedia.org/wiki/Ancient_Greek_Numbers_(Unicode_block)), [Phaistos Disc](https://en.wikipedia.org/wiki/Phaistos_Disc_(Unicode_block)), and [Old Persian](https://en.wikipedia.org/wiki/Old_Persian_(Unicode_block)). (65536...66559).to_a.map { |c| [c].pack("U*") }.each_slice(20) { |s| puts s.join(" " )} 𐀀 𐀁 𐀂 𐀃 𐀄 𐀅 𐀆 𐀇 𐀈 𐀉 𐀊 𐀋 𐀍 𐀎 𐀏 𐀐 𐀑 𐀒 𐀓 𐀔 𐀕 𐀖 𐀗 𐀘 𐀙 𐀚 𐀛 𐀜 𐀝 𐀞 𐀟 𐀠 𐀡 𐀢 𐀣 𐀤 𐀥 𐀦 𐀨 𐀩 𐀪 𐀫 𐀬 𐀭 𐀮 𐀯 𐀰 𐀱 𐀲 𐀳 𐀴 𐀵 𐀶 𐀷 𐀸 𐀹 𐀺 𐀼 𐀽 𐀿 𐁀 𐁁 𐁂 𐁃 𐁄 𐁅 𐁆 𐁇 𐁈 𐁉 𐁊 𐁋 𐁌 𐁍 𐁐 𐁑 𐁒 𐁓 𐁔 𐁕 𐁖 𐁗 𐁘 𐁙 𐁚 𐁛 𐁜 𐁝 𐂀 𐂁 𐂂 𐂃 𐂄 𐂅 𐂆 𐂇 𐂈 𐂉 𐂊 𐂋 𐂌 𐂍 𐂎 𐂏 𐂐 𐂑 𐂒 𐂓 𐂔 𐂕 𐂖 𐂗 𐂘 𐂙 𐂚 𐂛 𐂜 𐂝 𐂞 𐂟 𐂠 𐂡 𐂢 𐂣 𐂤 𐂥 𐂦 𐂧 𐂨 𐂩 𐂪 𐂫 𐂬 𐂭 𐂮 𐂯 𐂰 𐂱 𐂲 𐂳 𐂴 𐂵 𐂶 𐂷 𐂸 𐂹 𐂺 𐂻 𐂼 𐂽 𐂾 𐂿 𐃀 𐃁 𐃂 𐃃 𐃄 𐃅 𐃆 𐃇 𐃈 𐃉 𐃊 𐃋 𐃌 𐃍 𐃎 𐃏 𐃐 𐃑 𐃒 𐃓 𐃔 𐃕 𐃖 𐃗 𐃘 𐃙 𐃚 𐃛 𐃜 𐃝 𐃞 𐃟 𐃠 𐃡 𐃢 𐃣 𐃤 𐃥 𐃦 𐃧 𐃨 𐃩 𐃪 𐃫 𐃬 𐃭 𐃮 𐃯 𐃰 𐃱 𐃲 𐃳 𐃴 𐃵 𐃶 𐃷 𐃸 𐃹 𐃺 𐄀 𐄁 𐄂 𐄇 𐄈 𐄉 𐄊 𐄋 𐄌 𐄍 𐄎 𐄏 𐄐 𐄑 𐄒 𐄓 𐄔 𐄕 𐄖 𐄗 𐄘 𐄙 𐄚 𐄛 𐄜 𐄝 𐄞 𐄟 𐄠 𐄡 𐄢 𐄣 𐄤 𐄥 𐄦 𐄧 𐄨 𐄩 𐄪 𐄫 𐄬 𐄭 𐄮 𐄯 𐄰 𐄱 𐄲 𐄳 𐄷 𐄸 𐄹 𐄺 𐄻 𐄼 𐄽 𐄾 𐄿 𐅀 𐅁 𐅂 𐅃 𐅄 𐅅 𐅆 𐅇 𐅈 𐅉 𐅊 𐅋 𐅌 𐅍 𐅎 𐅏 𐅐 𐅑 𐅒 𐅓 𐅔 𐅕 𐅖 𐅗 𐅘 𐅙 𐅚 𐅛 𐅜 𐅝 𐅞 𐅟 𐅠 𐅡 𐅢 𐅣 𐅤 𐅥 𐅦 𐅧 𐅨 𐅩 𐅪 𐅫 𐅬 𐅭 𐅮 𐅯 𐅰 𐅱 𐅲 𐅳 𐅴 𐅵 𐅶 𐅷 𐅸 𐅹 𐅺 𐅻 𐅼 𐅽 𐅾 𐅿 𐆀 𐆁 𐆂 𐆃 𐆄 𐆅 𐆆 𐆇 𐆈 𐆉 𐆊 𐆋 𐆌 𐆍 𐆎 𐆐 𐆑 𐆒 𐆓 𐆔 𐆕 𐆖 𐆗 𐆘 𐆙 𐆚 𐆛 𐆜 𐆠 𐇐 𐇑 𐇒 𐇓 𐇔 𐇕 𐇖 𐇗 𐇘 𐇙 𐇚 𐇛 𐇜 𐇝 𐇞 𐇟 𐇠 𐇡 𐇢 𐇣 𐇤 𐇥 𐇦 𐇧 𐇨 𐇩 𐇪 𐇫 𐇬 𐇭 𐇮 𐇯 𐇰 𐇱 𐇲 𐇳 𐇴 𐇵 𐇶 𐇷 𐇸 𐇹 𐇺 𐇻 𐇼 𐊀 𐊁 𐊂 𐊃 𐊄 𐊅 𐊆 𐊇 𐊈 𐊉 𐊊 𐊋 𐊌 𐊍 𐊎 𐊏 𐊐 𐊑 𐊒 𐊓 𐊔 𐊕 𐊖 𐊗 𐊘 𐊙 𐊚 𐊛 𐊜 𐊠 𐊡 𐊢 𐊣 𐊤 𐊥 𐊦 𐊧 𐊨 𐊩 𐊪 𐊫 𐊬 𐊭 𐊮 𐊯 𐊰 𐊱 𐊲 𐊳 𐊴 𐊵 𐊶 𐊷 𐊸 𐊹 𐊺 𐊻 𐊼 𐊽 𐊾 𐊿 𐋀 𐋁 𐋂 𐋃 𐋄 𐋅 𐋆 𐋇 𐋈 𐋉 𐋊 𐋋 𐋌 𐋍 𐋎 𐋏 𐋐 𐋠 𐋡 𐋢 𐋣 𐋤 𐋥 𐋦 𐋧 𐋨 𐋩 𐋪 𐋫 𐋬 𐋭 𐋮 𐋯 𐋰 𐋱 𐋲 𐋳 𐋴 𐋵 𐋶 𐋷 𐋸 𐋹 𐋺 𐋻 𐌀 𐌁 𐌂 𐌃 𐌄 𐌅 𐌆 𐌇 𐌈 𐌉 𐌊 𐌋 𐌌 𐌍 𐌎 𐌏 𐌐 𐌑 𐌒 𐌓 𐌔 𐌕 𐌖 𐌗 𐌘 𐌙 𐌚 𐌛 𐌜 𐌝 𐌞 𐌟 𐌠 𐌡 𐌢 𐌣 𐌭 𐌮 𐌯 𐌰 𐌱 𐌲 𐌳 𐌴 𐌵 𐌶 𐌷 𐌸 𐌹 𐌺 𐌻 𐌼 𐌽 𐌾 𐌿 𐍀 𐍁 𐍂 𐍃 𐍄 𐍅 𐍆 𐍇 𐍈 𐍉 𐍊 𐍐 𐍑 𐍒 𐍓 𐍔 𐍕 𐍖 𐍗 𐍘 𐍙 𐍚 𐍛 𐍜 𐍝 𐍞 𐍟 𐍠 𐍡 𐍢 𐍣 𐍤 𐍥 𐍦 𐍧 𐍨 𐍩 𐍪 𐍫 𐍬 𐍭 𐍮 𐍯 𐍰 𐍱 𐍲 𐍳 𐍴 𐍵 𐍶 𐍷 𐍸 𐍹 𐍺 𐎀 𐎁 𐎂 𐎃 𐎄 𐎅 𐎆 𐎇 𐎈 𐎉 𐎊 𐎋 𐎌 𐎍 𐎎 𐎏 𐎐 𐎑 𐎒 𐎓 𐎔 𐎕 𐎖 𐎗 𐎘 𐎙 𐎚 𐎛 𐎜 𐎝 𐎟 𐎠 𐎡 𐎢 𐎣 𐎤 𐎥 𐎦 𐎧 𐎨 𐎩 𐎪 𐎫 𐎬 𐎭 𐎮 𐎯 𐎰 𐎱 𐎲 𐎳 𐎴 𐎵 𐎶 𐎷 𐎸 𐎹 𐎺 𐎻 𐎼 𐎽 𐎾 𐎿 𐏀 𐏁 𐏂 𐏃 𐏈 𐏉 𐏊 𐏋 𐏌 𐏍 𐏎 𐏏 𐏐 𐏑 𐏒 𐏓 𐏔 𐏕
yob · Dec 26, 2023 · 65623e7 · 65623e7
1 parent 7762166
commit 65623e7
Show file tree

Hide file tree

Showing 3 changed files with 53 additions and 2 deletions.
diff --git a/lib/pdf/reader/cmap.rb b/lib/pdf/reader/cmap.rb
@@ -118,8 +118,8 @@ def str_to_int(str)
       result = []
       while unpacked_string.any? do
         if unpacked_string.size >= 2 &&
-            unpacked_string.first.to_i > 0xD800 &&
-            unpacked_string.first.to_i < 0xDBFF
+            unpacked_string.first.to_i >= 0xD800 &&
+            unpacked_string.first.to_i <= 0xDBFF
           # this is a Unicode UTF-16 "Surrogate Pair" see Unicode Spec. Chapter 3.7
           # lets convert to a UTF-32. (the high bit is between 0xD800-0xDBFF, the
           # low bit is between 0xDC00-0xDFFF) for example: U+1D44E (U+D835 U+DC4E)

diff --git a/spec/cmap_spec.rb b/spec/cmap_spec.rb
@@ -116,5 +116,18 @@
         expect(map.decode(0x00C1)).to eql([0x00C1])
       end
     end
+
+    context "cmap with bfchar and surrogate pairs, where the surrogate pair starts with D800" do
+      it "correctly loads character mapping" do
+        filename = File.dirname(__FILE__) + "/data/cmap_with_surrogate_pairs_on_boundary.txt"
+        map = PDF::Reader::CMap.new(binread(filename))
+        expect(map.map).to be_a_kind_of(Hash)
+        expect(map.size).to        eq(27)
+        expect(map.map[0x0]).to    eq([0x10102])
+        expect(map.map[0xB]).to    eq([0x28])
+        expect(map.map[0x1E]).to   eq([0x3B])
+        expect(map.map[0x0194]).to eq([0x25CF])
+      end
+    end
   end
 end
diff --git a/spec/data/cmap_with_surrogate_pairs_on_boundary.txt b/spec/data/cmap_with_surrogate_pairs_on_boundary.txt
@@ -0,0 +1,38 @@
+/CIDInit /ProcSet findresource begin
+14 dict begin
+begincmap
+/CIDSystemInfo
+<< /Registry (Adobe)
+/Ordering (UCS)
+/Supplement 0
+>> def
+/CMapName /Adobe-Identity-UCS def
+/CMapType 2 def
+1 begincodespacerange
+<0000> <FFFF>
+endcodespacerange
+2 beginbfchar
+<0000> <D800DD02>
+<0003> <0020>
+endbfchar
+1 beginbfrange
+<000B> <000C> <0028>
+endbfrange
+1 beginbfchar
+<001E> <003B>
+endbfchar
+3 beginbfrange
+<0044> <004C> <0061>
+<004F> <0053> <006C>
+<0055> <0058> <0072>
+endbfrange
+4 beginbfchar
+<005C> <0079>
+<00B2> <2014>
+<00B6> <2019>
+<0194> <25CF>
+endbfchar
+endcmap
+CMapName currentdict /CMap defineresource pop
+end
+end