Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix decoding of some UTF-16 strings that use surrogate pairs #529

Merged
merged 1 commit into from
Dec 26, 2023

Conversation

yob
Copy link
Owner

@yob yob commented Dec 26, 2023

When extracting PDF text to UTF-8, some fonts use a ToUnicode mapping that's defined using a CMap table. CMap tables define unicode using UTF-16 and for reasons, we unwisely do the decoding of UTF16 to codepoints ourselves instead of deferring to a library.

Turns out we had a boundary bug, where some codepoints that get encoded with the surrogate pair 0xD800 or 0xDBFF weren't detected as surrogate pairs and were decoded incorrectly.

This would usually manifest as an incompatible encoding error while extracting text:

/home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `strip': invalid byte sequence in UTF-8 (Encoding::CompatibilityError)
    from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `block in interesting_rows'
    from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `map'
    from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `interesting_rows'
    from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:46:in `to_s'
    from /home/jh/git/pdf-reader/lib/pdf/reader/page.rb:121:in `text'
    from bin/pdf_text:12:in `block in <main>'
    from bin/pdf_text:11:in `each'
    from bin/pdf_text:11:in `<main>'

I believe Unicode codepoints in the range 0x10000 (decimal 65536) to 0x103FF (decimal 66559) were impacted, a total of 1023 codepoints. Technically higher codepoints were also impacted, but in an unallocated range). They're mostly ancient languages and numbers, like Aegean Numbers, Ancient Greek, Phaistos Disc, and Old Persian.

(65536...66559).to_a.map { |c| [c].pack("U*") }.each_slice(20)  { |s| puts s.join(" " )}                                     
𐀀 𐀁 𐀂 𐀃 𐀄 𐀅 𐀆 𐀇 𐀈 𐀉 𐀊 𐀋  𐀍 𐀎 𐀏 𐀐 𐀑 𐀒 𐀓
𐀔 𐀕 𐀖 𐀗 𐀘 𐀙 𐀚 𐀛 𐀜 𐀝 𐀞 𐀟 𐀠 𐀡 𐀢 𐀣 𐀤 𐀥 𐀦                                                                                                                                                                                                                                                       
𐀨 𐀩 𐀪 𐀫 𐀬 𐀭 𐀮 𐀯 𐀰 𐀱 𐀲 𐀳 𐀴 𐀵 𐀶 𐀷 𐀸 𐀹 𐀺                                                                                                                                                                                                                                                       
𐀼 𐀽  𐀿 𐁀 𐁁 𐁂 𐁃 𐁄 𐁅 𐁆 𐁇 𐁈 𐁉 𐁊 𐁋 𐁌 𐁍                                                                                                                                                                                                                                                          
𐁐 𐁑 𐁒 𐁓 𐁔 𐁕 𐁖 𐁗 𐁘 𐁙 𐁚 𐁛 𐁜 𐁝                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                            
        𐂀 𐂁 𐂂 𐂃 𐂄 𐂅 𐂆 𐂇 𐂈 𐂉 𐂊 𐂋                                                                                                                                                                                                                                                             
𐂌 𐂍 𐂎 𐂏 𐂐 𐂑 𐂒 𐂓 𐂔 𐂕 𐂖 𐂗 𐂘 𐂙 𐂚 𐂛 𐂜 𐂝 𐂞 𐂟                                                                                                                                                                                                                                                     
𐂠 𐂡 𐂢 𐂣 𐂤 𐂥 𐂦 𐂧 𐂨 𐂩 𐂪 𐂫 𐂬 𐂭 𐂮 𐂯 𐂰 𐂱 𐂲 𐂳                                                                                                                                                                                                                                                     
𐂴 𐂵 𐂶 𐂷 𐂸 𐂹 𐂺 𐂻 𐂼 𐂽 𐂾 𐂿 𐃀 𐃁 𐃂 𐃃 𐃄 𐃅 𐃆 𐃇                                                                                                                                                                                                                                                     
𐃈 𐃉 𐃊 𐃋 𐃌 𐃍 𐃎 𐃏 𐃐 𐃑 𐃒 𐃓 𐃔 𐃕 𐃖 𐃗 𐃘 𐃙 𐃚 𐃛                                                                                                                                                                                                                                                     
𐃜 𐃝 𐃞 𐃟 𐃠 𐃡 𐃢 𐃣 𐃤 𐃥 𐃦 𐃧 𐃨 𐃩 𐃪 𐃫 𐃬 𐃭 𐃮 𐃯                                                                                                                                                                                                                                                     
𐃰 𐃱 𐃲 𐃳 𐃴 𐃵 𐃶 𐃷 𐃸 𐃹 𐃺      𐄀 𐄁 𐄂                                                                                                                                                                                                                                                            
   𐄇 𐄈 𐄉 𐄊 𐄋 𐄌 𐄍 𐄎 𐄏 𐄐 𐄑 𐄒 𐄓 𐄔 𐄕 𐄖 𐄗                                                                                                               
𐄘 𐄙 𐄚 𐄛 𐄜 𐄝 𐄞 𐄟 𐄠 𐄡 𐄢 𐄣 𐄤 𐄥 𐄦 𐄧 𐄨 𐄩 𐄪 𐄫                                                                                                            
𐄬 𐄭 𐄮 𐄯 𐄰 𐄱 𐄲 𐄳    𐄷 𐄸 𐄹 𐄺 𐄻 𐄼 𐄽 𐄾 𐄿
𐅀 𐅁 𐅂 𐅃 𐅄 𐅅 𐅆 𐅇 𐅈 𐅉 𐅊 𐅋 𐅌 𐅍 𐅎 𐅏 𐅐 𐅑 𐅒 𐅓
𐅔 𐅕 𐅖 𐅗 𐅘 𐅙 𐅚 𐅛 𐅜 𐅝 𐅞 𐅟 𐅠 𐅡 𐅢 𐅣 𐅤 𐅥 𐅦 𐅧
𐅨 𐅩 𐅪 𐅫 𐅬 𐅭 𐅮 𐅯 𐅰 𐅱 𐅲 𐅳 𐅴 𐅵 𐅶 𐅷 𐅸 𐅹 𐅺 𐅻
𐅼 𐅽 𐅾 𐅿 𐆀 𐆁 𐆂 𐆃 𐆄 𐆅 𐆆 𐆇 𐆈 𐆉 𐆊 𐆋 𐆌 𐆍 𐆎 
𐆐 𐆑 𐆒 𐆓 𐆔 𐆕 𐆖 𐆗 𐆘 𐆙 𐆚 𐆛 𐆜    𐆠   
                    
    𐇐 𐇑 𐇒 𐇓 𐇔 𐇕 𐇖 𐇗 𐇘 𐇙 𐇚 𐇛 𐇜 𐇝 𐇞 𐇟
𐇠 𐇡 𐇢 𐇣 𐇤 𐇥 𐇦 𐇧 𐇨 𐇩 𐇪 𐇫 𐇬 𐇭 𐇮 𐇯 𐇰 𐇱 𐇲 𐇳
𐇴 𐇵 𐇶 𐇷 𐇸 𐇹 𐇺 𐇻 𐇼 𐇽          
                    
𐊀 𐊁 𐊂 𐊃 𐊄 𐊅 𐊆 𐊇 𐊈 𐊉 𐊊 𐊋 𐊌 𐊍 𐊎 𐊏 𐊐 𐊑 𐊒 𐊓
𐊔 𐊕 𐊖 𐊗 𐊘 𐊙 𐊚 𐊛 𐊜    𐊠 𐊡 𐊢 𐊣 𐊤 𐊥 𐊦 𐊧
𐊨 𐊩 𐊪 𐊫 𐊬 𐊭 𐊮 𐊯 𐊰 𐊱 𐊲 𐊳 𐊴 𐊵 𐊶 𐊷 𐊸 𐊹 𐊺 𐊻
𐊼 𐊽 𐊾 𐊿 𐋀 𐋁 𐋂 𐋃 𐋄 𐋅 𐋆 𐋇 𐋈 𐋉 𐋊 𐋋 𐋌 𐋍 𐋎 𐋏
𐋐                𐋠 𐋡 𐋢 𐋣
𐋤 𐋥 𐋦 𐋧 𐋨 𐋩 𐋪 𐋫 𐋬 𐋭 𐋮 𐋯 𐋰 𐋱 𐋲 𐋳 𐋴 𐋵 𐋶 𐋷
𐋸 𐋹 𐋺 𐋻     𐌀 𐌁 𐌂 𐌃 𐌄 𐌅 𐌆 𐌇 𐌈 𐌉 𐌊 𐌋
𐌌 𐌍 𐌎 𐌏 𐌐 𐌑 𐌒 𐌓 𐌔 𐌕 𐌖 𐌗 𐌘 𐌙 𐌚 𐌛 𐌜 𐌝 𐌞 𐌟
𐌠 𐌡 𐌢 𐌣          𐌭 𐌮 𐌯 𐌰 𐌱 𐌲 𐌳
𐌴 𐌵 𐌶 𐌷 𐌸 𐌹 𐌺 𐌻 𐌼 𐌽 𐌾 𐌿 𐍀 𐍁 𐍂 𐍃 𐍄 𐍅 𐍆 𐍇
𐍈 𐍉 𐍊      𐍐 𐍑 𐍒 𐍓 𐍔 𐍕 𐍖 𐍗 𐍘 𐍙 𐍚 𐍛
𐍜 𐍝 𐍞 𐍟 𐍠 𐍡 𐍢 𐍣 𐍤 𐍥 𐍦 𐍧 𐍨 𐍩 𐍪 𐍫 𐍬 𐍭 𐍮 𐍯
𐍰 𐍱 𐍲 𐍳 𐍴 𐍵 𐍶 𐍷 𐍸 𐍹 𐍺      𐎀 𐎁 𐎂 𐎃
𐎄 𐎅 𐎆 𐎇 𐎈 𐎉 𐎊 𐎋 𐎌 𐎍 𐎎 𐎏 𐎐 𐎑 𐎒 𐎓 𐎔 𐎕 𐎖 𐎗
𐎘 𐎙 𐎚 𐎛 𐎜 𐎝  𐎟 𐎠 𐎡 𐎢 𐎣 𐎤 𐎥 𐎦 𐎧 𐎨 𐎩 𐎪 𐎫
𐎬 𐎭 𐎮 𐎯 𐎰 𐎱 𐎲 𐎳 𐎴 𐎵 𐎶 𐎷 𐎸 𐎹 𐎺 𐎻 𐎼 𐎽 𐎾 𐎿
𐏀 𐏁 𐏂 𐏃     𐏈 𐏉 𐏊 𐏋 𐏌 𐏍 𐏎 𐏏 𐏐 𐏑 𐏒 𐏓
𐏔 𐏕

When extracting PDF text to UTF-8, some fonts use a ToUnicode mapping
that's defined using a CMap table. CMap tables define unicode using
UTF-16 and for reasons, we unwisely do the decoding of UTF16 to
codepoints ourselves instead of deferring to a library.

Turns out we had a boundary bug, where some codepoints that get encoded
with the surrogate pair 0xD800 or 0xDBFF weren't detected as surrogate
pairs and were decoded incorrectly.

This would usually manifest as an incompatible encoding error while
extracting text:

    /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `strip': invalid byte sequence in UTF-8 (Encoding::CompatibilityError)
        from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `block in interesting_rows'
        from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `map'
        from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:66:in `interesting_rows'
        from /home/jh/git/pdf-reader/lib/pdf/reader/page_layout.rb:46:in `to_s'
        from /home/jh/git/pdf-reader/lib/pdf/reader/page.rb:121:in `text'
        from bin/pdf_text:12:in `block in <main>'
        from bin/pdf_text:11:in `each'
        from bin/pdf_text:11:in `<main>'

I believe Unicode codepoints in the range 0x10000 (decimal 65536) to
0x103FF (decimal 66559) were impacted, a total of 1023 codepoints.
Technically higher codepoints were also impacted, but in an unallocated
range). They're mostly ancient languages and numbers, like [Aegean
Numbers](https://en.wikipedia.org/wiki/Aegean_Numbers_(Unicode_block)),
[Ancient
Greek](https://en.wikipedia.org/wiki/Ancient_Greek_Numbers_(Unicode_block)),
[Phaistos
Disc](https://en.wikipedia.org/wiki/Phaistos_Disc_(Unicode_block)), and
[Old
Persian](https://en.wikipedia.org/wiki/Old_Persian_(Unicode_block)).

    (65536...66559).to_a.map { |c| [c].pack("U*") }.each_slice(20)  { |s| puts s.join(" " )}
    𐀀 𐀁 𐀂 𐀃 𐀄 𐀅 𐀆 𐀇 𐀈 𐀉 𐀊 𐀋  𐀍 𐀎 𐀏 𐀐 𐀑 𐀒 𐀓
    𐀔 𐀕 𐀖 𐀗 𐀘 𐀙 𐀚 𐀛 𐀜 𐀝 𐀞 𐀟 𐀠 𐀡 𐀢 𐀣 𐀤 𐀥 𐀦
    𐀨 𐀩 𐀪 𐀫 𐀬 𐀭 𐀮 𐀯 𐀰 𐀱 𐀲 𐀳 𐀴 𐀵 𐀶 𐀷 𐀸 𐀹 𐀺
    𐀼 𐀽  𐀿 𐁀 𐁁 𐁂 𐁃 𐁄 𐁅 𐁆 𐁇 𐁈 𐁉 𐁊 𐁋 𐁌 𐁍
    𐁐 𐁑 𐁒 𐁓 𐁔 𐁕 𐁖 𐁗 𐁘 𐁙 𐁚 𐁛 𐁜 𐁝

            𐂀 𐂁 𐂂 𐂃 𐂄 𐂅 𐂆 𐂇 𐂈 𐂉 𐂊 𐂋
    𐂌 𐂍 𐂎 𐂏 𐂐 𐂑 𐂒 𐂓 𐂔 𐂕 𐂖 𐂗 𐂘 𐂙 𐂚 𐂛 𐂜 𐂝 𐂞 𐂟
    𐂠 𐂡 𐂢 𐂣 𐂤 𐂥 𐂦 𐂧 𐂨 𐂩 𐂪 𐂫 𐂬 𐂭 𐂮 𐂯 𐂰 𐂱 𐂲 𐂳
    𐂴 𐂵 𐂶 𐂷 𐂸 𐂹 𐂺 𐂻 𐂼 𐂽 𐂾 𐂿 𐃀 𐃁 𐃂 𐃃 𐃄 𐃅 𐃆 𐃇
    𐃈 𐃉 𐃊 𐃋 𐃌 𐃍 𐃎 𐃏 𐃐 𐃑 𐃒 𐃓 𐃔 𐃕 𐃖 𐃗 𐃘 𐃙 𐃚 𐃛
    𐃜 𐃝 𐃞 𐃟 𐃠 𐃡 𐃢 𐃣 𐃤 𐃥 𐃦 𐃧 𐃨 𐃩 𐃪 𐃫 𐃬 𐃭 𐃮 𐃯
    𐃰 𐃱 𐃲 𐃳 𐃴 𐃵 𐃶 𐃷 𐃸 𐃹 𐃺      𐄀 𐄁 𐄂
       𐄇 𐄈 𐄉 𐄊 𐄋 𐄌 𐄍 𐄎 𐄏 𐄐 𐄑 𐄒 𐄓 𐄔 𐄕 𐄖 𐄗
    𐄘 𐄙 𐄚 𐄛 𐄜 𐄝 𐄞 𐄟 𐄠 𐄡 𐄢 𐄣 𐄤 𐄥 𐄦 𐄧 𐄨 𐄩 𐄪 𐄫
    𐄬 𐄭 𐄮 𐄯 𐄰 𐄱 𐄲 𐄳    𐄷 𐄸 𐄹 𐄺 𐄻 𐄼 𐄽 𐄾 𐄿
    𐅀 𐅁 𐅂 𐅃 𐅄 𐅅 𐅆 𐅇 𐅈 𐅉 𐅊 𐅋 𐅌 𐅍 𐅎 𐅏 𐅐 𐅑 𐅒 𐅓
    𐅔 𐅕 𐅖 𐅗 𐅘 𐅙 𐅚 𐅛 𐅜 𐅝 𐅞 𐅟 𐅠 𐅡 𐅢 𐅣 𐅤 𐅥 𐅦 𐅧
    𐅨 𐅩 𐅪 𐅫 𐅬 𐅭 𐅮 𐅯 𐅰 𐅱 𐅲 𐅳 𐅴 𐅵 𐅶 𐅷 𐅸 𐅹 𐅺 𐅻
    𐅼 𐅽 𐅾 𐅿 𐆀 𐆁 𐆂 𐆃 𐆄 𐆅 𐆆 𐆇 𐆈 𐆉 𐆊 𐆋 𐆌 𐆍 𐆎
    𐆐 𐆑 𐆒 𐆓 𐆔 𐆕 𐆖 𐆗 𐆘 𐆙 𐆚 𐆛 𐆜    𐆠

        𐇐 𐇑 𐇒 𐇓 𐇔 𐇕 𐇖 𐇗 𐇘 𐇙 𐇚 𐇛 𐇜 𐇝 𐇞 𐇟
    𐇠 𐇡 𐇢 𐇣 𐇤 𐇥 𐇦 𐇧 𐇨 𐇩 𐇪 𐇫 𐇬 𐇭 𐇮 𐇯 𐇰 𐇱 𐇲 𐇳
    𐇴 𐇵 𐇶 𐇷 𐇸 𐇹 𐇺 𐇻 𐇼

    𐊀 𐊁 𐊂 𐊃 𐊄 𐊅 𐊆 𐊇 𐊈 𐊉 𐊊 𐊋 𐊌 𐊍 𐊎 𐊏 𐊐 𐊑 𐊒 𐊓
    𐊔 𐊕 𐊖 𐊗 𐊘 𐊙 𐊚 𐊛 𐊜    𐊠 𐊡 𐊢 𐊣 𐊤 𐊥 𐊦 𐊧
    𐊨 𐊩 𐊪 𐊫 𐊬 𐊭 𐊮 𐊯 𐊰 𐊱 𐊲 𐊳 𐊴 𐊵 𐊶 𐊷 𐊸 𐊹 𐊺 𐊻
    𐊼 𐊽 𐊾 𐊿 𐋀 𐋁 𐋂 𐋃 𐋄 𐋅 𐋆 𐋇 𐋈 𐋉 𐋊 𐋋 𐋌 𐋍 𐋎 𐋏
    𐋐                𐋠 𐋡 𐋢 𐋣
    𐋤 𐋥 𐋦 𐋧 𐋨 𐋩 𐋪 𐋫 𐋬 𐋭 𐋮 𐋯 𐋰 𐋱 𐋲 𐋳 𐋴 𐋵 𐋶 𐋷
    𐋸 𐋹 𐋺 𐋻     𐌀 𐌁 𐌂 𐌃 𐌄 𐌅 𐌆 𐌇 𐌈 𐌉 𐌊 𐌋
    𐌌 𐌍 𐌎 𐌏 𐌐 𐌑 𐌒 𐌓 𐌔 𐌕 𐌖 𐌗 𐌘 𐌙 𐌚 𐌛 𐌜 𐌝 𐌞 𐌟
    𐌠 𐌡 𐌢 𐌣          𐌭 𐌮 𐌯 𐌰 𐌱 𐌲 𐌳
    𐌴 𐌵 𐌶 𐌷 𐌸 𐌹 𐌺 𐌻 𐌼 𐌽 𐌾 𐌿 𐍀 𐍁 𐍂 𐍃 𐍄 𐍅 𐍆 𐍇
    𐍈 𐍉 𐍊      𐍐 𐍑 𐍒 𐍓 𐍔 𐍕 𐍖 𐍗 𐍘 𐍙 𐍚 𐍛
    𐍜 𐍝 𐍞 𐍟 𐍠 𐍡 𐍢 𐍣 𐍤 𐍥 𐍦 𐍧 𐍨 𐍩 𐍪 𐍫 𐍬 𐍭 𐍮 𐍯
    𐍰 𐍱 𐍲 𐍳 𐍴 𐍵 𐍶 𐍷 𐍸 𐍹 𐍺      𐎀 𐎁 𐎂 𐎃
    𐎄 𐎅 𐎆 𐎇 𐎈 𐎉 𐎊 𐎋 𐎌 𐎍 𐎎 𐎏 𐎐 𐎑 𐎒 𐎓 𐎔 𐎕 𐎖 𐎗
    𐎘 𐎙 𐎚 𐎛 𐎜 𐎝  𐎟 𐎠 𐎡 𐎢 𐎣 𐎤 𐎥 𐎦 𐎧 𐎨 𐎩 𐎪 𐎫
    𐎬 𐎭 𐎮 𐎯 𐎰 𐎱 𐎲 𐎳 𐎴 𐎵 𐎶 𐎷 𐎸 𐎹 𐎺 𐎻 𐎼 𐎽 𐎾 𐎿
    𐏀 𐏁 𐏂 𐏃     𐏈 𐏉 𐏊 𐏋 𐏌 𐏍 𐏎 𐏏 𐏐 𐏑 𐏒 𐏓
    𐏔 𐏕
@yob yob merged commit b5dbae9 into main Dec 26, 2023
1 check passed
@yob yob deleted the fix-utf16-surrogate-pairs branch December 26, 2023 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant