Skip to content

Commit

Permalink
Reduce allocations when parsing hex strings
Browse files Browse the repository at this point in the history
Running a script based on one shared by Aaron at [1], I noticed we allocate a
surprising number of objects when parsing hex strings.

The allocations.rb script (see below) when parsing a file with lots of hex
strings shows the hex_string method as the top source of allocations. We can
fix that!

-- before

    $ ruby allocations.rb | head -n 10
                          sourcefile                        sourceline                   class                   count
    ------------------------------------------------------  ----------  ---------------------------------------  -----
    <PWD>/lib/pdf/reader/parser.rb                                 176  Array                                    65246
    <PWD>/lib/pdf/reader/parser.rb                                 176  String                                   63124
    <PWD>/lib/pdf/reader/parser.rb                                 177  String                                   53500
    <PWD>/lib/pdf/reader/buffer.rb                                 362  String                                   41386
    <PWD>/lib/pdf/reader/buffer.rb                                 384  String                                   27386
    <PWD>/lib/pdf/reader/transformation_matrix.rb                   20  Array                                    19238
    <PWD>/lib/pdf/reader/page_state.rb                             243  Array                                    14846
    <PWD>/lib/pdf/reader/encoding.rb                               143  Array                                    14336

    $ ruby benchmark.rb
    ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
    Warming up --------------------------------------
                             1.000 i/100ms
    Calculating -------------------------------------
                              1.973 (± 0.0%) i/s -     20.000 in  10.135409s
    {:ALLOCATIONS=>772349}

-- after

    $ ruby allocations.rb | head -n 10
                          sourcefile                        sourceline                   class                   count
    ------------------------------------------------------  ----------  ---------------------------------------  -----
    <PWD>/lib/pdf/reader/buffer.rb                                 362  String                                   41386
    <PWD>/lib/pdf/reader/buffer.rb                                 384  String                                   27386
    <PWD>/lib/pdf/reader/transformation_matrix.rb                   20  Array                                    19238
    <internal:pack>                                                  8  String                                   17047
    <PWD>/lib/pdf/reader/page_state.rb                             243  Array                                    14846
    <PWD>/lib/pdf/reader/encoding.rb                               143  Array                                    14336
    <PWD>/lib/pdf/reader/page_state.rb                             342  PDF::Reader::TransformationMatrix        10743
    <PWD>/lib/pdf/reader/transformation_matrix.rb                  115  Array                                    10641

    $ ruby benchmark.rb
    ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
    Warming up --------------------------------------
                             1.000 i/100ms
    Calculating -------------------------------------
                              2.097 (± 0.0%) i/s -     21.000 in  10.017634s
    {:ALLOCATIONS=>561561}

-- benchmark.rb

    $ cat benchmark.rb
    #!/bin/env ruby

    $LOAD_PATH << "lib"
    require "pdf/reader"
    require "benchmark/ips"

    def allocations
      x = GC.stat(:total_allocated_objects)
      yield
      GC.stat(:total_allocated_objects) - x
    end

    def go
      doc = PDF::Reader.new(File.join(File.dirname(__FILE__), "spec/data/cairo-unicode.pdf"))
      doc.pages.each do |page|
        page.text #extract the text but do nothing with it
      end
    end

    Benchmark.ips { |x|
      x.config(:time => 10, :warmup => 5)
      x.report {
        go
      }
    }
    p ALLOCATIONS: allocations { go }

-- allocations.rb

    $ cat allocations.rb
    #!/bin/env ruby

    $LOAD_PATH << "lib"
    require "pdf/reader"
    require "allocation_stats"

    FILENAME = File.join(File.dirname(__FILE__), "spec/data/cairo-unicode.pdf")

    def go
      doc = PDF::Reader.new(FILENAME)
      doc.pages.each do |page|
        page.text #extract the text but do nothing with it
      end
    end

    stats = AllocationStats.trace { go }
    puts stats.allocations(alias_paths: true).group_by(:sourcefile, :sourceline, :class).sort_by_size.to_text

[1] https://tenderlovemaking.com/2023/09/02/fast-tokenizers-with-stringscanner.html
  • Loading branch information
yob committed Dec 25, 2023
1 parent ea4370c commit 803c1b6
Showing 1 changed file with 1 addition and 3 deletions.
4 changes: 1 addition & 3 deletions lib/pdf/reader/parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -173,9 +173,7 @@ def hex_string

# add a missing digit if required, as required by the spec
str << "0" unless str.size % 2 == 0
str.chars.each_slice(2).map { |nibbles|
nibbles.join("").hex.chr
}.join.force_encoding("binary")
[str].pack('H*')
end
################################################################################
# Reads a PDF String from the buffer and converts it to a Ruby String
Expand Down

0 comments on commit 803c1b6

Please sign in to comment.