Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reduce allocations when parsing hex strings
Running a script based on one shared by Aaron at [1], I noticed we allocate a surprising number of objects when parsing hex strings. The allocations.rb script (see below) when parsing a file with lots of hex strings shows the hex_string method as the top source of allocations. We can fix that! -- before $ ruby allocations.rb | head -n 10 sourcefile sourceline class count ------------------------------------------------------ ---------- --------------------------------------- ----- <PWD>/lib/pdf/reader/parser.rb 176 Array 65246 <PWD>/lib/pdf/reader/parser.rb 176 String 63124 <PWD>/lib/pdf/reader/parser.rb 177 String 53500 <PWD>/lib/pdf/reader/buffer.rb 362 String 41386 <PWD>/lib/pdf/reader/buffer.rb 384 String 27386 <PWD>/lib/pdf/reader/transformation_matrix.rb 20 Array 19238 <PWD>/lib/pdf/reader/page_state.rb 243 Array 14846 <PWD>/lib/pdf/reader/encoding.rb 143 Array 14336 $ ruby benchmark.rb ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux] Warming up -------------------------------------- 1.000 i/100ms Calculating ------------------------------------- 1.973 (± 0.0%) i/s - 20.000 in 10.135409s {:ALLOCATIONS=>772349} -- after $ ruby allocations.rb | head -n 10 sourcefile sourceline class count ------------------------------------------------------ ---------- --------------------------------------- ----- <PWD>/lib/pdf/reader/buffer.rb 362 String 41386 <PWD>/lib/pdf/reader/buffer.rb 384 String 27386 <PWD>/lib/pdf/reader/transformation_matrix.rb 20 Array 19238 <internal:pack> 8 String 17047 <PWD>/lib/pdf/reader/page_state.rb 243 Array 14846 <PWD>/lib/pdf/reader/encoding.rb 143 Array 14336 <PWD>/lib/pdf/reader/page_state.rb 342 PDF::Reader::TransformationMatrix 10743 <PWD>/lib/pdf/reader/transformation_matrix.rb 115 Array 10641 $ ruby benchmark.rb ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux] Warming up -------------------------------------- 1.000 i/100ms Calculating ------------------------------------- 2.097 (± 0.0%) i/s - 21.000 in 10.017634s {:ALLOCATIONS=>561561} -- benchmark.rb $ cat benchmark.rb #!/bin/env ruby $LOAD_PATH << "lib" require "pdf/reader" require "benchmark/ips" def allocations x = GC.stat(:total_allocated_objects) yield GC.stat(:total_allocated_objects) - x end def go doc = PDF::Reader.new(File.join(File.dirname(__FILE__), "spec/data/cairo-unicode.pdf")) doc.pages.each do |page| page.text #extract the text but do nothing with it end end Benchmark.ips { |x| x.config(:time => 10, :warmup => 5) x.report { go } } p ALLOCATIONS: allocations { go } -- allocations.rb $ cat allocations.rb #!/bin/env ruby $LOAD_PATH << "lib" require "pdf/reader" require "allocation_stats" FILENAME = File.join(File.dirname(__FILE__), "spec/data/cairo-unicode.pdf") def go doc = PDF::Reader.new(FILENAME) doc.pages.each do |page| page.text #extract the text but do nothing with it end end stats = AllocationStats.trace { go } puts stats.allocations(alias_paths: true).group_by(:sourcefile, :sourceline, :class).sort_by_size.to_text [1] https://tenderlovemaking.com/2023/09/02/fast-tokenizers-with-stringscanner.html
- Loading branch information