Optimize `Range#sample(n)` for integer ranges, and allow float ranges too #10310

asterite · 2021-01-27T14:20:58Z

On a gitter discussion it was mentioned that this:

(0..1_000_000_000).sample(10)

is pretty slow, while in Python it's very fast.

I don't know how Python does it, but one way to do that is, if the range is an integer range, then we can call rand(range) (which is optimized to pick a number between those two) n times until we get all the necessary values. Without this optimization the default implementation ends up traversing the entire Enumerable.

A benchmark for the above gives, before this PR:

sample(n) 138.12m (  7.24s ) (± 0.00%)

With this PR:

sample(n)   7.02M (142.43ns) (± 3.11%)

So it goes down from 7 seconds to 142 nanoseconds :-)

Then we can do a similar thing for float ranges, which previously weren't supported by this functionality.

asterite · 2021-01-27T14:28:57Z

On a separate note, the same optimization could be done for a range of chars... after all we can convert a char to a number (via ord), then choose the numbers, then map them back to Char with chr.

That makes me thing that this could actually work with anything that could be mapped to integers. In Haskell this is the Enum typeclass. Maybe we could have a similar "interface" in Crystal.

That said, the only things I can think about right now are integers and chars... well, enums can also be mapped to integers, though not always from 0 to n. So maybe it can be done exclusively for Char, if we really wanted that.

straight-shoota

Maybe it would be a good idea to extract the entire implementations for int and float ranges into separate private methods to reduce the complexity of Range#sample.

src/range.cr

straight-shoota · 2021-01-27T14:44:42Z

src/range.cr

+
+        if min == max
+          [min]
+        elsif n <= 16


Maybe use Array::SMALL_ARRAY_SIZE?

It's actually Hash::MAX_INDICES_SIZE_LINEAR_SCAN / 2 but it's a private constant.

Regardless, the heuristic of using a small number (whatever that number is) is independent of the actual values using in other types.

Array::SMALL_ARRAY_SIZE is used for deciding between linear scan and hash lookup, so that's looks pretty much like the same heuristic use case.

Okay... but it's a private constant.

Sure we can change it to be public + nodoc.

And here it's just a number literal.

Definitely agree to that. But the constant gives some context. Now it's just a magic number without any explanation.

What do you mean? There's a comment right below that condition.

Yeah, I mean without explanation of the specific value.

It's just the same number we use for equivalent situations. So IMO using a common constant would make sense. But I'm not going to press this any further.

This constant will never, ever change. So we can continue copying over. The problem would be if we later had to update this number everyone. But this number has been proven to be the optimal one.

src/range.cr

straight-shoota · 2021-01-27T14:52:24Z

Time and Time::Span can also be converted to numbers. However the full range is larger than Int64. But it could work easily for ranges where the difference between begin and end fits into Int64.

straight-shoota · 2021-01-27T15:28:42Z

There's a change in behaviour: For n == 0, Enumerable#sample returns an empty array no matter what. But Range#sample raises on an open range. It's not very relevant edge case, but sample(n: 0) could be made to work for any range.

asterite · 2021-01-27T15:54:34Z

I think for any "empty" range, regardless of n being 0 or not, the method should raise.

I'll fix that later.

...that's probably optimized by LLVM anyway

HertzDevil · 2021-01-27T18:05:21Z

How fast is the new version compared to the old one when n is O(size)?

Also it'd be interesting to know how #10271 compares to the following after this PR: (the memory consumption here is probably larger since a temporary is involved in map)

module Indexable(T)
  def sample(n, random)
    (0...n).sample(n, random).map { |i| unsafe_fetch(i) }
  end
end

src/range.cr

asterite · 2021-01-27T18:20:08Z

@HertzDevil Good point.

I did this benchmark:

require "benchmark"

Benchmark.ips do |x|
  x.report("sample (old)") do
    (0..1_000).sample_old(999)
  end

  x.report("sample (new)") do
    (0..1_000).sample(999)
  end
end

Results:

sample (old)  89.52k ( 11.17µs) (± 5.17%)  3.94kB/op        fastest
sample (new)  11.07k ( 90.31µs) (± 2.67%)  16.0kB/op   8.08× slower

So we should probably use the old approach when n is close to size. I don't know exactly what condition to use, though. Maybe if n > size // 4.

HertzDevil · 2021-06-10T09:27:12Z

Python uses reservoir sampling for large sample counts and small populations, and a temporary set otherwise. They have their own heuristic for this.

src/range.cr

asterite · 2021-09-20T17:26:00Z

Closing because I don't have time to work on this PR.

beta-ziliani · 2021-09-20T18:27:11Z

We'll take it from here

asterite · 2022-09-27T10:45:40Z

We'll take it from here

Or maybe not 😄

Let's close this for now.

asterite added 2 commits January 27, 2021 11:11

Range: optimize Range#sample(n) for integer ranges

8f81390

Range: enable Range#sample(n) to work for float ranges

0a8c512

asterite added performance topic:stdlib:numeric topic:stdlib:collection labels Jan 27, 2021

Oops, forgot to pass random

9b58b0f

straight-shoota reviewed Jan 27, 2021

View reviewed changes

asterite added 2 commits January 27, 2021 11:55

Simplify condition

7e4e402

Simplify and optimize range sampling

55f4c16

Yet another way to write a condition...

61f8839

...that's probably optimized by LLVM anyway

HertzDevil reviewed Jan 27, 2021

View reviewed changes

src/range.cr Outdated Show resolved Hide resolved

asterite added 4 commits January 27, 2021 15:25

Use the old algorithm when n >= size // 4

ca80782

Always raise on invalid range on sample(0)

0c765d8

Pass random to shuffle!

5a41eaa

Fix specs again

656b608

straight-shoota reviewed Sep 20, 2021

View reviewed changes

src/range.cr Show resolved Hide resolved

src/range.cr Show resolved Hide resolved

src/range.cr Show resolved Hide resolved

asterite closed this Sep 20, 2021

asterite deleted the opt/range-sample-for-int-and-float branch September 20, 2021 17:25

beta-ziliani restored the opt/range-sample-for-int-and-float branch September 20, 2021 18:26

beta-ziliani reopened this Sep 20, 2021

asterite closed this Sep 27, 2022

asterite deleted the opt/range-sample-for-int-and-float branch September 27, 2022 10:45

This was referenced Sep 27, 2022

Optimize Range#sample(n) #12531

Closed

Optimize Range#sample(n) for integers and floats #12535

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `Range#sample(n)` for integer ranges, and allow float ranges too #10310

Optimize `Range#sample(n)` for integer ranges, and allow float ranges too #10310

asterite commented Jan 27, 2021

asterite commented Jan 27, 2021

straight-shoota left a comment

straight-shoota Jan 27, 2021

asterite Jan 27, 2021

straight-shoota Jan 27, 2021

asterite Jan 27, 2021

straight-shoota Jan 27, 2021

asterite Jan 27, 2021

straight-shoota Jan 27, 2021

asterite Jan 27, 2021

straight-shoota Jan 27, 2021

asterite Jan 27, 2021

straight-shoota commented Jan 27, 2021

straight-shoota commented Jan 27, 2021

asterite commented Jan 27, 2021

HertzDevil commented Jan 27, 2021 •

edited

Loading

asterite commented Jan 27, 2021

HertzDevil commented Jun 10, 2021 •

edited

Loading

asterite commented Sep 20, 2021

beta-ziliani commented Sep 20, 2021

asterite commented Sep 27, 2022

Optimize Range#sample(n) for integer ranges, and allow float ranges too #10310

Optimize Range#sample(n) for integer ranges, and allow float ranges too #10310

Conversation

asterite commented Jan 27, 2021

asterite commented Jan 27, 2021

straight-shoota left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

straight-shoota commented Jan 27, 2021

straight-shoota commented Jan 27, 2021

asterite commented Jan 27, 2021

HertzDevil commented Jan 27, 2021 • edited Loading

asterite commented Jan 27, 2021

HertzDevil commented Jun 10, 2021 • edited Loading

asterite commented Sep 20, 2021

beta-ziliani commented Sep 20, 2021

asterite commented Sep 27, 2022

Optimize `Range#sample(n)` for integer ranges, and allow float ranges too #10310

Optimize `Range#sample(n)` for integer ranges, and allow float ranges too #10310

HertzDevil commented Jan 27, 2021 •

edited

Loading

HertzDevil commented Jun 10, 2021 •

edited

Loading