Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Range#sample(n) for integer ranges, and allow float ranges too #10310

Closed
wants to merge 10 commits into from

Conversation

asterite
Copy link
Member

On a gitter discussion it was mentioned that this:

(0..1_000_000_000).sample(10)

is pretty slow, while in Python it's very fast.

I don't know how Python does it, but one way to do that is, if the range is an integer range, then we can call rand(range) (which is optimized to pick a number between those two) n times until we get all the necessary values. Without this optimization the default implementation ends up traversing the entire Enumerable.

A benchmark for the above gives, before this PR:

sample(n) 138.12m (  7.24s ) (± 0.00%) 

With this PR:

sample(n)   7.02M (142.43ns) (± 3.11%) 

So it goes down from 7 seconds to 142 nanoseconds :-)

Then we can do a similar thing for float ranges, which previously weren't supported by this functionality.

@asterite
Copy link
Member Author

On a separate note, the same optimization could be done for a range of chars... after all we can convert a char to a number (via ord), then choose the numbers, then map them back to Char with chr.

That makes me thing that this could actually work with anything that could be mapped to integers. In Haskell this is the Enum typeclass. Maybe we could have a similar "interface" in Crystal.

That said, the only things I can think about right now are integers and chars... well, enums can also be mapped to integers, though not always from 0 to n. So maybe it can be done exclusively for Char, if we really wanted that.

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be a good idea to extract the entire implementations for int and float ranges into separate private methods to reduce the complexity of Range#sample.

src/range.cr Outdated Show resolved Hide resolved
src/range.cr Outdated

if min == max
[min]
elsif n <= 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use Array::SMALL_ARRAY_SIZE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually Hash::MAX_INDICES_SIZE_LINEAR_SCAN / 2 but it's a private constant.

Regardless, the heuristic of using a small number (whatever that number is) is independent of the actual values using in other types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array::SMALL_ARRAY_SIZE is used for deciding between linear scan and hash lookup, so that's looks pretty much like the same heuristic use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay... but it's a private constant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can change it to be public + nodoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here it's just a number literal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree to that. But the constant gives some context. Now it's just a magic number without any explanation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? There's a comment right below that condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean without explanation of the specific value.

It's just the same number we use for equivalent situations. So IMO using a common constant would make sense. But I'm not going to press this any further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant will never, ever change. So we can continue copying over. The problem would be if we later had to update this number everyone. But this number has been proven to be the optimal one.

src/range.cr Outdated Show resolved Hide resolved
@straight-shoota
Copy link
Member

Time and Time::Span can also be converted to numbers. However the full range is larger than Int64. But it could work easily for ranges where the difference between begin and end fits into Int64.

@straight-shoota
Copy link
Member

There's a change in behaviour: For n == 0, Enumerable#sample returns an empty array no matter what. But Range#sample raises on an open range. It's not very relevant edge case, but sample(n: 0) could be made to work for any range.

@asterite
Copy link
Member Author

I think for any "empty" range, regardless of n being 0 or not, the method should raise.

I'll fix that later.

...that's probably optimized by LLVM anyway
@HertzDevil
Copy link
Contributor

HertzDevil commented Jan 27, 2021

How fast is the new version compared to the old one when n is O(size)?

Also it'd be interesting to know how #10271 compares to the following after this PR: (the memory consumption here is probably larger since a temporary is involved in map)

module Indexable(T)
  def sample(n, random)
    (0...n).sample(n, random).map { |i| unsafe_fetch(i) }
  end
end

src/range.cr Outdated Show resolved Hide resolved
@asterite
Copy link
Member Author

@HertzDevil Good point.

I did this benchmark:

require "benchmark"

Benchmark.ips do |x|
  x.report("sample (old)") do
    (0..1_000).sample_old(999)
  end

  x.report("sample (new)") do
    (0..1_000).sample(999)
  end
end

Results:

sample (old)  89.52k ( 11.17µs) (± 5.17%)  3.94kB/op        fastest
sample (new)  11.07k ( 90.31µs) (± 2.67%)  16.0kB/op   8.08× slower

So we should probably use the old approach when n is close to size. I don't know exactly what condition to use, though. Maybe if n > size // 4.

@HertzDevil
Copy link
Contributor

HertzDevil commented Jun 10, 2021

Python uses reservoir sampling for large sample counts and small populations, and a temporary set otherwise. They have their own heuristic for this.

src/range.cr Show resolved Hide resolved
src/range.cr Show resolved Hide resolved
src/range.cr Show resolved Hide resolved
@asterite asterite closed this Sep 20, 2021
@asterite asterite deleted the opt/range-sample-for-int-and-float branch September 20, 2021 17:25
@asterite
Copy link
Member Author

Closing because I don't have time to work on this PR.

@beta-ziliani beta-ziliani restored the opt/range-sample-for-int-and-float branch September 20, 2021 18:26
@beta-ziliani
Copy link
Member

We'll take it from here

@beta-ziliani beta-ziliani reopened this Sep 20, 2021
@asterite
Copy link
Member Author

We'll take it from here

Or maybe not 😄

Let's close this for now.

@asterite asterite closed this Sep 27, 2022
@asterite asterite deleted the opt/range-sample-for-int-and-float branch September 27, 2022 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants