-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Range#sample(n)
for integer ranges, and allow float ranges too
#10310
Conversation
On a separate note, the same optimization could be done for a range of chars... after all we can convert a char to a number (via That makes me thing that this could actually work with anything that could be mapped to integers. In Haskell this is the That said, the only things I can think about right now are integers and chars... well, enums can also be mapped to integers, though not always from 0 to n. So maybe it can be done exclusively for Char, if we really wanted that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be a good idea to extract the entire implementations for int and float ranges into separate private methods to reduce the complexity of Range#sample
.
src/range.cr
Outdated
|
||
if min == max | ||
[min] | ||
elsif n <= 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use Array::SMALL_ARRAY_SIZE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually Hash::MAX_INDICES_SIZE_LINEAR_SCAN / 2
but it's a private constant.
Regardless, the heuristic of using a small number (whatever that number is) is independent of the actual values using in other types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array::SMALL_ARRAY_SIZE
is used for deciding between linear scan and hash lookup, so that's looks pretty much like the same heuristic use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay... but it's a private constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we can change it to be public + nodoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here it's just a number literal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree to that. But the constant gives some context. Now it's just a magic number without any explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? There's a comment right below that condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I mean without explanation of the specific value.
It's just the same number we use for equivalent situations. So IMO using a common constant would make sense. But I'm not going to press this any further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant will never, ever change. So we can continue copying over. The problem would be if we later had to update this number everyone. But this number has been proven to be the optimal one.
|
There's a change in behaviour: For |
I think for any "empty" range, regardless of I'll fix that later. |
...that's probably optimized by LLVM anyway
How fast is the new version compared to the old one when Also it'd be interesting to know how #10271 compares to the following after this PR: (the memory consumption here is probably larger since a temporary is involved in module Indexable(T)
def sample(n, random)
(0...n).sample(n, random).map { |i| unsafe_fetch(i) }
end
end |
@HertzDevil Good point. I did this benchmark: require "benchmark"
Benchmark.ips do |x|
x.report("sample (old)") do
(0..1_000).sample_old(999)
end
x.report("sample (new)") do
(0..1_000).sample(999)
end
end Results:
So we should probably use the old approach when |
Python uses reservoir sampling for large sample counts and small populations, and a temporary set otherwise. They have their own heuristic for this. |
Closing because I don't have time to work on this PR. |
We'll take it from here |
Or maybe not 😄 Let's close this for now. |
On a gitter discussion it was mentioned that this:
is pretty slow, while in Python it's very fast.
I don't know how Python does it, but one way to do that is, if the range is an integer range, then we can call
rand(range)
(which is optimized to pick a number between those two)n
times until we get all the necessary values. Without this optimization the default implementation ends up traversing the entireEnumerable
.A benchmark for the above gives, before this PR:
With this PR:
So it goes down from 7 seconds to 142 nanoseconds :-)
Then we can do a similar thing for float ranges, which previously weren't supported by this functionality.