Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move sampling methods to Enumerable #10129

Merged

Conversation

HertzDevil
Copy link
Contributor

Resolves #10118. There are two caveats here:

  • Since Indexable's optimization calls the RNG exactly once, it will produce different results than types that aren't Indexable, most notably the collections' own external iterators:
    ary = [1, 2, 3, 4, 5]
    ary.sample(Random.new(1))      # => 3
    ary.each.sample(Random.new(1)) # => 1
  • Technically Range(Float, Float)#sample(random) doesn't mean the same thing as the other sampling methods, because in this case it represents a continuous uniform distribution rather than a discrete collection of items. I allowed it anyway for consistency with Range(Int, Int). The #sample(n, random) overload still won't work as it requires Float#succ (c.f. Float-based ranges #9339).

src/range.cr Outdated Show resolved Hide resolved
Co-authored-by: Johannes Müller <johannes.mueller@smj-fulda.org>
@asterite
Copy link
Member

asterite commented Jan 6, 2021

Oh, needs a conflict resolution, but after that I'll merge it. Thank you!

@asterite asterite added this to the 1.0.0 milestone Jan 11, 2021
Copy link
Member

@bcardiff bcardiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I would suggest to mention somewhere in docs about reservoir sampling. Currently I need to get into the issue discussion to understand why the equal probability statement holds.

@asterite
Copy link
Member

asterite commented Jan 12, 2021

@bcardiff Do you mean add a doc comment, or just an internal comment about reservoir sampling? The old code was also using it but it wasn't mentioned. I would mention it as an internal comment in case it needs to be updated, but I wouldn't expose this fact. You don't need to know this to use the method; if you are curious you can dig into the code; there are no alternatives; and reading "reservoir sampling" might confuse users more than help them (for instance, Ruby docs never mention algorithms).

@bcardiff
Copy link
Member

I was thinking of internal docs. I missed that the same algorithm was used already.

@HertzDevil
Copy link
Contributor Author

Thinking if we could implement Algorithm L on Indexable: https://carc.in/#/r/a975

That algorithm seems to require that Random.rand return a sample in (0, 1) instead of [0, 1). I don't know what would happen when it actually returns zero...

@oprypin
Copy link
Member

oprypin commented Jan 12, 2021

If you're worried about it returning zero, then there's always the strategy of just trying again if it gives you zero. It's definitely correct and not a bad strategy overall, it's already being used elsewhere.

@asterite
Copy link
Member

Maybe the optimization can be undone and moved to a separate PR? Because now this needs 2 reviews again.

@straight-shoota
Copy link
Member

Thank you

@straight-shoota straight-shoota merged commit a7d8ce5 into crystal-lang:master Jan 13, 2021
@HertzDevil HertzDevil deleted the feature/enumerable-sample branch January 13, 2021 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Sampling in any Enumerable
5 participants