Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch sample_single int to O'neill's modulo method & update gen_rang… #1154

Closed

Conversation

TheIronBorn
Copy link
Collaborator

Fixes #1145 by using the modulo-widening-multiplication method while shortcutting modulo in many cases. Specifically the two shortcuts proposed by O'Neill here.

See this chart for 16-bit rejection/acceptance probabilities and modulo calculation chance:
image

At some ranges below one third the full size the rejection probabilities are equal and so the modulo may slow things down but the fewer rejections of other low ranges more than make up for it. Note that above one-third of the range size no modulo is ever performed and the rejection chances are equal. Note also that we use 32-bits internally for 8/16 bits so this chart isn't comparable for those.

This is around 2.5x faster than the previous method for lower ranges and about 0.92x slower for high ranges, discounting 8/16 bits. It seems the the extra branches slow things down slightly when using the same rejection chance. The 128-bit modulo also seems to slow things down a bit. For 8/16 bits the high ranges are not so affected due to the 32-bit internals.

Benchmarks:

 name       lz_apx ns/iter          shortcut_modulo ns/iter   diff ns/iter   diff %  speedup
-i128_high  2,365,025 (676 MB/s)    2,889,420 (553 MB/s)           524,395   22.17%   x 0.82
+i128_low   1,771,940 (902 MB/s)    1,297,330 (1233 MB/s)         -474,610  -26.78%   x 1.37
+i16_high   683,815 (292 MB/s)      281,753 (709 MB/s)            -402,062  -58.80%   x 2.43
+i16_low    676,730 (295 MB/s)      253,068 (790 MB/s)            -423,662  -62.60%   x 2.67
+i32_high   1,305,030 (306 MB/s)    1,302,990 (306 MB/s)            -2,040   -0.16%   x 1.00
+i32_low    839,545 (476 MB/s)      238,871 (1674 MB/s)           -600,674  -71.55%   x 3.51
-i64_high   1,327,230 (602 MB/s)    1,414,850 (565 MB/s)            87,620    6.60%   x 0.94
+i64_low    849,788 (941 MB/s)      242,690 (3296 MB/s)           -607,098  -71.44%   x 3.50
+i8_high    675,245 (148 MB/s)      241,050 (414 MB/s)            -434,195  -64.30%   x 2.80
+i8_low     675,745 (147 MB/s)      236,577 (422 MB/s)            -439,168  -64.99%   x 2.86

I also changed the gen_range_high benchmarks to use a more sensible region of the integer, just above half where both rejection chances are highest. The old ones didn't seem to have any pattern to them. For 8/16 bits we could try to use ranges even closer to the 8/16 bit max for more rejections thanks to 32-bit internals.

This breaks value stability of course. I wasn't sure if I should update those tests with new values.

And here's a Criterion benchmark as well:
image

…e_high benchmarks

doc shortcutting sample_single int method

mention which methods
@vks
Copy link
Collaborator

vks commented Jul 28, 2021

This breaks the value-stability tests, so they would need to be updated.

The performance gains look good! How does it compare to just using Uniform?

@newpavlov
Copy link
Member

Broken value stability means that we should release it only in rand v0.9, no? It looks like we don't have an explicit value stability policy yet, so we probably should add it as well.

@vks
Copy link
Collaborator

vks commented Jul 28, 2021

@newpavlov Here is our policy. I don't quite understand what is meant by "portable item", but it sounds like this PR would be considered a breaking change.

@newpavlov
Copy link
Member

@vks
Ah, yes. I thought I've seen it somewhere, but didn't get as far as the book. We probably should mention it in the rand's README.

I don't quite understand what is meant by "portable item"

I think it was an attempt to introduce two classes of deterministic algorithms: value stable and not. For example, into the latter class probably should fall algorithms working with floats, since we can not provide bit-level value stability for them. Floats being floats and all. But it's not clear where exactly we define class for an algorithm (maybe we should introduce yet another marker trait?).

@dhardy
Copy link
Member

dhardy commented Jul 28, 2021

I don't quite understand what is meant by "portable item"

It's not well worded I agree. Possibly "item" should be replaced with "method".

But, yes, this would be considered a breaking change, and yes the test results need to be updated.

@newpavlov see the last section on that page:

... However, we strive to make them as portable as possible.

@dhardy dhardy added the B-value Breakage: changes output values label Jul 28, 2021
@TheIronBorn
Copy link
Collaborator Author

Compared to just Uniform

 name       Uniform ns/iter       gen_range ns/iter    diff ns/iter   diff %  speedup
+i128_high  6,775,080 (236 MB/s)  3,057,767 (523 MB/s)   -3,717,313  -54.87%   x 2.22
+i128_low   5,291,480 (302 MB/s)  1,287,980 (1242 MB/s)  -4,003,500  -75.66%   x 4.11
+i16_high   691,000 (289 MB/s)    286,856 (697 MB/s)       -404,144  -58.49%   x 2.41
+i16_low    701,163 (285 MB/s)    262,461 (762 MB/s)       -438,702  -62.57%   x 2.67
+i32_high   1,748,010 (228 MB/s)  1,390,462 (287 MB/s)     -357,548  -20.45%   x 1.26
+i32_low    674,240 (593 MB/s)    227,620 (1757 MB/s)      -446,620  -66.24%   x 2.96
+i64_high   2,415,940 (331 MB/s)  1,478,100 (541 MB/s)     -937,840  -38.82%   x 1.63
+i64_low    1,245,925 (642 MB/s)  255,013 (3137 MB/s)      -990,912  -79.53%   x 4.89
+i8_high    703,913 (142 MB/s)    247,025 (404 MB/s)       -456,888  -64.91%   x 2.85
+i8_low     713,518 (140 MB/s)    244,256 (409 MB/s)       -469,262  -65.77%   x 2.92

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. Thanks @TheIronBorn.

We should hold off on merging until we're preparing to release 0.9 (or 1.0 as the case may be).

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Ignore]

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modifies sample_single_inclusive but not sample.

dhardy added a commit that referenced this pull request Oct 20, 2021
@vks
Copy link
Collaborator

vks commented Feb 22, 2022

@dhardy Can we fix sample and merge this, or should we go for #1196 instead?

@dhardy
Copy link
Member

dhardy commented Feb 22, 2022

I'm (still) working on #1196, which will replace this. I'd like to leave it open for now as a reference (though I think we won't end up using this code for any of the variants).

dhardy added a commit that referenced this pull request Feb 24, 2022
dhardy added a commit that referenced this pull request Feb 24, 2022
dhardy added a commit that referenced this pull request Feb 28, 2022
@dhardy dhardy closed this Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-value Breakage: changes output values
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rng::gen_range() is slowest with power-of-2 input sizes?
4 participants