Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make arbitrary generate full range of integers #240

Closed
wants to merge 16 commits into from

Conversation

maxbla
Copy link
Contributor

@maxbla maxbla commented Aug 19, 2019

Aims to fix #119, #190 and #233. Changes the implementation of Arbitrary for numbers to ignore the size of gen, and instead pick uniformly from the entire range of the type.

Todo:

  • change behavior
    • integer types
    • floating-point types
  • add tests
  • explain new meaning of Gen's size() in documentation

@maxbla
Copy link
Contributor Author

maxbla commented Aug 19, 2019

For the purpose of semantic versioning, this PR (or any like it) should be considered a breaking change, as code may have (unwittingly) depended on the fact that integers were in -100..=100.

src/arbitrary.rs Outdated Show resolved Hide resolved
@bluss
Copy link
Contributor

bluss commented Sep 16, 2019

as code may have (unwittingly) depended on the fact that integers were in -100..=100.

There is also code wittingly using usize::arbitrary() to size data structures.

@maxbla
Copy link
Contributor Author

maxbla commented Sep 18, 2019

@bluss yep, intentional dependence on $integer_type::arbitrary() to limit the range to gen::size() is a real backwards compatibility problem for this PR (which must consequently come with a minor semver increment).

This PR adds one workaround (for now) in this crate, for a type that clearly needs it. std::time::Duration's Arbitrary implementation depends on u64 and u32's Arbitrary. the workaround is just _ % gen.size() (see commit abdd617) (also I think it should be using gen.gen_range() instead).

There is another category of types, including std::net::Ipv4Addr, where I'm not sure what the behavior should be -- what is the size of an ipv4 address? I think its not entirely clear and I'm leaning towards IpAddr ignoring gen::size() completely. note: the previous behavior was to respect gen:size() for each individual u8 that goes into the constructor of the ipv4 (or u16/ipv6 address), so 101.101.101.101 through 255.255.255.255 would not be selected with the default gen::size() of 100.

A third category is for types that clearly should ignore gen::size(), including all integer types, when they are called directly.

For crates that define a type, implement Arbitrary on that type, and depend on integer types' implementation of Arbitrary to size their own type, the situation is not great. Those crates will have to make a decision to either use gen.gen_range(0, gen.size()) (or equivalent) or use the new behavior by default (probably bumping their semver).

gen::size() is necessary is for types whose memory or CPU usage is dependent on gen::size() (like vec) and it is a good api for types that are overwhelmingly used for small values (Duration). Any types outside these problem categories could be free to disregard gen::size(). This is a substantial change, requiring documentation and possibly marker traits.

If you want to show me some code I'm breaking to give me a little perspective, I'm all ears.

@bluss
Copy link
Contributor

bluss commented Sep 18, 2019

Good point about the _ % gen.size() workaround, although the renewed focus on problem values sounds like it is a problem for this (size is weirdly distributed then).

I have no special reason to share this petgraph code (that should be trivial to port), but it uses usize::arbitrary in a now-misguided way, and can easily migrate to using .gen_range().

Though I think that in the long run, I would prefer if users of quickcheck used quickcheck as the only source of randomness and randomness API (maybe it's called arbitrariness for quickcheck). Reexporting rand is a valid way to do that, but that's just one of the ways.

To me it would make sense that this is more quickcheck specific. For example, .gen_size_range() gives you an *arbitrary size compatible with current limits. Then we have quickcheck-controlled sizes and don't have to hardcode them as uniformly distributed, for example.

@maxbla
Copy link
Contributor Author

maxbla commented Sep 19, 2019

Yeah, _ % gen.size() isn't perfect because of special values. It also isn't great because it can mess with the distribution. For example with u8 and when gen.size() is 100, each number in 0..55 has a greater chance to be picked than one in 56..99 (3/255 vs 2/255). Of course, this difference approaches zero as the range of the underlying type is much bigger than gen.size().

I agree that third party crates should be able to depend only on Quickcheck for randomness (at least in their respective implementations of Arbitrary) (especially since we're considering dropping rand #241 ), but I'm unsure of the best way to achieve that goal.

It would make sense to wrap gen.gen_range() in some way, but that way can't involve adding requirements to Arbitrary (like RngCore). I could see there being another trait like RawArbitrary that generates arbitrary values without special values and requires explicit sizing, but this feels a bit over-engineered (and more difficult to maintain). Is it possible that the extra weirdness from special/problem values doesn't really matter and we can just ignore it, suggesting people use the _ % gen.size() trick (or similar) to fit their needs?

@maxbla
Copy link
Contributor Author

maxbla commented Sep 19, 2019

While formulating a previous response, I looked through the types that Arbitrary is implemented on in this crate and realized I might have some issues

type issue solution
f32/f64 still respects gen.size() future PR to address desired behavior of floats
Range<T>/Bound<T> Not sure if it should respect gen::size() for integer types wait and think
SocketAddrV6 I don't know what flowinfo or scope_id are in constructor. Is it legal for them to take any value? seems like yes, but unsure wait for review

This ensures the distribution will be uniform, rather than just close
to uniform (because of problem values).
gen_range is exclusive of the max value in the range, but gen() includes
all values in the range
@maxbla
Copy link
Contributor Author

maxbla commented Oct 3, 2019

@BurntSushi and/or @bluss I would appreciate guidance on what should happen to Range<T>/Bound<T> (when T is some integer type)

The basic options are

  • make them not respect gen::size() (current behavior, easy)
  • make them respect gen::size()
    • not sure the best way to accomplish this - the % gen.size() trick won't work unless I add additional trait bound of std::ops::Rem to Range<T>. Most uses of Range are integer types, so this probably wouldn't break much code in the wild, but it is hacky and technically backwards incompatible (e.g. Range<Ipv4Addr> would no longer have an arbitrary implementation)
    • create an internal api that always respects gen.size(), involving lots of macros

@maxbla maxbla marked this pull request as ready for review December 7, 2019 07:52
range.contains is unstable in rust 1.34, which is supported by this
library.
Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this PR and the thoughtful attention to detail. :-)

So firstly, I do kind of feel like this is the right direction to head in. It's definitely a breaking change, so this will require a semver breaking change release. But I'm fine with that.

With respect to Range, it would seem like it should just delegate to its component types Arbitrary impl, like it does today, no? Could you say more about why you were thinking differently? Same for SocketAddrV6.

Floats, on the other hand, should probably be consistent with integers. Is there any way you might be willing to include that in this PR? I'd rather not merge it piecemeal since I'd rather do both changes in a single release. It would be a little crazy to ignore size for integers but not for floats!

@bluss What do you think? Are you okay with this general direction?

src/arbitrary.rs Outdated Show resolved Hide resolved
@maxbla
Copy link
Contributor Author

maxbla commented Jan 11, 2020

For Range<T> My thinking was that since it was often used as the index for a slice, it might break code that contains something like array[Arbitrary::<Range<usize>>()], and there would be no perfect way for a user to fix that (and still have a uniform distribution). I think relying on the underlying type makes the most sense, but I understand how this might be a pain point for some users.

For SocketAddrV6 I don't know what range of values it is allowed to take (if all values are valid, that's good, b/c that's how this PR works right now)

@BurntSushi
Copy link
Owner

Interesting. I personally find array[Arbitrary::<Range<usize>>()] to be a bit weird. Namely, it seems just as weird as array[Arbitrary::<usize>()], which would similarly fail to work. I think if we're okay squashing the latter, then we should be okay squashing the former. (I'd really expect someone to write array[Arbitrary::<usize>() % array.len()] instead I think. And one could do the same with ranges with a helper function.

@BurntSushi
Copy link
Owner

As for SocketAddrV6, I don't see any documented restrictions, so it seems fine to me!

@maxbla
Copy link
Contributor Author

maxbla commented Jan 11, 2020

It would be a little crazy to ignore size for integers but not for floats!

Yeah, it would. I'll take a crack at it tomorrow. It seems like a fairly quick fix.

@maxbla
Copy link
Contributor Author

maxbla commented Jan 12, 2020

It seems like a fairly quick fix.

I take it back.

I experimented a bit, but I think to match user expectations, the distribution needs to generate numbers in
..., (-100, -10), (-10, -1), (-1, 0), (0,1), (1,10), (10,100), ...
with equal probability. One distribution with this property is the log-uniform distribution. I couldn't find a crate with this distribution, but dependencies are bad and it was easy to fudge.

@maxbla
Copy link
Contributor Author

maxbla commented Jan 12, 2020

At this point, I have addressed all the review I received so far.

@maxbla
Copy link
Contributor Author

maxbla commented Jan 13, 2020

This PR now fixes #27

@BurntSushi BurntSushi mentioned this pull request Jan 22, 2020
@maxbla
Copy link
Contributor Author

maxbla commented Mar 24, 2020

Are you waiting for me to resolve merge conflicts with a merge commit or rebase?

Or is this waiting on @bluss? Whether it is or not, I think the float code could do with review.

@bluss
Copy link
Contributor

bluss commented Apr 6, 2020

Please don't wait for me :) Thanks for working on good solutions

@maxbla
Copy link
Contributor Author

maxbla commented Apr 13, 2020

@BurntSushi This PR is ready to be merged. I'm curious what you think of the log-uniform distribution for floats.

@cognivore
Copy link

cognivore commented Jul 14, 2020

Is there a published fork with this bug fixed? It's a deal-breaker.

In the meantime, to those looking for a workaround, I advice to do something like this:

impl Arbitrary for Pos {                      
  fn arbitrary<G : Gen>(g : &mut G) -> Pos {  
    let workaround_seed : u8 = random();      
    if workaround_seed % 10 == 1 {            
      Pos(i8::MAX, i8::MAX)                   
    } else if workaround_seed % 10 == 0 {     
      Pos(i8::MIN, i8::MIN)                   
    } else {                                  
      Pos(i8::arbitrary(g), i8::arbitrary(g)) 
    }                                         
  }                                           
}                                             

@maxbla
Copy link
Contributor Author

maxbla commented Jul 14, 2020

@cognivore this PR is being made from my personal fork, so you could use that for now, but I don't advise it. I have no plans to maintain a fork of quickcheck and I will probably delete my fork soon after this PR is merged (but may create a new fork to make additional changes).

Regarding your workaround, if you're using quickcheck internally i.e. not exposing a struct that implements Arbitrary in the public API of a library, the following is nicer workaround in my opinion. Be careful not to use this workaround if your test function takes a Vec<T> as a parameter, as arbitrary will then generate the maximum length Vecs, eating all your memory.

let gen = StdThreadGen::new(usize::max_value());
let mut qc = QuickCheck::with_gen(gen);
qc.quickcheck(test_function as fn(u8, u8) -> TestResult);

@emlun
Copy link

emlun commented Sep 13, 2020

Any movement on this? I'd be happy to help get this merged if need be.

@aldanor
Copy link

aldanor commented Nov 1, 2020

Also interested in what's the status of this, is anything blocking merging it in? (3 out of 10 open issues are also directly related to this)

Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this LGTM. I'll merge this via a rollup PR soon.

@BurntSushi
Copy link
Owner

All right, I've managed to bring this into my rollup branch. Note that for the future, when submitting reasonably small patches like this, please keep them to a single commit and force push. Having 16 commits in this branch tracking every winding change along with merges of master into the branch made it a real pain to squash it down into one commit. With that said, thank you for sticking with this PR over such a long time period! Apologies for taking a long time to merge it. QuickCheck isn't high on my list of crates to maintain.

@@ -1039,8 +1056,8 @@ impl Arbitrary for RangeFull {

impl Arbitrary for Duration {
fn arbitrary<G: Gen>(gen: &mut G) -> Self {
let seconds = u64::arbitrary(gen);
let nanoseconds = u32::arbitrary(gen) % 1_000_000;
let seconds = gen.gen_range(0, gen.size() as u64);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why did you change this to use gen.size()? It was implemented this way before, but it seems like we should probably open up the range of durations generated too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was an oversight and based on previous discussion, I think the criterion for whether a type respects gen.size() should be whether there are practical (memory, speed or some other resource) limits which make ignoring gen.size() impractical. Based on this criterion, Duration should not respect gen.size(). If you agree with this criterion, I could add it to the documentation for Arbitrary.

BurntSushi pushed a commit that referenced this pull request Dec 27, 2020
This commit tweaks the Arbitrary impls of number types (integers,
floats) to use the full range with a small bias toward "problem" values.
This is a change from prior behavior that would use the `size` parameter
to control the range of integers.

In retrospect, using the `size` parameter this way was probably
misguided. Instead, it should only be used to control the sizes of data
structures instead of also constraining numeric ranges. By constraining
numeric ranges, we leave out a huge space of values that are never
tested.

Fixes #27, Fixes #119, Fixes #190, Fixes #233, Closes #240
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The random selection doesn't appear to be very random
6 participants