Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

depr(python): Deprecate allow_infinities and null_probability args to parametric test strategies #16183

Merged
merged 3 commits into from
May 13, 2024

Conversation

stinodego
Copy link
Member

@stinodego stinodego commented May 13, 2024

Changes

  • Add **kwargs to series and dataframes which passes the kwargs to underlying data generation strategies.
  • Deprecate allow_infinities. Users should use allow_infinity which is passed to the float strategy.
    • This is not very discoverable yet, but adding all possible configurations for all data types to the top-level strategies is not sustainable. I am considering making the underlying data generation strategies public.
  • Deprecate null_probability. Users should not be micromanaging probabilities - hypothesis will find appropriate examples. Replaced by allow_null boolean arg.
  • Update the categories strategy to sample from a number of set categories, instead of being a free text field.

@github-actions github-actions bot added deprecation Add a deprecation warning to outdated functionality python Related to Python Polars labels May 13, 2024
@stinodego stinodego marked this pull request as ready for review May 13, 2024 08:25
Copy link

codecov bot commented May 13, 2024

Codecov Report

Attention: Patch coverage is 82.00000% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 80.98%. Comparing base (dbfc6b2) to head (b639211).
Report is 2 commits behind head on main.

Files Patch % Lines
...olars/polars/testing/parametric/strategies/core.py 70.96% 5 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16183      +/-   ##
==========================================
- Coverage   80.99%   80.98%   -0.01%     
==========================================
  Files        1392     1392              
  Lines      178884   178937      +53     
  Branches     2893     2897       +4     
==========================================
+ Hits       144884   144911      +27     
- Misses      33504    33522      +18     
- Partials      496      504       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stinodego stinodego merged commit 77aaec3 into main May 13, 2024
17 checks passed
@stinodego stinodego deleted the hyp-strats-2 branch May 13, 2024 08:46
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented May 13, 2024

  • Deprecate null_probability. Users should not be micromanaging probabilities - hypothesis will find appropriate examples. Replaced by allow_null boolean arg.

@stinodego: I'd be slightly tempted to allow True/False and a float probability here, as another use-case for these strategies is to feed synthetic dataframes for testing into third party code that relies on Polars (eg: use of these strategies isn't just about testing Polars' own code, it's also about allowing libraries that use Polars to generate frames to test their code).

This is probably also an argument to integrate some sort of dedicated synthetic data generation facility as a separate consideration, of course; while there is overlap between the two use-cases I certainly wouldn't argue that they are the same thing! (Might have to go take a look at the synthetic data landscape now, hmmm 🤔)

@stinodego
Copy link
Member Author

One of the key 'insights' that motivated this rewrite is that generating data for parametric tests and generating generic synthetic data are two very different things.

Hypothesis 'skews' the data intentionally in all kinds of ways to try and produce failures. You could have 1 billion rows with a 0.99 null probability, and hypothesis would still produce a column without any nulls in it. So that number doesn't really mean anything in the context of parametric testing.

You can see the same in hypothesis' own API: for example, their floats strategy has allow_nan as a bool, not a nan_probability.

Synthetic data generation can be really useful, but hypothesis should not be used for that. It's really not suitable for that purpose, as far as I've seen.

@alexander-beedie
Copy link
Collaborator

Synthetic data generation can be really useful, but hypothesis should not be used for that. It's really not suitable for that purpose, as far as I've seen.

Indeed; also my eventual conclusion above, I just need to rush to find a replacement for doing it now, heh 😂

@c-peters c-peters added the accepted Ready for implementation label May 21, 2024
Wouittone pushed a commit to Wouittone/polars that referenced this pull request Jun 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation deprecation Add a deprecation warning to outdated functionality python Related to Python Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants