Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(python): Parametric test coverage for EWM functions #5011

Merged
merged 1 commit into from
Sep 30, 2022
Merged

test(python): Parametric test coverage for EWM functions #5011

merged 1 commit into from
Sep 30, 2022

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Sep 28, 2022

Massive expansion of EWM test coverage, using pandas (with ignore_na=True) as a reference implementation for comparison purposes.

Parametric tests cover...

  • use of all three decay params: com, span, half_life.
  • floating point values between -1e8 and 1e8.
  • null values present / not present.
  • different min_period values.
  • chunked / unchunked series.
  • series of different lengths
  • int and float series.

@github-actions github-actions bot added the python Related to Python Polars label Sep 28, 2022
@ritchie46
Copy link
Member

Wow! That are a lot of tests!

I think we should really utilize this for the operations that are sensitive to breaking (because I optimize them much, and create new fast paths). Here I think of combinations of groupby, sort, joins, extend, append (without rechunks`. I think those operation combinations are most sensitive to edge cases.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 28, 2022

Wow! That are a lot of tests!

The joy of parametric testing; you get a lot of bang for your buck ;)

I think we should really utilize this for the operations that are sensitive to breaking (because I optimize them much, and create new fast paths). Here I think of combinations of groupby, sort, joins, extend, append (without rechunks`. I think those operation combinations are most sensitive to edge cases.

Definitely; in conjunction with tests like this (where there is an existing well-tested implementation that we can reference) it hugely expands the coverage surface, actively probing all kinds of different corner-cases.

There is an interesting option for future consideration, which would map very well onto what you describe above: stateful testing with rule-based state machines. The combination of parametrically-generated operator chains (groupby,join, etc) plus numerical testing could be really strong.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 28, 2022

@matteosantama - hope you don't mind looking over these ewm_mean results... ;)
(There are more, but these are quite clean; calculated value has +1 difference).

  • Example 1:

    import polars as pl
    s = pl.Series([100000, 9007199254640994, -99999])
    
    print( s.ewm_mean(com=0.0,min_periods=2).to_list() )
    print( s.to_pandas().ewm(com=0.0,min_periods=2).mean().to_list() )
    
    # [None, 9007199254640994.0, -99998.0]  << polars
    # [ nan, 9007199254640994.0, -99999.0]  << pandas

    Note: changing the middle value above by ±1 (to either 9007199254640993 or to 9007199254640995) brings the values back in-line again (eg: both -99999.0). The pandas calculation appears numerically stable throughout the same range.

  • Example 2: (very similar)

    s = pl.Series([10000000000, 9007199254740992, -1])
    
    print( s.ewm_mean(com=0.0,min_periods=2).to_list() )
    print( s.to_pandas().ewm(com=0.0,min_periods=2).mean().to_list() )
    
    # [None, 9007199254740992.0,  0.0]  << polars
    # [ nan, 9007199254740992.0, -1.0]  << pandas

Misc: Some of the other tests that don't tie-out may be connected to subnormal number handling (or something along those lines) as they are generating extreme values; I'm going to see about constraining the tolerance for these before considering them a problem.

@matteosantama
Copy link
Contributor

Will have to look closer at the ewm_mean issue; I will get back to you shortly.

As for the ewm_std and ewm_var we might want to test the bias parameter too. Simplest way would be to add another for loop...

@matteosantama
Copy link
Contributor

Where is the chunking happening? Is that taken care of in the generation of the input Series?

@matteosantama
Copy link
Contributor

Ok, so com=0 corresponds with alpha=1. The math says an ewm_mean with alpha=1 should just return the original values back, but we do need to account for min_periods.

I'm inclined to say we should just special-case the alpha=1 situation. Doing so would be an optimization too as we can bypass a lot of intermediary calculations.

Since this problem only seems to affect ewm_mean and not ewm_var or ewm_std, we would just special-case it there. If that approach sounds reasonable I am happy to open a PR.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 28, 2022

Where is the chunking happening? Is that taken care of in the generation of the input Series?

Yup; unless you explicitly set True (always chunked) or False (never chunked), it is applied randomly to 50% of the output Series. This way everything subject to parametric testing is automatically validated over chunked/non-chunked data.

we might want to test the bias parameter too.

I knew I'd forgotten something... will take care of it!

If that approach sounds reasonable I am happy to open a PR.

Doesn't sound crazy - and with this new coverage we'll find out pretty fast if it has any unforeseen side-effects... ;)

@matteosantama
Copy link
Contributor

@alexander-beedie Great! I just opened a pull request with the change. Feel free to merge it into this PR, or we can keep them separate.

@alexander-beedie
Copy link
Collaborator Author

@alexander-beedie Great! I just opened a pull request with the change. Feel free to merge it into this PR, or we can keep them separate.

Fantastic; lets keep them separate for now - there was still some other noise that I want to track down/explain. (It looks more like noise than something problematic; will take a look against your patch above to see what remains ;)

Just added adjust validation and rebased.

@matteosantama
Copy link
Contributor

Just added adjust validation and rebased.

Don't forget the bias parameter too, though :)

@alexander-beedie
Copy link
Collaborator Author

Don't forget the bias parameter too, though :)

Was about to ask - what is bias equivalent to on the pandas functions?
https://pandas.pydata.org/docs/reference/api/pandas.Series.ewm.html

@matteosantama
Copy link
Contributor

matteosantama commented Sep 28, 2022

Ah, it’s a parameter you pass directly to either std or var (it is meaningless for mean).

https://pandas.pydata.org/docs/reference/api/pandas.core.window.ewm.ExponentialMovingWindow.var.html

pd.Series(xs).ewm().var(bias=True)

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 28, 2022

Ah, it’s a parameter you pass directly to either std or var (it is meaningless for mean).

Got it. FYI: with all the parameter/data variations, I think we're now looking at a little over 4,000 validating tests per EWM type (each of std, mean, var); congrats on the implementation ;)

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 29, 2022

@matteosantama: have compiled-in your update to validate against the parametric tests; found a difference in behavior with the alpha optimization, and one question about acceptable tolerance (vs pandas/reference):

  • Behaviour: alpha=1 optimization, original values returned as-is (inc. mid-series nulls)

    import polars as pl
    s = pl.Series([0, 0, 1, None, 0])
    p = s.to_pandas()
    
    s.ewm_mean( com=0.0, min_periods=2, adjust=True )
    # Series: '' [f64]
    # [
    #     null
    #     0.0
    #     1.0
    #     null   # << null is preserved
    #     0.0
    # ]
    
    p.ewm( com=0.0, min_periods=2, adjust=True, ignore_na=True ).mean()
    # 0    NaN
    # 1    0.0
    # 2    1.0
    # 3    1.0   # << not null; previous value filled forward?
    # 4    0.0
    # dtype: float64
  • Tolerance: assert_series_equal requires raising atol to 1e-06 (usually 1e-08)

    from polars.testing import assert_series_equal
    import polars as pl
    
    s = pl.Series([0, 0, 0, 954])
    p = s.to_pandas()
    
    hl = 0.023902893066406257
    
    std_pl = s.ewm_std( min_periods=2, half_life=hl, bias=True )
    std_pd = pl.Series( p.ewm(min_periods=2, halflife=hl).std(bias=True) )
    
    assert_series_equal(std_pl, std_pd)
    # Value mismatch
    # [left]:  [0.0004815538224696626]
    # [right]: [0.0004816587247304581]

    Note: this is for std - a larger number of iterations vs half_life found it. If this degree of mismatch doesn't cause any particular concern, I'll raise the tolerance.

@matteosantama
Copy link
Contributor

matteosantama commented Sep 29, 2022

Behaviour: alpha=1 optimization, original values returned as-is (inc. mid-series nulls)

Wow, your test are catching everything! This is definitely an unforeseen change, but I do think it is correct. Consider the standard equation for ewm mean, y_{j} = (1 - \alpha) y_{j - 1} + \alpha x_{j}. With alpha=1, this is simply y_{j} = x_{j}. Note that this is the formula for adjust=False but the math works out the same when adjust=True.

I'm OK with returning that null -- its defensible because its exactly what the math says. If someone wants the pandas behavior, they can use fill_null instead.

Tolerance: assert_series_equal requires raising atol to 1e-06 (usually 1e-08)

I think thats probably acceptable. If someone opens an issue and provides us a reason to tighten it we can revisit.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 29, 2022

Wow, your test are catching everything!

Parametric testing... "tell your friends" 🤣

I'm OK with returning that null -- its defensible because its exactly what the math says.

Great; have made a note to that effect in the tests, and applied fill_null/forward when alpha=1 (just in the tests), which takes care of that entire class of discrepancy.

I think thats probably acceptable. If someone opens an issue and provides us a reason to tighten it we can revisit.

With a little more experimentation it seems that by excluding half_life and com values in the range 0 < abs(x) < 1e-08 most of the noise disappears (with a few edge-cases, as above).


@ritchie46: think we're good to go; I've rebased/squashed to capture @matteosantama's alpha optimization and have uncommented mean in the parametric tests. Unless the additional iterations run for CI find something else, can merge.

Any other areas you'd like subjected to more intense testing?
All of the above shows it off quite well ;)

@alexander-beedie
Copy link
Collaborator Author

think we're good to go

I just had to tempt fate... ;p
Taking a look at the current error.

@ritchie46
Copy link
Member

Any other areas you'd like subjected to more intense testing?

Yes, the relational primitives combined. sort, join, groupby, select, with different optimizations toggled on and of. I think those are most sensitive to bugs. Especially as we try to do smart things there.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 29, 2022

Yes, the relational primitives combined. sort, join, groupby, select, with different optimizations toggled on and of. I think those are most sensitive to bugs. Especially as we try to do smart things there.

Permuting the optimisations is very doable. Need to create some sort of mechanism of combining chained operations with reasonable parameters (which is also something I want at work for our own data API, so it'll happen one way or another ;) I suspect the rules-based state machine approach is going to be useful here. Will investigate properly when I have a good block of time.

In the meantime I'll probably look at hardening other areas as well (for example: IO round trip via different formats could be a fruitful area, rolling functions, etc).

@stinodego
Copy link
Member

Interesting stuff!

One thing I noticed though: the test pipelines are a lot slower now. What about the trade-off here? Are the tests slow or is calculating the coverage slowing everything down?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 30, 2022

@ritchie46:
Now we're good to go... :)

@matteosantama:
Identified/addressed the source of all the remaining noise, applying an alpha filter to the floats-generation strategy that suppresses it; prevents passing-through any values of com, span, or half_life that would result in an alpha that's closer to 0 or 1 than 1e-06 (so, final alpha is now bounded between 0.000001 and 0.999999 for these tests).

For final validation I ran 20,000 iterations locally (which translates to ~1.1 million successful tests for each of the three decay params; I'm happy to report that every one of those tests passed with flying colours ;)

@stinodego:
Hmm, this type of testing is (inevitably) more expensive than a handful of unit tests (hence not running it by default locally). However, in this PR I also raised the number of iterations under CI from 100 (same as the default local profile) to 500; perhaps that was a bit much, if coverage has a significant negative interaction? I've scaled it back down to 200, which looks much less impactful. I'll see about characterising the performance with/without coverage (which I haven't tested) and see if we can then tune things to a happy medium :)

@ritchie46
Copy link
Member

Thanks a lot for the extensive tests.

Given that this adds a huge lot of tests for only one of our many operations. I wonder a bit how we should go about scaling this.

Would it be possible to keep the parameters around that lead to bugs? Maybe we could run those around as these were the found edge cases.

Any suggestions/ideas?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 30, 2022

Would it be possible to keep the parameters around that lead to bugs?

Encoding too many assumptions tends to be a bad idea, as it's hard to say in advance what combination of parameters/values may lead to bugs. Arguably you're more likely to find bugs where you weren't expecting them, for exactly that reason, heh. (Which isn't to say you shouldn't encode any assumptions, just that you should do so judiciously).

Any suggestions/ideas?

Plenty :) There are lots of ways to tune things; given how few parameterised tests there are at the moment I have spent essentially no time assessing them in terms of performance tradeoff.

One immediately available option is the ability to control the number of iterations on a per-test basis. When we have a heavier test (like this new EWM one) we can dial it back a bit. For instance, the Series slice test runs 500 iterations in ~450ms, but I think 200 iterations of EWM takes around 16 secs (I can optimize this significantly; every single combination of every param on every iteration is overkill).

Setting per-test iterations is trivial; just a decorator:

@settings(max_examples=250)
def test_( ... ) -> None:

Could also keep the default number of iterations during CI the same as a local run (100), rather than have CI do more (200). The author of the test can always run more before committing (I ran 20,000 iterations on this one).

I'm also very curious about @stinodego's suggestion that coverage may interact badly here; I want to look into that and see if it's a factor.

How about I optimize this test a bit more, then we merge it, and I'll start looking at things from a higher-level performance perspective next?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 30, 2022

Also: this test is a bit of an outlier in another way - you don't necessarily want to be limited to testing against a reference implementation (though here it makes sense to do so). One of the ways this test becomes more expensive is that we aren't just testing our own functionality, we're testing pandas as well.

One of the best ways to employ this type of testing more generically is to validate invariants; essentially you test the properties of your results to check for violations of the invariants, rather than limit yourself to testing the values of your results against some reference implementation (hence the term "property-based testing"). By chance I happened to find "Considerations for control of data invariants" by @jorgecarleitao on YouTube recently, so that ties-in nicely ;)

There's a very clean description of how to think about property-based testing here.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 30, 2022

@stinodego / @ritchie46: hah. ok, there was actually a lot of low-hanging fruit; I just committed a patch that generically speeds-up a lot of the polars-specific parametric testing functionality. That plus some test-specific optimisations made these EWM tests run about ~10x faster. Probably more can be done (haven't looked at coverage yet), but already it's a big improvement; much-reduced CI impact.


Generic speedups

  • reduced max generated dataframe/series length to 10 (from 20).
  • constrained max length of Utf8 text strategy results to 10 characters (could be as high as 80 before).
  • made an opt-in CI_MAX profile and defaulted regular CI back to 100 iterations.
  • optimized application of all/no null values to generated data.

EWM speedups

  • moved alpha/bias params into the @given decorator rather than run each permutation on every iteration.
  • test against a single float series instead of a float and an int series (which just gets turned into a float series anyway).

Misc speedups

  • optimized the tests that test the parametric tests:parametric/tests_testing.py :)

@stinodego
Copy link
Member

I think the test pipeline is now faster than it was before this PR 😄 nice work!

@alexander-beedie
Copy link
Collaborator Author

I think the test pipeline is now faster than it was before this PR 😄 nice work!

Heh. It is actually... ;)
I'll look deeper at the weekend.

@ritchie46
Copy link
Member

I think the test pipeline is now faster than it was before this PR smile nice work!

Heh. It is actually... ;) I'll look deeper at the weekend.

Should we wait then, or should we follow up with that?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 30, 2022

Should we wait then, or should we follow up with that?

No, let's not wait; this is now ready to go and speeds up the existing pipeline (by making all the non-EWM parametric tests faster). If there's a good follow-up (eg: can tweak coverage behaviour, or make other worthwhile adjustments to the polars-specific hypothesis primitives) it will be separate from this.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Sep 30, 2022

Rebased/squashed (merge conflict with updated requirements).

@ritchie46 ritchie46 merged commit 1bfb378 into pola-rs:master Sep 30, 2022
@ritchie46
Copy link
Member

Alright! Weekend! 🎉😉

@alexander-beedie alexander-beedie deleted the parametric-ewm-tests branch September 30, 2022 19:21
@stinodego stinodego changed the title tests[python]: parametric test coverage for EWM functions test(python): parametric test coverage for EWM functions Oct 1, 2022
@github-actions github-actions bot added the test Related to the test suite label Oct 1, 2022
@stinodego stinodego changed the title test(python): parametric test coverage for EWM functions test(python): Parametric test coverage for EWM functions Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars test Related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants