Refactor Frame reductions #8944

vyasr · 2021-08-04T00:01:24Z

This PR moves implementations of reductions out of the Series/DataFrame classes and into Frame. The resulting reduction code is implemented in terms of columns, which improves the performance of DataFrame reductions, and using a single code path makes it easier to maintain. The median and sum_of_squares reductions, which were previously only available for Series, are now transparently enabled for DataFrame as well.

This PR also explicitly disables reductions for Index objects to match pandas Index APIs. Since a few reductions had previously been implemented, removing these features constitutes a breaking change.

This reverts commit 61f09eb36445beabc7a01dc065cb9c2666e8ee3c.

…me any/all and clean up Series any/all implementations.

…for DataFrame.

This reverts commit 034036fdc371576d8be1d7e49b753525fe1fc943.

vyasr · 2021-08-04T00:03:56Z

EDIT 8/4: I had originally posted these benchmarks backwards, my mistake.

Benchmarks before

--------------------------------------------------------------------------------------------------------- benchmark: 36 tests ----------------------------------------------------------------------------------------------------------
Name (time in us)                                            Min                   Max                  Mean              StdDev                Median                 IQR            Outliers         OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reductions[False-1000-DataFrame-product]           459.1811 (4.51)     6,591.2269 (46.06)      471.6086 (4.56)     148.6678 (94.95)      465.2860 (4.52)       3.0380 (4.81)        3;263  2,120.4025 (0.22)       1832           1
test_reductions[False-100000-DataFrame-sum]             465.8159 (4.58)     4,530.4187 (31.66)      474.5430 (4.59)     102.1458 (65.24)      470.3626 (4.57)       2.7325 (4.33)        2;227  2,107.2906 (0.22)       1586           1
test_reductions[False-100000-DataFrame-product]         466.6895 (4.59)     2,949.7053 (20.61)      474.2220 (4.59)      58.0874 (37.10)      471.4187 (4.58)       2.9150 (4.62)        1;229  2,108.7171 (0.22)       1832           1
test_reductions[False-1000-DataFrame-mean]              471.2828 (4.63)     1,656.7800 (11.58)      478.5818 (4.63)      31.6981 (20.25)      475.8220 (4.62)       2.9625 (4.69)       15;210  2,089.5068 (0.22)       1451           1
test_reductions[False-100000-DataFrame-mean]            478.3235 (4.70)     2,539.7614 (17.75)      484.7092 (4.69)      48.4824 (30.97)      482.2602 (4.68)       2.5472 (4.03)        2;236  2,063.0926 (0.21)       1816           1
test_reductions[False-1000-DataFrame-sum]               491.4440 (4.83)     4,474.4834 (31.26)      504.2512 (4.88)     152.5271 (97.42)      496.7842 (4.82)       2.8848 (4.57)         1;95  1,983.1384 (0.21)        681           1
test_reductions[False-10000000-DataFrame-sum]           753.4828 (7.40)       863.5093 (6.03)       759.3434 (7.34)       6.1838 (3.95)       757.9075 (7.36)       2.9467 (4.67)       80;104  1,316.9272 (0.14)        820           1
test_reductions[False-10000000-DataFrame-product]       755.5410 (7.42)     2,644.9803 (18.48)      763.1034 (7.38)      57.6807 (36.84)      760.0840 (7.38)       2.7400 (4.34)        2;130  1,310.4384 (0.14)       1077           1
test_reductions[False-10000000-DataFrame-mean]          760.3224 (7.47)     4,612.8873 (32.23)      772.8637 (7.47)     119.1516 (76.10)      767.4918 (7.45)       4.1737 (6.61)         2;76  1,293.8891 (0.13)       1061           1
test_reductions[True-1000-DataFrame-product]            767.5551 (7.54)     2,562.7799 (17.91)      777.7369 (7.52)      52.6391 (33.62)      774.1274 (7.52)       4.9379 (7.82)         5;75  1,285.7819 (0.13)       1176           1
test_reductions[True-100000-DataFrame-product]          770.0827 (7.57)     2,564.3315 (17.92)      789.4741 (7.63)      54.1230 (34.57)      790.2645 (7.67)      15.0320 (23.81)        2;10  1,266.6661 (0.13)       1115           1
test_reductions[True-1000-DataFrame-mean]               775.7321 (7.62)     4,494.1679 (31.40)      801.4046 (7.75)     112.9491 (72.14)      798.7209 (7.76)       5.6066 (8.88)        2;258  1,247.8092 (0.13)       1082           1
test_reductions[True-100000-DataFrame-sum]              782.6481 (7.69)       894.6210 (6.25)       792.6479 (7.66)       7.2169 (4.61)       790.7227 (7.68)       4.9504 (7.84)        98;53  1,261.5942 (0.13)       1053           1
test_reductions[True-1000-DataFrame-sum]                785.0789 (7.71)       855.8016 (5.98)       792.9524 (7.67)       6.1919 (3.95)       791.2787 (7.68)       4.8168 (7.63)       167;57  1,261.1098 (0.13)       1030           1
test_reductions[True-100000-DataFrame-mean]             792.6673 (7.79)       886.3751 (6.19)       806.0984 (7.80)      14.7627 (9.43)       800.1458 (7.77)       8.9784 (14.22)     144;144  1,240.5433 (0.13)       1127           1
test_reductions[True-10000000-DataFrame-product]      2,051.5919 (20.16)    4,179.7161 (29.21)    2,103.5902 (20.34)    118.6643 (75.79)    2,089.4688 (20.29)      8.3195 (13.18)        9;37    475.3778 (0.05)        417           1
test_reductions[True-10000000-DataFrame-sum]          2,058.0981 (20.22)    4,288.7162 (29.97)    2,094.8560 (20.26)    110.9394 (70.86)    2,087.6573 (20.27)     10.1924 (16.14)        1;36    477.3598 (0.05)        398           1
test_reductions[True-10000000-DataFrame-mean]         2,068.6761 (20.33)    2,649.0465 (18.51)    2,110.3968 (20.41)     44.0728 (28.15)    2,103.7245 (20.43)     20.0011 (31.68)       17;18    473.8445 (0.05)        406           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Benchmarks after

Name (time in us)                                            Min                    Max                  Mean              StdDev                Median                 IQR            Outliers         OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reductions[False-1000-DataFrame-product]           376.9975 (3.80)      2,969.2184 (21.67)      387.3554 (3.83)      74.7089 (46.57)      381.9857 (3.79)       2.8973 (3.78)        6;333  2,581.6083 (0.26)       2143           1
test_reductions[False-1000-DataFrame-sum]               377.8562 (3.81)        614.9188 (4.49)       383.7564 (3.79)      11.7315 (7.31)       381.8981 (3.79)       2.7176 (3.54)        12;63  2,605.8201 (0.26)        501           1
test_reductions[False-100000-DataFrame-sum]             384.1836 (3.87)        533.0686 (3.89)       390.7614 (3.86)       7.5587 (4.71)       388.3205 (3.85)       2.5691 (3.35)      196;307  2,559.1063 (0.26)       1887           1
test_reductions[False-100000-DataFrame-product]         384.7275 (3.88)      2,475.6696 (18.07)      393.0814 (3.89)      46.6877 (29.10)      389.4400 (3.86)       2.8079 (3.66)        7;355  2,544.0022 (0.26)       2047           1
test_reductions[False-1000-DataFrame-mean]              384.8076 (3.88)        486.6216 (3.55)       393.1574 (3.89)       9.4375 (5.88)       390.2679 (3.87)       2.7958 (3.64)      165;327  2,543.5104 (0.26)       1984           1
test_reductions[False-100000-DataFrame-mean]            393.4763 (3.96)        450.4528 (3.29)       401.0123 (3.96)       7.8656 (4.90)       398.2447 (3.95)       2.9644 (3.86)      241;359  2,493.6892 (0.25)       2033           1
test_reductions[False-10000000-DataFrame-sum]           671.3420 (6.76)        883.9816 (6.45)       679.3628 (6.72)      12.2985 (7.67)       676.3516 (6.71)       4.0028 (5.22)       61;111  1,471.9675 (0.15)        904           1
test_reductions[False-10000000-DataFrame-product]       673.5548 (6.78)        807.8441 (5.90)       680.9655 (6.73)       7.2397 (4.51)       678.6939 (6.73)       3.9046 (5.09)      128;152  1,468.5032 (0.15)       1173           1
test_reductions[False-10000000-DataFrame-mean]          678.6920 (6.84)        811.1354 (5.92)       686.5401 (6.79)       7.7184 (4.81)       684.4029 (6.79)       4.6762 (6.09)      114;110  1,456.5791 (0.15)       1161           1
test_reductions[True-1000-DataFrame-mean]               702.9921 (7.08)      2,809.5022 (20.51)      721.2702 (7.13)      59.7265 (37.23)      722.0190 (7.16)      15.6453 (20.39)        3;15  1,386.4431 (0.14)       1264           1
test_reductions[True-100000-DataFrame-product]          704.9441 (7.10)      4,696.9540 (34.29)      721.1655 (7.13)     112.3117 (70.00)      712.4506 (7.07)      13.6299 (17.76)        2;69  1,386.6442 (0.14)       1276           1
test_reductions[True-1000-DataFrame-sum]                707.6617 (7.13)        815.4251 (5.95)       716.2585 (7.08)       6.2876 (3.92)       714.5908 (7.09)       3.9618 (5.16)      171;140  1,396.1439 (0.14)       1181           1
test_reductions[True-1000-DataFrame-product]            709.3251 (7.14)        801.0324 (5.85)       717.5478 (7.09)       6.9645 (4.34)       715.5538 (7.10)       3.9986 (5.21)      147;153  1,393.6355 (0.14)       1291           1
test_reductions[True-100000-DataFrame-sum]              716.3677 (7.22)        832.3677 (6.08)       724.1857 (7.16)       7.1275 (4.44)       722.0749 (7.16)       4.1611 (5.42)      131;135  1,380.8614 (0.14)       1278           1
test_reductions[True-100000-DataFrame-mean]             725.6586 (7.31)        918.2934 (6.70)       736.7115 (7.28)      20.1752 (12.57)      731.9003 (7.26)       5.7896 (7.54)        40;66  1,357.3835 (0.14)       1083           1
test_reductions[True-10000000-DataFrame-product]      2,036.5752 (20.51)     2,747.7220 (20.06)    2,096.6431 (20.73)     41.2705 (25.72)    2,090.6944 (20.74)     10.7386 (13.99)       13;44    476.9529 (0.05)        413           1
test_reductions[True-10000000-DataFrame-sum]          2,047.1606 (20.62)     2,757.9069 (20.13)    2,092.3061 (20.68)     37.3036 (23.25)    2,088.9034 (20.73)      7.7048 (10.04)       10;47    477.9415 (0.05)        416           1
test_reductions[True-10000000-DataFrame-mean]         2,055.8219 (20.71)    12,020.5674 (87.74)    2,121.0629 (20.97)    484.3582 (301.90)   2,095.7980 (20.79)      8.7516 (11.40)        1;42    471.4617 (0.05)        420           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

vyasr · 2021-08-04T17:03:41Z

Could be reading the benchmarks wrong, but are we somemwhat slower after the refactor? For example:

Before:

test_reductions[True-100000-DataFrame-mean]             725.6586 (7.31)        918.2934 (6.70)       736.7115 (7.28)      20.1752 (12.57)      731.9003 (7.26)       5.7896 (7.54)        40;66  1,357.3835 (0.14)       1083           1

After:

test_reductions[True-100000-DataFrame-mean]             792.6673 (7.79)       886.3751 (6.19)       806.0984 (7.80)      14.7627 (9.43)       800.1458 (7.77)       8.9784 (14.22)     144;144  1,240.5433 (0.13)       1127           1

@shwina my bad, I had posted the benchmarks backwards :)

python/cudf/cudf/core/index.py

python/cudf/cudf/core/frame.py

…e_reductions

codecov · 2021-08-05T23:01:12Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@4b5853d). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.10    #8944   +/-   ##
===============================================
  Coverage                ?   10.60%           
===============================================
  Files                   ?      116           
  Lines                   ?    19003           
  Branches                ?        0           
===============================================
  Hits                    ?     2015           
  Misses                  ?    16988           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b5853d...50f92c0. Read the comment docs.

kkraus14 · 2021-08-06T00:49:33Z

hmm, apparently I'm not a python codeowner anymore?

quasiben · 2021-08-06T17:49:12Z

@kkraus14 we are working through your question. In the mean time, @shwina can you review/approve ?

python/cudf/cudf/tests/test_dataframe.py

…e_reductions

vyasr · 2021-08-06T22:36:43Z

@shwina I just pushed changes to add tests of index reductions again. Regarding the other change (removing a NotImplementedError), I don't think that's a breaking change because it takes a case that failed before (was marked as unsupported even though it should be supported) and lets it function as expected.

shwina · 2021-08-07T01:06:29Z

@gpucibot merge

This PR moves implementations of reductions out of the `Series`/`DataFrame` classes and into `Frame`. The resulting reduction code is implemented in terms of columns, which improves the performance of `DataFrame` reductions, and using a single code path makes it easier to maintain. The `median` and `sum_of_squares` reductions, which were previously only available for `Series`, are now transparently enabled for `DataFrame` as well. This PR also explicitly disables reductions for Index objects to match pandas Index APIs. Since a few reductions had previously been implemented, removing these features constitutes a breaking change. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Keith Kraus (https://github.com/kkraus14) - Ashwin Srinath (https://github.com/shwina) URL: rapidsai#8944

This PR rewrites DataFrame's `kurtosis` and `skew` to use the `_reduce` method introduced in #8944, and it inlines the logic for the `count` to bypass the `_apply_support_method` machinery. This allows us to remove most of that logic entirely aside from the code for row-wise reductions and scans that dispatches to `cupy`. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #9068

vyasr added 17 commits August 3, 2021 16:16

Move logic for any/all down from Series into Column.

0b373d9

Simplify _apply_support_method for improved performance.

3a0e11e

Alias DataFrame.product as DataFrame.prod.

e6a1e2a

Revert "Simplify _apply_support_method for improved performance."

a70eebd

This reverts commit 61f09eb36445beabc7a01dc065cb9c2666e8ee3c.

Standardize reduction operation argument handling.

3d19fec

Replace asserts with exception handlers.

b437f50

Split axis 0 and axis 1 support methods into separate functions.

ad025b6

Switch reductions to use a new helper method.

823f558

Reimplement axis 0 reductions as column rather than Series ops.

5442189

Move most reductions up to the Frame level.

30781d8

Fix behavior of bool_only to not fail for bool_only=False for DataFra…

e2e37c1

…me any/all and clean up Series any/all implementations.

Move any and all into Frame.

f9d68f4

Optimize DataFrame-Series binop.

8b51959

Move sum_of_squares to Frame.

25f1233

Move median to Frame and add tests of both median and sum_of_squares …

57b4d0c

…for DataFrame.

Revert "Optimize DataFrame-Series binop."

7ed91c7

This reverts commit 034036fdc371576d8be1d7e49b753525fe1fc943.

Fix axis lookup logic.

933f578

vyasr added the 3 - Ready for Review Ready for review by team label Aug 4, 2021

vyasr added this to the CuDF Python Refactoring milestone Aug 4, 2021

vyasr self-assigned this Aug 4, 2021

vyasr requested a review from a team as a code owner August 4, 2021 00:01

vyasr requested review from cwharris and shwina August 4, 2021 00:01

github-actions bot added the Python Affects Python cuDF API. label Aug 4, 2021

vyasr added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue and removed Python Affects Python cuDF API. labels Aug 4, 2021

Disable reductions for index types.

3cd7903

vyasr added breaking Breaking change and removed non-breaking Non-breaking change labels Aug 4, 2021

github-actions bot added the Python Affects Python cuDF API. label Aug 4, 2021

kkraus14 reviewed Aug 4, 2021

View reviewed changes

python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved

vyasr added 5 commits August 5, 2021 10:03

Enable reductions for Index types.

8336fb5

Use f-string for exception.

76cdc13

Fix assertion.

dd756a7

Fix NaN handling for ColumnBase's any and all.

8ab25b5

Merge remote-tracking branch 'origin/branch-21.10' into refactor/fram…

6a52830

…e_reductions

kkraus14 approved these changes Aug 5, 2021

View reviewed changes

kkraus14 approved these changes Aug 6, 2021

View reviewed changes

shwina reviewed Aug 6, 2021

View reviewed changes

python/cudf/cudf/tests/test_dataframe.py Show resolved Hide resolved

Merge remote-tracking branch 'origin/branch-21.10' into refactor/fram…

d1cdb9d

…e_reductions

vyasr requested a review from shwina August 6, 2021 22:34

Add tests for more index reductions.

50f92c0

shwina approved these changes Aug 7, 2021

View reviewed changes

rapids-bot bot merged commit 115f3b6 into rapidsai:branch-21.10 Aug 7, 2021

vyasr mentioned this pull request Aug 18, 2021

Remove remaining "support" methods from DataFrame #9068

Merged

vyasr deleted the refactor/frame_reductions branch March 31, 2022 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Frame reductions #8944

Refactor Frame reductions #8944

vyasr commented Aug 4, 2021 •

edited

Loading

vyasr commented Aug 4, 2021 •

edited

Loading

vyasr commented Aug 4, 2021

codecov bot commented Aug 5, 2021 •

edited

Loading

kkraus14 commented Aug 6, 2021

quasiben commented Aug 6, 2021

vyasr commented Aug 6, 2021

shwina commented Aug 7, 2021

Refactor Frame reductions #8944

Refactor Frame reductions #8944

Conversation

vyasr commented Aug 4, 2021 • edited Loading

vyasr commented Aug 4, 2021 • edited Loading

vyasr commented Aug 4, 2021

codecov bot commented Aug 5, 2021 • edited Loading

Codecov Report

kkraus14 commented Aug 6, 2021

quasiben commented Aug 6, 2021

vyasr commented Aug 6, 2021

shwina commented Aug 7, 2021

vyasr commented Aug 4, 2021 •

edited

Loading

vyasr commented Aug 4, 2021 •

edited

Loading

codecov bot commented Aug 5, 2021 •

edited

Loading