Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Frame reductions #8944

Merged
merged 25 commits into from
Aug 7, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Aug 4, 2021

This PR moves implementations of reductions out of the Series/DataFrame classes and into Frame. The resulting reduction code is implemented in terms of columns, which improves the performance of DataFrame reductions, and using a single code path makes it easier to maintain. The median and sum_of_squares reductions, which were previously only available for Series, are now transparently enabled for DataFrame as well.

This PR also explicitly disables reductions for Index objects to match pandas Index APIs. Since a few reductions had previously been implemented, removing these features constitutes a breaking change.

@vyasr vyasr added the 3 - Ready for Review Ready for review by team label Aug 4, 2021
@vyasr vyasr added this to the CuDF Python Refactoring milestone Aug 4, 2021
@vyasr vyasr self-assigned this Aug 4, 2021
@vyasr vyasr requested a review from a team as a code owner August 4, 2021 00:01
@vyasr vyasr requested review from cwharris and shwina August 4, 2021 00:01
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 4, 2021
@vyasr vyasr added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue and removed Python Affects Python cuDF API. labels Aug 4, 2021
@vyasr
Copy link
Contributor Author

vyasr commented Aug 4, 2021

EDIT 8/4: I had originally posted these benchmarks backwards, my mistake.

Benchmarks before

--------------------------------------------------------------------------------------------------------- benchmark: 36 tests ----------------------------------------------------------------------------------------------------------
Name (time in us)                                            Min                   Max                  Mean              StdDev                Median                 IQR            Outliers         OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reductions[False-1000-DataFrame-product]           459.1811 (4.51)     6,591.2269 (46.06)      471.6086 (4.56)     148.6678 (94.95)      465.2860 (4.52)       3.0380 (4.81)        3;263  2,120.4025 (0.22)       1832           1
test_reductions[False-100000-DataFrame-sum]             465.8159 (4.58)     4,530.4187 (31.66)      474.5430 (4.59)     102.1458 (65.24)      470.3626 (4.57)       2.7325 (4.33)        2;227  2,107.2906 (0.22)       1586           1
test_reductions[False-100000-DataFrame-product]         466.6895 (4.59)     2,949.7053 (20.61)      474.2220 (4.59)      58.0874 (37.10)      471.4187 (4.58)       2.9150 (4.62)        1;229  2,108.7171 (0.22)       1832           1
test_reductions[False-1000-DataFrame-mean]              471.2828 (4.63)     1,656.7800 (11.58)      478.5818 (4.63)      31.6981 (20.25)      475.8220 (4.62)       2.9625 (4.69)       15;210  2,089.5068 (0.22)       1451           1
test_reductions[False-100000-DataFrame-mean]            478.3235 (4.70)     2,539.7614 (17.75)      484.7092 (4.69)      48.4824 (30.97)      482.2602 (4.68)       2.5472 (4.03)        2;236  2,063.0926 (0.21)       1816           1
test_reductions[False-1000-DataFrame-sum]               491.4440 (4.83)     4,474.4834 (31.26)      504.2512 (4.88)     152.5271 (97.42)      496.7842 (4.82)       2.8848 (4.57)         1;95  1,983.1384 (0.21)        681           1
test_reductions[False-10000000-DataFrame-sum]           753.4828 (7.40)       863.5093 (6.03)       759.3434 (7.34)       6.1838 (3.95)       757.9075 (7.36)       2.9467 (4.67)       80;104  1,316.9272 (0.14)        820           1
test_reductions[False-10000000-DataFrame-product]       755.5410 (7.42)     2,644.9803 (18.48)      763.1034 (7.38)      57.6807 (36.84)      760.0840 (7.38)       2.7400 (4.34)        2;130  1,310.4384 (0.14)       1077           1
test_reductions[False-10000000-DataFrame-mean]          760.3224 (7.47)     4,612.8873 (32.23)      772.8637 (7.47)     119.1516 (76.10)      767.4918 (7.45)       4.1737 (6.61)         2;76  1,293.8891 (0.13)       1061           1
test_reductions[True-1000-DataFrame-product]            767.5551 (7.54)     2,562.7799 (17.91)      777.7369 (7.52)      52.6391 (33.62)      774.1274 (7.52)       4.9379 (7.82)         5;75  1,285.7819 (0.13)       1176           1
test_reductions[True-100000-DataFrame-product]          770.0827 (7.57)     2,564.3315 (17.92)      789.4741 (7.63)      54.1230 (34.57)      790.2645 (7.67)      15.0320 (23.81)        2;10  1,266.6661 (0.13)       1115           1
test_reductions[True-1000-DataFrame-mean]               775.7321 (7.62)     4,494.1679 (31.40)      801.4046 (7.75)     112.9491 (72.14)      798.7209 (7.76)       5.6066 (8.88)        2;258  1,247.8092 (0.13)       1082           1
test_reductions[True-100000-DataFrame-sum]              782.6481 (7.69)       894.6210 (6.25)       792.6479 (7.66)       7.2169 (4.61)       790.7227 (7.68)       4.9504 (7.84)        98;53  1,261.5942 (0.13)       1053           1
test_reductions[True-1000-DataFrame-sum]                785.0789 (7.71)       855.8016 (5.98)       792.9524 (7.67)       6.1919 (3.95)       791.2787 (7.68)       4.8168 (7.63)       167;57  1,261.1098 (0.13)       1030           1
test_reductions[True-100000-DataFrame-mean]             792.6673 (7.79)       886.3751 (6.19)       806.0984 (7.80)      14.7627 (9.43)       800.1458 (7.77)       8.9784 (14.22)     144;144  1,240.5433 (0.13)       1127           1
test_reductions[True-10000000-DataFrame-product]      2,051.5919 (20.16)    4,179.7161 (29.21)    2,103.5902 (20.34)    118.6643 (75.79)    2,089.4688 (20.29)      8.3195 (13.18)        9;37    475.3778 (0.05)        417           1
test_reductions[True-10000000-DataFrame-sum]          2,058.0981 (20.22)    4,288.7162 (29.97)    2,094.8560 (20.26)    110.9394 (70.86)    2,087.6573 (20.27)     10.1924 (16.14)        1;36    477.3598 (0.05)        398           1
test_reductions[True-10000000-DataFrame-mean]         2,068.6761 (20.33)    2,649.0465 (18.51)    2,110.3968 (20.41)     44.0728 (28.15)    2,103.7245 (20.43)     20.0011 (31.68)       17;18    473.8445 (0.05)        406           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Benchmarks after

Name (time in us)                                            Min                    Max                  Mean              StdDev                Median                 IQR            Outliers         OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_reductions[False-1000-DataFrame-product]           376.9975 (3.80)      2,969.2184 (21.67)      387.3554 (3.83)      74.7089 (46.57)      381.9857 (3.79)       2.8973 (3.78)        6;333  2,581.6083 (0.26)       2143           1
test_reductions[False-1000-DataFrame-sum]               377.8562 (3.81)        614.9188 (4.49)       383.7564 (3.79)      11.7315 (7.31)       381.8981 (3.79)       2.7176 (3.54)        12;63  2,605.8201 (0.26)        501           1
test_reductions[False-100000-DataFrame-sum]             384.1836 (3.87)        533.0686 (3.89)       390.7614 (3.86)       7.5587 (4.71)       388.3205 (3.85)       2.5691 (3.35)      196;307  2,559.1063 (0.26)       1887           1
test_reductions[False-100000-DataFrame-product]         384.7275 (3.88)      2,475.6696 (18.07)      393.0814 (3.89)      46.6877 (29.10)      389.4400 (3.86)       2.8079 (3.66)        7;355  2,544.0022 (0.26)       2047           1
test_reductions[False-1000-DataFrame-mean]              384.8076 (3.88)        486.6216 (3.55)       393.1574 (3.89)       9.4375 (5.88)       390.2679 (3.87)       2.7958 (3.64)      165;327  2,543.5104 (0.26)       1984           1
test_reductions[False-100000-DataFrame-mean]            393.4763 (3.96)        450.4528 (3.29)       401.0123 (3.96)       7.8656 (4.90)       398.2447 (3.95)       2.9644 (3.86)      241;359  2,493.6892 (0.25)       2033           1
test_reductions[False-10000000-DataFrame-sum]           671.3420 (6.76)        883.9816 (6.45)       679.3628 (6.72)      12.2985 (7.67)       676.3516 (6.71)       4.0028 (5.22)       61;111  1,471.9675 (0.15)        904           1
test_reductions[False-10000000-DataFrame-product]       673.5548 (6.78)        807.8441 (5.90)       680.9655 (6.73)       7.2397 (4.51)       678.6939 (6.73)       3.9046 (5.09)      128;152  1,468.5032 (0.15)       1173           1
test_reductions[False-10000000-DataFrame-mean]          678.6920 (6.84)        811.1354 (5.92)       686.5401 (6.79)       7.7184 (4.81)       684.4029 (6.79)       4.6762 (6.09)      114;110  1,456.5791 (0.15)       1161           1
test_reductions[True-1000-DataFrame-mean]               702.9921 (7.08)      2,809.5022 (20.51)      721.2702 (7.13)      59.7265 (37.23)      722.0190 (7.16)      15.6453 (20.39)        3;15  1,386.4431 (0.14)       1264           1
test_reductions[True-100000-DataFrame-product]          704.9441 (7.10)      4,696.9540 (34.29)      721.1655 (7.13)     112.3117 (70.00)      712.4506 (7.07)      13.6299 (17.76)        2;69  1,386.6442 (0.14)       1276           1
test_reductions[True-1000-DataFrame-sum]                707.6617 (7.13)        815.4251 (5.95)       716.2585 (7.08)       6.2876 (3.92)       714.5908 (7.09)       3.9618 (5.16)      171;140  1,396.1439 (0.14)       1181           1
test_reductions[True-1000-DataFrame-product]            709.3251 (7.14)        801.0324 (5.85)       717.5478 (7.09)       6.9645 (4.34)       715.5538 (7.10)       3.9986 (5.21)      147;153  1,393.6355 (0.14)       1291           1
test_reductions[True-100000-DataFrame-sum]              716.3677 (7.22)        832.3677 (6.08)       724.1857 (7.16)       7.1275 (4.44)       722.0749 (7.16)       4.1611 (5.42)      131;135  1,380.8614 (0.14)       1278           1
test_reductions[True-100000-DataFrame-mean]             725.6586 (7.31)        918.2934 (6.70)       736.7115 (7.28)      20.1752 (12.57)      731.9003 (7.26)       5.7896 (7.54)        40;66  1,357.3835 (0.14)       1083           1
test_reductions[True-10000000-DataFrame-product]      2,036.5752 (20.51)     2,747.7220 (20.06)    2,096.6431 (20.73)     41.2705 (25.72)    2,090.6944 (20.74)     10.7386 (13.99)       13;44    476.9529 (0.05)        413           1
test_reductions[True-10000000-DataFrame-sum]          2,047.1606 (20.62)     2,757.9069 (20.13)    2,092.3061 (20.68)     37.3036 (23.25)    2,088.9034 (20.73)      7.7048 (10.04)       10;47    477.9415 (0.05)        416           1
test_reductions[True-10000000-DataFrame-mean]         2,055.8219 (20.71)    12,020.5674 (87.74)    2,121.0629 (20.97)    484.3582 (301.90)   2,095.7980 (20.79)      8.7516 (11.40)        1;42    471.4617 (0.05)        420           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@vyasr
Copy link
Contributor Author

vyasr commented Aug 4, 2021

Could be reading the benchmarks wrong, but are we somemwhat slower after the refactor? For example:

Before:

test_reductions[True-100000-DataFrame-mean]             725.6586 (7.31)        918.2934 (6.70)       736.7115 (7.28)      20.1752 (12.57)      731.9003 (7.26)       5.7896 (7.54)        40;66  1,357.3835 (0.14)       1083           1

After:

test_reductions[True-100000-DataFrame-mean]             792.6673 (7.79)       886.3751 (6.19)       806.0984 (7.80)      14.7627 (9.43)       800.1458 (7.77)       8.9784 (14.22)     144;144  1,240.5433 (0.13)       1127           1

@shwina my bad, I had posted the benchmarks backwards :)

@vyasr vyasr added breaking Breaking change and removed non-breaking Non-breaking change labels Aug 4, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 4, 2021
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 5, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@4b5853d). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #8944   +/-   ##
===============================================
  Coverage                ?   10.60%           
===============================================
  Files                   ?      116           
  Lines                   ?    19003           
  Branches                ?        0           
===============================================
  Hits                    ?     2015           
  Misses                  ?    16988           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b5853d...50f92c0. Read the comment docs.

@kkraus14
Copy link
Collaborator

kkraus14 commented Aug 6, 2021

hmm, apparently I'm not a python codeowner anymore?

@quasiben
Copy link
Member

quasiben commented Aug 6, 2021

@kkraus14 we are working through your question. In the mean time, @shwina can you review/approve ?

@vyasr vyasr requested a review from shwina August 6, 2021 22:34
@vyasr
Copy link
Contributor Author

vyasr commented Aug 6, 2021

@shwina I just pushed changes to add tests of index reductions again. Regarding the other change (removing a NotImplementedError), I don't think that's a breaking change because it takes a case that failed before (was marked as unsupported even though it should be supported) and lets it function as expected.

@shwina
Copy link
Contributor

shwina commented Aug 7, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 115f3b6 into rapidsai:branch-21.10 Aug 7, 2021
shwina pushed a commit to shwina/cudf that referenced this pull request Aug 9, 2021
This PR moves implementations of reductions out of the `Series`/`DataFrame` classes and into `Frame`. The resulting reduction code is implemented in terms of columns, which improves the performance of `DataFrame` reductions, and using a single code path makes it easier to maintain. The `median` and `sum_of_squares` reductions, which were previously only available for `Series`, are now transparently enabled for `DataFrame` as well.

This PR also explicitly disables reductions for Index objects to match pandas Index APIs. Since a few reductions had previously been implemented, removing these features constitutes a breaking change.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Ashwin Srinath (https://github.com/shwina)

URL: rapidsai#8944
rapids-bot bot pushed a commit that referenced this pull request Aug 20, 2021
This PR rewrites DataFrame's `kurtosis` and `skew` to use the `_reduce` method introduced in #8944, and it inlines the logic for the `count` to bypass the `_apply_support_method` machinery. This allows us to remove most of that logic entirely aside from the code for row-wise reductions and scans that dispatches to `cupy`.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #9068
@vyasr vyasr deleted the refactor/frame_reductions branch March 31, 2022 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change improvement Improvement / enhancement to an existing function Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants