Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Optimize cudf.concat for axis=0 #9222

Merged
merged 2 commits into from
Sep 14, 2021

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Sep 13, 2021

This PR optimizes cudf.concat when axis=0 by not materializing RangeIndex objects present as index to the Dataframe objects.

Partially addresses #9200, This is 1/2 of full optimizations. A follow-up PR to optimize axis=1 will be opened as there are multiple large changes.

Here is a benchmark:
On branch-21.10:

IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*100})

In [3]: df2 = cudf.DataFrame({'a':[1, 2, 3]*100}, index=cudf.RangeIndex(300, 600))

In [4]: %timeit cudf.concat([df, df2])
806 µs ± 8.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This PR:

IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*100})

In [3]: df2 = cudf.DataFrame({'a':[1, 2, 3]*100}, index=cudf.RangeIndex(300, 600))

In [4]: %timeit cudf.concat([df, df2])
434 µs ± 4.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 13, 2021
@galipremsagar galipremsagar self-assigned this Sep 13, 2021
@galipremsagar galipremsagar requested a review from a team as a code owner September 13, 2021 20:13
@codecov
Copy link

codecov bot commented Sep 13, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@c6ddd46). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head dd6338c differs from pull request most recent head f3f134d. Consider uploading reports for the commit f3f134d to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9222   +/-   ##
===============================================
  Coverage                ?   10.81%           
===============================================
  Files                   ?      115           
  Lines                   ?    19170           
  Branches                ?        0           
===============================================
  Hits                    ?     2074           
  Misses                  ?    17096           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6ddd46...f3f134d. Read the comment docs.

Comment on lines +1227 to +1230
elif are_all_range_index and not ignore_index:
out._index = cudf.core.index.GenericIndex._concat(
[o._index for o in objs]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case not included in line 1218?

Copy link
Contributor Author

@galipremsagar galipremsagar Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that line's expectation is to have the index columns materialized. Whereas we don't want to materialize and hit the specialized concat rangeindex logic already present in index.py:

https://github.com/rapidsai/cudf/blob/branch-21.10/python/cudf/cudf/core/index.py#L688-L702

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially we want to hit _concat_range_index in this case:

def _concat_range_index(indexes: List[RangeIndex]) -> BaseIndex:

@shwina
Copy link
Contributor

shwina commented Sep 14, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cc13060 into rapidsai:branch-21.10 Sep 14, 2021
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants