Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Improve rename performace for Lazy API #18890

Merged
merged 2 commits into from
Sep 24, 2024

Conversation

siddharth-vi
Copy link
Contributor

Closes #17336

For the python rename API, if we pass a dictionary arguments, we convert to LazyFrame and internally call the rename operation; the issue is that we need to get index number for all the columns we are renaming; this is an O(n^2) operation.
If we pass a function as argument to rename, we convert to LazyFrame and internally call the select operation; here we get the index for each column from the Schema, which is an IndexMap and hence overall the operation becomes O(n).
Have tried to fix this issue in a similar fashion by constructing the schema first and then getting index from it.

After the fix :

import polars as pl
df = pl.DataFrame({f"column_{i}": ['x', 'y', 'z'] for i in range(10000)})
cols_map = {f"column_{i}": f"renamed_{i}" for i in range(10000)}
start = time.perf_counter()
_ = df.rename(cols_map)
end = time.perf_counter()
print(f"Time for rename(): {end - start} | Version: {pl.__version__}")

Time for rename(): 0.041061488911509514 | Version: 1.7.1

df2 = df.rename(lambda col: cols_map.get(col, col))
end = time.perf_counter()
print(f"Time for rename(): {end - start} | Version: {pl.__version__}")

Time for rename(): 0.06575666600838304 | Version: 1.7.1

In the Eager API for Rust, this was fixed in #1028 by implementing a hash map.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Sep 24, 2024
Copy link

codecov bot commented Sep 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.87%. Comparing base (dddf0b7) to head (314929a).
Report is 75 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18890      +/-   ##
==========================================
- Coverage   79.88%   79.87%   -0.02%     
==========================================
  Files        1513     1520       +7     
  Lines      203466   206558    +3092     
  Branches     2892     2906      +14     
==========================================
+ Hits       162546   164985    +2439     
- Misses      40372    41025     +653     
  Partials      548      548              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coastalwhite
Copy link
Collaborator

LGTM. If you could quickly fix the clippy error, I feel like this is quite okay to merge.

@siddharth-vi siddharth-vi marked this pull request as ready for review September 24, 2024 14:10
@ritchie46 ritchie46 merged commit 0f4360b into pola-rs:main Sep 24, 2024
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame rename method is extremely slow for dictionary argument
3 participants