Skip to content

Commit

Permalink
Refactor and optimize Frame.where (NVIDIA#11168)
Browse files Browse the repository at this point in the history
This PR is a substantial refactoring of `Frame.where`. It removes many dead code paths, excises numerous unnecessary copies, and simplifies and consolidates various parts of the logic. It also splits up parts of the implementation into the specific Frame classes for which they are used. Prior to this PR, all the code was contained in a single function that essentially had completely independent code paths for DataFrame vs SingleColumnFrame. Splitting these into methods of the appropriate classes also makes mypy much happier.

The resulting code is significantly faster. I'll post more benchmarks soon, but we see improvements from 20% to up to 70%, even for reasonable data sizes (e.g. 1 million rows). You have to go past 10 million rows before the performance improvements are washed out by the sheer volume of computation time.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai/cudf#11168
  • Loading branch information
vyasr authored Jul 4, 2022
1 parent d8f5e46 commit 9e08c73
Show file tree
Hide file tree
Showing 8 changed files with 228 additions and 364 deletions.
6 changes: 6 additions & 0 deletions python/cudf/benchmarks/API/bench_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

import numpy
import pytest
import pytest_cases
from config import cudf, cupy
from utils import benchmark_with_object

Expand Down Expand Up @@ -115,3 +116,8 @@ def bench_sort_values(benchmark, dataframe, num_cols_to_sort):
def bench_nsmallest(benchmark, dataframe, num_cols_to_sort, n):
by = list(dataframe.columns[:num_cols_to_sort])
benchmark(dataframe.nsmallest, n, by)


@pytest_cases.parametrize_with_cases("dataframe, cond, other", prefix="where")
def bench_where(benchmark, dataframe, cond, other):
benchmark(dataframe.where, cond, other)
14 changes: 14 additions & 0 deletions python/cudf/benchmarks/API/bench_dataframe_cases.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright (c) 2022, NVIDIA CORPORATION.

from utils import benchmark_with_object


@benchmark_with_object(cls="dataframe", dtype="int", nulls=False)
def where_case_1(dataframe):
return dataframe, dataframe % 2 == 0, 0


@benchmark_with_object(cls="dataframe", dtype="int", nulls=False)
def where_case_2(dataframe):
cond = dataframe[dataframe.columns[0]] % 2 == 0
return dataframe, cond, 0
Loading

0 comments on commit 9e08c73

Please sign in to comment.