Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Explicit-comms shuffle does not obey "_partitions" column #1239

Closed
rjzamora opened this issue Sep 26, 2023 · 1 comment · Fixed by #1240
Closed

[BUG] Explicit-comms shuffle does not obey "_partitions" column #1239

rjzamora opened this issue Sep 26, 2023 · 1 comment · Fixed by #1240
Assignees
Labels
bug Something isn't working

Comments

@rjzamora
Copy link
Member

rjzamora commented Sep 26, 2023

While debugging a data-curation workflow, I discovered that the explicit-comms shuffle has a subtle bug in the logic used to assign data to the final partitions when "_partitions" is specified. For example:

from dask_cuda import LocalCUDACluster
from distributed import Client
import dask.dataframe as dd
from dask.datasets import timeseries

from dask_cuda.explicit_comms.dataframe.shuffle import (
    shuffle as explicit_comms_shuffle,
)

client = Client(LocalCUDACluster(n_workers=2))

ddf = timeseries().reset_index(drop=True).to_backend("cudf")
ddf["_partitions"] = 0

result = explicit_comms_shuffle(ddf, ["_partitions"])
result.partitions[0].compute()

Since all rows should be 0 in the "_partitions" columns, then all data should be moved to partition 0 after the shuffle. However, I get an empty DataFrame when I execute this:

Empty DataFrame
Columns: [name, id, x, y, _partitions]
Index: []

As far as I can tell, this problem is caused by the fact that shuffle_result[rank] is not in the same order as rank_to_out_part_ids[rank] in this client.submit loop (the order is reversed).

@rjzamora rjzamora added the bug Something isn't working label Sep 26, 2023
@madsbk madsbk self-assigned this Sep 26, 2023
@madsbk
Copy link
Member

madsbk commented Sep 26, 2023

Good catch @rjzamora !

rapids-bot bot pushed a commit that referenced this issue Sep 26, 2023
`shuffle_task()` now returns a dict mapping partition IDs to dataframes`

Fixes #1239

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1240
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants