Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose pack/unpack API to Python #8153

Merged
merged 44 commits into from
Jun 30, 2021

Conversation

charlesbluca
Copy link
Member

@charlesbluca charlesbluca commented May 4, 2021

Closes #7601

Adds a Python API for pack/unpack, so that we might be able to pack/unpack DataFrames in serialization:

  • PackedColumns is a Python representation of the cudf::packed_columns struct containing the struct itself along with some Python metadata for the DataFrame being packed; supports Dask/pickle serialization
  • pack() takes in a Table and returns a PackedColumns
  • unpack() takes in a PackedColumns and returns a Table

cc @brandon-b-miller

@charlesbluca charlesbluca requested review from a team as code owners May 4, 2021 15:49
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels May 4, 2021
@charlesbluca charlesbluca marked this pull request as draft May 4, 2021 16:35
@codecov
Copy link

codecov bot commented May 4, 2021

Codecov Report

Merging #8153 (82c6aba) into branch-21.08 (1e53776) will increase coverage by 0.00%.
The diff coverage is 0.00%.

❗ Current head 82c6aba differs from pull request most recent head 3fa48c1. Consider uploading reports for the commit 3fa48c1 to get more accurate results
Impacted file tree graph

@@              Coverage Diff               @@
##           branch-21.08    #8153    +/-   ##
==============================================
  Coverage         10.62%   10.62%            
==============================================
  Files               109      109            
  Lines             18246    18627   +381     
==============================================
+ Hits               1938     1980    +42     
- Misses            16308    16647   +339     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/decimal.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <ø> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/window/rolling.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e53776...3fa48c1. Read the comment docs.

@jakirkham jakirkham added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 12, 2021
@jakirkham
Copy link
Member

Yep that makes sense. Figure we should test directly in case we change how pickling works later (not that I expect we will just want to make sure we have coverage in that event)

@jakirkham
Copy link
Member

One last question, so IIUC we are simply adding the utilities for pack/unpack now, but not using them yet. Is that right?

Asking based on a discussion Ashwin and I had a while back about making pack/unpack configureable ( #5311 )

@charlesbluca
Copy link
Member Author

Yeah, this PR doesn't do anything to change how dataframes are currently serialized - I imagined that the task of adding pack/unpack as an option for serialization would probably bring up a larger conversation about how we want to make cuDF configurable, which I thought would be better for a follow up PR

@jakirkham
Copy link
Member

Yep that makes a lot of sense. Thanks for working on this btw 🙂

@charlesbluca
Copy link
Member Author

rerun tests

@charlesbluca
Copy link
Member Author

Not sure why this is failing on mypy in the style checks, seems to pass locally

@shwina
Copy link
Contributor

shwina commented Jun 23, 2021

The style checks should resolve with #8595.

@charlesbluca
Copy link
Member Author

rerun tests

1 similar comment
@jakirkham
Copy link
Member

rerun tests

@charlesbluca
Copy link
Member Author

rerun tests

1 similar comment
@jakirkham
Copy link
Member

rerun tests

@jakirkham
Copy link
Member

@gpucibot merge

@jakirkham
Copy link
Member

It appears we are seeing a testing failure related to this change

20:43:20 ___________________ ERROR collecting cudf/tests/test_pack.py ___________________
20:43:20 ImportError while importing test module '/workspace/python/cudf/cudf/tests/test_pack.py'.
20:43:20 Hint: make sure your test modules/packages have valid Python names.
20:43:20 Traceback:
20:43:20 /opt/conda/envs/rapids/lib/python3.7/importlib/__init__.py:127: in import_module
20:43:20     return _bootstrap._gcd_import(name[level:], package, level)
20:43:20 cudf/tests/test_pack.py:23: in <module>
20:43:20     from cudf.tests.utils import assert_eq
20:43:20 E   ModuleNotFoundError: No module named 'cudf.tests'

https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,OS=centos7,PYTHON=3.7/2326/console

@charlesbluca
Copy link
Member Author

Yup it looks like assert_eq was relocated to cudf.testing._utils

@charlesbluca
Copy link
Member Author

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Python bindings to pack/unpack
8 participants