Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for binary size method to Expr and Series "bin" namespace #17924

Merged
merged 4 commits into from
Aug 1, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jul 29, 2024

Feature

  • We didn't have a way to get the size of Binary elements; this PR adds size to the "bin" namespace for Expr and Series.

  • Behaves the same as the frame-level estimated_size, where we return integer bytes in Rust, and allow Python to additionally supply a size unit to get a scaled float value in kb/mb/gb, if desired.

Example

Get the size of individual binary elements:

from os import urandom
import polars as pl

df = pl.DataFrame({"data": [urandom(n) for n in (512, 256, 2560, 1024)]})
df.with_columns(
    n_bytes=pl.col("data").bin.size(),
    n_kilobytes=pl.col("data").bin.size("kb"),
)
# shape: (4, 3)
# ┌─────────────────────────────────┬─────────┬─────────────┐
# │ data                            ┆ n_bytes ┆ n_kilobytes │
# │ ---                             ┆ ---     ┆ ---         │
# │ binary                          ┆ u32     ┆ f64         │
# ╞═════════════════════════════════╪═════════╪═════════════╡
# │ b"\xb7\xa7\xba\x92\xa0\xe1\x0d… ┆ 512     ┆ 0.5         │
# │ b"z\xa4\x9d\xad\x1c\xc1\x10\xa… ┆ 256     ┆ 0.25        │
# │ b"A\xdd\x94\x12Q\x85\xbc\x14\x… ┆ 2560    ┆ 2.5         │
# │ b"\x92n\xf8Ow\x11h\x8b\xd2\xd6… ┆ 1024    ┆ 1.0         │
# └─────────────────────────────────┴─────────┴─────────────┘

Check size of the total binary payload for the frame:

import polars.selectors as cs

df = pl.DataFrame({
    "b1": [urandom(n) for n in (512, 256, 2560, 1024)],
    "b2": [urandom(n) for n in (8096, 2048, 64, 32)],
})
df.select(
    sz_kb=pl.sum_horizontal(cs.binary().bin.size("kb").sum())
)
# shape: (1, 1)
# ┌───────┐
# │ sz_kb │
# │ ---   │
# │ f64   │
# ╞═══════╡
# │ 14.25 │
# └───────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jul 29, 2024
@alexander-beedie alexander-beedie force-pushed the bin-size-expr branch 2 times, most recently from 1706c12 to 461f1d5 Compare July 29, 2024 05:25
Copy link

codecov bot commented Jul 29, 2024

Codecov Report

Attention: Patch coverage is 96.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.35%. Comparing base (fae85ff) to head (90ff20e).
Report is 1 commits behind head on main.

Files Patch % Lines
crates/polars-plan/src/dsl/function_expr/binary.rs 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17924      +/-   ##
==========================================
+ Coverage   80.34%   80.35%   +0.01%     
==========================================
  Files        1492     1495       +3     
  Lines      196303   196480     +177     
  Branches     2813     2817       +4     
==========================================
+ Hits       157724   157891     +167     
- Misses      38058    38069      +11     
+ Partials      521      520       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stinodego
Copy link
Member

Seems useful to have. Just commenting to point out that the equivalent method in the str namespace is called len_bytes. Not sure that's better than size here though.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jul 29, 2024

Seems useful to have. Just commenting to point out that the equivalent method in the str namespace is called len_bytes. Not sure that's better than size here though.

I thought about it, but binary blobs are likely to be payloads of some kind, which you would more naturally think of as having a certain size, rather than a length (and you definitely wouldn't think of length in terms of kilo/mega/giga bytes) ;)

eg: this is a bit odd -

"how long is my image data?"
"its length is about 64kb"

vs:

"what size is my image data?"
"its size is about 64kb"

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jul 29, 2024

Looks like I might be missing a feature gate somewhere - will have a look after lunch 🤔

Update: found & fixed.

@alexander-beedie
Copy link
Collaborator Author

Rebased 👌

@ritchie46 ritchie46 merged commit a7a6461 into pola-rs:main Aug 1, 2024
27 checks passed
@alexander-beedie alexander-beedie deleted the bin-size-expr branch August 1, 2024 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants