Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple arithmetic operations on the "list" type columns #8006

Closed
mkleinbort-ic opened this issue Apr 5, 2023 · 7 comments
Closed

Simple arithmetic operations on the "list" type columns #8006

mkleinbort-ic opened this issue Apr 5, 2023 · 7 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@mkleinbort-ic
Copy link

Problem description

It'd be nice for this code to work:

df = pl.DataFrame({
    'x1': [[1,2],[3,4]],
    'x2': [10,20]
})

df.with_columns(scaled_x1 = pl.col('x1')/pl.col('x2'))

# Desired output: 

shape: (2, 3)
┌───────────┬─────┬─────────────┐
│ x1x2scaled_x1   │
│ ---------         │
│ list[i64] ┆ i64list[f64]   │
╞═══════════╪═════╪═════════════╡
│ [1, 2]    ┆ 10  ┆ [0.1, 0.2]  │
│ [3, 4]    ┆ 20  ┆ [0.15, 0.2] │
└───────────┴─────┴─────────────┘

The idea here is that x1 is column of arrays, and I want to divide each element in by the value in pl.col('x2').

I'd be nice to support the basic arithmetic operations from numpy: +, -, *, /, %

@mkleinbort-ic mkleinbort-ic added the enhancement New feature or an improvement of an existing feature label Apr 5, 2023
@lunarspectrum
Copy link

This is a much needed feature.

It seems that the expected syntax would be

df.with_columns(pl.col("x1").arr.eval(pl.element() / pl.col("x2")).alias("scaled_x1"))

A clunky way to get to the desired result can currently be accomplished by

pl.concat(
        [
            _df.with_columns(
                pl.col("x1").arr.eval(pl.element() / x2).alias("scaled_x1")
            )
            for x2, _df in df.groupby("x2")
        ]
    )

@tim-x-y-z
Copy link
Contributor

tim-x-y-z commented Jul 18, 2023

If you are okay using numpy, this is another way to do it:

import numpy as np
df.with_columns(scaled_x1 = pl.struct(["x1", "x2"]).apply(lambda x: np.array(x["x1"]) / x["x2"]))

@itamarst
Copy link
Contributor

I have implemented a working prototype (see branch 8006-list-arithmetic-part-2 in my fork), based on the work in #17823.

@itamarst
Copy link
Contributor

itamarst commented Sep 19, 2024

Remaining work:

  • Existing list arithmetic still works
  • Support all numeric dtypes and all 5 arithmetic operations
  • RHS is primitive numeric, LHS is list (addition and multiplication only)
  • Test basic operations
  • Test nested lists
  • Test nulls (in the primitive numeric series)
  • Test nulls (list items)
  • Test error cases
  • Test RHS is primitive numeric, LHS is list (addition and multiplication only)
  • File follow-up issue for dates etc (dtypes where physical is numeric but logical is not)
  • File follow-up issue for scalars (if fix: Properly broadcast list arithmetic #18858 is merged first this might be easy enough to do in this branch? We'll see.)

@itamarst
Copy link
Contributor

Have made decent progress on making this work.

@itamarst
Copy link
Contributor

The PR will also close #14711.

@cmdlineluser
Copy link
Contributor

This has been added #19162

df.with_columns(scaled_x1 = pl.col.x1 / pl.col.x2)
# shape: (2, 3)
# ┌───────────┬─────┬─────────────┐
# │ x1        ┆ x2  ┆ scaled_x1   │
# │ ---       ┆ --- ┆ ---         │
# │ list[i64] ┆ i64 ┆ list[f64]   │
# ╞═══════════╪═════╪═════════════╡
# │ [1, 2]    ┆ 10  ┆ [0.1, 0.2]  │
# │ [3, 4]    ┆ 20  ┆ [0.15, 0.2] │
# └───────────┴─────┴─────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
7 participants