Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_elements replaces all array elements with nulls #17873

Closed
2 tasks done
postatum opened this issue Jul 25, 2024 · 4 comments
Closed
2 tasks done

map_elements replaces all array elements with nulls #17873

postatum opened this issue Jul 25, 2024 · 4 comments
Labels
bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@postatum
Copy link

postatum commented Jul 25, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df1 = pl.DataFrame({"arrays": [[1, 2], [3, 4]], "numbers": [None, 1000.0]})

def custom_map(x):

    def multiply(operand_1, operand_2):
        if operand_1 is None or operand_2 is None:
            return None
        return operand_1 * operand_2

    return [multiply(array_el, x["number"]) for array_el in (x["array"])]


df1 = df1.with_columns(tmp_struct=pl.struct(array=pl.col("arrays"), number=pl.col("numbers")))
df1 = df1.with_columns(array_multi=pl.col("tmp_struct").map_elements(custom_map, return_dtype=pl.List(pl.Float64)))
print(df1)

"""
shape: (2, 4)
┌───────────┬─────────┬─────────────────┬──────────────┐
│ arrays    ┆ numbers ┆ tmp_struct      ┆ array_multi  │
│ ---       ┆ ---     ┆ ---             ┆ ---          │
│ list[i64] ┆ f64     ┆ struct[2]       ┆ list[f64]    │
╞═══════════╪═════════╪═════════════════╪══════════════╡
│ [1, 2]    ┆ null    ┆ {[1, 2],null}   ┆ [null, null] │
│ [3, 4]    ┆ 1000.0  ┆ {[3, 4],1000.0} ┆ [null, null] │
└───────────┴─────────┴─────────────────┴──────────────┘
"""

Log output

None

Issue description

When applying UDFs to structs AND the UDFs produce arrays AND some arrays are full of Nones, all other valid requests are also filled with Nones. There's a workaround of using float('nan') instead of None.

Expected behavior

I expect the dataframe to look like this:

shape: (2, 4)
┌───────────┬─────────┬─────────────────┬──────────────────┐
│ arrays    ┆ numbers ┆ tmp_struct      ┆ array_multi      │
│ ---       ┆ ---     ┆ ---             ┆ ---              │
│ list[i64] ┆ f64     ┆ struct[2]       ┆ list[f64]        │
╞═══════════╪═════════╪═════════════════╪══════════════════╡
│ [1, 2]    ┆ null    ┆ {[1, 2],null}   ┆ [null, null]     │
│ [3, 4]    ┆ 1000.0  ┆ {[3, 4],1000.0} ┆ [3000.0, 4000.0] │
└───────────┴─────────┴─────────────────┴──────────────────┘

Installed versions

--------Version info---------
Polars:               1.2.1
Index type:           UInt32
Platform:             macOS-14.5-x86_64-i386-64bit
Python:               3.10.12 (main, Mar  4 2024, 12:35:22) [Clang 15.0.0 (clang-1500.0.40.1)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.23.5
openpyxl:             <not installed>
pandas:               1.5.3
pyarrow:              8.0.0
pydantic:             1.10.15
pyiceberg:            <not installed>
sqlalchemy:           1.4.37
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           3.2.0
@postatum postatum added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 25, 2024
@cmdlineluser
Copy link
Contributor

Can reproduce.

It seems like it also happens with non-struct columns.

It also seems to be specific to the first return value being null?

pl.DataFrame({"a": [1, 2, 3]}).with_columns(pl.all().map_elements(
    {1: [None], 2: [9.0], 3: [10.0]}.get,
    return_dtype=pl.List(pl.Float64)
))
# shape: (3, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ list[f64] │
# ╞═══════════╡
# │ [null]    │
# │ [null]    │ # ???
# │ [null]    │ # ???
# └───────────┘
pl.DataFrame({"a": [1, 2, 3]}).with_columns(pl.all().map_elements(
    {1: [9.0], 2: [None], 3: [10.0]}.get,
    return_dtype=pl.List(pl.Float64)
))
# shape: (3, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ list[f64] │
# ╞═══════════╡
# │ [9.0]     │
# │ [null]    │
# │ [10.0]    │
# └───────────┘

@deanm0000
Copy link
Collaborator

A work around that would be much more performant (assumes your arrays column is really fixed width and that you don't mind using numpy) would be to do

def custom_map(x):
    s1 = x.struct[0].list.to_array(x.struct[0].list.len().max()).to_numpy()
    s2 = x.struct[1].to_numpy()
    return pl.Series((s1*s2).transpose())

df1.with_columns(array_multi=pl.struct('arrays','numbers').map_batches(custom_map))

Alternatively, an all polars no map_* approach that makes no assumptions but is slightly worse performance than numpy.

(
    df1
    .with_row_index('i')
    .explode('arrays')
    .group_by('i', maintain_order=True)
    .agg(
        'arrays', 
        pl.col("numbers").first(), 
        array_multi=pl.col('arrays')*pl.col('numbers')
        )
    .drop('i')
    )

and yet another way (this one assumes the 'arrays' are fixed width again) but is the most performant of the 3

array_width=df1['arrays'].list.len().max()
(
    df1
    .with_columns(
        array_multi=(
            pl.col('arrays').list.explode()*pl.col('numbers').repeat_by(array_width).explode()
            ).reshape((df1.shape[0],array_width))
        )
    )

@deanm0000 deanm0000 added P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Aug 15, 2024
@cmdlineluser
Copy link
Contributor

So it appears this is actually due to skip_nulls= ...

df1.with_columns(array_multi=pl.col("tmp_struct").map_elements(custom_map, 
    return_dtype=pl.List(pl.Float64), 
    skip_nulls=False)
)

# shape: (2, 4)
# ┌───────────┬─────────┬─────────────────┬──────────────────┐
# │ arrays    ┆ numbers ┆ tmp_struct      ┆ array_multi      │
# │ ---       ┆ ---     ┆ ---             ┆ ---              │
# │ list[i64] ┆ f64     ┆ struct[2]       ┆ list[f64]        │
# ╞═══════════╪═════════╪═════════════════╪══════════════════╡
# │ [1, 2]    ┆ null    ┆ {[1, 2],null}   ┆ [null, null]     │
# │ [3, 4]    ┆ 1000.0  ┆ {[3, 4],1000.0} ┆ [3000.0, 4000.0] │
# └───────────┴─────────┴─────────────────┴──────────────────┘

@cmdlineluser
Copy link
Contributor

This is now fixed on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

4 participants