map_elements replaces all array elements with nulls #17873

postatum · 2024-07-25T15:18:32Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df1 = pl.DataFrame({"arrays": [[1, 2], [3, 4]], "numbers": [None, 1000.0]})

def custom_map(x):

    def multiply(operand_1, operand_2):
        if operand_1 is None or operand_2 is None:
            return None
        return operand_1 * operand_2

    return [multiply(array_el, x["number"]) for array_el in (x["array"])]


df1 = df1.with_columns(tmp_struct=pl.struct(array=pl.col("arrays"), number=pl.col("numbers")))
df1 = df1.with_columns(array_multi=pl.col("tmp_struct").map_elements(custom_map, return_dtype=pl.List(pl.Float64)))
print(df1)

"""
shape: (2, 4)
┌───────────┬─────────┬─────────────────┬──────────────┐
│ arrays    ┆ numbers ┆ tmp_struct      ┆ array_multi  │
│ ---       ┆ ---     ┆ ---             ┆ ---          │
│ list[i64] ┆ f64     ┆ struct[2]       ┆ list[f64]    │
╞═══════════╪═════════╪═════════════════╪══════════════╡
│ [1, 2]    ┆ null    ┆ {[1, 2],null}   ┆ [null, null] │
│ [3, 4]    ┆ 1000.0  ┆ {[3, 4],1000.0} ┆ [null, null] │
└───────────┴─────────┴─────────────────┴──────────────┘
"""

Log output

None

Issue description

When applying UDFs to structs AND the UDFs produce arrays AND some arrays are full of Nones, all other valid requests are also filled with Nones. There's a workaround of using float('nan') instead of None.

Expected behavior

I expect the dataframe to look like this:

shape: (2, 4)
┌───────────┬─────────┬─────────────────┬──────────────────┐
│ arrays    ┆ numbers ┆ tmp_struct      ┆ array_multi      │
│ ---       ┆ ---     ┆ ---             ┆ ---              │
│ list[i64] ┆ f64     ┆ struct[2]       ┆ list[f64]        │
╞═══════════╪═════════╪═════════════════╪══════════════════╡
│ [1, 2]    ┆ null    ┆ {[1, 2],null}   ┆ [null, null]     │
│ [3, 4]    ┆ 1000.0  ┆ {[3, 4],1000.0} ┆ [3000.0, 4000.0] │
└───────────┴─────────┴─────────────────┴──────────────────┘

Installed versions

--------Version info---------
Polars:               1.2.1
Index type:           UInt32
Platform:             macOS-14.5-x86_64-i386-64bit
Python:               3.10.12 (main, Mar  4 2024, 12:35:22) [Clang 15.0.0 (clang-1500.0.40.1)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.23.5
openpyxl:             <not installed>
pandas:               1.5.3
pyarrow:              8.0.0
pydantic:             1.10.15
pyiceberg:            <not installed>
sqlalchemy:           1.4.37
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           3.2.0

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-07-26T13:55:45Z

Can reproduce.

It seems like it also happens with non-struct columns.

It also seems to be specific to the first return value being null?

pl.DataFrame({"a": [1, 2, 3]}).with_columns(pl.all().map_elements(
    {1: [None], 2: [9.0], 3: [10.0]}.get,
    return_dtype=pl.List(pl.Float64)
))
# shape: (3, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ list[f64] │
# ╞═══════════╡
# │ [null]    │
# │ [null]    │ # ???
# │ [null]    │ # ???
# └───────────┘

pl.DataFrame({"a": [1, 2, 3]}).with_columns(pl.all().map_elements(
    {1: [9.0], 2: [None], 3: [10.0]}.get,
    return_dtype=pl.List(pl.Float64)
))
# shape: (3, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ list[f64] │
# ╞═══════════╡
# │ [9.0]     │
# │ [null]    │
# │ [10.0]    │
# └───────────┘

deanm0000 · 2024-08-15T04:41:52Z

A work around that would be much more performant (assumes your arrays column is really fixed width and that you don't mind using numpy) would be to do

def custom_map(x):
    s1 = x.struct[0].list.to_array(x.struct[0].list.len().max()).to_numpy()
    s2 = x.struct[1].to_numpy()
    return pl.Series((s1*s2).transpose())

df1.with_columns(array_multi=pl.struct('arrays','numbers').map_batches(custom_map))

Alternatively, an all polars no map_* approach that makes no assumptions but is slightly worse performance than numpy.

(
    df1
    .with_row_index('i')
    .explode('arrays')
    .group_by('i', maintain_order=True)
    .agg(
        'arrays', 
        pl.col("numbers").first(), 
        array_multi=pl.col('arrays')*pl.col('numbers')
        )
    .drop('i')
    )

and yet another way (this one assumes the 'arrays' are fixed width again) but is the most performant of the 3

array_width=df1['arrays'].list.len().max()
(
    df1
    .with_columns(
        array_multi=(
            pl.col('arrays').list.explode()*pl.col('numbers').repeat_by(array_width).explode()
            ).reshape((df1.shape[0],array_width))
        )
    )

cmdlineluser · 2024-08-30T10:48:42Z

So it appears this is actually due to skip_nulls= ...

map_elements incorrectly converts all output to NULL #18472

df1.with_columns(array_multi=pl.col("tmp_struct").map_elements(custom_map, 
    return_dtype=pl.List(pl.Float64), 
    skip_nulls=False)
)

# shape: (2, 4)
# ┌───────────┬─────────┬─────────────────┬──────────────────┐
# │ arrays    ┆ numbers ┆ tmp_struct      ┆ array_multi      │
# │ ---       ┆ ---     ┆ ---             ┆ ---              │
# │ list[i64] ┆ f64     ┆ struct[2]       ┆ list[f64]        │
# ╞═══════════╪═════════╪═════════════════╪══════════════════╡
# │ [1, 2]    ┆ null    ┆ {[1, 2],null}   ┆ [null, null]     │
# │ [3, 4]    ┆ 1000.0  ┆ {[3, 4],1000.0} ┆ [3000.0, 4000.0] │
# └───────────┴─────────┴─────────────────┴──────────────────┘

cmdlineluser · 2024-09-05T14:25:23Z

This is now fixed on main.

fix: Fix map_elements for List return dtypes #18567

postatum added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 25, 2024

jparag mentioned this issue Jul 30, 2024

fix(rust): Address replacing all array elements with nulls for map_elements #17934

Closed

deanm0000 added P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Aug 15, 2024

ritchie46 closed this as completed Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

map_elements replaces all array elements with nulls #17873

map_elements replaces all array elements with nulls #17873

postatum commented Jul 25, 2024 •

edited

Loading

cmdlineluser commented Jul 26, 2024

deanm0000 commented Aug 15, 2024

cmdlineluser commented Aug 30, 2024

cmdlineluser commented Sep 5, 2024

map_elements replaces all array elements with nulls #17873

map_elements replaces all array elements with nulls #17873

Comments

postatum commented Jul 25, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented Jul 26, 2024

deanm0000 commented Aug 15, 2024

cmdlineluser commented Aug 30, 2024

cmdlineluser commented Sep 5, 2024

postatum commented Jul 25, 2024 •

edited

Loading