Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of apply and exploding arrays in sequence causes memory to grow sript to crash #6880

Closed
2 tasks done
Zyell opened this issue Feb 14, 2023 · 2 comments · Fixed by #6890
Closed
2 tasks done

Usage of apply and exploding arrays in sequence causes memory to grow sript to crash #6880

Zyell opened this issue Feb 14, 2023 · 2 comments · Fixed by #6890
Labels
bug Something isn't working python Related to Python Polars

Comments

@Zyell
Copy link

Zyell commented Feb 14, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

While building a larger project, I uncovered a memory bug (OOM) of some sort caused by a sequence of operations in both lazy and eager mode. This issue occurs when exploding array columns that have None's in them. However, it doesn't happen whenever there are None's present. It seems to occur after creating a new column of values using apply with a lambda or named function that returns a list where that list generates a None for some row, or via a join operation. If these null values are filtered out before calling explode on the column(s), there is no issue. If the nulls are not filtered out, the script appears to hang but the memory grows to consume all available memory resources. I confirmed this with the last few released version on pypi, the latest of which is 0.16.5. I can confirm that this issue occurs on both Windows and Linux on separate machines. The reproducible example I provided fails repeatably for me in both windows and linux. However, if I get rid of the column 'd' creation and just join, I can get it to fail in linux but not in windows (I was running a slightly older version of polars on windows 0.15.16).

This may be related/similar to a closed issue: #4108

I will update this issue with any additional findings as I explore this more. Thanks!

Reproducible example

import polars as pl


df = pl.DataFrame([
    {'a': 1, 'b': ['1', '2', '4']},
    {'a': 2, 'b': ['1', '9768', '4']},
    {'a': 1, 'b': ['1', '23', '1']},
    {'a': 3, 'b': ['3', 'b', '56']},
])

df_b = pl.DataFrame([
    {'a': 1, 'c': ['1', '2', 'c']},
    {'a': 2, 'c': ['1', '23', 'c']},
])

exploded = (
    df
    .join(df_b, on=['a'], how='left')
    .with_columns([
        pl.struct(['b', 'c']).apply(lambda x: [e in x['b'] for e in x['c']]).alias('d')
    ])
    .drop(['b'])
    .explode(['c', 'd'])
)

Expected behavior

I would expect this to handle null cases by returning the Nones in the response (explode sometimes does this) or raising a shape error (which it does in other cases).

Installed versions

---Version info---
Polars: 0.16.5
Index type: UInt32
Platform: Linux-6.0.12-76060006-generic-x86_64-with-glibc2.35
Python: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
---Optional dependencies---
pyarrow: 11.0.0
pandas: 1.5.3
numpy: 1.24.2
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: <not installed>
@Zyell Zyell added bug Something isn't working python Related to Python Polars labels Feb 14, 2023
@ritchie46
Copy link
Member

ritchie46 commented Feb 15, 2023

The apply is a red herring.

Minimal example

df = pl.DataFrame({
    "a": [1, 2, 1, 3]
})

df_b = pl.DataFrame({
    "a": [1, 2],
    "c": [['1', '2', 'c'], ['1', '2', 'c']]
})

(
    df
    .join(df_b, on=['a'], how='left')
    .select(['c'])
    .explode('c')
)

@Zyell
Copy link
Author

Zyell commented Feb 16, 2023

@ritchie46 Thank you for the fast fix! I will endeavor to filter out the red herrings better in future reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants