[python-package] Pandas memory overhead leads to OOM #6280

mglowacki100 · 2024-01-18T21:23:47Z

Description

I'm not sure if this is a bug or expected behaviour or place to improve, but I've noticed weird thing, namely when I provide data with pandas dataframe directly:

model = lgb.LGBMRegressor(
    n_estimators=5_000, 
    learning_rate=0.01, 
    max_depth=5, 
    num_leaves=2**5-1, 
    colsample_bytree=0.1,
    data_sample_strategy='GOSS',
    random_state=0,
)
model.fit(
    df[f_all], #provide pandas df directly
    df["target"]
)

it crashes notebook with OOM, but when I provide it converted to numpy with .values then there is also memory peak usage but smaller so it doesn't lead to OOM:

model = lgb.LGBMRegressor(
    n_estimators=5_000, 
    learning_rate=0.01, 
    max_depth=5, 
    num_leaves=2**5-1, 
    colsample_bytree=0.1,
    data_sample_strategy='GOSS',
    random_state=0,
)
model.fit(
    df[f_all].values, #provide pandas dataframe converted to numpy
    df["target"]
)

In both cases memory surges at initial stage of fitting, even before log [LightGBM] [Info] Start training from score ... appears.
One peculiar thing about my data is that all columns are of type int8.
I've also looked at https://lightgbm.readthedocs.io/en/latest/FAQ.html#when-running-lightgbm-on-a-large-dataset-my-computer-runs-out-of-ram but it doesn't help.

Reproducible example

Here is google colab notebook with data:
https://colab.research.google.com/drive/1W4ohwLPP76rpvxRXJlziQ9NxZVtbrfbp?usp=sharing

Environment info

google colab
lightgbm 4.1.0

The text was updated successfully, but these errors were encountered:

jmoralez · 2024-01-22T20:23:40Z

Hey @mglowacki100, thanks for using LightGBM. The core library uses floats (either float32 or float64), so if you provide ints they need to be converted to float first, which creates a copy and also increases memory usage (from int8 to float32 you'd see a 4x increase). You can try casting your DataFrame to a floating type first and in that case no copies will be made when building the LightGBM dataset (so no memory peak). Here's an example:

# cast all of your columns to float32
df = df.astype('float32')
model.fit(
    df[f_all],
    df["target"]
)

Please let us know if this helps.

mglowacki100 · 2024-01-22T21:48:53Z

Thanks @jmoralez for looking into this. Your explanation makes perfect sense, but I've checked df = df.astype('float32') and it doesn't fix problem with pandas (still OOM), worse after this type casting numpy workaround also doesn't work (OOM).
Just from curiousity, lightGBM introduced 'histogram binning', how it works with feature that has only 0,1,2,3,4 values, it is also converted to float and "re-binned" in some sense ?

jmoralez · 2024-01-25T19:03:55Z

Is your target column in between the dataframe? We have a test

LightGBM/tests/python_package_test/test_basic.py

Line 810 in 252828f

def test_no_copy_when_single_float_dtype_dataframe(dtype, feature_name):

to check that the df is not copied when it's of a single float dtype, but that's only for the data and I'm guessing it has to be contiguous. So maybe you can try: X_df = df[f_all].astype('float32') instead and then provide that to fit.

mglowacki100 · 2024-01-26T15:50:43Z

@jmoralez I've tried in this way:

model.fit(
    df[f_all].astype('float32'), 
    df["target"]
)

and now it works without OOM, but peak memory usage is higher than for

model.fit(
    df[f_all].values, 
    df["target"]
)

Here is notebook with sample data: https://colab.research.google.com/drive/1W4ohwLPP76rpvxRXJlziQ9NxZVtbrfbp?usp=sharing

jameslamb changed the title ~~Pandas memory overhead leads to OOM~~ [python-package] Pandas memory overhead leads to OOM Jan 18, 2024

jameslamb added the question label Jan 18, 2024

jmoralez added the awaiting response label Jan 22, 2024

github-actions bot removed the awaiting response label Jan 22, 2024

jmoralez added the awaiting response label Jan 25, 2024

github-actions bot removed the awaiting response label Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Pandas memory overhead leads to OOM #6280

[python-package] Pandas memory overhead leads to OOM #6280

mglowacki100 commented Jan 18, 2024

jmoralez commented Jan 22, 2024

mglowacki100 commented Jan 22, 2024

jmoralez commented Jan 25, 2024 •

edited

Loading

mglowacki100 commented Jan 26, 2024

[python-package] Pandas memory overhead leads to OOM #6280

[python-package] Pandas memory overhead leads to OOM #6280

Comments

mglowacki100 commented Jan 18, 2024

Description

Reproducible example

Environment info

jmoralez commented Jan 22, 2024

mglowacki100 commented Jan 22, 2024

jmoralez commented Jan 25, 2024 • edited Loading

mglowacki100 commented Jan 26, 2024

jmoralez commented Jan 25, 2024 •

edited

Loading