Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Pandas memory overhead leads to OOM #6280

Open
mglowacki100 opened this issue Jan 18, 2024 · 4 comments
Open

[python-package] Pandas memory overhead leads to OOM #6280

mglowacki100 opened this issue Jan 18, 2024 · 4 comments
Labels

Comments

@mglowacki100
Copy link

Description

I'm not sure if this is a bug or expected behaviour or place to improve, but I've noticed weird thing, namely when I provide data with pandas dataframe directly:

model = lgb.LGBMRegressor(
    n_estimators=5_000, 
    learning_rate=0.01, 
    max_depth=5, 
    num_leaves=2**5-1, 
    colsample_bytree=0.1,
    data_sample_strategy='GOSS',
    random_state=0,
)
model.fit(
    df[f_all], #provide pandas df directly
    df["target"]
)

it crashes notebook with OOM, but when I provide it converted to numpy with .values then there is also memory peak usage but smaller so it doesn't lead to OOM:

model = lgb.LGBMRegressor(
    n_estimators=5_000, 
    learning_rate=0.01, 
    max_depth=5, 
    num_leaves=2**5-1, 
    colsample_bytree=0.1,
    data_sample_strategy='GOSS',
    random_state=0,
)
model.fit(
    df[f_all].values, #provide pandas dataframe converted to numpy
    df["target"]
)

In both cases memory surges at initial stage of fitting, even before log [LightGBM] [Info] Start training from score ... appears.
One peculiar thing about my data is that all columns are of type int8.
I've also looked at https://lightgbm.readthedocs.io/en/latest/FAQ.html#when-running-lightgbm-on-a-large-dataset-my-computer-runs-out-of-ram but it doesn't help.

Reproducible example

Here is google colab notebook with data:
https://colab.research.google.com/drive/1W4ohwLPP76rpvxRXJlziQ9NxZVtbrfbp?usp=sharing

Environment info

  • google colab
  • lightgbm 4.1.0
@jameslamb jameslamb changed the title Pandas memory overhead leads to OOM [python-package] Pandas memory overhead leads to OOM Jan 18, 2024
@jmoralez
Copy link
Collaborator

Hey @mglowacki100, thanks for using LightGBM. The core library uses floats (either float32 or float64), so if you provide ints they need to be converted to float first, which creates a copy and also increases memory usage (from int8 to float32 you'd see a 4x increase). You can try casting your DataFrame to a floating type first and in that case no copies will be made when building the LightGBM dataset (so no memory peak). Here's an example:

# cast all of your columns to float32
df = df.astype('float32')
model.fit(
    df[f_all],
    df["target"]
)

Please let us know if this helps.

@mglowacki100
Copy link
Author

Thanks @jmoralez for looking into this. Your explanation makes perfect sense, but I've checked df = df.astype('float32') and it doesn't fix problem with pandas (still OOM), worse after this type casting numpy workaround also doesn't work (OOM).
Just from curiousity, lightGBM introduced 'histogram binning', how it works with feature that has only 0,1,2,3,4 values, it is also converted to float and "re-binned" in some sense ?

@jmoralez
Copy link
Collaborator

jmoralez commented Jan 25, 2024

Is your target column in between the dataframe? We have a test

def test_no_copy_when_single_float_dtype_dataframe(dtype, feature_name):

to check that the df is not copied when it's of a single float dtype, but that's only for the data and I'm guessing it has to be contiguous. So maybe you can try: X_df = df[f_all].astype('float32') instead and then provide that to fit.

@mglowacki100
Copy link
Author

@jmoralez I've tried in this way:

model.fit(
    df[f_all].astype('float32'), 
    df["target"]
)

ver_C

and now it works without OOM, but peak memory usage is higher than for

model.fit(
    df[f_all].values, 
    df["target"]
)

ver_B
Here is notebook with sample data: https://colab.research.google.com/drive/1W4ohwLPP76rpvxRXJlziQ9NxZVtbrfbp?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants