-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Pandas memory overhead leads to OOM #6280
Comments
Hey @mglowacki100, thanks for using LightGBM. The core library uses floats (either float32 or float64), so if you provide ints they need to be converted to float first, which creates a copy and also increases memory usage (from int8 to float32 you'd see a 4x increase). You can try casting your DataFrame to a floating type first and in that case no copies will be made when building the LightGBM dataset (so no memory peak). Here's an example: # cast all of your columns to float32
df = df.astype('float32')
model.fit(
df[f_all],
df["target"]
) Please let us know if this helps. |
Thanks @jmoralez for looking into this. Your explanation makes perfect sense, but I've checked |
Is your target column in between the dataframe? We have a test
to check that the df is not copied when it's of a single float dtype, but that's only for the data and I'm guessing it has to be contiguous. So maybe you can try: X_df = df[f_all].astype('float32') instead and then provide that to fit.
|
@jmoralez I've tried in this way:
and now it works without OOM, but peak memory usage is higher than for
|
Description
I'm not sure if this is a bug or expected behaviour or place to improve, but I've noticed weird thing, namely when I provide data with pandas dataframe directly:
it crashes notebook with OOM, but when I provide it converted to numpy with
.values
then there is also memory peak usage but smaller so it doesn't lead to OOM:In both cases memory surges at initial stage of fitting, even before log
[LightGBM] [Info] Start training from score ...
appears.One peculiar thing about my data is that all columns are of type
int8
.I've also looked at https://lightgbm.readthedocs.io/en/latest/FAQ.html#when-running-lightgbm-on-a-large-dataset-my-computer-runs-out-of-ram but it doesn't help.
Reproducible example
Here is google colab notebook with data:
https://colab.research.google.com/drive/1W4ohwLPP76rpvxRXJlziQ9NxZVtbrfbp?usp=sharing
Environment info
The text was updated successfully, but these errors were encountered: