Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use less memory when decreasing parameter max_bin #6319

Closed
zansibal opened this issue Feb 15, 2024 · 6 comments
Closed

Use less memory when decreasing parameter max_bin #6319

zansibal opened this issue Feb 15, 2024 · 6 comments
Labels

Comments

@zansibal
Copy link

Summary

Smaller max_bin should decrease the memory footprint used during training. In my tests, it does not.

Motivation

Less memory requirement makes it possible to train on larger datasets. This is especially important in gpu and cuda mode, where VRAM is scarce.

Description

It is recommended to test different max_bin settings for gpu and cuda, to speed up the training, like 15, 63, and 255. While testing different settings, there was no significant change in the memory usage of the GPU. This is weird, as each value in the training array should require less number of bits (4 bits for 15, 6 bits for 63 and 8 bits for 255). I can appreciate that it is hard to do, given that all of these sizes are equal to, or less than, 1 byte. Is it possible?

References

Test results from my particular dataset (running mse regression):
Data shape (41_865_312, 88) and 14.0 GB (float32) size in numpy before constructing LightGBM dataset.

max_bin VRAM usage training time
255 8900 MB 223 s
63 8700 MB 200 s
15 8700 MB 204 s

Finally, the GPU memory usage is more than half that of the numpy memory usage (that is using single precision floats). Shouldn't the memory usage be a quarter of that (like 3500 MB)?

Btw, the recently added cuda support is a tremendous improvement over the old gpu.

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

Can you share the code you used to estimate that memory usage? For example, is that the memory usage of just the Dataset after construction, or peak memory usage throughout training?

I ask because it's possible that for a sufficiently large model (in terms of n_trees * num_leaves), the memory usage of the model could be larger than that of the Dataset.

@zansibal
Copy link
Author

Hi, thanks for the quick response.

I am using nvidia-smi to monitor the VRAM usage.

I checked just now and, from what I can see, there are three steps in the memory allocation.

  1. When running dataset.construct(), the VRAM usage goes to 315 MB.
  2. When running lgb.train(), the VRAM jumps to 3800 MB during initialization.
  3. When training actually starts, the usage jumps to the aforementioned peak 8700 MB, and stays there the whole time.

Some of the training params:

model_params = {
    'n_estimators': 400,
    'learning_rate': 0.01,
    'min_data_in_leaf': 7300,
    'num_leaves': 1000,
    'max_depth': -1,
    'boosting': 'gbdt',
    'objective': 'mse',
    'device_type': 'cuda',
    'max_bin': 63,
}

The final model takes about 35 MB of disk space when saving it down.

Screenshot 2024-02-15 at 17 29 23 Screenshot 2024-02-15 at 17 29 51 Screenshot 2024-02-15 at 17 30 17

@jameslamb
Copy link
Collaborator

Great, thanks for that!

So to me, it looks like this statement is not true:

Smaller max_bin should decrease the memory footprint used during training. In my tests, it does not.

It seems that you did observe smaller memory footprint using a smaller max_bin (e.g. 200MB less VRAM going from 255 to 63 bins).

And that the size of the model is the dominant source of memory usage in your application, not the Dataset.

num_leaves=1000 is will generate very large trees, and with n_estimators = 400 you're asking LightGBM to generate up to 400 of them.

I recommend trying some combination of the the following to reduce the size of the model:

You can also try quantized training, which is available in the CUDA version since #5933. See https://lightgbm.readthedocs.io/en/latest/Parameters.html#use_quantized_grad. With quantized training, the gradients and hessians are represented with smaller data types. That allows you to trade some precision in exchange for lower memory usage.

@zansibal
Copy link
Author

Thanks for taking the time.

Although I am not sure it is the model taking this amount of space, I am starting to realize other necessary data structures are consuming memory as well (like you mention the gradients and hessians).

I will experiment with quantized training. Thanks for the tip.

@jameslamb
Copy link
Collaborator

Although I am not sure it is the model taking this amount of space, I am starting to realize other necessary data structures are consuming memory as well (like you mention the gradients and hessians).

You are totally right! It was a bit imprecise for me to say "the model".

The training-time memory usage has these 4 main sources:

  • the raw data
  • the LightGBM Dataset (which includes things like init_score and weight)
  • the LightGBM Booster (really "the model")
  • other data structures used in training but not preserved when you save the model (e.g. the gradients and hessians, which I was carelessly also including in what I referred to as "the model")

You can avoid the memory usage for the raw data by constructing a Dataset directly from a file (either a CSV/TSV/LibSVM file or a LightGBM Dataset binary file).

You can reduce the memory usage of the Dataset by using smaller max_bin or high min_data_in_bin. Or by removing irrelevant features before construction. In the Python package, if you construct a Dataset in the same process where you perform training, you can avoid LightGBM storing a copy of the raw data by passing free_raw_data=True.

You can reduce the memory usage of the Booster by some of the strategies I mentioned in #6319 (comment).

You can reduce the memory usage of the other data structures by trying quantized training. If you have a lot of rows and any are identical or very similar, you could also try collapsing those into a single row and using weighted training to capture the relative representation of those samples in the whole training dataset.

We should get more of this information into the docs, sorry 😅

@jameslamb
Copy link
Collaborator

The other complication here in your case is which of these data structures are stored on the host memory, the GPU's memory, or both. That's an are of active development in LightGBM right now. If you're familiar with CUDA and want to look through the code here, we'd welcome contributions that identify ways to cut out any unnecessary copies being held in both places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants