Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns #6358

Open
jaguerrerod opened this issue Mar 12, 2024 · 7 comments
Open

Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns #6358

jaguerrerod opened this issue Mar 12, 2024 · 7 comments
Labels

Comments

@jaguerrerod
Copy link

Is there a way to generate an lgb.Dataset by reading a file column-wise? I don't think so, but I'm not sure why this functionality doesn't exist. Dataset creation is a bottleneck that prevents us from utilizing all the RAM. Once the bins for each variable are created, especially if there are few like in my case (around 25 numeric values per variable), the dataset size is drastically reduced. However, to achieve this, we need to load the data into RAM, typically from a disk file, and this intermediate step consumes several times the RAM required by the lgb.Dataset. In practice, if we have X RAM, we can only use X/2 or even less. Would it not be possible to read a CSV where each row contains the data for one column, perform binning, and sequentially free up the RAM? Is there any other alternative to fully utilize all the RAM?

@jaguerrerod jaguerrerod changed the title Optimizing RAM Usage: lgb.Dataset Creation from CSV Columns Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns Mar 12, 2024
@jmoralez
Copy link
Collaborator

Hey @jaguerrerod, thanks for using LightGBM. You can create single column datasets one at a time and add them to your full dataset. Here's an example:

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_features=5)
# create dataset with single column and target
ds = lgb.Dataset(X[:, [0]], y, feature_name=['x0']).construct()
# add the rest of the columns, these can be read from a file one at a time
for j in range(1, X.shape[1]):
    ds.add_features_from(lgb.Dataset(X[:, [j]], feature_name=[f'x{j+1}']).construct())
print(ds.num_feature())
# 5

Please let us know if you have further doubts.

@jaguerrerod
Copy link
Author

jaguerrerod commented Mar 12, 2024

Great, Thank you!
Does it works in R? If not, Could I construct the dataset in python, save it and load from R?

@jmoralez
Copy link
Collaborator

I think the R package doesn't have that feature (I looked for calls to LGBM_DatasetAddFeaturesFrom in R's Dataset and didn't find any), but you should be able to save it from Python and load it in R.

@jaguerrerod
Copy link
Author

It would be great to have this in the R interface, but at least using Python to generate the dataset incrementally is a workaround. Thank you.

@jameslamb
Copy link
Collaborator

would be great to have this in the R interface

Would you like to contribute that? We'd welcome the help.

@jaguerrerod
Copy link
Author

Unfortunately, I lack knowledge of C++ for this task. I could offer a reward if someone is willing to do the work for me, but I am unaware of where to post such a request. If anyone interested contacts me, we can discuss it further, and of course, it would be for incorporation into the master and sharing it with the community.

@jameslamb
Copy link
Collaborator

No problem, thanks anyway for considering it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants