Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns #6358

jaguerrerod · 2024-03-12T12:33:28Z

Is there a way to generate an lgb.Dataset by reading a file column-wise? I don't think so, but I'm not sure why this functionality doesn't exist. Dataset creation is a bottleneck that prevents us from utilizing all the RAM. Once the bins for each variable are created, especially if there are few like in my case (around 25 numeric values per variable), the dataset size is drastically reduced. However, to achieve this, we need to load the data into RAM, typically from a disk file, and this intermediate step consumes several times the RAM required by the lgb.Dataset. In practice, if we have X RAM, we can only use X/2 or even less. Would it not be possible to read a CSV where each row contains the data for one column, perform binning, and sequentially free up the RAM? Is there any other alternative to fully utilize all the RAM?

The text was updated successfully, but these errors were encountered:

jmoralez · 2024-03-12T13:47:39Z

Hey @jaguerrerod, thanks for using LightGBM. You can create single column datasets one at a time and add them to your full dataset. Here's an example:

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_features=5)
# create dataset with single column and target
ds = lgb.Dataset(X[:, [0]], y, feature_name=['x0']).construct()
# add the rest of the columns, these can be read from a file one at a time
for j in range(1, X.shape[1]):
    ds.add_features_from(lgb.Dataset(X[:, [j]], feature_name=[f'x{j+1}']).construct())
print(ds.num_feature())
# 5

Please let us know if you have further doubts.

jaguerrerod · 2024-03-12T14:00:41Z

Great, Thank you!
Does it works in R? If not, Could I construct the dataset in python, save it and load from R?

jmoralez · 2024-03-12T14:07:54Z

I think the R package doesn't have that feature (I looked for calls to LGBM_DatasetAddFeaturesFrom in R's Dataset and didn't find any), but you should be able to save it from Python and load it in R.

jaguerrerod · 2024-03-12T18:02:15Z

It would be great to have this in the R interface, but at least using Python to generate the dataset incrementally is a workaround. Thank you.

jameslamb · 2024-03-12T18:05:08Z

would be great to have this in the R interface

Would you like to contribute that? We'd welcome the help.

jaguerrerod · 2024-03-12T18:15:06Z

Unfortunately, I lack knowledge of C++ for this task. I could offer a reward if someone is willing to do the work for me, but I am unaware of where to post such a request. If anyone interested contacts me, we can discuss it further, and of course, it would be for incorporation into the master and sharing it with the community.

jameslamb · 2024-03-12T18:19:35Z

No problem, thanks anyway for considering it.

jaguerrerod changed the title ~~Optimizing RAM Usage: lgb.Dataset Creation from CSV Columns~~ Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns Mar 12, 2024

jmoralez added question awaiting response labels Mar 12, 2024

github-actions bot removed the awaiting response label Mar 12, 2024

jmoralez added the awaiting response label Mar 12, 2024

github-actions bot removed the awaiting response label Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns #6358

Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns #6358

jaguerrerod commented Mar 12, 2024

jmoralez commented Mar 12, 2024

jaguerrerod commented Mar 12, 2024 •

edited

Loading

jmoralez commented Mar 12, 2024

jaguerrerod commented Mar 12, 2024

jameslamb commented Mar 12, 2024

jaguerrerod commented Mar 12, 2024

jameslamb commented Mar 12, 2024

Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns #6358

Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns #6358

Comments

jaguerrerod commented Mar 12, 2024

jmoralez commented Mar 12, 2024

jaguerrerod commented Mar 12, 2024 • edited Loading

jmoralez commented Mar 12, 2024

jaguerrerod commented Mar 12, 2024

jameslamb commented Mar 12, 2024

jaguerrerod commented Mar 12, 2024

jameslamb commented Mar 12, 2024

jaguerrerod commented Mar 12, 2024 •

edited

Loading