-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing RAM Usage: lgb.Dataset Creation from CSV by columns #6358
Comments
Hey @jaguerrerod, thanks for using LightGBM. You can create single column datasets one at a time and add them to your full dataset. Here's an example: import lightgbm as lgb
from sklearn.datasets import make_regression
X, y = make_regression(n_features=5)
# create dataset with single column and target
ds = lgb.Dataset(X[:, [0]], y, feature_name=['x0']).construct()
# add the rest of the columns, these can be read from a file one at a time
for j in range(1, X.shape[1]):
ds.add_features_from(lgb.Dataset(X[:, [j]], feature_name=[f'x{j+1}']).construct())
print(ds.num_feature())
# 5 Please let us know if you have further doubts. |
Great, Thank you! |
I think the R package doesn't have that feature (I looked for calls to |
It would be great to have this in the R interface, but at least using Python to generate the dataset incrementally is a workaround. Thank you. |
Would you like to contribute that? We'd welcome the help. |
Unfortunately, I lack knowledge of C++ for this task. I could offer a reward if someone is willing to do the work for me, but I am unaware of where to post such a request. If anyone interested contacts me, we can discuss it further, and of course, it would be for incorporation into the master and sharing it with the community. |
No problem, thanks anyway for considering it. |
Is there a way to generate an
lgb.Dataset
by reading a file column-wise? I don't think so, but I'm not sure why this functionality doesn't exist. Dataset creation is a bottleneck that prevents us from utilizing all the RAM. Once the bins for each variable are created, especially if there are few like in my case (around 25 numeric values per variable), the dataset size is drastically reduced. However, to achieve this, we need to load the data into RAM, typically from a disk file, and this intermediate step consumes several times the RAM required by thelgb.Dataset
. In practice, if we have X RAM, we can only use X/2 or even less. Would it not be possible to read a CSV where each row contains the data for one column, perform binning, and sequentially free up the RAM? Is there any other alternative to fully utilize all the RAM?The text was updated successfully, but these errors were encountered: