Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow dataset creation #4037

Open
pseudotensor opened this issue Mar 1, 2021 · 9 comments
Open

Slow dataset creation #4037

pseudotensor opened this issue Mar 1, 2021 · 9 comments

Comments

@pseudotensor
Copy link

pseudotensor commented Mar 1, 2021

import datatable as dt
import numpy as np

rows=200
cols=800000
table = dt.Frame(np.random.rand(rows, cols))
table.names = ["name_" + str(x) for x in range(table.shape[1])]
target = "name_0"

y = table[:, target].to_numpy().ravel()
del table[target]

import lightgbm as lgb
model = lgb.LGBMRegressor()
model.fit(table, y)

3.1.1.99

Using datatable or numpy, same result. Gets "stuck" here using 1 core for 10-20 minutes:

  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 1204 in __init_from_np2d
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 1158 in _lazy_init
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 1356 in construct
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2096 in __init__
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 230 in train
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637 in fit
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 794 in fit
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(_ZN8LightGBM10FindGroupsERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EERKS0_IiSaIiEEPPiPKiiiibbPS0_IaSaIaEE+0x1f4)[0x7fc824bb1404]
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(_ZN8LightGBM19FastFeatureBundlingERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EEPPiPPdPKiiiRKS0_IiSaIiEEibbPS0_IaSaIaEE+0xbc2)[0x7fc824bb44e2]
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(_ZN8LightGBM7Dataset9ConstructEPSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS3_EESaIS6_EEiRKS1_IS1_IdSaIdEESaISB_EEPPiPPdPKiimRKNS_6ConfigE+0x278)[0x7fc824bb50b8]
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(_ZN8LightGBM13DatasetLoader31ConstructBinMappersFromTextDataEiiRKSt6vectorISsSaISsEEPKNS_6ParserEPNS_7DatasetE+0x1a29)[0x7fc824bcf179]
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(_ZN8LightGBM13DatasetLoader12LoadFromFileEPKcii+0x1bb)[0x7fc824bd2b7b]
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/lib_lightgbm.so(LGBM_DatasetCreateFromFile+0x17b)[0x7fc824df36eb]

After the 10-20 minutes, I get system OOM even though I have 64GB.

I read: #1081

which is also quite wide. But in general, what recommendations are there for many columns and speeding things up while using not too much memory.

I'm trying various things, but seems like how lgb uses 1 core for 20 minutes can be improved. E.g. in xgboost I/we used openmp for data ingestion, which despite alot of the operations being memory bandwidth limited does speed things up. That makes sense since the lgb dataset construction operation is about 100x slower than making the data itself, which is bad.

So, it must be possible to parallelize the dataset construction since features are independent. E.g. even fork many jobs that take portion of data and create Dataset objects, then use add_features_from to column bind the features? Why isn't that done internally using openmp?

@pseudotensor
Copy link
Author

Using csv is still very slow and single threaded

import datatable as dt
import numpy as np

rows=200
cols=800000
table = dt.Frame(np.random.rand(rows, cols))
table.names = ["name_" + str(x) for x in range(table.shape[1])]
target = "name_0"
table_csv = "table.csv"
table.to_csv(table_csv, header=False)

y = table[:, target].to_numpy().ravel()

import lightgbm as lgb
train_set = lgb.Dataset(table_csv, label=y)
booster = lgb.train({}, train_set)




Current thread 0x00007fc8c279d740 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 1148 in _lazy_init
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 1356 in construct
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2096 in __init__
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 230 in train
  File "slow_lgb3.py", line 17 in <module>

@pseudotensor
Copy link
Author

Setting max_bin = 4 doesn't help, still same time. Is there possibly an O(N^2) operation in the dataset creation process?

@pseudotensor
Copy link
Author

pseudotensor commented Mar 1, 2021

Repeated samples of where in gdb shows always stuck in the FindGroups function using the BinMapper

https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L120-L187

pseudotensor added a commit to h2oai/LightGBM that referenced this issue Mar 2, 2021
@pseudotensor
Copy link
Author

pseudotensor commented Mar 2, 2021

https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L125-L134

Basically this inner features_in_group loop has a size that grows directly and same size as fidx (the outer loop). So even if by fidx=100000 the time is only 100us per fidx value, but by fidx=800000 the time is 1000us per fidx value.

I already moved the bin_mapper stuff out of the loop:

h2oai@e498551

but that didn't help this slowness or the O(N^2) behavior.

@guolinke
Copy link
Collaborator

guolinke commented Mar 2, 2021

ping @shiyu1994

@pseudotensor
Copy link
Author

pseudotensor commented Mar 2, 2021

I tried to use the rand.Sample() to sample features_in_group, but turns out rand.Sample() is just as bad. The way it is designed it scales with the size to sample from, not the size of sample. That's also quite bad.

So such an attempt still keeps things at O(N^2).

Example of random without replacement that is O(N): https://stackoverflow.com/questions/28287138/c-randomly-sample-k-numbers-from-range-0n-1-n-k-without-replacement

@shiyu1994
Copy link
Collaborator

@pseudotensor Thanks for using LightGBM.

The synthesized dataset is dense. So I think

LightGBM/src/io/dataset.cpp

Lines 125 to 134 in 37e9878

for (int gid = 0; gid < static_cast<int>(features_in_group.size()); ++gid) {
auto cur_num_bin = group_num_bin[gid] + bin_mappers[fidx]->num_bin() +
(bin_mappers[fidx]->GetDefaultBin() == 0 ? -1 : 0);
if (group_total_data_cnt[gid] + cur_non_zero_cnt <=
total_sample_cnt + single_val_max_conflict_cnt) {
if (!is_use_gpu || cur_num_bin <= max_bin_per_group) {
available_groups.push_back(gid);
}
}
}

won't find any available groups. Each feature will end up in a separate group.
This is a very extreme case. A possible solution would be to limit the maximum number of trials (iterations) in line 125.

@StrikerRUS
Copy link
Collaborator

@shiyu1994

A possible solution would be to limit the maximum number of trials (iterations) in line 125.

Do you have plans implementing this? Or we can write this into our feature requests?

@shiyu1994
Copy link
Collaborator

@StrikerRUS We can have this in feature requests. A quick fix would be to randomly sample from feature groups in line 125. How to sample the groups when the total number of groups is large is an open question.

We have a plan to separate the dataset construction. I think we may leave it to that part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants