Multi file as input for LightGBM #2031

sosososo · 2019-02-26T09:27:47Z

When I want to use LightGBM on 'Aether' ( a platform in Microsoft), multi files as input will be faster to upload or set up folder, but current LightGBM don't support multi files or dataset as input. Besides that, If we want to use it, we have to merge multi files as one file, which would be time-consuming especially when data is bigger. I wonder will we support it in future?

andrewliuxxx · 2019-03-15T09:54:42Z

good advice，i‘m supporting file folder as input in my private lgb。

StrikerRUS · 2019-04-10T12:42:20Z

@andrewliuxxx Great! Would you mind creating a PR?

snoe925 · 2019-05-13T19:44:58Z

I have also been working on fixing pipe read support e.g. data=<(process on the fly) valid_data=<(process on the fly ..)

StrikerRUS · 2019-08-01T17:01:51Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

StrikerRUS · 2019-12-20T01:55:42Z

One more request for using multiple files (preferably in parquet format) on one machine as input: #2638.

zwqjoy · 2021-04-09T04:16:18Z

@StrikerRUS Hi, Does have a detail date to support the training data from multiple parquet files with the CLI version ?

StrikerRUS · 2021-04-09T20:48:21Z

@zwqjoy Hey! As of my knowledge, no one has picked this feature request up yet. Maybe MMLSpark or dataset creation from multiple files in Python will fit your needs?

zwqjoy · 2021-04-13T06:12:20Z

@StrikerRUS Thanks, do you mean if I want train dist lightgbm(data in hadoop: lots of part, I need use MML Spark). not use CLI DIST method. because input not support lots of part file ?

StrikerRUS · 2021-04-13T12:52:19Z

@zwqjoy Yeah, exactly. CLI distributed version requires that entire file with training data should be presented on each machine. Please try MMLSpark for chunked data.

shiyu1994 · 2021-04-14T02:26:08Z

Actually, distributed training with CLI supports partitioning data into each machine. See pre_partition
https://lightgbm.readthedocs.io/en/latest/Parameters.html#pre_partition

StrikerRUS · 2021-04-14T20:02:48Z

@shiyu1994 Ah, forgot about this option, thanks for correcting me!

zwqjoy · 2021-04-15T23:50:54Z

@shiyu1994 in CLI mode, both non-pre-partition and pre-partition can work. If I using pre-partition data, set pre_partition=true. for example I use 4 workers, and I store the data in nfs, split the data to data_split1 data_split2 data_split3 data_split4((4 name is diff, still store in nfs, 4 worker can access the NFS storage)
Can I run CLI dist pre_partition=true . each worker use the diff train data name ?

worker1 use data_split1, and worker2 use data_split2.

zwqjoy · 2021-04-15T23:57:23Z

@StrikerRUS @shiyu1994
in CLI mode, both non-pre-partition and pre-partition can work. If I using pre-partition data, set pre_partition=true.
For example I use 4 workers, and I store the data in nfs, then split the data to data_split1 data_split2 data_split3 data_split4((4 name is diff, still store in nfs, 4 worker can access the NFS storage)

Can I run CLI dist pre_partition=true . each worker use the diff train data name ?
worker1 use data_split1, and worker2 use data_split2.
pre_partition=true is for train data, the valid data still need one data ,cannot pre_partition?

shiyu1994 · 2021-04-16T07:21:33Z

Hi @zwqjoy

Sure. Different names for the training data files in different machines are perfectly OK.
If you set pre_partition=true, then both the training data and validation data are taken as pre-partitioned.

shiyu1994 · 2021-04-16T08:28:20Z

And you should be careful that, with pre_partition=true, each machine only outputs the metric value calculated on the data partitioned to that machine.

zwqjoy · 2021-04-16T14:35:28Z

Hi @zwqjoy

1. Sure. Different names for the training data files in different machines are perfectly OK.

2. If you set `pre_partition=true`, then both the training data and validation data are taken as pre-partitioned.

@shiyu1994 pre_partition

https://lightgbm.readthedocs.io/en/latest/Parameters.html#pre_partition from doc "true if training data are pre-partitioned, and different machines use different partitions"
I think the pre_partion is only use for train data, not the valid data. But you say both the training data and validation data are taken as pre-partitioned. So I am confused.

And You say Different names for the training data files in different machines are perfectly OK. Do you mean each work local train config file (content train = part-[number]) ?

zwqjoy · 2021-04-16T14:39:50Z

And you should be careful that, with pre_partition=true, each machine only outputs the metric value calculated on the data partitioned to that machine.

@shiyu1994 you say pre_partition=true, each machine only outputs the metric value calculated on the data partitioned to that machine.

pre_partition=false. all worker train auc is diff (4 auc train metric)but valid is (1 auc metric) same
pre_partition=true.(I set the train pre partion, valid data is all same) and all worker train auc is diff (4 auc train metric)but valid is (1 auc metric) same
what is the diff?

guolinke added help wanted feature request labels Mar 7, 2019

guolinke mentioned this issue Aug 1, 2019

Feature Requests & Voting Hub #2302

Open

guolinke closed this as completed Aug 1, 2019

cyfdecyf mentioned this issue Feb 3, 2021

Support multiple train data on single machine #3900

Closed

cyfdecyf mentioned this issue Mar 21, 2021

[python-package] Create Dataset from multiple data files #4089

Merged

StrikerRUS mentioned this issue Mar 22, 2021

Does Dist Lightgbm support config set train_data with mutil values #4090

Closed

jameslamb mentioned this issue Aug 25, 2022

Support training data with dir[mutil libsvm data] in the CLI version #5417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi file as input for LightGBM #2031

Multi file as input for LightGBM #2031

sosososo commented Feb 26, 2019

andrewliuxxx commented Mar 15, 2019

StrikerRUS commented Apr 10, 2019

snoe925 commented May 13, 2019

StrikerRUS commented Aug 1, 2019

StrikerRUS commented Dec 20, 2019

zwqjoy commented Apr 9, 2021

StrikerRUS commented Apr 9, 2021

zwqjoy commented Apr 13, 2021

StrikerRUS commented Apr 13, 2021

shiyu1994 commented Apr 14, 2021

StrikerRUS commented Apr 14, 2021

zwqjoy commented Apr 15, 2021

zwqjoy commented Apr 15, 2021

shiyu1994 commented Apr 16, 2021

shiyu1994 commented Apr 16, 2021

zwqjoy commented Apr 16, 2021

zwqjoy commented Apr 16, 2021 •

edited

Loading

Multi file as input for LightGBM #2031

Multi file as input for LightGBM #2031

Comments

sosososo commented Feb 26, 2019

andrewliuxxx commented Mar 15, 2019

StrikerRUS commented Apr 10, 2019

snoe925 commented May 13, 2019

StrikerRUS commented Aug 1, 2019

StrikerRUS commented Dec 20, 2019

zwqjoy commented Apr 9, 2021

StrikerRUS commented Apr 9, 2021

zwqjoy commented Apr 13, 2021

StrikerRUS commented Apr 13, 2021

shiyu1994 commented Apr 14, 2021

StrikerRUS commented Apr 14, 2021

zwqjoy commented Apr 15, 2021

zwqjoy commented Apr 15, 2021

shiyu1994 commented Apr 16, 2021

shiyu1994 commented Apr 16, 2021

zwqjoy commented Apr 16, 2021

zwqjoy commented Apr 16, 2021 • edited Loading

zwqjoy commented Apr 16, 2021 •

edited

Loading