Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi file as input for LightGBM #2031

Closed
sosososo opened this issue Feb 26, 2019 · 17 comments
Closed

Multi file as input for LightGBM #2031

sosososo opened this issue Feb 26, 2019 · 17 comments

Comments

@sosososo
Copy link

When I want to use LightGBM on 'Aether' ( a platform in Microsoft), multi files as input will be faster to upload or set up folder, but current LightGBM don't support multi files or dataset as input. Besides that, If we want to use it, we have to merge multi files as one file, which would be time-consuming especially when data is bigger. I wonder will we support it in future?

@andrewliuxxx
Copy link

good advice,i‘m supporting file folder as input in my private lgb。

@StrikerRUS
Copy link
Collaborator

@andrewliuxxx Great! Would you mind creating a PR?

@snoe925
Copy link

snoe925 commented May 13, 2019

I have also been working on fixing pipe read support e.g. data=<(process on the fly) valid_data=<(process on the fly ..)

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@StrikerRUS
Copy link
Collaborator

One more request for using multiple files (preferably in parquet format) on one machine as input: #2638.

@zwqjoy
Copy link

zwqjoy commented Apr 9, 2021

@StrikerRUS Hi, Does have a detail date to support the training data from multiple parquet files with the CLI version ?

@StrikerRUS
Copy link
Collaborator

@zwqjoy Hey! As of my knowledge, no one has picked this feature request up yet. Maybe MMLSpark or dataset creation from multiple files in Python will fit your needs?

@zwqjoy
Copy link

zwqjoy commented Apr 13, 2021

@StrikerRUS Thanks, do you mean if I want train dist lightgbm(data in hadoop: lots of part, I need use MML Spark). not use CLI DIST method. because input not support lots of part file ?

@StrikerRUS
Copy link
Collaborator

@zwqjoy Yeah, exactly. CLI distributed version requires that entire file with training data should be presented on each machine. Please try MMLSpark for chunked data.

@shiyu1994
Copy link
Collaborator

Actually, distributed training with CLI supports partitioning data into each machine. See pre_partition
https://lightgbm.readthedocs.io/en/latest/Parameters.html#pre_partition

@StrikerRUS
Copy link
Collaborator

@shiyu1994 Ah, forgot about this option, thanks for correcting me!

@zwqjoy
Copy link

zwqjoy commented Apr 15, 2021

@shiyu1994 in CLI mode, both non-pre-partition and pre-partition can work. If I using pre-partition data, set pre_partition=true. for example I use 4 workers, and I store the data in nfs, split the data to data_split1 data_split2 data_split3 data_split4((4 name is diff, still store in nfs, 4 worker can access the NFS storage)
Can I run CLI dist pre_partition=true . each worker use the diff train data name ?

worker1 use data_split1, and worker2 use data_split2.

@zwqjoy
Copy link

zwqjoy commented Apr 15, 2021

@StrikerRUS @shiyu1994
in CLI mode, both non-pre-partition and pre-partition can work. If I using pre-partition data, set pre_partition=true.
For example I use 4 workers, and I store the data in nfs, then split the data to data_split1 data_split2 data_split3 data_split4((4 name is diff, still store in nfs, 4 worker can access the NFS storage)

  1. Can I run CLI dist pre_partition=true . each worker use the diff train data name ?
    worker1 use data_split1, and worker2 use data_split2.

  2. pre_partition=true is for train data, the valid data still need one data ,cannot pre_partition?

@shiyu1994
Copy link
Collaborator

Hi @zwqjoy

  1. Sure. Different names for the training data files in different machines are perfectly OK.
  2. If you set pre_partition=true, then both the training data and validation data are taken as pre-partitioned.

@shiyu1994
Copy link
Collaborator

And you should be careful that, with pre_partition=true, each machine only outputs the metric value calculated on the data partitioned to that machine.

@zwqjoy
Copy link

zwqjoy commented Apr 16, 2021

Hi @zwqjoy

1. Sure. Different names for the training data files in different machines are perfectly OK.

2. If you set `pre_partition=true`, then both the training data and validation data are taken as pre-partitioned.

@shiyu1994 pre_partition

https://lightgbm.readthedocs.io/en/latest/Parameters.html#pre_partition from doc "true if training data are pre-partitioned, and different machines use different partitions"
I think the pre_partion is only use for train data, not the valid data. But you say both the training data and validation data are taken as pre-partitioned. So I am confused.

And You say Different names for the training data files in different machines are perfectly OK. Do you mean each work local train config file (content train = part-[number]) ?

@zwqjoy
Copy link

zwqjoy commented Apr 16, 2021

And you should be careful that, with pre_partition=true, each machine only outputs the metric value calculated on the data partitioned to that machine.

@shiyu1994 you say pre_partition=true, each machine only outputs the metric value calculated on the data partitioned to that machine.

  1. pre_partition=false. all worker train auc is diff (4 auc train metric)but valid is (1 auc metric) same
  2. pre_partition=true.(I set the train pre partion, valid data is all same) and all worker train auc is diff (4 auc train metric)but valid is (1 auc metric) same
    what is the diff?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants