Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLI, c/c++] Question about save_binary, train and improving performance for huge dataset #6190

Open
wil70 opened this issue Nov 14, 2023 · 1 comment
Labels

Comments

@wil70
Copy link

wil70 commented Nov 14, 2023

Hello

I have a few questions about save_binary from "task = save_binary"
I have huge csv files. It takes days to convert them to bin files, I would love to speed this by > 10

  1. Any idea how I can speed this process?
  2. Would the cli-socket speed up save_binary?
  3. Can GPU help?
  4. Can multi machines cpus (via something like cli-socket) and each with several GPUs work together to speed up save_binary and train?
  5. Is there a way to add an extra columns to an existing bin file from save_bin?

Thanks for your help!

Wil

Briefly explain your feature proposal.

Speed up save_binary and train for huge files (TB data). It works as of today but it takes days/weeks.

Why is it useful to have this feature in the LightGBM project?

Many problem have huge dataset even after reduction techniques.

Detailed description of the new feature.

Being able to handle huge data set faster than today with the CLI and c/c++ API

Environment information

I'm using windows 10 and windows server with the latest lightgbm code (cli and c/c++ to c#).

Thanks

Wil

ref: https://lightgbm.readthedocs.io/en/latest/Features.html#optimization-in-distributed-learning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants