Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write tests for parallel code #3841

Open
StrikerRUS opened this issue Jan 24, 2021 · 8 comments
Open

Write tests for parallel code #3841

StrikerRUS opened this issue Jan 24, 2021 · 8 comments
Assignees

Comments

@StrikerRUS
Copy link
Collaborator

I'm marking this issue with question label but hope to convert it to feature request after some discussions.

Right now LightGBM lacks any tests related to parallel (in located in network folder) code. MPI job which run on some our CIs simply execute ordinary tests with serial single machine tree learner but with dynamic library compiled with MPI support. In other words, we just check that MPI code can be compiled.

Recently added Dask module are getting more and more tests and this is very good but they are

  • run only on Linux so far;
  • limited to Dask functionality;
  • not very clear about errors in terms what is going wrong: either some Dask internal processes or LightGBM underlying code.

I believe it will be good to write some basic tests that will cover low-level LightGBM cpp code. Ideally it should cover both socket and MPI implementations.

Adding such tests will help to improve standalone LightGBM parallel library and in consequence Dask-package built on top of it.

Linking #261.

Refer to #3839 (comment) for test example.

@StrikerRUS
Copy link
Collaborator Author

OK, I see "thumb-up" reactions and based on them marking this issue with "feature request" label 🙂

@StrikerRUS StrikerRUS changed the title [RFC] Need tests for parallel code Write tests for parallel code Jan 26, 2021
@StrikerRUS
Copy link
Collaborator Author

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@jmoralez
Copy link
Collaborator

Hi. I recently used the CLI interface to do "distributed" training on my local machine, so I just set the machine list to 127.0.0.1 with different ports, two config files and ran the lightgbm binary in two terminals. Would this be something along those lines?

@jameslamb
Copy link
Collaborator

Yep exactly! We'd want a way to do that automatically in CI, but would support a PR that even just adds the tests and proposes a way to easily run and extend them.

@jmoralez
Copy link
Collaborator

jmoralez commented Apr 4, 2021

I've been trying this out and currently I have a python script that uses threads to spawn the machines from the CLI. I'm using the data and configurations provided in examples/parallel_learning. I have a couple of questions:

  1. Is this a good approach?
  2. Would the CI stuff run in github actions or on azure?
  3. Would we install the python library for testing? Currently I'm only checking that the model file and the predictions exist. With the python library we could load the booster and do some more advanced checks.

@jameslamb
Copy link
Collaborator

Would we install the python library for testing?

It's fine to install the Python library as part of such a test.

Would the CI stuff run in github actions or on azure?

I think it would be good to target Azure DevOps for such tests. Currently, this project's Python package tests on GitHub Actions (https://github.com/microsoft/LightGBM/blob/d517ba12f2e7862ac533908304dddbd770655d2b/.github/workflows/python_package.yml) are only for Mac. This was done because of some limitations with Mac builds on Azure.

Please only add a single job, in the Linux-latest section (

- job: Linux_latest
). Please only use the socket-based build (the default). Running MPI-based training in tests would be great but will be more involved and shouldn't be part of a first PR for this issue.

This project's Azure capacity is limited, so I'd appreciate it if you start by testing on your own fork while you develop this. You can comment out most of https://github.com/microsoft/LightGBM/blob/d517ba12f2e7862ac533908304dddbd770655d2b/.vsts-ci.yml when testing on your own fork. You can also take advantage of the fact that all of those jobs run in containers using the public ubuntu-latest container, and consider testing locally.

If we find out through your eventual submission that the tests take a very long time to run or are flaky, then we might ask you to move them to GitHub Actions and introduce them as tests that only run when manually triggered by a comment, instead of on every commit.

Is this a good approach?

It sounds reasonable to me, but it's hard to say without seeing the code. If you prepare a pull request, please consider how to make the tests extensible. So, for example, the data in examples/parallel_learning is a good first step but we'll also want to be able to add tests in future PRs on some of the strange situations you've probably encountered working with lightgbm.dask, such as "what happens when only one worker has training data" or "what happens when you have features whose distributions have 0 or very little overlap between partitions" (#4026).

@jmoralez
Copy link
Collaborator

Hi @jameslamb. I've been working on this and have ran a simple test in my fork. The relevant file is here and the workflow run is here

The workflow right now builds the lightgbm binary, installs the python package and uses pytest to run the tests in that file.

I'd appreciate your feedback when you have time.

@jameslamb
Copy link
Collaborator

Thanks so much for doing this, @jmoralez ! Really, really nice implementation. The way you set up the tests looks like it would be very manageable to extend in the future.

I'm going to re-open this issue since you're doing work on this.

Could you open a draft PR with your changes? I can help there with other things like how to integrate with the rest of our CI structure.

@jameslamb jameslamb reopened this May 1, 2021
@jameslamb jameslamb removed their assignment Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants