Write tests for parallel code #3841

StrikerRUS · 2021-01-24T15:19:37Z

I'm marking this issue with question label but hope to convert it to feature request after some discussions.

Right now LightGBM lacks any tests related to parallel (in located in network folder) code. MPI job which run on some our CIs simply execute ordinary tests with serial single machine tree learner but with dynamic library compiled with MPI support. In other words, we just check that MPI code can be compiled.

Recently added Dask module are getting more and more tests and this is very good but they are

run only on Linux so far;
limited to Dask functionality;
not very clear about errors in terms what is going wrong: either some Dask internal processes or LightGBM underlying code.

I believe it will be good to write some basic tests that will cover low-level LightGBM cpp code. Ideally it should cover both socket and MPI implementations.

Adding such tests will help to improve standalone LightGBM parallel library and in consequence Dask-package built on top of it.

Linking #261.

Refer to #3839 (comment) for test example.

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2021-01-26T14:27:54Z

OK, I see "thumb-up" reactions and based on them marking this issue with "feature request" label 🙂

StrikerRUS · 2021-02-08T15:33:03Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

jmoralez · 2021-03-25T03:53:06Z

Hi. I recently used the CLI interface to do "distributed" training on my local machine, so I just set the machine list to 127.0.0.1 with different ports, two config files and ran the lightgbm binary in two terminals. Would this be something along those lines?

jameslamb · 2021-03-25T03:59:08Z

Yep exactly! We'd want a way to do that automatically in CI, but would support a PR that even just adds the tests and proposes a way to easily run and extend them.

jmoralez · 2021-04-04T01:10:19Z

I've been trying this out and currently I have a python script that uses threads to spawn the machines from the CLI. I'm using the data and configurations provided in examples/parallel_learning. I have a couple of questions:

Is this a good approach?
Would the CI stuff run in github actions or on azure?
Would we install the python library for testing? Currently I'm only checking that the model file and the predictions exist. With the python library we could load the booster and do some more advanced checks.

jameslamb · 2021-04-04T02:28:28Z

Would we install the python library for testing?

It's fine to install the Python library as part of such a test.

Would the CI stuff run in github actions or on azure?

I think it would be good to target Azure DevOps for such tests. Currently, this project's Python package tests on GitHub Actions (https://github.com/microsoft/LightGBM/blob/d517ba12f2e7862ac533908304dddbd770655d2b/.github/workflows/python_package.yml) are only for Mac. This was done because of some limitations with Mac builds on Azure.

Please only add a single job, in the Linux-latest section (

LightGBM/.vsts-ci.yml

Line 76 in d517ba1

- job: Linux_latest

). Please only use the socket-based build (the default). Running MPI-based training in tests would be great but will be more involved and shouldn't be part of a first PR for this issue.

This project's Azure capacity is limited, so I'd appreciate it if you start by testing on your own fork while you develop this. You can comment out most of https://github.com/microsoft/LightGBM/blob/d517ba12f2e7862ac533908304dddbd770655d2b/.vsts-ci.yml when testing on your own fork. You can also take advantage of the fact that all of those jobs run in containers using the public ubuntu-latest container, and consider testing locally.

If we find out through your eventual submission that the tests take a very long time to run or are flaky, then we might ask you to move them to GitHub Actions and introduce them as tests that only run when manually triggered by a comment, instead of on every commit.

Is this a good approach?

It sounds reasonable to me, but it's hard to say without seeing the code. If you prepare a pull request, please consider how to make the tests extensible. So, for example, the data in examples/parallel_learning is a good first step but we'll also want to be able to add tests in future PRs on some of the strange situations you've probably encountered working with lightgbm.dask, such as "what happens when only one worker has training data" or "what happens when you have features whose distributions have 0 or very little overlap between partitions" (#4026).

jmoralez · 2021-04-28T04:17:28Z

Hi @jameslamb. I've been working on this and have ran a simple test in my fork. The relevant file is here and the workflow run is here

The workflow right now builds the lightgbm binary, installs the python package and uses pytest to run the tests in that file.

I'd appreciate your feedback when you have time.

jameslamb · 2021-05-01T22:41:38Z

Thanks so much for doing this, @jmoralez ! Really, really nice implementation. The way you set up the tests looks like it would be very manageable to extend in the future.

I'm going to re-open this issue since you're doing work on this.

Could you open a draft PR with your changes? I can help there with other things like how to integrate with the rest of our CI structure.

StrikerRUS added the question label Jan 24, 2021

StrikerRUS assigned jameslamb and guolinke Jan 24, 2021

StrikerRUS added feature request and removed question labels Jan 26, 2021

StrikerRUS changed the title ~~[RFC] Need tests for parallel code~~ Write tests for parallel code Jan 26, 2021

jameslamb mentioned this issue Feb 7, 2021

[dask] Dask estimators sometimes return an incomplete booster #3918

Closed

StrikerRUS mentioned this issue Feb 8, 2021

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Feb 8, 2021

StrikerRUS mentioned this issue Feb 17, 2021

[feature] Streaming data allocation #3995

Closed

jameslamb mentioned this issue Mar 23, 2021

[Dask] Expected error randomly not raised in Dask test #4099

Closed

jameslamb mentioned this issue Mar 28, 2021

show specific error message in TCP accept/send/receive logs #4128

Merged

jameslamb reopened this May 1, 2021

jmoralez mentioned this issue May 4, 2021

[tests][cli] distributed training #4254

Merged

jameslamb removed their assignment Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write tests for parallel code #3841

Write tests for parallel code #3841

StrikerRUS commented Jan 24, 2021

StrikerRUS commented Jan 26, 2021

StrikerRUS commented Feb 8, 2021

jmoralez commented Mar 25, 2021

jameslamb commented Mar 25, 2021

jmoralez commented Apr 4, 2021

jameslamb commented Apr 4, 2021

jmoralez commented Apr 28, 2021

jameslamb commented May 1, 2021

Write tests for parallel code #3841

Write tests for parallel code #3841

Comments

StrikerRUS commented Jan 24, 2021

StrikerRUS commented Jan 26, 2021

StrikerRUS commented Feb 8, 2021

jmoralez commented Mar 25, 2021

jameslamb commented Mar 25, 2021

jmoralez commented Apr 4, 2021

jameslamb commented Apr 4, 2021

jmoralez commented Apr 28, 2021

jameslamb commented May 1, 2021