Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Support more operating systems #3782

Closed
StrikerRUS opened this issue Jan 18, 2021 · 4 comments
Closed

[dask] Support more operating systems #3782

StrikerRUS opened this issue Jan 18, 2021 · 4 comments

Comments

@StrikerRUS
Copy link
Collaborator

Summary

The Dask interface in https://github.com/microsoft/LightGBM/blob/706f2af7badc26f6ec68729469ec6ec79a66d802/python-package/lightgbm/dask.py currently only supports Linux OS.

It should support other operating systems that are supported by LightGBM itself (macOS and Windows) as well.

Motivation

Adding this feature would allow users to use lightgbm.dask on their favorite OS.

Description

References

To close this issue modify lightgbm.dask codebase so that it will be possible to remove the following skip directive.

if not sys.platform.startswith("linux"):
pytest.skip("lightgbm.dask is currently supported in Linux environments", allow_module_level=True)

@StrikerRUS
Copy link
Collaborator Author

Closing this, as we use #2302 to track feature requests. Leave a comment below if you'd like to contribute this feature, and we'll be happy to re-open it!

@jameslamb
Copy link
Collaborator

I tried testing today whether #3840 would allow use of the Dask package on Mac (based on #3839 (comment)).

I found that most tests from pytest tests/python_package_test/test_dask.py pass, but the ranker tests regularly fail.

==================================== short test summary info ====================================
------------------------------------- Captured stderr call --------------------------------------
[LightGBM] [Fatal] Socket recv error, code: 54
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:53872', name: tcp://127.0.0.1:53872, memory: 5, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:53872
distributed.worker - WARNING -  Compute Failed
Function:  _train_part
args:      ()
kwargs:    {'model_factory': <class 'lightgbm.sklearn.LGBMRanker'>, 'params': {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 1, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 50, 'num_leaves': 20, 'objective': None, 'random_state': 42, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'time_out': 5, 'local_listen_port': 13300, 'tree_learner': 'data_parallel', 'num_threads': 1, 'machines': '127.0.0.1:13300,127.0.0.1:13301', 'num_machines': 2}, 'list_of_parts': [{'data': array([[ 6.99924775e-01, -2.42956264e+00,  1.68371729e+00,
        -9.88911943e-01,  5.45197463e-01,  5.35069719e-01,
         1.84462871e-01,  2.99595912e-01,  3.09936926e-01,
         3.97318115e-01,  4.26788905e-01,  7.99782022e-01,
         3.49386999e-01,  4.68231828e-01,  6.24975987e-01,
         3.77725947e-01,  8.36564813e-
Exception: LightGBMError('Socket recv error, code: 54')

FAILED test_dask.py::test_ranker[None-array] - lightgbm.basic.LightGBMError: Socket recv error...
FAILED test_dask.py::test_ranker[None-dataframe] - lightgbm.basic.LightGBMError: Socket recv e...
FAILED test_dask.py::test_ranker[group1-array] - lightgbm.basic.LightGBMError: Socket recv err...
ERROR test_dask.py::test_ranker[None-array] - asyncio.exceptions.TimeoutError
ERROR test_dask.py::test_ranker[None-dataframe] - asyncio.exceptions.TimeoutError
ERROR test_dask.py::test_ranker[group1-array] - asyncio.exceptions.TimeoutError
================= 3 failed, 33 passed, 1 warning, 3 errors in 175.56s (0:02:55) ================
full log of errors
 /Users/jlamb/miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:730>>
timeout = 3

    async def wait_for(fut, timeout, *, loop=None):
        """Wait for the single Future or coroutine to complete, with timeout.

        Coroutine will be wrapped in Task.

        Returns result of the Future or coroutine.  When a timeout occurs,
        it cancels the task and raises TimeoutError.  To avoid the task
        cancellation, wrap it in shield().

        If the wait is cancelled, the task is also cancelled.

        This function is a coroutine.
        """
        if loop is None:
            loop = events.get_running_loop()
        else:
            warnings.warn("The loop argument is deprecated since Python 3.8, "
                          "and scheduled for removal in Python 3.10.",
                          DeprecationWarning, stacklevel=2)

        if timeout is None:
            return await fut

        if timeout <= 0:
            fut = ensure_future(fut, loop=loop)

            if fut.done():
                return fut.result()

            fut.cancel()
            raise exceptions.TimeoutError()

        waiter = loop.create_future()
        timeout_handle = loop.call_later(timeout, _release_waiter, waiter)
        cb = functools.partial(_release_waiter, waiter)

        fut = ensure_future(fut, loop=loop)
        fut.add_done_callback(cb)

        try:
            # wait until the future completes or the timeout
            try:
                await waiter
            except exceptions.CancelledError:
                fut.remove_done_callback(cb)
                fut.cancel()
                raise

            if fut.done():
                return fut.result()
            else:
                fut.remove_done_callback(cb)
                # We must ensure that the task is not running
                # after wait_for() returns.
                # See https://bugs.python.org/issue32751
                await _cancel_and_wait(fut, loop=loop)
>               raise exceptions.TimeoutError()
E               asyncio.exceptions.TimeoutError

../../../../miniconda3/lib/python3.8/asyncio/tasks.py:490: TimeoutError
------------------------------------- Captured stderr setup -------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:53793
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53794
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53794
distributed.worker - INFO -          dashboard at:            127.0.0.1:53796
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53793
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53795
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53795
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-20bbc437-af08-41d9-8e55-aa3dacc21c85/dask-worker-space/worker-penwna_z
distributed.worker - INFO -          dashboard at:            127.0.0.1:53797
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53793
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-9f3bb684-b35f-4201-884d-b60e3e06ba07/dask-worker-space/worker-vvdlshrk
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53794', name: tcp://127.0.0.1:53794, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53794
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53793
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53795', name: tcp://127.0.0.1:53795, memory: 0, processing: 0>
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53795
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53793
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-08c38f16-62b7-11eb-8d62-8c8590957efe
distributed.core - INFO - Starting established connection
------------------------------------- Captured stdout call --------------------------------------
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 13280...
[LightGBM] [Info] Binding port 13280 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Listening...
[LightGBM] [Info] Trying to bind port 13281...
[LightGBM] [Info] Binding port 13281 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
------------------------------------- Captured stderr call --------------------------------------
[LightGBM] [Fatal] Socket recv error, code: 54
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:53794', name: tcp://127.0.0.1:53794, memory: 5, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:53794
distributed.worker - WARNING -  Compute Failed
Function:  _train_part
args:      ()
kwargs:    {'model_factory': <class 'lightgbm.sklearn.LGBMRanker'>, 'params': {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 1, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 50, 'num_leaves': 20, 'objective': None, 'random_state': 42, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'time_out': 5, 'local_listen_port': 13280, 'tree_learner': 'data_parallel', 'num_threads': 1, 'machines': '127.0.0.1:13280,127.0.0.1:13281', 'num_machines': 2}, 'list_of_parts': [{'data': array([[-1.35113428,  0.1509421 ,  0.28555344, -0.79690187,  0.96463185,
         0.1120389 ,  0.3978556 ,  0.96947043,  0.86550713,  0.81707207,
         0.25790283,  0.17088759,  0.66864322,  0.92937599,  0.55676289,
         0.57161269,  0.27997909,  0.76949293,  0.18704375,  0.32367924],
       [-1.25614053,  0.6623576 ,  0.253
Exception: LightGBMError('Socket recv error, code: 54')

----------------------------------- Captured stdout teardown ------------------------------------
[LightGBM] [Info] Listening...
----------------------------------- Captured stderr teardown ------------------------------------
distributed.scheduler - INFO - Remove client Client-08c38f16-62b7-11eb-8d62-8c8590957efe
distributed.scheduler - INFO - Remove client Client-08c38f16-62b7-11eb-8d62-8c8590957efe
distributed.scheduler - INFO - Close client connection: Client-08c38f16-62b7-11eb-8d62-8c8590957efe
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:53795
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:53795', name: tcp://127.0.0.1:53795, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:53795
distributed.scheduler - INFO - Lost all workers
_______________________ ERROR at teardown of test_ranker[None-dataframe] ________________________

loop = <tornado.platform.asyncio.AsyncIOLoop object at 0x13c272d90>

    @pytest.fixture
    def cluster_fixture(loop):
        with cluster() as (scheduler, workers):
>           yield (scheduler, workers)

../../../../miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:521:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../miniconda3/lib/python3.8/contextlib.py:120: in __exit__
    next(self.gen)
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:676: in cluster
    loop.run_sync(
../../../../miniconda3/lib/python3.8/site-packages/tornado/ioloop.py:532: in run_sync
    return future_cell[0].result()
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:739: in disconnect_all
    await asyncio.gather(*[disconnect(addr, timeout, rpc_kwargs) for addr in addresses])
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:735: in disconnect
    await asyncio.wait_for(do_disconnect(), timeout=timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

fut = <Task cancelled name='Task-2027' coro=<disconnect.<locals>.do_disconnect() done, defined at /Users/jlamb/miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:730>>
timeout = 3

    async def wait_for(fut, timeout, *, loop=None):
        """Wait for the single Future or coroutine to complete, with timeout.

        Coroutine will be wrapped in Task.

        Returns result of the Future or coroutine.  When a timeout occurs,
        it cancels the task and raises TimeoutError.  To avoid the task
        cancellation, wrap it in shield().

        If the wait is cancelled, the task is also cancelled.

        This function is a coroutine.
        """
        if loop is None:
            loop = events.get_running_loop()
        else:
            warnings.warn("The loop argument is deprecated since Python 3.8, "
                          "and scheduled for removal in Python 3.10.",
                          DeprecationWarning, stacklevel=2)

        if timeout is None:
            return await fut

        if timeout <= 0:
            fut = ensure_future(fut, loop=loop)

            if fut.done():
                return fut.result()

            fut.cancel()
            raise exceptions.TimeoutError()

        waiter = loop.create_future()
        timeout_handle = loop.call_later(timeout, _release_waiter, waiter)
        cb = functools.partial(_release_waiter, waiter)

        fut = ensure_future(fut, loop=loop)
        fut.add_done_callback(cb)

        try:
            # wait until the future completes or the timeout
            try:
                await waiter
            except exceptions.CancelledError:
                fut.remove_done_callback(cb)
                fut.cancel()
                raise

            if fut.done():
                return fut.result()
            else:
                fut.remove_done_callback(cb)
                # We must ensure that the task is not running
                # after wait_for() returns.
                # See https://bugs.python.org/issue32751
                await _cancel_and_wait(fut, loop=loop)
>               raise exceptions.TimeoutError()
E               asyncio.exceptions.TimeoutError

../../../../miniconda3/lib/python3.8/asyncio/tasks.py:490: TimeoutError
------------------------------------- Captured stderr setup -------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:53833
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53835
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53835
distributed.worker - INFO -          dashboard at:            127.0.0.1:53836
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53833
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-00f16599-3796-426f-a3f5-8ffa270bd8be/dask-worker-space/worker-t_rqdzza
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53834
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53834
distributed.worker - INFO -          dashboard at:            127.0.0.1:53838
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53833
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-18710643-dde6-4930-9765-3bc92ba9c3e9/dask-worker-space/worker-wt1ixqz7
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53835', name: tcp://127.0.0.1:53835, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53835
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53833
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53834', name: tcp://127.0.0.1:53834, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53834
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53833
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-0c7c9134-62b7-11eb-8d62-8c8590957efe
distributed.core - INFO - Starting established connection
------------------------------------- Captured stdout call --------------------------------------
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 13290...
[LightGBM] [Info] Binding port 13290 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 13291...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Binding port 13291 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
------------------------------------- Captured stderr call --------------------------------------
[LightGBM] [Fatal] Socket recv error, code: 54
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:53834', name: tcp://127.0.0.1:53834, memory: 1, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:53834
distributed.worker - WARNING -  Compute Failed
Function:  _train_part
args:      ()
kwargs:    {'model_factory': <class 'lightgbm.sklearn.LGBMRanker'>, 'params': {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 1, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 50, 'num_leaves': 20, 'objective': None, 'random_state': 42, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'time_out': 5, 'local_listen_port': 13290, 'tree_learner': 'data_parallel', 'num_threads': 1, 'machines': '127.0.0.1:13290,127.0.0.1:13291', 'num_machines': 2}, 'list_of_parts': [{'data':    feature_0  feature_1  feature_2  ...  feature_17  feature_18  feature_19
g                                   ...
5  -1.351134   0.150942   0.285553  ...    0.991169    0.231672    0.942732
5  -1.446128  -0.360473   0.317472  ...    0.696021    0.153896    0.815833
5  -1.351134   0.150942   0.2
Exception: LightGBMError('Socket recv error, code: 54')

----------------------------------- Captured stdout teardown ------------------------------------
[LightGBM] [Info] Listening...
----------------------------------- Captured stderr teardown ------------------------------------
distributed.scheduler - INFO - Remove client Client-0c7c9134-62b7-11eb-8d62-8c8590957efe
distributed.scheduler - INFO - Remove client Client-0c7c9134-62b7-11eb-8d62-8c8590957efe
distributed.scheduler - INFO - Close client connection: Client-0c7c9134-62b7-11eb-8d62-8c8590957efe
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:53835
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:53835', name: tcp://127.0.0.1:53835, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:53835
distributed.scheduler - INFO - Lost all workers
________________________ ERROR at teardown of test_ranker[group1-array] _________________________

loop = <tornado.platform.asyncio.AsyncIOLoop object at 0x13c0bf130>

    @pytest.fixture
    def cluster_fixture(loop):
        with cluster() as (scheduler, workers):
>           yield (scheduler, workers)

../../../../miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:521:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../miniconda3/lib/python3.8/contextlib.py:120: in __exit__
    next(self.gen)
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:676: in cluster
    loop.run_sync(
../../../../miniconda3/lib/python3.8/site-packages/tornado/ioloop.py:532: in run_sync
    return future_cell[0].result()
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:739: in disconnect_all
    await asyncio.gather(*[disconnect(addr, timeout, rpc_kwargs) for addr in addresses])
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:735: in disconnect
    await asyncio.wait_for(do_disconnect(), timeout=timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

fut = <Task cancelled name='Task-2091' coro=<disconnect.<locals>.do_disconnect() done, defined at /Users/jlamb/miniconda3/lib/python3.8/site-packages/distributed/utils_test.py:730>>
timeout = 3

    async def wait_for(fut, timeout, *, loop=None):
        """Wait for the single Future or coroutine to complete, with timeout.

        Coroutine will be wrapped in Task.

        Returns result of the Future or coroutine.  When a timeout occurs,
        it cancels the task and raises TimeoutError.  To avoid the task
        cancellation, wrap it in shield().

        If the wait is cancelled, the task is also cancelled.

        This function is a coroutine.
        """
        if loop is None:
            loop = events.get_running_loop()
        else:
            warnings.warn("The loop argument is deprecated since Python 3.8, "
                          "and scheduled for removal in Python 3.10.",
                          DeprecationWarning, stacklevel=2)

        if timeout is None:
            return await fut

        if timeout <= 0:
            fut = ensure_future(fut, loop=loop)

            if fut.done():
                return fut.result()

            fut.cancel()
            raise exceptions.TimeoutError()

        waiter = loop.create_future()
        timeout_handle = loop.call_later(timeout, _release_waiter, waiter)
        cb = functools.partial(_release_waiter, waiter)

        fut = ensure_future(fut, loop=loop)
        fut.add_done_callback(cb)

        try:
            # wait until the future completes or the timeout
            try:
                await waiter
            except exceptions.CancelledError:
                fut.remove_done_callback(cb)
                fut.cancel()
                raise

            if fut.done():
                return fut.result()
            else:
                fut.remove_done_callback(cb)
                # We must ensure that the task is not running
                # after wait_for() returns.
                # See https://bugs.python.org/issue32751
                await _cancel_and_wait(fut, loop=loop)
>               raise exceptions.TimeoutError()
E               asyncio.exceptions.TimeoutError

../../../../miniconda3/lib/python3.8/asyncio/tasks.py:490: TimeoutError
------------------------------------- Captured stderr setup -------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:53871
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53872
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53873
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53872
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53873
distributed.worker - INFO -          dashboard at:            127.0.0.1:53875
distributed.worker - INFO -          dashboard at:            127.0.0.1:53874
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53871
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53871
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-f9f733ee-85cf-4673-aff7-bcb0e6065250/dask-worker-space/worker-kpaa_x1q
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-f9dd35fd-92c0-4c01-9cd8-640f39242b98/dask-worker-space/worker-lunpfq03
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53872', name: tcp://127.0.0.1:53872, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53872
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53871
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53873', name: tcp://127.0.0.1:53873, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53873
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53871
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-0ff01fac-62b7-11eb-8d62-8c8590957efe
distributed.core - INFO - Starting established connection
------------------------------------- Captured stdout call --------------------------------------
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 13300...
[LightGBM] [Info] Binding port 13300 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 13301...
[LightGBM] [Info] Binding port 13301 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
------------------------------------- Captured stderr call --------------------------------------
[LightGBM] [Fatal] Socket recv error, code: 54
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:53872', name: tcp://127.0.0.1:53872, memory: 5, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:53872
distributed.worker - WARNING -  Compute Failed
Function:  _train_part
args:      ()
kwargs:    {'model_factory': <class 'lightgbm.sklearn.LGBMRanker'>, 'params': {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 1, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 50, 'num_leaves': 20, 'objective': None, 'random_state': 42, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'time_out': 5, 'local_listen_port': 13300, 'tree_learner': 'data_parallel', 'num_threads': 1, 'machines': '127.0.0.1:13300,127.0.0.1:13301', 'num_machines': 2}, 'list_of_parts': [{'data': array([[ 6.99924775e-01, -2.42956264e+00,  1.68371729e+00,
        -9.88911943e-01,  5.45197463e-01,  5.35069719e-01,
         1.84462871e-01,  2.99595912e-01,  3.09936926e-01,
         3.97318115e-01,  4.26788905e-01,  7.99782022e-01,
         3.49386999e-01,  4.68231828e-01,  6.24975987e-01,
         3.77725947e-01,  8.36564813e-
Exception: LightGBMError('Socket recv error, code: 54')

----------------------------------- Captured stdout teardown ------------------------------------
[LightGBM] [Info] Listening...
----------------------------------- Captured stderr teardown ------------------------------------
distributed.scheduler - INFO - Remove client Client-0ff01fac-62b7-11eb-8d62-8c8590957efe
distributed.scheduler - INFO - Remove client Client-0ff01fac-62b7-11eb-8d62-8c8590957efe
distributed.scheduler - INFO - Close client connection: Client-0ff01fac-62b7-11eb-8d62-8c8590957efe
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:53873
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:53873', name: tcp://127.0.0.1:53873, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:53873
distributed.scheduler - INFO - Lost all workers
=========================================== FAILURES ============================================
____________________________________ test_ranker[None-array] ____________________________________

output = 'array'
client = <Client: 'tcp://127.0.0.1:53793' processes=2 threads=2, memory=17.18 GB>
listen_port = 13280, group = None

    @pytest.mark.parametrize('output', ['array', 'dataframe'])
    @pytest.mark.parametrize('group', [None, group_sizes])
    def test_ranker(output, client, listen_port, group):

        X, y, w, g, dX, dy, dw, dg = _create_ranking_data(
            output=output,
            group=group
        )

        # use many trees + leaves to overfit, help ensure that dask data-parallel strategy matches that of
        # serial learner. See https://github.com/microsoft/LightGBM/issues/3292#issuecomment-671288210.
        params = {
            "random_state": 42,
            "n_estimators": 50,
            "num_leaves": 20,
            "min_child_samples": 1
        }
        dask_ranker = lgb.DaskLGBMRanker(
            time_out=5,
            local_listen_port=listen_port,
            tree_learner_type='data_parallel',
            **params
        )
>       dask_ranker = dask_ranker.fit(dX, dy, sample_weight=dw, group=dg, client=client)

test_dask.py:399:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:611: in fit
    return self._fit(
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:454: in _fit
    model = _train(
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:320: in _train
    results = client.gather(futures_classifiers)
../../../../miniconda3/lib/python3.8/site-packages/distributed/client.py:1986: in gather
    return self.sync(
../../../../miniconda3/lib/python3.8/site-packages/distributed/client.py:832: in sync
    return sync(
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils.py:340: in sync
    raise exc.with_traceback(tb)
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils.py:324: in f
    result[0] = yield future
../../../../miniconda3/lib/python3.8/site-packages/tornado/gen.py:735: in run
    value = future.result()
../../../../miniconda3/lib/python3.8/site-packages/distributed/client.py:1851: in _gather
    raise exception.with_traceback(traceback)
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:163: in _train_part
    model.fit(data, label, sample_weight=weight, group=group, **kwargs)
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/sklearn.py:983: in fit
    super().fit(X, y, sample_weight=sample_weight, init_score=init_score, group=group,
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/sklearn.py:629: in fit
    self._Booster = train(params, train_set,
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/engine.py:249: in train
    booster.update(fobj=fobj)
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/basic.py:2620: in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
E   lightgbm.basic.LightGBMError: Socket recv error, code: 54

../../../../miniconda3/lib/python3.8/site-packages/lightgbm/basic.py:110: LightGBMError
------------------------------------- Captured stderr setup -------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:53793
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53794
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53794
distributed.worker - INFO -          dashboard at:            127.0.0.1:53796
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53793
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53795
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53795
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-20bbc437-af08-41d9-8e55-aa3dacc21c85/dask-worker-space/worker-penwna_z
distributed.worker - INFO -          dashboard at:            127.0.0.1:53797
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53793
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-9f3bb684-b35f-4201-884d-b60e3e06ba07/dask-worker-space/worker-vvdlshrk
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53794', name: tcp://127.0.0.1:53794, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53794
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53793
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53795', name: tcp://127.0.0.1:53795, memory: 0, processing: 0>
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53795
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53793
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-08c38f16-62b7-11eb-8d62-8c8590957efe
distributed.core - INFO - Starting established connection
------------------------------------- Captured stdout call --------------------------------------
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 13280...
[LightGBM] [Info] Binding port 13280 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Listening...
[LightGBM] [Info] Trying to bind port 13281...
[LightGBM] [Info] Binding port 13281 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
------------------------------------- Captured stderr call --------------------------------------
[LightGBM] [Fatal] Socket recv error, code: 54
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:53794', name: tcp://127.0.0.1:53794, memory: 5, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:53794
distributed.worker - WARNING -  Compute Failed
Function:  _train_part
args:      ()
kwargs:    {'model_factory': <class 'lightgbm.sklearn.LGBMRanker'>, 'params': {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 1, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 50, 'num_leaves': 20, 'objective': None, 'random_state': 42, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'time_out': 5, 'local_listen_port': 13280, 'tree_learner': 'data_parallel', 'num_threads': 1, 'machines': '127.0.0.1:13280,127.0.0.1:13281', 'num_machines': 2}, 'list_of_parts': [{'data': array([[-1.35113428,  0.1509421 ,  0.28555344, -0.79690187,  0.96463185,
         0.1120389 ,  0.3978556 ,  0.96947043,  0.86550713,  0.81707207,
         0.25790283,  0.17088759,  0.66864322,  0.92937599,  0.55676289,
         0.57161269,  0.27997909,  0.76949293,  0.18704375,  0.32367924],
       [-1.25614053,  0.6623576 ,  0.253
Exception: LightGBMError('Socket recv error, code: 54')

__________________________________ test_ranker[None-dataframe] __________________________________

output = 'dataframe'
client = <Client: 'tcp://127.0.0.1:53833' processes=2 threads=2, memory=17.18 GB>
listen_port = 13290, group = None

    @pytest.mark.parametrize('output', ['array', 'dataframe'])
    @pytest.mark.parametrize('group', [None, group_sizes])
    def test_ranker(output, client, listen_port, group):

        X, y, w, g, dX, dy, dw, dg = _create_ranking_data(
            output=output,
            group=group
        )

        # use many trees + leaves to overfit, help ensure that dask data-parallel strategy matches that of
        # serial learner. See https://github.com/microsoft/LightGBM/issues/3292#issuecomment-671288210.
        params = {
            "random_state": 42,
            "n_estimators": 50,
            "num_leaves": 20,
            "min_child_samples": 1
        }
        dask_ranker = lgb.DaskLGBMRanker(
            time_out=5,
            local_listen_port=listen_port,
            tree_learner_type='data_parallel',
            **params
        )
>       dask_ranker = dask_ranker.fit(dX, dy, sample_weight=dw, group=dg, client=client)

test_dask.py:399:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:611: in fit
    return self._fit(
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:454: in _fit
    model = _train(
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:320: in _train
    results = client.gather(futures_classifiers)
../../../../miniconda3/lib/python3.8/site-packages/distributed/client.py:1986: in gather
    return self.sync(
../../../../miniconda3/lib/python3.8/site-packages/distributed/client.py:832: in sync
    return sync(
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils.py:340: in sync
    raise exc.with_traceback(tb)
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils.py:324: in f
    result[0] = yield future
../../../../miniconda3/lib/python3.8/site-packages/tornado/gen.py:735: in run
    value = future.result()
../../../../miniconda3/lib/python3.8/site-packages/distributed/client.py:1851: in _gather
    raise exception.with_traceback(traceback)
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:163: in _train_part
    model.fit(data, label, sample_weight=weight, group=group, **kwargs)
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/sklearn.py:983: in fit
    super().fit(X, y, sample_weight=sample_weight, init_score=init_score, group=group,
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/sklearn.py:629: in fit
    self._Booster = train(params, train_set,
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/engine.py:249: in train
    booster.update(fobj=fobj)
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/basic.py:2620: in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
E   lightgbm.basic.LightGBMError: Socket recv error, code: 54

../../../../miniconda3/lib/python3.8/site-packages/lightgbm/basic.py:110: LightGBMError
------------------------------------- Captured stderr setup -------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:53833
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53835
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53835
distributed.worker - INFO -          dashboard at:            127.0.0.1:53836
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53833
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-00f16599-3796-426f-a3f5-8ffa270bd8be/dask-worker-space/worker-t_rqdzza
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53834
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53834
distributed.worker - INFO -          dashboard at:            127.0.0.1:53838
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53833
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-18710643-dde6-4930-9765-3bc92ba9c3e9/dask-worker-space/worker-wt1ixqz7
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53835', name: tcp://127.0.0.1:53835, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53835
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53833
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53834', name: tcp://127.0.0.1:53834, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53834
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53833
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-0c7c9134-62b7-11eb-8d62-8c8590957efe
distributed.core - INFO - Starting established connection
------------------------------------- Captured stdout call --------------------------------------
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 13290...
[LightGBM] [Info] Binding port 13290 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 13291...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Binding port 13291 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
------------------------------------- Captured stderr call --------------------------------------
[LightGBM] [Fatal] Socket recv error, code: 54
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:53834', name: tcp://127.0.0.1:53834, memory: 1, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:53834
distributed.worker - WARNING -  Compute Failed
Function:  _train_part
args:      ()
kwargs:    {'model_factory': <class 'lightgbm.sklearn.LGBMRanker'>, 'params': {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 1, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 50, 'num_leaves': 20, 'objective': None, 'random_state': 42, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'time_out': 5, 'local_listen_port': 13290, 'tree_learner': 'data_parallel', 'num_threads': 1, 'machines': '127.0.0.1:13290,127.0.0.1:13291', 'num_machines': 2}, 'list_of_parts': [{'data':    feature_0  feature_1  feature_2  ...  feature_17  feature_18  feature_19
g                                   ...
5  -1.351134   0.150942   0.285553  ...    0.991169    0.231672    0.942732
5  -1.446128  -0.360473   0.317472  ...    0.696021    0.153896    0.815833
5  -1.351134   0.150942   0.2
Exception: LightGBMError('Socket recv error, code: 54')

___________________________________ test_ranker[group1-array] ___________________________________

output = 'array'
client = <Client: 'tcp://127.0.0.1:53871' processes=1 threads=1, memory=8.59 GB>
listen_port = 13300, group = [5, 5, 5, 10, 10, 10, ...]

    @pytest.mark.parametrize('output', ['array', 'dataframe'])
    @pytest.mark.parametrize('group', [None, group_sizes])
    def test_ranker(output, client, listen_port, group):

        X, y, w, g, dX, dy, dw, dg = _create_ranking_data(
            output=output,
            group=group
        )

        # use many trees + leaves to overfit, help ensure that dask data-parallel strategy matches that of
        # serial learner. See https://github.com/microsoft/LightGBM/issues/3292#issuecomment-671288210.
        params = {
            "random_state": 42,
            "n_estimators": 50,
            "num_leaves": 20,
            "min_child_samples": 1
        }
        dask_ranker = lgb.DaskLGBMRanker(
            time_out=5,
            local_listen_port=listen_port,
            tree_learner_type='data_parallel',
            **params
        )
>       dask_ranker = dask_ranker.fit(dX, dy, sample_weight=dw, group=dg, client=client)

test_dask.py:399:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:611: in fit
    return self._fit(
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:454: in _fit
    model = _train(
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:320: in _train
    results = client.gather(futures_classifiers)
../../../../miniconda3/lib/python3.8/site-packages/distributed/client.py:1986: in gather
    return self.sync(
../../../../miniconda3/lib/python3.8/site-packages/distributed/client.py:832: in sync
    return sync(
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils.py:340: in sync
    raise exc.with_traceback(tb)
../../../../miniconda3/lib/python3.8/site-packages/distributed/utils.py:324: in f
    result[0] = yield future
../../../../miniconda3/lib/python3.8/site-packages/tornado/gen.py:735: in run
    value = future.result()
../../../../miniconda3/lib/python3.8/site-packages/distributed/client.py:1851: in _gather
    raise exception.with_traceback(traceback)
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:163: in _train_part
    model.fit(data, label, sample_weight=weight, group=group, **kwargs)
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/sklearn.py:983: in fit
    super().fit(X, y, sample_weight=sample_weight, init_score=init_score, group=group,
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/sklearn.py:629: in fit
    self._Booster = train(params, train_set,
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/engine.py:249: in train
    booster.update(fobj=fobj)
../../../../miniconda3/lib/python3.8/site-packages/lightgbm/basic.py:2620: in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
E   lightgbm.basic.LightGBMError: Socket recv error, code: 54

../../../../miniconda3/lib/python3.8/site-packages/lightgbm/basic.py:110: LightGBMError
------------------------------------- Captured stderr setup -------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:53871
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53872
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:53873
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53872
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:53873
distributed.worker - INFO -          dashboard at:            127.0.0.1:53875
distributed.worker - INFO -          dashboard at:            127.0.0.1:53874
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53871
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:53871
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -                Memory:                    8.59 GB
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-f9f733ee-85cf-4673-aff7-bcb0e6065250/dask-worker-space/worker-kpaa_x1q
distributed.worker - INFO -       Local Directory: /Users/jlamb/repos/LightGBM/tests/python_package_test/_test_worker-f9dd35fd-92c0-4c01-9cd8-640f39242b98/dask-worker-space/worker-lunpfq03
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53872', name: tcp://127.0.0.1:53872, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53872
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53871
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:53873', name: tcp://127.0.0.1:53873, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:53873
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:53871
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-0ff01fac-62b7-11eb-8d62-8c8590957efe
distributed.core - INFO - Starting established connection
------------------------------------- Captured stdout call --------------------------------------
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 13300...
[LightGBM] [Info] Binding port 13300 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 13301...
[LightGBM] [Info] Binding port 13301 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
------------------------------------- Captured stderr call --------------------------------------
[LightGBM] [Fatal] Socket recv error, code: 54
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:53872', name: tcp://127.0.0.1:53872, memory: 5, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:53872
distributed.worker - WARNING -  Compute Failed
Function:  _train_part
args:      ()
kwargs:    {'model_factory': <class 'lightgbm.sklearn.LGBMRanker'>, 'params': {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 1, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 50, 'num_leaves': 20, 'objective': None, 'random_state': 42, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'time_out': 5, 'local_listen_port': 13300, 'tree_learner': 'data_parallel', 'num_threads': 1, 'machines': '127.0.0.1:13300,127.0.0.1:13301', 'num_machines': 2}, 'list_of_parts': [{'data': array([[ 6.99924775e-01, -2.42956264e+00,  1.68371729e+00,
        -9.88911943e-01,  5.45197463e-01,  5.35069719e-01,
         1.84462871e-01,  2.99595912e-01,  3.09936926e-01,
         3.97318115e-01,  4.26788905e-01,  7.99782022e-01,
         3.49386999e-01,  4.68231828e-01,  6.24975987e-01,
         3.77725947e-01,  8.36564813e-
Exception: LightGBMError('Socket recv error, code: 54')

======================================= warnings summary ========================================
test_dask.py::test_errors
  /Users/jlamb/miniconda3/lib/python3.8/site-packages/lightgbm/dask.py:280: RuntimeWarning: coroutine '_wait' was never awaited
    wait(parts)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
==================================== short test summary info ====================================
FAILED test_dask.py::test_ranker[None-array] - lightgbm.basic.LightGBMError: Socket recv error...
FAILED test_dask.py::test_ranker[None-dataframe] - lightgbm.basic.LightGBMError: Socket recv e...
FAILED test_dask.py::test_ranker[group1-array] - lightgbm.basic.LightGBMError: Socket recv err...
ERROR test_dask.py::test_ranker[None-array] - asyncio.exceptions.TimeoutError
ERROR test_dask.py::test_ranker[None-dataframe] - asyncio.exceptions.TimeoutError
ERROR test_dask.py::test_ranker[group1-array] - asyncio.exceptions.TimeoutError
================= 3 failed, 33 passed, 1 warning, 3 errors in 175.56s (0:02:55) =================

I ran these tests on my Mac laptop (mac 10.14).

  1. installed the package with cd python-package && python setup.py install
  2. commented out
    if not platform.startswith('linux'):
    pytest.skip('lightgbm.dask is currently supported in Linux environments', allow_module_level=True)
  3. ran pytest tests/python_package_test/test_dask.py

I tried this three times and only the ranker tests failed.

some thoughts

This looks like #3829, but I don't think it's exactly the same issue. The error Socket recv error, code: 54 is important here.

That error comes from

Log::Fatal("Socket recv error, code: %d", GetLastError());
.

I'll investigate at some point if this is related to dask/distributed#3356. Noticed dask/distributed#3356 (comment) which says the following

the default ulimit for macOS is quite low at 256

I want to try the fixes for that described here and here and here.

@StrikerRUS
Copy link
Collaborator Author

Report from Windows with ea8e47e:

test_dask.py .FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF...                                                                [100%]
E   lightgbm.basic.LightGBMError: Machine list file doesn't contain the local machine

Attaching log in file because

You can't comment at this time — your comment is too long (maximum is 65536 characters).

Full log:

log.txt

@jameslamb
Copy link
Collaborator

Tonight, I ran the Dask tasks on my Mac again (based on master as of the latest commit, bcf443b) and got just two failures. I ran the tests twice, with the following code.

pushd python-package
python setup.py install
popd
pytest tests/python_package_tests/test_dask.py

Just like in #3782 (comment), most tests passed and only two test_ranker tests failed.

-- Docs: https://docs.pytest.org/en/stable/warnings.html
====================================== short test summary info ======================================
FAILED test_dask.py::test_ranker[None-array] - lightgbm.basic.LightGBMError: Socket recv error, co...
FAILED test_dask.py::test_ranker[None-dataframe] - lightgbm.basic.LightGBMError: Socket recv error...
================ 2 failed, 153 passed, 4 skipped, 229 warnings in 706.33s (0:11:46) =================

Both test_ranker failures were with this error:

E lightgbm.basic.LightGBMError: Socket recv error, code: 54

According to https://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/Errors/unix_system_errors.html, error 54 is "Exchange full",

I haven't been able to find a good resource that explains what that means though.

There are many other tests using DaskLGBMRanker, so I tried looking at what is different between test_ranker and the others. The main difference is that the test_ranker tests use deeper trees and more iterations: https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_dask.py#L597-L598. So maybe there is some leak in the ranker code that is creating file handles or some other resources during distributed training and not cleaning them up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants