Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Python 3.10 installing apache-beam==2.43.0 with multiprocess>=0.70.12 has incompatibility for dill #24458

Open
rpeloff-id opened this issue Dec 1, 2022 · 20 comments
Labels
bug dependencies Pull requests that update a dependency file P2 python

Comments

@rpeloff-id
Copy link

rpeloff-id commented Dec 1, 2022

What happened?

On Python 3.10 it is not possible to install apache-beam==2.43.0 together with multiprocess. This is due to Python 3.10 only being supported by multiprocess>=0.70.12 which requires dill>=0.3.4 and is in conflict with the apache-beam requirement for dill>=0.3.1.1,<0.3.2.

These libraries are used together, for example, in the datasets library.

Is there a specific reason for the apache-beam version requirement of the dill package? If not, maybe this could be updated to fix the issue for Python 3.10? Although I have not tested it, the same issue might apply to Python 3.9 which is only supported by multiprocess>=0.70.11 requiring dill>=0.3.3.

Issue Priority

Priority: 2

Issue Component

Component: dependencies

@github-actions github-actions bot added dependencies Pull requests that update a dependency file P2 labels Dec 1, 2022
@rpeloff-id
Copy link
Author

rpeloff-id commented Dec 1, 2022

Just saw this:

beam/sdks/python/setup.py

Lines 235 to 240 in 2e71061

# Dill doesn't have forwards-compatibility guarantees within minor
# version. Pickles created with a new version of dill may not unpickle
# using older version of dill. It is best to use the same version of
# dill on client and server, therefore list of allowed versions is very
# narrow. See: https://github.com/uqfoundation/dill/issues/341.
'dill>=0.3.1.1,<0.3.2',

Handing this issue over to the multiprocess team since I think it is more relevant there: uqfoundation/multiprocess#125

However, I do wonder what the plan is for apache-beam as more libraries start to depend on newer version of dill?

@kennknowles
Copy link
Member

@tvalentyn

@kennknowles
Copy link
Member

Is dill used in coders, hence very high risk of trouble? Or is it primarily/only for pipeline serialization hence easy to match client/server?

@tvalentyn
Copy link
Contributor

We are attempting to Vendor dill in #23870.

@tvalentyn
Copy link
Contributor

Dill is used for pipeline serialization, not in coders.

@tvalentyn
Copy link
Contributor

As far as Beam is concerned, Dill's version at startup and at runtime should match.

@AnandInguva
Copy link
Contributor

cc: @ryanthompson591

@rpeloff-id
Copy link
Author

rpeloff-id commented Dec 8, 2022

This is actually still an issue for me since I am trying to develop a package for Python >=3.9 that has both apache-beam and multiprocess in the install_requires of the setuptools.setup. It is not possible to relax the constraint on dill for either package so I cannot build this package. Any guidance on how to solve this? This would be simple if pip had a simple method to relax constraints in cases like this but pypa/pip#8076 ...

@rpeloff-id
Copy link
Author

The only solution I have found is to pin the multiprocess dependency:

setuptools.setup(
    ...,
    install_requires=["apache-beam", "multiprocess==0.70.9", ...],
)

But this would mean that the package is highly constrained with respect to multiprocess. It also means that for Python >= 3.9 we require a two step installation, see uqfoundation/multiprocess#125 (comment):

pip install -e .
pip install --no-deps multiprocess>=0.70.11

@tvalentyn
Copy link
Contributor

Unfortunately right now you are dealing with two packages that have tight constraints, and I understand the inconvenience. We plan to -address the inconvenience on our side by vendoring dill, but that would take some time. I don't see a clean solution right now, but sounds like installing a newer version of multiprocessing, while ignoring its constrains, would work for you. You can ignore beam's constraint on dill, but if you do so, you need to make sure you install the same version of dill on the workers, or your pipelines will fail.

@davidcavazos
Copy link
Contributor

I also have a similar issue. I'm trying to install HuggingFace datasets which depends on multiprocess, so switching to multiprocessing is not an option. Right now dill is pinned to 0.3.1.1 in Beam, which is from 2019 and datasets 2.x is from 2022 and is pinning to the latest version of multiprocess available, so it's impossible to make it work.

Could it be an option to pin each Beam release with the latest version available of dill similar to what multiprocess does? That would also benefit of any improvements and bug fixes they make. Since both server and workers would have the same Beam version, they should also have the same dill version, right?

@tvalentyn
Copy link
Contributor

pinning latest version is possible, we do that for cloudpickle; but currently Beam is incompatible with the latest version of dill as far as I know. There is ongoing work in #23870 to vendor dill, which would remove the tight bound.

@davidcavazos
Copy link
Contributor

Are there any updates on this? This is currently blocking me from updating some samples to include RunInference pipelines with PyTorch models.

@tvalentyn
Copy link
Contributor

we plan to update to next version of dill before the next Beam release. I am now looking into the issue.

@AnandInguva
Copy link
Contributor

#22893 (comment) - an update on dill is posted here.

@tealgreen0503
Copy link

Is there any update on this issue?

@tvalentyn
Copy link
Contributor

There is no update since last update since #22893 (comment), current course of action is to resolve adoption blockers for cloudpickle.

@henrytomsf
Copy link

has there been an update on this?

@tvalentyn
Copy link
Contributor

tvalentyn commented May 29, 2024

No significant update, https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pickling-and-managing-the-main-session mentions current workarounds - did any of those work for you @henrytomsf ?

@janheinrichmerker janheinrichmerker mentioned this issue Jun 27, 2024
4 tasks
@BigBird01
Copy link

pip install -U multiprocess solved my problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug dependencies Pull requests that update a dependency file P2 python
Projects
None yet
Development

No branches or pull requests

8 participants