Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove pandas upper limit now that SQLA is 1.4+ #22162

Merged
merged 1 commit into from
Mar 21, 2022

Conversation

dstandish
Copy link
Contributor

No description provided.

@ashb ashb requested a review from potiuk March 10, 2022 19:25
@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Mar 10, 2022
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@ashb
Copy link
Member

ashb commented Mar 10, 2022

Hmmm something is still keeping it at pandas v1.3.5. I guess it's still using the constraints from main for this build?

@potiuk
Copy link
Member

potiuk commented Mar 10, 2022

Hmmm something is still keeping it at pandas v1.3.5. I guess it's still using the constraints from main for this build?

The "Eager upgrade" builds running when setup.py and setup.cfg change (or run in main) do not use constraints at all.

My bext guess is some of the dependencies in google provider or apache-beam - but this is coming more from intuition and guess. Unfortunately (and @uranusjr might have some other tricks which I am not aware of) it might be not easy guess if this is done through an extra (or even transitive extra as it happens) because pip only uses extra's limitations during installation, and any metadata connected to extra limitations is not available for inspection after installation is complete.

The way to check it (at least what I use):

  1. Check with pipdeptree - if this is not an extra-originated limit - you will see the limit there
docker run -it ghcr.io/apache/airflow/main/ci/python3.7:2276fd417180d89da60d661023bf9b48d6b71c7f

Then run:

pipdeptree | less

Then look for pandas.

  1. You can also take look at pip logs (starting here : https://github.com/apache/airflow/runs/5501411485?check_suite_focus=true#step:10:2520) and try to guess from the logs where the limitation came from. But so far I think it did not have an information that could allow to determine it easliy. The change is coming in pip to make the log more useful to track the source of conflicts (Present found conflicts when discarding some criterion pypa/pip#10937), and I hope it's going to help a lot, but for now I could not really figure out much from the log.

  2. Guess (intelligently). There are a few big packages with lots of dependencies. Apache Beam is one but few other google and amazon depencies that are the usual culprits, so the problem space is usually much smaller than 580 dependencies we have.

What's helpful here is https://pypi.org/pypi/PACKAGE/json and specifically https://pypi.org/pypi/PACKAGE/VERSION/json that can help as it contains the meta-data. But when it comes to extras and transitive extras it's more of a manual effort than automation, I think it's rather dificult to replicate what pip does fully (because of the complexity of qualifiers for example) but it is useful for manual verification as you do not have to download and unpack pacjage

@potiuk
Copy link
Member

potiuk commented Mar 11, 2022

Another option (probably easiest):

  1. Check which library uses pandas in the CI Airflow image:
  • Run the image:
docker run -it ghcr.io/apache/airflow/main/ci/python3.7:2276fd417180d89da60d661023bf9b48d6b71c7f

Find all packages importing pandas:

cd /usr/local/lib/python3.7/site-packages
grep -rE "from pandas|import pandas" . 2>/dev/null | grep -v "pycache" | grep ".py:" | cut -d"/" -f2 | sort | uniq

Result (short list of candidates to check):

apache_beam
azure
dask
db_dtypes
distributed
elasticsearch
fsspec
future
google
hdfs
influxdb_client
pandas
pandas_gbq
partd
pyarrow
pydruid
pyexasol
pyspark
redshift_connector
scrapbook
snowflake
tqdm

@potiuk
Copy link
Member

potiuk commented Mar 16, 2022

Hey @ashb @dstandish @kaxil @jedcunningham - can you please rebase the PR ?

I took a deeper look and the reason why pandas is held back is rather straightforward - Pandas 1.4+ does not support Python 3.7 any more 😱

So there is no way it can be upgraded for 3.7.

The PR has full tests needed label added so we should see if Python 3.8+ will get Pandas updated.

@potiuk potiuk closed this Mar 20, 2022
@potiuk potiuk reopened this Mar 20, 2022
@potiuk
Copy link
Member

potiuk commented Mar 20, 2022

Reopened to rebuild and check for 3.8/3.9 pandas upgrade.

@potiuk
Copy link
Member

potiuk commented Mar 21, 2022

Yep. As expected. Pandas was upgraded to 1.4 in Python 3.8+ - so it was simply Python 3.7 that still uses old Pandas simply because Pandas 1.4 is Python 3.8+

@potiuk potiuk merged commit cef004d into apache:main Mar 21, 2022
@ephraimbuddy ephraimbuddy added the type:misc/internal Changelog: Misc changes that should appear in change log label Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full tests needed We need to run full set of tests for this PR to merge type:misc/internal Changelog: Misc changes that should appear in change log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants