Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaskGeoDataFrame parquet write error - Series object has no attribute total_bounds #138

Open
4andy opened this issue Feb 23, 2024 · 7 comments

Comments

@4andy
Copy link

4andy commented Feb 23, 2024

Hi - I'm running into an error when trying to write a DaskGeoDataFrame. I'm following the basic pattern here (see also) but using a smaller sample of a point dataset. Everything seems to run as expected until trying to write out the packed file and I encounter the error below.

ALL software version info

pyarrow =15.0.0
spatialpandas=0.4.10
pandas=2.1.1
dask=2024.2.0
python=3.9.16

df = df.pack_partitions(npartitions=df.npartitions, shuffle='disk')
df.to_parquet(save_path)

image

image

@4andy
Copy link
Author

4andy commented Feb 27, 2024

I was able to get a small file written without error but I still encounter the error with a large dataset.

I re-ran on a different system with pandas 2.2.1 and again with pandas 1.5.3 and encountered the error each time. Any ideas are appreciated. Here is a more complete stack trace
image

@4andy
Copy link
Author

4andy commented Feb 27, 2024

If there is only one Dataframe partition saving works fine - if there is > 1 partition, this error is returned.

@hoxbro
Copy link
Member

hoxbro commented Feb 28, 2024

I would guess that this was implemented with fastparquet, which has now been dropped by Dask. Can you try downgrading the Dask version to something like 2020 and see if that will work with/without fastparquet.

@4andy
Copy link
Author

4andy commented Feb 28, 2024

Thanks for that idea @hoxbro. I downgraded dask to 2020 but it returns the same error.

So far in looking into the issue I found that any call to df.geometry.total_bounds after df.pack_partitions() raises the error. However, you can call the total_bounds property any number of times before packing partitions and it returns correctly.

@hoxbro
Copy link
Member

hoxbro commented Feb 28, 2024

Did you try to set the parquet backend to fastparquet?

@4andy
Copy link
Author

4andy commented Feb 28, 2024

I did try fastparquet (same error). However, I don't think it's related to that or to saving directly. Something happens with pack_partitions that causes and future calls to the geometry.total_bounds property to fail. It's failing at save because to_parquet makes calls to that property.

@4andy
Copy link
Author

4andy commented Feb 28, 2024

I found a trigger condition for the error - it occurs when one or more longitudes are negative. I attached a simple notebook that reproduces the error. If you change the negative longitude to positive the error is resolved. Not sure where to look in the code to patch this. Thanks!
sp_error_example.ipynb.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants