-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet files cannot be read from pre-signed S3 URLs due to S3 forbidding HTTP HEAD #18186
Comments
Potentially related to to #17864 |
Additionall context: provided examples are from MinIO with s3v4 signatures, we tested AWS S3 presigned urls as well, same issue. |
I'm getting a 400 error with urllib just the same. Using Azure I can make a SAS link which seems to be the equivalent to a presigned S3 URL and it works. For reference that looks like pl.read_parquet("https://ACCOUNT.blob.core.windows.net/CONTAINER/PATH/part-00001-801b155e-34dd-49b9-9271-afc51ea67548-c000.zstd.parquet?secret_query_params |
As it was pointed out, this affects both S3 and MinIO. S3 signed urls are not the same as Azure, esp depending on the signature algorithm selected. Issue was raised specifically for S3 API - let's focus on that :)
So - apologies for the confusion, I've generated a new S3 presigned url for you - valid for 12 hours. So in case you don't get here before, I can generate the one with a week expiration later - but it is much easier if you create them yourself to try out, you just need an S3 bucket with write permissions in it :)
|
@nameexhaustion can you take a look? |
This error is due to presigned URLs not allowing the HTTP HEAD method - it can also be reproduced in the terminal with There is a hack that can be used to retrieve the length using a GET request with |
Seems like it'd be better to raise the issue with object_store than try to patch it in polars when they'll eventually want the same behavior, no? |
Just checked - GET and HEAD cannot be combined in one signature, so the workaround is having two urls which ofc won't fit into read_parquet :/ |
Checks
Reproducible example
Log output
Issue description
Polars is able to read data directly from a URL, however, for some reason it cannot read S3 presigned URLs even though no authentication is required. Loading the URL in Pandas or even in my Chrome browser, it works fine.
When I use a presigned URL from Azure I have no problems.
If I use urllib to read the bytes and provide the bytes to Polars, it works just fine. Example:
Expected behavior
Expected behavior is that no exception is raised when reading the following URL:
Installed versions
The text was updated successfully, but these errors were encountered: