Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet files cannot be read from pre-signed S3 URLs due to S3 forbidding HTTP HEAD #18186

Closed
2 tasks done
matt035343 opened this issue Aug 14, 2024 · 8 comments · Fixed by #18274
Closed
2 tasks done
Assignees
Labels
A-io-cloud Area: reading/writing to cloud storage accepted Ready for implementation bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@matt035343
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

url = 'https://data-bolt-s3.awsp.sneaksanddata.com/dev/MATA%40ecco.com/dummy.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=CULIE8RJLVA02A82XSKQ%2F20240814%2Feu-central-1%2Fs3%2Faws4_request&X-Amz-Date=20240814T123605Z&X-Amz-Expires=2592000&X-Amz-SignedHeaders=host&X-Amz-Signature=6d0a6fdef31c350e0cd0dd5d1838b0d776e8867f38883d7c5f1084fd0c8159b0'
df = pl.read_parquet(url)

Log output

Traceback (most recent call last):
  File "/opt/pycharm-2023.1.2/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "/home/mata/.cache/pypoetry/virtualenvs/ecco-auto-replenishment-crystal-solver-hg2QoUWq-py3.11/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mata/.cache/pypoetry/virtualenvs/ecco-auto-replenishment-crystal-solver-hg2QoUWq-py3.11/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mata/.cache/pypoetry/virtualenvs/ecco-auto-replenishment-crystal-solver-hg2QoUWq-py3.11/lib/python3.11/site-packages/polars/io/parquet/functions.py", line 208, in read_parquet
    return lf.collect()
           ^^^^^^^^^^^^
  File "/home/mata/.cache/pypoetry/virtualenvs/ecco-auto-replenishment-crystal-solver-hg2QoUWq-py3.11/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: Generic HTTP error: Request error: Client error with status 400 Bad Request: No Body

Issue description

Polars is able to read data directly from a URL, however, for some reason it cannot read S3 presigned URLs even though no authentication is required. Loading the URL in Pandas or even in my Chrome browser, it works fine.
When I use a presigned URL from Azure I have no problems.

If I use urllib to read the bytes and provide the bytes to Polars, it works just fine. Example:

url = 'https://data-bolt-s3.awsp.sneaksanddata.com/dev/MATA%40ecco.com/dummy.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=CULIE8RJLVA02A82XSKQ%2F20240814%2Feu-central-1%2Fs3%2Faws4_request&X-Amz-Date=20240814T123605Z&X-Amz-Expires=2592000&X-Amz-SignedHeaders=host&X-Amz-Signature=6d0a6fdef31c350e0cd0dd5d1838b0d776e8867f38883d7c5f1084fd0c8159b0'
data_bytes = urllib.request.urlopen(url).read()
df = pl.read_parquet(data_bytes)
shape: (3, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘

Expected behavior

Expected behavior is that no exception is raised when reading the following URL:

url = 'https://data-bolt-s3.awsp.sneaksanddata.com/dev/MATA%40ecco.com/dummy.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=CULIE8RJLVA02A82XSKQ%2F20240814%2Feu-central-1%2Fs3%2Faws4_request&X-Amz-Date=20240814T123605Z&X-Amz-Expires=2592000&X-Amz-SignedHeaders=host&X-Amz-Signature=6d0a6fdef31c350e0cd0dd5d1838b0d776e8867f38883d7c5f1084fd0c8159b0'
df = pl.read_parquet(url)
shape: (3, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:               3.11.9 (main, Apr  6 2024, 17:59:24) [GCC 11.4.0]
----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            0.18.2
fastexcel:            <not installed>
fsspec:               2024.6.1
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             2.6.4
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@matt035343 matt035343 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 14, 2024
@matt035343
Copy link
Author

Potentially related to to #17864

@george-zubrienko
Copy link

george-zubrienko commented Aug 14, 2024

Additionall context: provided examples are from MinIO with s3v4 signatures, we tested AWS S3 presigned urls as well, same issue.

@deanm0000
Copy link
Collaborator

I'm getting a 400 error with urllib just the same. Using Azure I can make a SAS link which seems to be the equivalent to a presigned S3 URL and it works. For reference that looks like

pl.read_parquet("https://ACCOUNT.blob.core.windows.net/CONTAINER/PATH/part-00001-801b155e-34dd-49b9-9271-afc51ea67548-c000.zstd.parquet?secret_query_params

@george-zubrienko
Copy link

george-zubrienko commented Aug 15, 2024

@deanm0000

https://data-bolt-s3.awsp.sneaksanddata.com/dev/MATA%40ecco.com/dummy.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=CULIE8RJLVA02A82XSKQ%2F20240814%2Feu-central-1%2Fs3%2Faws4_request&X-Amz-Date=20240814T123605Z&X-Amz-Expires=2592000&X-Amz-SignedHeaders=host&X-Amz-Signature=6d0a6fdef31c350e0cd0dd5d1838b0d776e8867f38883d7c5f1084fd0c8159b0

As it was pointed out, this affects both S3 and MinIO. S3 signed urls are not the same as Azure, esp depending on the signature algorithm selected. Issue was raised specifically for S3 API - let's focus on that :)
However, Matthias provided a url with wrong expiry period, if you open what he provided in a browser it gives 400 indeed, with a reason:

<Error>
<Code>AuthorizationQueryParametersError</Code>
<Message>X-Amz-Expires must be less than a week (in seconds); that is, the given X-Amz-Expires must be less than 604800 seconds</Message>
<Key>MATA@ecco.com/dummy.parquet</Key>
<BucketName>dev</BucketName>
<Resource>/dev/MATA@ecco.com/dummy.parquet</Resource>
<Region>eu-central-1</Region>
<RequestId>17EBD280527D45F9</RequestId>
<HostId>97b38168d719f9a48744208e207be4917e284c5de759ed78908a910e7b62024f</HostId>
</Error>

So - apologies for the confusion, I've generated a new S3 presigned url for you - valid for 12 hours. So in case you don't get here before, I can generate the one with a week expiration later - but it is much easier if you create them yourself to try out, you just need an S3 bucket with write permissions in it :)

url = 'https://esd-airflow-dev-1-log-archive.s3.eu-central-1.amazonaws.com/dummy.parquet?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjENb%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGV1LWNlbnRyYWwtMSJHMEUCICu6oKsN4GzIAenRNl7l99Y4A2U80UwYIBMaJOxLdXAmAiEAh3sqm5sYMWpXwA3FQBbRMFd75wOyMWncvU7m%2B6JMrrcq2AMIz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgw5NTc3NDczMzQzNzIiDOi6%2Fx4ceHyXIhvIpyqsAyCBtYnbPF11wmdQ2lwrbR85tTBLLkjWLATvXjySQl2DsFsPbivC4ujrnoiaeBd4ettGqHMtqhjBUAVgs%2FzSRzBOZPD8Ylw08s220qgHo9JNOgih1xxbUCqONM3GV9gJdOlPV4tEDwqU61C7DTpeQREdCtFkiMo0lJYeFZfEzISGccBHxFBJkVT9czq52RxmOxXu0K1vppwy2ELsEHDh%2BGDESjmwxFUOn9i9ic4P1cNxeD6bIWCV4yiHhYyyHNNH134vPAnRzIURW2%2Fgh2nBaexoW3PUyG2KDIJxviC2X5u8MJ9GEWCGLYlZ9sdVx0PT3jI%2BXYufJILc2rU6ZobAt%2BVoh3ftaf0ftDtChRw0Im%2BLwQxH7vYZPm%2B8GSKpQh3AAmljjLM%2BCsD5O%2B%2FG7%2FAcRglpoG%2Fd51CtGPE81NEHW%2Bhv%2FjXfjfNshXnJvCo90Lfvgs7J7OkA%2FQzXSnH4lD4h1wlkkIbY8wTQb30vukDQKG6h1h4OERUr5XNTdpG7E0pG1FW10rqo6x9G2l2rBoAnftB33X6Iu7%2F%2BFpiLaoykW2ALUunnpQIafQURh32nMOO89rUGOpQCarVSSlu48LmKWUCWzSxOHw7O9f6N45%2BN%2B4Ub24f88RjSEJXW52izfOWQOz09ZE2cQSi0593uMNvjimlAnO6jrs0TrAIxAvvajBqbfvqhCALeSctjYshchOPRX9PpqSyK6Tn4DWRhOjeztKPLuT9xo6fJXbC5L6%2F0%2FnUuvTxu45VUnBjl7dWRLu6fWPdpHtfIqNgrG35Ev1fHRZlfpexb9lL%2BZIuCqwZ2eh2SemIyrnMA%2F93P%2BkYsCmfsSFdqqVFf9Gz1FDSbojbCFBto4q3MKnRkEaKXMk6aSX%2FLKAePi1TXp023iH9V6tNAvZkw3Xt81FxHcaRXsaMJOlmR4T30gB5S8q92XQU%2FXmQZu5usVdDEOlpC&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240815T062229Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIA557RQRTSEBQ32OWT%2F20240815%2Feu-central-1%2Fs3%2Faws4_request&X-Amz-Signature=3ec7f438461e7a0367cd10e32802a1fd789f77f44278ba067b9ff1e962519f27'

@ritchie46
Copy link
Member

@nameexhaustion can you take a look?

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Aug 15, 2024

This error is due to presigned URLs not allowing the HTTP HEAD method - it can also be reproduced in the terminal with curl -I <URL>, which outputs HTTP/1.1 403 Forbidden.

There is a hack that can be used to retrieve the length using a GET request with Range: bytes=0-0 header (https://stackoverflow.com/a/39663152), but it won't be simple to do since we currently use HttpStore from object_store which doesn't expose a way for us to do this.

@nameexhaustion nameexhaustion added A-io-cloud Area: reading/writing to cloud storage accepted Ready for implementation P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Aug 15, 2024
@nameexhaustion nameexhaustion self-assigned this Aug 15, 2024
@deanm0000
Copy link
Collaborator

Seems like it'd be better to raise the issue with object_store than try to patch it in polars when they'll eventually want the same behavior, no?

@george-zubrienko
Copy link

george-zubrienko commented Aug 15, 2024

This error is due to presigned URLs not allowing the HTTP HEAD method - it can also be reproduced in the terminal with curl -I <URL>, which outputs HTTP/1.1 403 Forbidden.

There is a hack that can be used to retrieve the length using a GET request with Range: bytes=0-0 header (https://stackoverflow.com/a/39663152), but it won't be simple to do since we currently use HttpStore from object_store which doesn't expose a way for us to do this.

Presigned URL can be issued with HEAD method allowed - not through UI though. So technically on our end there is a non-hacky workaround, but i'd say it would be nice to get this documented?

Just checked - GET and HEAD cannot be combined in one signature, so the workaround is having two urls which ofc won't fit into read_parquet :/

@nameexhaustion nameexhaustion changed the title Raises ComputeError when reading parquet from presigned S3 URL Parquet files cannot be read from pre-signed S3 URLs due to S3 forbidding HTTP HEAD Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-cloud Area: reading/writing to cloud storage accepted Ready for implementation bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants