Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gateway] Content-Encoding: gzip and Content-Type: text/html #7268

Closed
xmaysonnave opened this issue May 3, 2020 · 20 comments
Closed

[Gateway] Content-Encoding: gzip and Content-Type: text/html #7268

xmaysonnave opened this issue May 3, 2020 · 20 comments
Labels
kind/feature A new feature status/wontfix This will not be addressed

Comments

@xmaysonnave
Copy link

Dear Friends,

Nginx when configured is able to send either static or on the fly gzip content.

The received Response-Header when one request an hypothetical https://example.org web page (assuming your web site default is index.html) is :

Content-Encoding: gzip
Content-Type: text/html

My Chrome browser is able to inflate and process the web page.

I uploaded an index.html.gz to my local ipfs server and made the following curl tests :

curl -X HEAD -I http://127.0.0.1:8080/ipfs/QmRhkAAucjWCdZmYVMgBf6oYEBQD9pWrMfKGHcgmMveHE2
Content-Type: application/x-gzip

curl -X HEAD -I curl -X HEAD -I http://127.0.0.1:8080/ipfs/QmRhkAAucjWCdZmYVMgBf6oYEBQD9pWrMfKGHcgmMveHE2\?filename\=index.html.gz
Content-Type: application/gzip

First Point -> You can noticed that the Content-Type is not consistent.

curl -X HEAD -I http://127.0.0.1:8080/ipfs/QmRhkAAucjWCdZmYVMgBf6oYEBQD9pWrMfKGHcgmMveHE2\?filename\=index.html
Content-Type: text/html

This test mimic what you usually expect when you request an index.html from a web server. While the content is zipped you didn't receive the proper Content-Encoding: gzip

Nginx has a particular setup to achieve that. Either it's setup is meant to gunzip on the fly or is able to serve an already gunzipped resource if an index.html and an index.html.gz exist at the same directory level.

The gain is very significant in my situation as my index.html is 5.5MB while the index.html.gz is 1.8MB. I uploaded and pinned on infura an index.html.gz available here:

I made some curl tests on infura but it appears that their is some web front-end processing.

curl -X HEAD -I https://ipfs.infura.io/ipfs/QmQ2x72Nw9oDhrPckfdbbjBEc6WiB3gqrnRZqmqxHMdmVS\?filename\=index.html

I receive :
content-type: text/plain; charset=utf-8

curl -X HEAD -I https://ipfs.infura.io/ipfs/bafybeiadqvziczgdi4k5qgqyixg6q7b6yzzcwxxcupxd3fm2nwehfi6j2q\?filename=index.html
Even with a regular index.html page I receive:
content-type: text/plain; charset=utf-8

My browser do not complain and display the content as a web page.

The go-ipfs server is able to understand if a requested resource is gunzipped. However I'm considering whether or not this situation could be improved.

I would expect with the following curl :
curl -X HEAD -I http://127.0.0.1:8080/ipfs/QmRhkAAucjWCdZmYVMgBf6oYEBQD9pWrMfKGHcgmMveHE2\?filename\=index.html
to receive :

Content-Encoding: gzip
Content-Type: text/html

Browsers will then be able to properly inflate the content and process the web page. The way infura is processing/proxying the response header is another topic.

Thanks

@xmaysonnave
Copy link
Author

I missed one important point in my reasoning.
The browser is sending in its Request Header the following Accept-Encoding: gzip, deflate, br
Otherwise no chance to get the proper Content-Encoding from Nginx.
Is there any modern browsers who's not doing that ??
Thanks

@Stebalien
Copy link
Member

Stebalien commented May 4, 2020

Most of the strangeness you're noticing here are quirks of content-type detection. We:

  1. Prefer the extension.
  2. Then fall back on the actual content type.

A reasonable extension to the gateway would be to serve index.html.gz files along with index.html files, automatically decompressing them on the fly if the user agent doesn't specify that they accept gzipped files.

Would that cover your use-case? If so, would you be willing to implement it?

@xmaysonnave
Copy link
Author

xmaysonnave commented May 5, 2020

@Stebalien
1 - Thanks for the explanations about the content-type detection.

Summary

Current behaviour as expected:

https://ipfs.infura.io/ipfs/QmQ2x72Nw9oDhrPckfdbbjBEc6WiB3gqrnRZqmqxHMdmVS
The content-type is application/x-gzip and the noticed behaviour in a browser is to display a save as with a suggested QmQ2x72Nw9oDhrPckfdbbjBEc6WiB3gqrnRZqmqxHMdmVS.gz file name.

https://ipfs.bluelightav.org/ipfs/QmeeqFYbLabqZA2KjmFTCRfAVpv4kjgRHNistw63V6Jp4X?filename=index.html.gz
The content-type is application/x-gzip and the noticed behaviour in a browser is to display a save as with a suggested index.html.gz file name.

Current Behaviour:
https://ipfs.infura.io/ipfs/QmQ2x72Nw9oDhrPckfdbbjBEc6WiB3gqrnRZqmqxHMdmVS?filename=index.html
content-type: text/html

The browser displays the gz content (pretty ugly)

Expected Behaviour:
https://ipfs.infura.io/ipfs/QmQ2x72Nw9oDhrPckfdbbjBEc6WiB3gqrnRZqmqxHMdmVS?filename=index.html
Content-Encoding: gzip
content-type: text/html

2 - Not sure to understand what you mean by :

to serve index.html.gz files along with index.html files

I don't expect to have two files (index.html and index.html.gz) as required with nginx in gzip static mode.

3 - I like the idea when the the Accept-Encoding: gzip, deflate, br do not contain gzip to inflate on the fly the content (as long as the content-type detection is application/gzip or application-x-gzip)
It's the exact opposite of what happened with nginx. nginx deflate while you suggest to inflate the content.

4 - I see this enhancement as a feature rather than a project use case. It will probably benefit a lot of users. @hsanjuan made a comment about a wider discussion on this topic.
#7252 (comment)
More people probably needs to be involved in this discussion as we spoke about a gzip html page in particular. I'm thinking about json gzip content for instance. I don't have a larger picture yet. Is there any kind of other content-type, content-encoding who could benefit of this approach ?

5 - My help will be very limited on this issue as I'm not a Go developper. However you could expect feedback and testing.

Warmly

@Stebalien
Copy link
Member

1

The current behavior is correct (ish). We first use the filename to detect the content type, only falling back on the file content if the filename is ambiguous.

2

Currently, when the user visits /ipfs/QmFoo/ and QmFoo is a directory, we serve an index.html file out of QmFoo, if it exists.

I'm suggesting that we alternatively serve an index.html.gz file. That is, when the user visits /ipfs/QmFoo/:

  • If the browser accepts gzipped HTML, we'd first try serving index.html.gz, falling back on index.html.
  • If the browser doesn't accept gzipped HTML, we'd first try serving index.html, falling back on transparently decompressing and serving index.html.gz.

4

Are you're talking about auto-compressing responses if the user-agent specifies an Accept-Encoding that accepts compressed responses? If so, then yes. We could auto-compress (like nginx does) gateway responses.

If you'd like to add a foo.json.gz file to go-ipfs and have go-ipfs serve it as a "json" file, that won't work, unfortunately.

The index.html trick works because the user didn't ask for index.html or index.html.gz, they asked for the directory. That gives us some wiggle-room.

However, if the user asks to download some foo.json.gz file, we need to give them the exact file they asked for.


The general solution is to compress at the edges:

  1. Compress chunks in the local datastore to save disk space.
  2. Transport-layer compression in bitswap (the IPFS data exchange protocol).
  3. Transparently compress responses to the user as nginx does.

To avoid wasting CPU cycles, we can probably play some neat tricks to avoid ever having to re-compress data but that can be done as an optimization later.

@xmaysonnave
Copy link
Author

@Stebalien Let me think about that.
However I realize that I forgot to mention that I work in block mode. I don't use IPLD (I hope I'm correct). Thanks

@Stebalien
Copy link
Member

Blocks are just serialized IPLD objects.

@xmaysonnave
Copy link
Author

It seems correct but I'm little bit confused with your wording as I don't use directories. But the general idea is when a content-type is application/x-gzip. This content is either directly served if the Request Header has an Accept-Encoding: gzip. Otherwise fallback to decompress and serve the content. However to mimic the current browser behaviour a filename=index.html is required.

@Stebalien
Copy link
Member

That's not what I'm trying to say. If we're serving a file, we need to serve the file as-is for the reasons I listed in my response to your point 4.

I'm saying that in the special case where we're serving a directory, we could consider automatically decompressing and serving a index.html.gz instead of an index.html file.

@hsanjuan
Copy link
Contributor

hsanjuan commented May 5, 2020

However, if the user asks to download some foo.json.gz file, we need to give them the exact file they asked for.

If the user asks for foo.json.gz but the user accepts gzip as content-encoding, could we set Content-Type to json directly? (this is assumming that a gzipped file has been added to ipfs directly).

I would not go the lengths of gzipping things for users as this does not really kill the need of removing nginx on production gateways and nginx does it anyways. Otherwise we are talking mostly about local usage for which it does not help too much.

@Stebalien
Copy link
Member

If the user asks for foo.json.gz but the user accepts gzip as content-encoding, could we set Content-Type to json directly? (this is assumming that a gzipped file has been added to ipfs directly).

If we did that, we'd also have to set Transfer-Encoding to gzip or the user would get back garbage. However, in that case, we'd be giving the user a different file than the one they added to go-ipfs.

@hsn10
Copy link

hsn10 commented May 6, 2020

There is no need to do decompress. It’s publisher responsibility to choose formats understood by his expected readers.

@xmaysonnave
Copy link
Author

@hsn10 no the gateway is a web server and should act as a web server who fits in an infrastructure. (proxy, ssl, cache, etc...) nginx is able to compress and uncompress a content.
https://docs.nginx.com/nginx/admin-guide/web-server/compression/

As a user when I upload a compressed HTML content I expect in return with the filename trick (or no filename trick if we consider html the default) to receive the appropriate response-headers to let the browser do its decompress job.

We cannot simply rely on the proxy to do the compression job. Right now infura is not compressing while gateway.ipfs.io do. Some improvement needs to be done at the gateway level. I opened a ticket @ infura and invited them as a public infrastructure provider to participate in this thread. INFURA/infura#200

The question now are the points raised by @Stebalien. I see two scenarii. In my current use case I upload a buffer to ipfs with the js-ipfs-http-client, no filename is involved. I work in block mode. The other use case is when using the file API. As @Stebalien suggested we could imagine to have two files (typical nginx use case). I'm in the first scenario as I want to improve the upload/download network traffic and let the browser decompress, I do not intend to upload two files. A question here is when a client is unable to decompress as specified in its Request-Header: accept-encoding: gzip, deflate, br a compressed content and how the gateway should behave. It also opens the various compression algorithm question support (brotli, etc...)

I made also some quick tests with compressed json content through my nginx and I confirm to have received a Transfer-Encoding response however I'm not really familiar with this header. From my perspective it is less critical mainly because my json traffic is under my application control. This is not the case for HTML as usually either through ipfs, ipns, ens or dns-link a user input an address or use a bookmark to load a content (for sure this content could also be controlled at the application level, as I do) and we cannot expect from them to also add a filename to trick a compressed content. The default in a web server is usually index.html.

If we can store compressed content and let the browser uncompress the content we ease the network traffic, we could also cache smaller content in nginx as we are dealing with immutable content. I agree that there are some edge case who need to be addressed (non compressed client support) with an uploaded compressed content, the other way is also a valid use case. Uploaded non compressed content and the client is able to decompress).

I would also suggest to focus on a minimal use case. It sounds to to me that enlarging the conversation to all the mime-types is probably prematured.

Could we agree first to the need of a compressed uploaded html content. It sound to me the minimal use case ?

Thanks

@Stebalien
Copy link
Member

@xmaysonnave what is your goal?

  • Is it to save bandwidth between the gateway and the browser?
  • Is it to save disk space for ipfs websites?
  • Is it to save bandwidth between IPFS peers when exchanging data?

Side note: please try to be brief and use ample punctuation and bullet points.

@xmaysonnave
Copy link
Author

@Stebalien my initial motivation is the first point but the other points are perfectly valid when compressed html content are uploaded.

@Stebalien
Copy link
Member

In that case, the solution here really is to to use an nginx reverse proxy. The gateway is not a full-featured HTTP server; it implements the bare minimum.

  • It does not cache rendered pages in memory.
  • It does not handle encryption.
  • It does not load balance.
  • It does not rate limit.
  • It does not deduplicate requests to the same resource.

If you're running a public go-ipfs gateway, you'll always have an nginx reverse proxy (for load balancing and caching if nothing else).

If you're running go-ipfs on a personal computer, you won't want compression between your local browser and your local go-ipfs node.

@xmaysonnave
Copy link
Author

Who is supposed to setup the proper content-type and content-encoding if not the gateway?

@Stebalien
Copy link
Member

Both.

Given an uncompressed index.html file stored in IPFS:

  1. The gateway sets the content-type to text/html and sets the content-encoding to "identity" (or nothing).
  2. The reverse proxy keeps the content-type and sets the content-encoding, compresses as necessary, and sets the content-encoding as necessary.

go-ipfs already does this correctly.

@xmaysonnave
Copy link
Author

Thanks for your detailed behaviour description with uncompressed html content. It works well I agree.

What about compressed html content stored on ipfs ?

@Stebalien
Copy link
Member

Not unless there's a really good motivation.

The correct solution is the one I posted at the bottom of #7268 (comment). That is:

  1. Compress on the network.
  2. Compress on disk (Expose Badger compression options #6848).

The consensus on compressing content pre-hashing in IPFS is that it's not the way to go. See ipld/specs#76 for a long discussion but the TL;DR is:

  1. If different users choose different compression algorithms, the data isn't de-duplicated.
  2. Changing the compression algorithm changes the content's name on the network because it changes the hash.
  3. Importantly, all peers need to support all compression algorithms:
    1. I can't use a new compression algorithm until everyone I care about supports it.
    2. I can never remove support for compression algorithms.

On the other hand, if compression is done on the network between two peers and on disk:

  1. On the network: The peers can negotiate what compression formats they support (as HTTP does).
  2. On disk: the local node can pick the best compression algorithm for its use-case.

@xmaysonnave
Copy link
Author

@Stebalien Thanks for the references.
I noticed that this topic/use case is also discussed here.
ipfs/notes#324
Thanks

@Stebalien Stebalien added status/wontfix This will not be addressed and removed need/triage Needs initial labeling and prioritization labels May 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature A new feature status/wontfix This will not be addressed
Projects
None yet
Development

No branches or pull requests

4 participants