Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON in ipfs object get transform bytes > 127 into U+FFFD (replacement character �) #2454

Closed
mildred opened this issue Mar 8, 2016 · 3 comments

Comments

@mildred
Copy link
Contributor

mildred commented Mar 8, 2016

I'm trying to decode a unixfs node I got from the Gateway API on the browser. I have problems because I can't figure out how the data is encoded:

$ ipfs object data /ipfs/QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG/readme | hexdump -C | head -n 5
00000000  08 02 12 c3 08 48 65 6c  6c 6f 20 61 6e 64 20 57  |.....Hello and W|
00000010  65 6c 63 6f 6d 65 20 74  6f 20 49 50 46 53 21 0a  |elcome to IPFS!.|
00000020  0a e2 96 88 e2 96 88 e2  95 97 e2 96 88 e2 96 88  |................|
00000030  e2 96 88 e2 96 88 e2 96  88 e2 96 88 e2 95 97 20  |............... |
00000040  e2 96 88 e2 96 88 e2 96  88 e2 96 88 e2 96 88 e2  |................|
$ ipfs object get /ipfs/QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG/readme | hexdump -C | head -n 5 
00000000  7b 22 4c 69 6e 6b 73 22  3a 5b 5d 2c 22 44 61 74  |{"Links":[],"Dat|
00000010  61 22 3a 22 5c 75 30 30  30 38 5c 75 30 30 30 32  |a":"\u0008\u0002|
00000020  5c 75 30 30 31 32 ef bf  bd 5c 75 30 30 30 38 48  |\u0012...\u0008H|
00000030  65 6c 6c 6f 20 61 6e 64  20 57 65 6c 63 6f 6d 65  |ello and Welcome|
00000040  20 74 6f 20 49 50 46 53  21 5c 6e 5c 6e e2 96 88  | to IPFS!\n\n...|

We can see here that in ipfs object data, the 4th byte is c3. In ipfs object get this byte appears encoded as three bytes instead: ef bf bd which is U+FFFD, the unicode replacement character (�). This signifies an encoding problem.

Now, when querying the API using curl (http://localhost:5001/api/v0/object), I get the same characters, but encoded as a JSON escape sequence instead (\ufffd).

I would expect the character to be encoded to U+00C3 which is à and encoded as c3 83 in UTF-8, but this is open to debate. We have two choices here:

  • Encode every byte > 127 to the corresponding unicode character in the U+0080 .. U+00FF range. Real UTF-8 characters would be encoded and not recognizable.
  • The current choice: assume the binary data is UTF-8 and replace every malformed UTF-8 sequence (c3 is one of them) by the unicode replacement character (U+FFFD �). This is a destructive operation.

In a file like /ipfs/QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG/readme, for a character like ╗ (u+2557) appearing in UTF-8 as e2 95 97 this implies:

  • For the first solution, it will be encoded as three characters U+00e2, U+0095 and U+0097. In UTF-8, C3 A2, C2 95 and C2 97, appearing as â��
  • For the second solution, it will be encoded in UTF-8 as its own character ╗ (e2 95 97)

I would say, JSON is not suitable for representing binary data, but on the web, we might not have the choice. Perhaps we should think more on what is the good option here. Perhaps we should not even try to encode the binary data in JSON and just tell people to use some other format.

@mildred
Copy link
Contributor Author

mildred commented Mar 8, 2016

Another solution is to use an encoding scheme like base64, or anything else that would work well with a low overhead for utf-8.

@mildred
Copy link
Contributor Author

mildred commented Mar 8, 2016

A solution could be UTF-8B: http://bsittler.livejournal.com/10381.html
See also: http://legacy.python.org/dev/peps/pep-0383/

This might not work in a browser where the strings are converted internally into UTF-16 though.

@mildred
Copy link
Contributor Author

mildred commented Jun 16, 2016

duplicate of #1582

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant