Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there some mechanism to decode HTML charset #20

Closed
cybexr opened this issue Apr 26, 2018 · 3 comments
Closed

Is there some mechanism to decode HTML charset #20

cybexr opened this issue Apr 26, 2018 · 3 comments

Comments

@cybexr
Copy link

cybexr commented Apr 26, 2018

Symptom : xidel cannot treat charset correctly.
Example: xidel.exe --html 5.htm -e "//div[@Class='qxName']/a" --stdin-encoding=oem >5o.htm --output-encoding=input

This 5.htm has , but the outpupt is incorrect.

After I convert the 5.htm file to UTF8 encoded, the output is fine. It seems xidel always treat input file as UTF8 , am I missed something? or xidel can only work like this .

@benibela
Copy link
Owner

What are the encodings of the files?

With -output-encoding=input 5o.htm should have the same encoding as 5.htm

--stdin-encoding=oem is ignored here, it is only used when you do - <5.htm

@cybexr
Copy link
Author

cybexr commented Apr 27, 2018

the 5.htm file is GBK encoded, which is DBCS. it's quite different from european-language( all are SBCS) https://en.wikipedia.org/wiki/GBK_(character_encoding)
https://en.wikipedia.org/wiki/DBCS

actually the 5.htm is download from a web-site http-response with header:Content-Type: text/html; charset=gb2312 .

@cybexr
Copy link
Author

cybexr commented Apr 27, 2018

tryed again, just with option : -output-encoding=input result is correct. both on windows&centos.
maybe I'm confused with debugging so much http response yesterday.
Thank you benibela !

@cybexr cybexr closed this as completed Apr 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants