-
-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 support #49
UTF-8 support #49
Conversation
We could still use the patchwork library you know. |
I looked into patchwork/utf8 as one possibility, but it seems to fallback to |
Spec 0.14 introduced the need to peek() for unicode whitespace: http://spec.commonmark.org/0.14/#right-facing-delimiter-run The Cursor therefore cannot be ignorant to multi-byte encodings.
dcafbfc
to
8c9d004
Compare
Requiring the extension is probably the best way to go then. :) |
Do you guys know if mbstring is that widespread? I personally have no idea |
I tried researching that but wasn't able to find any conclusive information. It does seem to be available in the most-popular Linux distros though:
|
Thanks! |
Spec version 0.14 enhanced some of the rules regarding emphasis parsing. It's now necessary to peek around at characters in a Unicode-aware way.
This introduces a major problem - the
Cursor
class is not currently Unicode-aware. For example, it thinks[Толпой]
is 14 characters long instead of 8. This was fine for our purposes since we never needed to address Unicode characters by position.So the new challenge is that given a
Cursor
in some position, we need to accurately obtain neighboring Unicode characters. Our$this->line[$this->position]
trick won't work as it will only obtain a single byte instead of the whole character. The proper solution would require theCursor
becoming Unicode-aware.iconv
andmbstring
both seemed to be strong candidates for this task, so I implemented both and benchmarked them.iconv
resulted in a 1000% performance penalty, whereasmbstring
represented a 26% drop, so I implemented the latter where needed.As much as I hate adding new dependencies and reducing performance, I do feel this is the correct approach. I definitely welcome any feedback, alternatives, or performance enhancements. I'll keep this open for a few days to gather feedback before accepting.
/cc @philsturgeon @cebe @GrahamCampbell @aleemb @dshafik