UTF-8 support #49

colinodell · 2015-01-02T19:52:32Z

Spec version 0.14 enhanced some of the rules regarding emphasis parsing. It's now necessary to peek around at characters in a Unicode-aware way.

This introduces a major problem - the Cursor class is not currently Unicode-aware. For example, it thinks [Толпой] is 14 characters long instead of 8. This was fine for our purposes since we never needed to address Unicode characters by position.

So the new challenge is that given a Cursor in some position, we need to accurately obtain neighboring Unicode characters. Our $this->line[$this->position] trick won't work as it will only obtain a single byte instead of the whole character. The proper solution would require the Cursor becoming Unicode-aware.

iconv and mbstring both seemed to be strong candidates for this task, so I implemented both and benchmarked them. iconv resulted in a 1000% performance penalty, whereas mbstring represented a 26% drop, so I implemented the latter where needed.

As much as I hate adding new dependencies and reducing performance, I do feel this is the correct approach. I definitely welcome any feedback, alternatives, or performance enhancements. I'll keep this open for a few days to gather feedback before accepting.

/cc @philsturgeon @cebe @GrahamCampbell @aleemb @dshafik

GrahamCampbell · 2015-01-02T19:54:17Z

We could still use the patchwork library you know.

colinodell · 2015-01-02T19:56:51Z

I looked into patchwork/utf8 as one possibility, but it seems to fallback to iconv which is terrible for performance.

Spec 0.14 introduced the need to peek() for unicode whitespace: http://spec.commonmark.org/0.14/#right-facing-delimiter-run The Cursor therefore cannot be ignorant to multi-byte encodings.

This reverts commits: - 6b28381 - c7331db - 82421eb

GrahamCampbell · 2015-01-02T20:16:06Z

Requiring the extension is probably the best way to go then. :)

UTF-8 support

mnapoli · 2015-02-05T07:18:23Z

Do you guys know if mbstring is that widespread? I personally have no idea

colinodell · 2015-02-05T13:31:36Z

I tried researching that but wasn't able to find any conclusive information. It does seem to be available in the most-popular Linux distros though:

Debian/Ubuntu: built-in by default
RHEL/CentOS and Fedora: yum package available
openSUSE: yast package available

mnapoli · 2015-02-05T20:35:54Z

Thanks!

colinodell self-assigned this Jan 2, 2015

colinodell added this to the Version 0.6 milestone Jan 2, 2015

colinodell added the feedback wanted We need your input! label Jan 2, 2015

colinodell added the spec compliance Issues or question about compliance with the CommonMark or GFM specs label Jan 2, 2015

colinodell changed the title ~~UTF-8 support~~ Spec 0.15 & UTF-8 support Jan 2, 2015

colinodell changed the title ~~Spec 0.15 & UTF-8 support~~ UTF-8 support Jan 2, 2015

colinodell added 2 commits January 2, 2015 15:00

Enable UTF-8 support for emphasis parsing

89de853

Spec 0.14 introduced the need to peek() for unicode whitespace: http://spec.commonmark.org/0.14/#right-facing-delimiter-run The Cursor therefore cannot be ignorant to multi-byte encodings.

Undo previous mbstring function replacements

8c9d004

This reverts commits: - 6b28381 - c7331db - 82421eb

colinodell force-pushed the utf8-support branch from dcafbfc to 8c9d004 Compare January 2, 2015 20:01

colinodell mentioned this pull request Jan 2, 2015

Spec 0.15 #50

Merged

colinodell added a commit that referenced this pull request Jan 6, 2015

Merge pull request #49 from thephpleague/utf8-support

3c0a3bc

UTF-8 support

colinodell merged commit 3c0a3bc into master Jan 6, 2015

colinodell deleted the utf8-support branch January 6, 2015 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 support #49

UTF-8 support #49

colinodell commented Jan 2, 2015

GrahamCampbell commented Jan 2, 2015

colinodell commented Jan 2, 2015

GrahamCampbell commented Jan 2, 2015

mnapoli commented Feb 5, 2015

colinodell commented Feb 5, 2015

mnapoli commented Feb 5, 2015

UTF-8 support #49

UTF-8 support #49

Conversation

colinodell commented Jan 2, 2015

GrahamCampbell commented Jan 2, 2015

colinodell commented Jan 2, 2015

GrahamCampbell commented Jan 2, 2015

mnapoli commented Feb 5, 2015

colinodell commented Feb 5, 2015

mnapoli commented Feb 5, 2015