Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 support #49

Merged
merged 2 commits into from
Jan 6, 2015
Merged

UTF-8 support #49

merged 2 commits into from
Jan 6, 2015

Conversation

colinodell
Copy link
Member

Spec version 0.14 enhanced some of the rules regarding emphasis parsing. It's now necessary to peek around at characters in a Unicode-aware way.

This introduces a major problem - the Cursor class is not currently Unicode-aware. For example, it thinks [Толпой] is 14 characters long instead of 8. This was fine for our purposes since we never needed to address Unicode characters by position.

So the new challenge is that given a Cursor in some position, we need to accurately obtain neighboring Unicode characters. Our $this->line[$this->position] trick won't work as it will only obtain a single byte instead of the whole character. The proper solution would require the Cursor becoming Unicode-aware.

iconv and mbstring both seemed to be strong candidates for this task, so I implemented both and benchmarked them. iconv resulted in a 1000% performance penalty, whereas mbstring represented a 26% drop, so I implemented the latter where needed.

As much as I hate adding new dependencies and reducing performance, I do feel this is the correct approach. I definitely welcome any feedback, alternatives, or performance enhancements. I'll keep this open for a few days to gather feedback before accepting.

/cc @philsturgeon @cebe @GrahamCampbell @aleemb @dshafik

@colinodell colinodell self-assigned this Jan 2, 2015
@colinodell colinodell added this to the Version 0.6 milestone Jan 2, 2015
@colinodell colinodell added the feedback wanted We need your input! label Jan 2, 2015
@GrahamCampbell
Copy link
Member

We could still use the patchwork library you know.

@colinodell colinodell added the spec compliance Issues or question about compliance with the CommonMark or GFM specs label Jan 2, 2015
@colinodell
Copy link
Member Author

I looked into patchwork/utf8 as one possibility, but it seems to fallback to iconv which is terrible for performance.

@colinodell colinodell changed the title UTF-8 support Spec 0.15 & UTF-8 support Jan 2, 2015
@colinodell colinodell changed the title Spec 0.15 & UTF-8 support UTF-8 support Jan 2, 2015
Spec 0.14 introduced the need to peek() for unicode whitespace: http://spec.commonmark.org/0.14/#right-facing-delimiter-run
The Cursor therefore cannot be ignorant to multi-byte encodings.
@GrahamCampbell
Copy link
Member

Requiring the extension is probably the best way to go then. :)

colinodell added a commit that referenced this pull request Jan 6, 2015
@colinodell colinodell merged commit 3c0a3bc into master Jan 6, 2015
@colinodell colinodell deleted the utf8-support branch January 6, 2015 13:49
@mnapoli
Copy link
Contributor

mnapoli commented Feb 5, 2015

Do you guys know if mbstring is that widespread? I personally have no idea

@colinodell
Copy link
Member Author

I tried researching that but wasn't able to find any conclusive information. It does seem to be available in the most-popular Linux distros though:

  • Debian/Ubuntu: built-in by default
  • RHEL/CentOS and Fedora: yum package available
  • openSUSE: yast package available

@mnapoli
Copy link
Contributor

mnapoli commented Feb 5, 2015

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback wanted We need your input! spec compliance Issues or question about compliance with the CommonMark or GFM specs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants