Punctiation regexp is incomplete #108

rlidwka · 2016-09-13T13:43:42Z

I came across a discrepancy between cmark and commonmark.js output:

$ echo '**。**话' | ./cmark/build/src/cmark 
<p>**。**话</p>
$ echo '**。**话' | ./commonmark.js/bin/commonmark 
<p><strong>。</strong>话</p>

So, according to spec v26,

A punctuation character is an ASCII punctuation character or anything in the Unicode classes Pc, Pd, Pe, Pf, Pi, Po, or Ps.

Character "。" or U+3002 belongs to a class Punctuation, Other [Po] (see http://www.fileformat.info/info/unicode/char/3002/index.htm), but it's not included here:

https://github.com/jgm/commonmark.js/blob/3587c91c62128e54a236648ff1ac4a1ad1cd5ad8/lib/inlines.js#L41

For the reference, here's the regexp from unicode-8.0.0 package (we're using that in markdown-it), which includes this character (and appears to be a lot larger):

https://github.com/mathiasbynens/unicode-8.0.0/blob/master/General_Category/Punctuation/regex.js

The text was updated successfully, but these errors were encountered:

rlidwka · 2016-09-13T14:04:19Z

PS: that said, the example above is a fragment from chinese text here:

**无论你是在打网球比赛，在打牌还是在选股票，有经验的人总会猜测对手会怎么做。**这样的话

I don't know the language, but they don't seem to use whitespace at all. Maybe commonmark rules don't quite work with chinese.

Still, commonmark.js works differently from whatever is written in the spec, so I opened bugreport here.

jgm · 2016-09-14T14:07:15Z

Thanks!

puzrin · 2016-09-14T16:10:46Z

@jgm i'd recommend to require() files from unicode-* package, instead of hardcoding. Or you can use our proxy https://github.com/markdown-it/uc.micro to not update anything when unicode-10.+ released.

cinty8b · 2018-09-28T16:47:12Z

It seems the problem is still there.

jgm · 2018-09-28T23:04:08Z

markdown-it does not use commonmark.js. cinty8b <notifications@github.com> writes:

…

It seems the problem is still there. ![snipaste_2018-09-29_00-45-15](https://user-images.githubusercontent.com/5980459/46221831-1bd5e400-c381-11e8-97da-0a629f66df38.jpg) -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #108 (comment)

puzrin · 2018-09-29T00:37:03Z

markdown-it does not use commonmark.js.

That's correct, but it uses existing spec for tests. Current sample from commit pass without issues.

@cinty8b, i'd recommend you to make reports more detailed. Screenshots are very inconvenient, and almost useless:

Try https://spec.commonmark.org/dingus/, to understand if your issue is related to commonmark.js/spec or not. Use permalinks (not screenshots) for samples.
If you are absolutely sure, that official dingus is ok, try https://markdown-it.github.io/ and report to it's tracker, not here (with repmalink too).

May be that's new issue.

@jgm IMHO existing fix with hardcoded regexp is not human-readable and not maintainable - nobody knows data source and how actual is it. If you don't like to use external packages - it worth add comments with link to original & unicode version number.

cinty8b · 2018-09-29T01:46:12Z

Sorry for the screenshot.
I'm just a user, not a developer, so I am not familiar with these terminologies and conventional bug report process. I appologize for any inconvenience from that and for the wrong place to report.
I tracked the problem from Emphasis cannot be recognized when it has no space with following words #285, and @rlidwka started this issue from there.

Tried https://spec.commonmark.org/dingus/ , it works the same as markdown-it.
Here is the sample
The fox. .fox 狐狸。 。狐狸 parts are not italic as expected.

puzrin · 2018-09-29T17:26:10Z

@cinty8b thanks for details. As far as i understand, your case is not asian language specific, because is reproducible in english too. So, it's not related to this issue. IMHO worth create a new one to not be lost.

@jgm could you advice better place where to forward this?

jgm · 2018-09-29T18:41:20Z

This behavior accords with the spec. (So, it is "expected" in that sense.) brown*fox.*jumps Here the second `*` delimiter is not "right-flanking." Notice that this gives you emphasis: brown*fox.*.jumps because now the second `*` is both right- and left- flanking. If you really need the `fox.` to be emphasized here, you could try inserting a zero-width space: the brown*fox.*jumps If you think this is a flaw in the spec, you could bring it up on talk.commonmark.org. But be aware that there are always tradeoffs; the question is whether there's an improvement that could be made without messing up other things we currently get right.

cinty8b · 2018-09-30T02:43:07Z

@jgm @puzrin Thanks for your explanation and possible workaround.
I think there's something more to say.

In English there are always spaces and punctuations to seperate words, so it is rare that a part of a continuous string has to be emphasized or italic with an ending or starting punctuation, like brown*fox.*jumps. In other words, it is always word as a whole to be emphasized, not just part of it.

But it's different in Chinese. We seldom use spaces in sentences. Punctuations do almost all the seperating work in a paragraph. As a result, it's common in Chinese that I want to emphasize a sentence with its period together (。 is period in Chinese), like 棕色狐狸。**黄色狐狸。**黑色狐狸。 But commonmark could not emphasize the sentence in the middle.
Of course a zero-width space will solve the problem, but it is a little complicated for new users to remember and to understand why a special symble has to be inserted. And I think it causes another problem: The zero-width space is visually invisible. I may look for 黄色狐狸。黑色狐狸。 in the text, but 黄色狐狸。黑色狐狸。 does not match. It can be confusing.

I hope I make it clear.

puzrin · 2018-09-30T03:03:01Z

@cinty8b AFAIK, there are some known spec issues with asian languages (no spaces), without good resolution. With high probability this one was discussed at commonmark forum. Try to post there. Probably, it worth to kick such topic again.

jgm · 2018-09-30T06:03:30Z

Vitaly Puzrin <notifications@github.com> writes:

@cinty8b AFAIK, there are some known spec issues with asian languages (no spaces), without good resolution. With high probability this one was discussed at commonmark forum. Try to post there. Probably, it worth to kick such topic again.

Here are some relevant links: https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491 commonmark/cmark#208 (comment)

tats-u · 2023-12-18T14:09:13Z

This is due to the fact that the definitions of left- & right-flanking delimiter run introduced in CM 0.14+ are designed under the erroneous assumption that all languages (including Chinese and Japanese!) included spaces around punctuation marks.
Without change of them, we still cannot parse the following case:

当社の**[製品A](https://example.com/product-a)**をぜひお試しください！

If the spec were revised based on commonmark/commonmark-spec#650 (comment), most cases would be improved.

rlidwka mentioned this issue Sep 13, 2016

Emphasis cannot be recognized when it has no space with following words markdown-it/markdown-it#285

Closed

jgm closed this as completed in db0503b Sep 14, 2016

colinodell added a commit to thephpleague/commonmark that referenced this issue Nov 22, 2016

Fix incomplete punctuation regex (mirrors commonmark/commonmark.js#108)

6177c2e

cinty8b mentioned this issue Sep 28, 2018

Markdown-it 渲染斜体有问题 vnotex/vnote#429

Closed

haqer1 mentioned this issue Apr 17, 2020

Add a sub-section on usage of zero-width space commonmark/commonmark-spec#643

Open

spencer246 mentioned this issue Aug 22, 2020

[markdown rendering issue] stick words with italic is not working javascript-tutorial/en.javascript.info#2040

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Punctiation regexp is incomplete #108

Punctiation regexp is incomplete #108

rlidwka commented Sep 13, 2016

rlidwka commented Sep 13, 2016 •

edited

Loading

jgm commented Sep 14, 2016

puzrin commented Sep 14, 2016

cinty8b commented Sep 28, 2018

jgm commented Sep 28, 2018 via email

puzrin commented Sep 29, 2018

cinty8b commented Sep 29, 2018

puzrin commented Sep 29, 2018

jgm commented Sep 29, 2018 via email

cinty8b commented Sep 30, 2018 •

edited

Loading

puzrin commented Sep 30, 2018

jgm commented Sep 30, 2018 via email

tats-u commented Dec 18, 2023 •

edited

Loading

Punctiation regexp is incomplete #108

Punctiation regexp is incomplete #108

Comments

rlidwka commented Sep 13, 2016

rlidwka commented Sep 13, 2016 • edited Loading

jgm commented Sep 14, 2016

puzrin commented Sep 14, 2016

cinty8b commented Sep 28, 2018

jgm commented Sep 28, 2018 via email

puzrin commented Sep 29, 2018

cinty8b commented Sep 29, 2018

puzrin commented Sep 29, 2018

jgm commented Sep 29, 2018 via email

cinty8b commented Sep 30, 2018 • edited Loading

puzrin commented Sep 30, 2018

jgm commented Sep 30, 2018 via email

tats-u commented Dec 18, 2023 • edited Loading

rlidwka commented Sep 13, 2016 •

edited

Loading

cinty8b commented Sep 30, 2018 •

edited

Loading

tats-u commented Dec 18, 2023 •

edited

Loading