Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Punctiation regexp is incomplete #108

Closed
rlidwka opened this issue Sep 13, 2016 · 13 comments
Closed

Punctiation regexp is incomplete #108

rlidwka opened this issue Sep 13, 2016 · 13 comments

Comments

@rlidwka
Copy link
Contributor

rlidwka commented Sep 13, 2016

I came across a discrepancy between cmark and commonmark.js output:

$ echo '**。**话' | ./cmark/build/src/cmark 
<p>****</p>
$ echo '**。**话' | ./commonmark.js/bin/commonmark 
<p><strong></strong></p>

So, according to spec v26,

A punctuation character is an ASCII punctuation character or anything in the Unicode classes Pc, Pd, Pe, Pf, Pi, Po, or Ps.

Character "。" or U+3002 belongs to a class Punctuation, Other [Po] (see http://www.fileformat.info/info/unicode/char/3002/index.htm), but it's not included here:

https://github.com/jgm/commonmark.js/blob/3587c91c62128e54a236648ff1ac4a1ad1cd5ad8/lib/inlines.js#L41

For the reference, here's the regexp from unicode-8.0.0 package (we're using that in markdown-it), which includes this character (and appears to be a lot larger):

https://github.com/mathiasbynens/unicode-8.0.0/blob/master/General_Category/Punctuation/regex.js

@rlidwka
Copy link
Contributor Author

rlidwka commented Sep 13, 2016

PS: that said, the example above is a fragment from chinese text here:

**无论你是在打网球比赛,在打牌还是在选股票,有经验的人总会猜测对手会怎么做。**这样的话

I don't know the language, but they don't seem to use whitespace at all. Maybe commonmark rules don't quite work with chinese.

Still, commonmark.js works differently from whatever is written in the spec, so I opened bugreport here.

@jgm jgm closed this as completed in db0503b Sep 14, 2016
@jgm
Copy link
Member

jgm commented Sep 14, 2016

Thanks!

@puzrin
Copy link
Contributor

puzrin commented Sep 14, 2016

@jgm i'd recommend to require() files from unicode-* package, instead of hardcoding. Or you can use our proxy https://github.com/markdown-it/uc.micro to not update anything when unicode-10.+ released.

@cinty8b
Copy link

cinty8b commented Sep 28, 2018

It seems the problem is still there.
snipaste_2018-09-29_00-45-15

@jgm
Copy link
Member

jgm commented Sep 28, 2018 via email

@puzrin
Copy link
Contributor

puzrin commented Sep 29, 2018

markdown-it does not use commonmark.js.

That's correct, but it uses existing spec for tests. Current sample from commit pass without issues.

@cinty8b, i'd recommend you to make reports more detailed. Screenshots are very inconvenient, and almost useless:

  1. Try https://spec.commonmark.org/dingus/, to understand if your issue is related to commonmark.js/spec or not. Use permalinks (not screenshots) for samples.
  2. If you are absolutely sure, that official dingus is ok, try https://markdown-it.github.io/ and report to it's tracker, not here (with repmalink too).

May be that's new issue.

@jgm IMHO existing fix with hardcoded regexp is not human-readable and not maintainable - nobody knows data source and how actual is it. If you don't like to use external packages - it worth add comments with link to original & unicode version number.

@cinty8b
Copy link

cinty8b commented Sep 29, 2018

Sorry for the screenshot.
I'm just a user, not a developer, so I am not familiar with these terminologies and conventional bug report process. I appologize for any inconvenience from that and for the wrong place to report.
I tracked the problem from Emphasis cannot be recognized when it has no space with following words #285, and @rlidwka started this issue from there.

Tried https://spec.commonmark.org/dingus/ , it works the same as markdown-it.
Here is the sample
The fox. .fox 狐狸。 。狐狸 parts are not italic as expected.
snipaste_2018-09-29_00-45-15

@puzrin
Copy link
Contributor

puzrin commented Sep 29, 2018

@cinty8b thanks for details. As far as i understand, your case is not asian language specific, because is reproducible in english too. So, it's not related to this issue. IMHO worth create a new one to not be lost.

@jgm could you advice better place where to forward this?

@jgm
Copy link
Member

jgm commented Sep 29, 2018 via email

@cinty8b
Copy link

cinty8b commented Sep 30, 2018

@jgm @puzrin Thanks for your explanation and possible workaround.
I think there's something more to say.

In English there are always spaces and punctuations to seperate words, so it is rare that a part of a continuous string has to be emphasized or italic with an ending or starting punctuation, like brown*fox.*jumps. In other words, it is always word as a whole to be emphasized, not just part of it.

But it's different in Chinese. We seldom use spaces in sentences. Punctuations do almost all the seperating work in a paragraph. As a result, it's common in Chinese that I want to emphasize a sentence with its period together ( is period in Chinese), like 棕色狐狸。**黄色狐狸。**黑色狐狸。 But commonmark could not emphasize the sentence in the middle.
Of course a zero-width space will solve the problem, but it is a little complicated for new users to remember and to understand why a special symble has to be inserted. And I think it causes another problem: The zero-width space is visually invisible. I may look for 黄色狐狸。黑色狐狸。 in the text, but 黄色狐狸。&#x200b;黑色狐狸。 does not match. It can be confusing.

I hope I make it clear.

@puzrin
Copy link
Contributor

puzrin commented Sep 30, 2018

@cinty8b AFAIK, there are some known spec issues with asian languages (no spaces), without good resolution. With high probability this one was discussed at commonmark forum. Try to post there. Probably, it worth to kick such topic again.

@jgm
Copy link
Member

jgm commented Sep 30, 2018 via email

@tats-u
Copy link
Contributor

tats-u commented Dec 18, 2023

This is due to the fact that the definitions of left- & right-flanking delimiter run introduced in CM 0.14+ are designed under the erroneous assumption that all languages (including Chinese and Japanese!) included spaces around punctuation marks.
Without change of them, we still cannot parse the following case:

当社の**[製品A](https://example.com/product-a)**をぜひお試しください!

If the spec were revised based on commonmark/commonmark-spec#650 (comment), most cases would be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants