-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve _HTMLWordTruncator. Fix #2863 #2864
Conversation
It would be nice to add a test case for this with non-latin text. Also, I'm a bit puzzled by the shortening of summaries in the current test cases. I don't see any apparent reason for that, but I need to look into it a bit further. |
Sorry to bother you, but have you found the reason now? 🤣 |
@ImBearChild, sorry I didn't really have the chance to look at it in detail before. Now that I have looked into it, I think the issue is the SBC=r"[\u0030-\u024f]|[\u0400-\u04FF]", especially the first part. It is too permissive. It includes punctuations, braces ( SBC=r"[0-9a-zA-Z]|[\u00C0-\u024f]|[\u0400-\u04FF]"
# ASCII |Extended Latin | Cyrillic with this test cases remain same. |
Well, should I close this pull request and open a new one with your improved code? And I think I have to check DBC |
Hi NianQing. 👋 Thank you for asking first, instead of just closing and creating a new PR. Many folks do that without asking first, which is a shame because it is totally unnecessary. So thanks again for asking! Pull requests are designed to be modified, so it is fairly straightforward to make follow-up changes. Any new commits that are pushed to your forked repo's
Following are some related links from our documentation that you may find helpful:
I hope this was helpful. How else might I assist you? |
45aacfd
to
963c552
Compare
Thank you for your instruction! And now it's time to check these new code.😄 |
21ef383
to
3af8382
Compare
That's strange, I've run all the tests on my machine and no failure happen. 🤔 |
There is currently an issue with feedgenerator and tests are likely failing because of that. I don't think your changes are the cause. Speaking of tests, can you add one or two tests for |
3af8382
to
2417f85
Compare
Oh, I have found where the problem is. I used wrong regular expressions like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thank you!
Use more than one unicode block in _word_regex, making word count function behave properly with CJK, cyrillic and more latin characters when generating summary.
2417f85
to
73aca20
Compare
Sorry, I've just found another typo in regular expression. I used |
Merged manually via 22192c1. Many thanks to @ImBearChild for the nice enhancement and to @avaris for reviewing. ✨ |
Use more than one unicode block in _word_regex, making word count function behave properly with CJK, cyrillic and more latin characters when generating summary.
Pull Request Checklist
Resolves: #2863