OCRTesseract can not recognize Chinese #873

Sunzhuokai · 2020-02-27T06:45:54Z

Summary of your issue

OCRTesseract can not recognize Chinese

Environment

What did you do when you faced the problem?

//write here

Example code:

using (var ocr = OCRTesseract.Create(“tessdata", "chi_sim"))
                        {
                            ocr.Run(topRoi, out text, out _componentRects, out  _componentTexts, out  _confidences);
                        }

Output:

Just 0-9,a-Z

What did you intend to be?

The text was updated successfully, but these errors were encountered:

shimat · 2020-02-27T07:26:24Z

Does your tessdata contain chi_sim.traineddata ?

Sunzhuokai · 2020-02-27T07:28:34Z

Does your tessdata contain chi_sim.traineddata ?

yes

Sunzhuokai · 2020-02-27T07:34:09Z

@shimat when i use https://github.com/charlesw/tesseract wrapper , it is works

shimat · 2020-02-27T23:19:40Z

I could reproduce this problem. It may be an OpenCV bug.
opencv/opencv_contrib#2062

stale · 2020-08-25T23:54:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jerry-Gump · 2021-09-09T01:55:41Z

I have the same problem,anything can fix it?

OK, maybe this is the reason: tesseract-ocr/tesseract#1250

n0099 · 2023-03-07T21:00:40Z

From https://docs.opencv.org/4.7.0/d7/ddc/classcv_1_1text_1_1OCRTesseract.html#a391b1e753f0b779b72204ec15200a99a:

char_whitelist specifies the list of characters used for recognition. NULL defaults to "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".

opencvsharp/src/OpenCvSharp/Modules/text/OCRTesseract.cs

Line 37 in 0498156

    
           /// null defaults to "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".</param>

opencvsharp/src/OpenCvSharpExtern/text.h

Line 87 in 0498156

    
           const auto result = cv::text::OCRTesseract::create(datapath, language, char_whitelist, oem, psmode);

so manually set the whitelist to empty string "" works for me.

… be recognized by tesseract: shimat/opencvsharp#873 (comment) @ `TesseractRecognizer.ctor()` @ crawler

… be recognized by tesseract: shimat/opencvsharp#873 (comment) @ `TesseractRecognizer.ctor()` - unused config `ImageOcrPipeline.PaddleOcr.ServingEndpoint` that deprecated since 541f4d2 @ appsettings.json @ crawler

stale bot added the wontfix label Aug 25, 2020

stale bot closed this as completed Sep 2, 2020

n0099 mentioned this issue Mar 7, 2023

OCR on cyrillic text #1364

Closed

n0099 added a commit to n0099/open-tbm that referenced this issue Mar 7, 2023

* fix only characters within the default whitelist [0-9a-zA-Z] will…

ac032ce

… be recognized by tesseract: shimat/opencvsharp#873 (comment) @ `TesseractRecognizer.ctor()` @ crawler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCRTesseract can not recognize Chinese #873

OCRTesseract can not recognize Chinese #873

Sunzhuokai commented Feb 27, 2020

shimat commented Feb 27, 2020

Sunzhuokai commented Feb 27, 2020

Sunzhuokai commented Feb 27, 2020

shimat commented Feb 27, 2020

stale bot commented Aug 25, 2020

Jerry-Gump commented Sep 9, 2021 •

edited

Loading

n0099 commented Mar 7, 2023 •

edited

Loading

OCRTesseract can not recognize Chinese #873

OCRTesseract can not recognize Chinese #873

Comments

Sunzhuokai commented Feb 27, 2020

Summary of your issue

Environment

What did you do when you faced the problem?

Example code:

Output:

What did you intend to be?

shimat commented Feb 27, 2020

Sunzhuokai commented Feb 27, 2020

Sunzhuokai commented Feb 27, 2020

shimat commented Feb 27, 2020

stale bot commented Aug 25, 2020

Jerry-Gump commented Sep 9, 2021 • edited Loading

n0099 commented Mar 7, 2023 • edited Loading

Jerry-Gump commented Sep 9, 2021 •

edited

Loading

n0099 commented Mar 7, 2023 •

edited

Loading