Best Traineddata Feedback - Hindi #66

Shreeshrii · 2017-08-01T13:26:12Z

hin.lstm-unicharset does not have the following devanagari characters and combining marks:

ङ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 129 0 129 ङ # ङ [919 ]x
ऍ | 2317 | ऍ | 090D | DEVANAGARI LETTER CANDRA E
ॅ 0 0,255,0,255,0,0,0,0,0,0 Devanagari 124 17 124 ॅ # ॅ [945 ]
ॐ 1 0,255,0,255,0,0,0,0,0,0 Devanagari 158 0 158 ॐ # ॐ [950 ]x

पङ्कज
गङ्गा
ऍण्ड
ऍक्ट
डू यू हैव अ पॅन
फॅरनहाइट
ॐ
ॐकार

Shreeshrii · 2017-08-01T17:49:51Z

See https://shreeshrii.github.io/tess4eval-san/

for accuracy reports with Hindi and Bihari language samples - not segregated.

The images used can be seen from
https://github.com/Shreeshrii/tess4eval-san/blob/master/0createcache.sh

I have NOT looked at wordlists yet because I was under the impression that they do not make much difference to accuracy for LSTM models. Is that correct, @theraysmith

Shreeshrii · 2017-08-02T08:47:13Z

Some of the errors in recognition of Hindi are because of use of a different orthographic style for some of the letters. Please see https://shreeshrii.github.io/tess4eval-san/index-4-hinbest.html where the errors relate to
अ
आ
ओ
औ
and
झ
for bhojpurilokgatha005035mbp_0278.tif

Interestingly, these are recognized correctly in the original hin.traineddata for 4.00.00-alpha.

These can be fixed by ensuring that fonts with different orthographies are used.

@theraysmith If you provide a list of Devanagari fonts used for training, I can check for this.

Shreeshrii · 2017-08-02T09:07:54Z

For wordlists/training_text for modern languages, I will also suggest using the localization lists from unicode.org

Please see:
http://www.unicode.org/cldr/charts/31/summary/hi.html
http://www.unicode.org/cldr/charts/31/summary/mr.html

see http://www.unicode.org/cldr/charts/31/summary/root.html
for the languages for which this info is available.

Shreeshrii · 2017-08-02T09:11:30Z

Also see comments for #64 - feedback regarding Sanskrit

Shreeshrii · 2017-09-04T10:15:59Z

See attached reports, run using https://github.com/eddieantonio/isri-ocr-evaluation-tools which supports utf-8 text.

ALL-hin-imageshin-rpt.txt

Shreeshrii mentioned this issue Aug 1, 2017

Added best traineddatas for 4.00 alpha #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best Traineddata Feedback - Hindi #66

Best Traineddata Feedback - Hindi #66

Shreeshrii commented Aug 1, 2017 •

edited

Loading

Shreeshrii commented Aug 1, 2017 •

edited

Loading

Shreeshrii commented Aug 2, 2017 •

edited

Loading

Shreeshrii commented Aug 2, 2017 •

edited

Loading

Shreeshrii commented Aug 2, 2017

Shreeshrii commented Sep 4, 2017

Best Traineddata Feedback - Hindi #66

Best Traineddata Feedback - Hindi #66

Comments

Shreeshrii commented Aug 1, 2017 • edited Loading

Shreeshrii commented Aug 1, 2017 • edited Loading

Shreeshrii commented Aug 2, 2017 • edited Loading

Shreeshrii commented Aug 2, 2017 • edited Loading

Shreeshrii commented Aug 2, 2017

Shreeshrii commented Sep 4, 2017

Shreeshrii commented Aug 1, 2017 •

edited

Loading

Shreeshrii commented Aug 1, 2017 •

edited

Loading

Shreeshrii commented Aug 2, 2017 •

edited

Loading

Shreeshrii commented Aug 2, 2017 •

edited

Loading