Best Traineddata Feedback - Devanagari Script #69

Shreeshrii · 2017-08-04T05:43:39Z

Devanagari Script Best Traineddata has the complete English alphabet. the wordlist also has a complete English wordlist including words in ALL CAPS.

@theraysmith What is the logic for this? It will be useful to have 0-9 numbers in Latin script as part of Devanagari traineddata, but why full English support? I think this reduces accuracy...

theraysmith · 2017-08-04T15:44:52Z

For Latin, I have ~4500 fonts to train with.
For Devanagari ~50, and for Kannada 15.
With a theory that poor accuracy on test data and over-fitting on training data was caused by the lack of fonts, I tried mixing the training data with English, thinking that English is often mixed in anyway, and some of the font diversity might generalize to the other script. The overall effect was slightly positive, so I left it that way.
It didn't resolve the issue with poor accuracy on test data, and even though a lot of languages are affected, I think there are multiple problems at work: with Hebrew/Arabic it's points, and with (at least some of) the Indic scripts, it looks like it might be older styles of typography.

Shreeshrii · 2017-08-04T16:37:00Z

I will list fonts with older orthography later.

Hope you have already got the devanagari fonts listed at https://github.com/tesseract-ocr/tesseract/wiki/Fonts

There are also a large number of fonts available from http://ildc.in/Hindi/GIST/hindi_cd_2/windows/index.htm
It is part of a larger zipped file.

Is there a method for creating box/tiff from scanned images? We can use it create more variations for training.

Shreeshrii · 2017-08-04T17:07:07Z

Fonts with older/northern orthography

Santipur OT
Siddhanta-Calcutta
Sahadeva
Uttara

Some older orthography alphabet images are at
https://github.com/Shreeshrii/tess4eval_deva/tree/master/images
with corresponding ground truth at
https://github.com/Shreeshrii/tess4eval_deva/tree/master/gt

Also, more variations of fonts can be created via exposures -3, -2, -1, 0, 1 and by the additional program to add more noise to the images (you had mentioned adding the feature to text2image).

If I want to experiment further for Devanagari training -

how many lines of text do you suggest to use for finetuning?
how may iterations? Does it need to go to 0.01%
Should the training text be same for all fonts or should I use different text but same number of lines for each?
Do the wordlists get generated from the training data? Can the word dawgs be replaced by u

Shreeshrii · 2017-08-05T04:04:33Z

@theraysmith

Just saw https://github.com/tesseract-ocr/tesseract/blob/2633fef0b6ac9b616eae3d457bf796076eb8f43c/training/tesstrain_utils.sh#L215

common_args+=" --outputbase=${outbase} --max_pages=3"

Does that mean that recommended training_text size for finetuning, replace a layer is about 150-200 lines?

Shreeshrii · 2017-08-24T09:49:17Z

For an example image with a mix of Hindi and Sanskrit text and a lot of punctuation - quotes, hyphens etc, I found that Devanagari traineddata gave best result. Though there are a few conjunct consonants that are NOT being recognized.

eg.

ह्र
ह्न
घ्र

Image and ground truth are attached.

vallabhamahAgaNapatitrishatinAmAvalI0-gt.txt

Shreeshrii · 2017-09-04T10:14:34Z

Some more testing of Devanagari script traineddata vs san and hin.

While in most cases, Devanagari does better. In samples with large font size 48 px, accuracy drops to 50% or lower.

See attached reports, run using https://github.com/eddieantonio/isri-ocr-evaluation-tools which supports utf-8 text.

ALL-accuracy-imageshin.txt

ALL-deva-imageshin-rpt.txt

ALL-deva-imagessan-rpt.txt

Shreeshrii · 2017-09-04T11:23:47Z

best/mar does better than best/Devanagari for Marathi text.

ALL-deva-imagesmar-rpt.txt

ghost · 2018-06-24T19:18:32Z

@Shreeshrii and by the additional program to add more noise to the images
How to add more noise to the generated images?

stweil · 2019-07-11T11:54:24Z

@jbarth-ubhd has trained a Devanagari model which gives better results in my first test. Maybe he can share it on tessdata_contrib.

Shreeshrii · 2019-07-12T04:55:00Z

@stweil I am curious to know how much improvement was achieved via finetuning.

See related posts in forum - https://groups.google.com/forum/#!msg/tesseract-ocr/NNZ7GOBLB_8/IMqn2IgzAwAJ
and
https://www.iias.asia/the-newsletter/article/naval-kishore-press-digital-hidden-treasure-open-access

stweil · 2019-07-12T15:52:57Z

I only tested a single page where tessdata_best/script/Devanagari achieved 93 % character recognition rate while Jochen's model recognized 98 %.

Shreeshrii · 2019-07-13T04:03:22Z

98% accuracy is very impressive!!

…

On Fri, 12 Jul 2019, 21:23 Stefan Weil, ***@***.***> wrote: I only tested a single page where tessdata_best/script/Devanagari achieved 93 % character recognition rate while Jochen's model recognized 98 %. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#69?email_source=notifications&email_token=ABG37I6CSMHGFAP7ESIB6T3P7CSF3A5CNFSM4DVVGIKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ2EZKQ#issuecomment-510938282>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG37IZ3A2THAGAFM2GINKTP7CSF3ANCNFSM4DVVGIKA> .

Shreeshrii · 2019-07-22T03:58:46Z

I only tested a single page where tessdata_best/script/Devanagari achieved 93 % character recognition rate while Jochen's model recognized 98 %.

@stweil, Did you run any additional tests on Jochen's model?

stweil · 2019-07-22T05:28:41Z

No, I didn't.

lokeshh · 2020-03-08T07:09:53Z

@Shreeshrii @stweil Where can I find Jochen's model with 98% accuracy?

Shreeshrii mentioned this issue Aug 5, 2017

Unused function PrepareDistortedPix() tesseract-ocr/tesseract#1052

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best Traineddata Feedback - Devanagari Script #69

Best Traineddata Feedback - Devanagari Script #69

Shreeshrii commented Aug 4, 2017

theraysmith commented Aug 4, 2017

Shreeshrii commented Aug 4, 2017 •

edited

Loading

Shreeshrii commented Aug 4, 2017

Shreeshrii commented Aug 5, 2017

Shreeshrii commented Aug 24, 2017

Shreeshrii commented Sep 4, 2017

Shreeshrii commented Sep 4, 2017

ghost commented Jun 24, 2018

stweil commented Jul 11, 2019

Shreeshrii commented Jul 12, 2019

stweil commented Jul 12, 2019

Shreeshrii commented Jul 13, 2019 via email

Shreeshrii commented Jul 22, 2019

stweil commented Jul 22, 2019

lokeshh commented Mar 8, 2020

Best Traineddata Feedback - Devanagari Script #69

Best Traineddata Feedback - Devanagari Script #69

Comments

Shreeshrii commented Aug 4, 2017

theraysmith commented Aug 4, 2017

Shreeshrii commented Aug 4, 2017 • edited Loading

Shreeshrii commented Aug 4, 2017

Shreeshrii commented Aug 5, 2017

Shreeshrii commented Aug 24, 2017

Shreeshrii commented Sep 4, 2017

Shreeshrii commented Sep 4, 2017

ghost commented Jun 24, 2018

stweil commented Jul 11, 2019

Shreeshrii commented Jul 12, 2019

stweil commented Jul 12, 2019

Shreeshrii commented Jul 13, 2019 via email

Shreeshrii commented Jul 22, 2019

stweil commented Jul 22, 2019

lokeshh commented Mar 8, 2020

Shreeshrii commented Aug 4, 2017 •

edited

Loading