Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best Traineddata Feedback - Devanagari Script #69

Open
Shreeshrii opened this issue Aug 4, 2017 · 15 comments
Open

Best Traineddata Feedback - Devanagari Script #69

Shreeshrii opened this issue Aug 4, 2017 · 15 comments

Comments

@Shreeshrii
Copy link
Contributor

Devanagari Script Best Traineddata has the complete English alphabet. the wordlist also has a complete English wordlist including words in ALL CAPS.

@theraysmith What is the logic for this? It will be useful to have 0-9 numbers in Latin script as part of Devanagari traineddata, but why full English support? I think this reduces accuracy...

@theraysmith
Copy link
Contributor

For Latin, I have ~4500 fonts to train with.
For Devanagari ~50, and for Kannada 15.
With a theory that poor accuracy on test data and over-fitting on training data was caused by the lack of fonts, I tried mixing the training data with English, thinking that English is often mixed in anyway, and some of the font diversity might generalize to the other script. The overall effect was slightly positive, so I left it that way.
It didn't resolve the issue with poor accuracy on test data, and even though a lot of languages are affected, I think there are multiple problems at work: with Hebrew/Arabic it's points, and with (at least some of) the Indic scripts, it looks like it might be older styles of typography.

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 4, 2017

I will list fonts with older orthography later.

Hope you have already got the devanagari fonts listed at https://github.com/tesseract-ocr/tesseract/wiki/Fonts

There are also a large number of fonts available from http://ildc.in/Hindi/GIST/hindi_cd_2/windows/index.htm
It is part of a larger zipped file.

Is there a method for creating box/tiff from scanned images? We can use it create more variations for training.

@Shreeshrii
Copy link
Contributor Author

Fonts with older/northern orthography

Santipur OT
Siddhanta-Calcutta
Sahadeva
Uttara

Some older orthography alphabet images are at
https://github.com/Shreeshrii/tess4eval_deva/tree/master/images
with corresponding ground truth at
https://github.com/Shreeshrii/tess4eval_deva/tree/master/gt

Also, more variations of fonts can be created via exposures -3, -2, -1, 0, 1 and by the additional program to add more noise to the images (you had mentioned adding the feature to text2image).


If I want to experiment further for Devanagari training -

  • how many lines of text do you suggest to use for finetuning?

  • how may iterations? Does it need to go to 0.01%

  • Should the training text be same for all fonts or should I use different text but same number of lines for each?

  • Do the wordlists get generated from the training data? Can the word dawgs be replaced by u

@Shreeshrii
Copy link
Contributor Author

@theraysmith

Just saw https://github.com/tesseract-ocr/tesseract/blob/2633fef0b6ac9b616eae3d457bf796076eb8f43c/training/tesstrain_utils.sh#L215

common_args+=" --outputbase=${outbase} --max_pages=3"

Does that mean that recommended training_text size for finetuning, replace a layer is about 150-200 lines?

@Shreeshrii
Copy link
Contributor Author

For an example image with a mix of Hindi and Sanskrit text and a lot of punctuation - quotes, hyphens etc, I found that Devanagari traineddata gave best result. Though there are a few conjunct consonants that are NOT being recognized.

eg.

ह्र
ह्न
घ्र

Image and ground truth are attached.

vallabhamahaganapatitrishatinamavali0
vallabhamahAgaNapatitrishatinAmAvalI0-gt.txt

@Shreeshrii
Copy link
Contributor Author

Some more testing of Devanagari script traineddata vs san and hin.

While in most cases, Devanagari does better. In samples with large font size 48 px, accuracy drops to 50% or lower.

See attached reports, run using https://github.com/eddieantonio/isri-ocr-evaluation-tools which supports utf-8 text.

ALL-accuracy-imageshin.txt

ALL-deva-imageshin-rpt.txt

ALL-deva-imagessan-rpt.txt

@Shreeshrii
Copy link
Contributor Author

best/mar does better than best/Devanagari for Marathi text.

ALL-deva-imagesmar-rpt.txt

@ghost
Copy link

ghost commented Jun 24, 2018

@Shreeshrii and by the additional program to add more noise to the images
How to add more noise to the generated images?

@stweil
Copy link
Contributor

stweil commented Jul 11, 2019

@jbarth-ubhd has trained a Devanagari model which gives better results in my first test. Maybe he can share it on tessdata_contrib.

@Shreeshrii
Copy link
Contributor Author

@stweil I am curious to know how much improvement was achieved via finetuning.

See related posts in forum - https://groups.google.com/forum/#!msg/tesseract-ocr/NNZ7GOBLB_8/IMqn2IgzAwAJ
and
https://www.iias.asia/the-newsletter/article/naval-kishore-press-digital-hidden-treasure-open-access

@stweil
Copy link
Contributor

stweil commented Jul 12, 2019

I only tested a single page where tessdata_best/script/Devanagari achieved 93 % character recognition rate while Jochen's model recognized 98 %.

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Jul 13, 2019 via email

@Shreeshrii
Copy link
Contributor Author

I only tested a single page where tessdata_best/script/Devanagari achieved 93 % character recognition rate while Jochen's model recognized 98 %.

@stweil, Did you run any additional tests on Jochen's model?

@stweil
Copy link
Contributor

stweil commented Jul 22, 2019

No, I didn't.

@lokeshh
Copy link

lokeshh commented Mar 8, 2020

@Shreeshrii @stweil Where can I find Jochen's model with 98% accuracy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants