-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best Traineddata Feedback - Devanagari Script #69
Comments
For Latin, I have ~4500 fonts to train with. |
I will list fonts with older orthography later. Hope you have already got the devanagari fonts listed at https://github.com/tesseract-ocr/tesseract/wiki/Fonts There are also a large number of fonts available from http://ildc.in/Hindi/GIST/hindi_cd_2/windows/index.htm Is there a method for creating box/tiff from scanned images? We can use it create more variations for training. |
Fonts with older/northern orthography Santipur OT Some older orthography alphabet images are at Also, more variations of fonts can be created via exposures -3, -2, -1, 0, 1 and by the additional program to add more noise to the images (you had mentioned adding the feature to text2image). If I want to experiment further for Devanagari training -
|
Does that mean that recommended training_text size for finetuning, replace a layer is about 150-200 lines? |
For an example image with a mix of Hindi and Sanskrit text and a lot of punctuation - quotes, hyphens etc, I found that Devanagari traineddata gave best result. Though there are a few conjunct consonants that are NOT being recognized. eg.
Image and ground truth are attached. |
Some more testing of Devanagari script traineddata vs san and hin. While in most cases, Devanagari does better. In samples with large font size 48 px, accuracy drops to 50% or lower. See attached reports, run using https://github.com/eddieantonio/isri-ocr-evaluation-tools which supports utf-8 text. |
best/mar does better than best/Devanagari for Marathi text. |
@Shreeshrii |
@jbarth-ubhd has trained a Devanagari model which gives better results in my first test. Maybe he can share it on tessdata_contrib. |
@stweil I am curious to know how much improvement was achieved via finetuning. See related posts in forum - https://groups.google.com/forum/#!msg/tesseract-ocr/NNZ7GOBLB_8/IMqn2IgzAwAJ |
I only tested a single page where tessdata_best/script/Devanagari achieved 93 % character recognition rate while Jochen's model recognized 98 %. |
98% accuracy is very impressive!!
…On Fri, 12 Jul 2019, 21:23 Stefan Weil, ***@***.***> wrote:
I only tested a single page where tessdata_best/script/Devanagari achieved
93 % character recognition rate while Jochen's model recognized 98 %.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#69?email_source=notifications&email_token=ABG37I6CSMHGFAP7ESIB6T3P7CSF3A5CNFSM4DVVGIKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ2EZKQ#issuecomment-510938282>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG37IZ3A2THAGAFM2GINKTP7CSF3ANCNFSM4DVVGIKA>
.
|
@stweil, Did you run any additional tests on Jochen's model? |
No, I didn't. |
@Shreeshrii @stweil Where can I find Jochen's model with 98% accuracy? |
Devanagari Script Best Traineddata has the complete English alphabet. the wordlist also has a complete English wordlist including words in ALL CAPS.
@theraysmith What is the logic for this? It will be useful to have 0-9 numbers in Latin script as part of Devanagari traineddata, but why full English support? I think this reduces accuracy...
The text was updated successfully, but these errors were encountered: