Added best traineddatas for 4.00 alpha #62

amitdo · 2017-08-01T07:27:50Z

https://github.com/tesseract-ocr/tessdata/tree/3a94ddd47be0

@theraysmith
,
How to present those 'best' files to our users?
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

Do you plan to push more updates to the best directory and/or to the root dir in the next few weeks?

stweil · 2017-08-01T10:21:49Z

The new files include two files for German Fraktur: best/Fraktur.traineddata and best/frk.traineddata. According to my first tests, both are better than the old deu_frak.traineddata and much better than the old frk.traineddata. There is not a clear winner for the two new files: in some cases -l Fraktur gives better results, in some other cases -l frk is better. Even a 3.05 based Fraktur model still is better for some words, but generally the new LSTM based models win the challenge.

Ray, it would be interesting to know the training differences of the two new Fraktur traineddata files. Did they use different fonts / training material / dictionaries?

amitdo · 2017-08-01T12:31:39Z

Related comment from Ray:
tesseract-ocr/tesseract#995 (comment)

2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer.

amitdo · 2017-08-01T13:50:22Z

My guess is that the upper case traineddata files are for 'one script multi langs'.

theraysmith · 2017-08-01T17:38:44Z

I'm currently working on the training documentation, before committing more code, so as not to leave training broken for more than maybe an hour or so. Here's a quick bullet list of what's going on: - Initial capitals indicate the one model for all langs in that script, so eg Latin is all latin-based languages except vie, which has its own Vietnamese. Most of the script models include English training data as well as the script, but not for Cyrillic, as that would have a major ambiguity problem. Devanagari is hin+san+mar+nep+eng, and Fraktur is basically a combination of all the latin-based languages that have an 'old' variant, etc... I would be interested to hear more feedback on the Script models as Stefan already provided for Fraktur. - The tessdata directory doesn't have to be called tessdata any more, so I was thinking of a structuring that allowed maybe best, fast and legacy as separate directories or repos. - I noticed git complain about the size of Latin.traineddata (~100MB), but didn't yet follow the pointer to git large data. - The current code can run the 'best' models, and the existing models, but incremental and fine tuning training will be tied to 'best' with a future commit/push. (Due to a switch to ADAM and the move of the unicharset/recoder). - Fine tuning/incremental training will not be possible from the 'fast' models, as they are 8-bit integer. It will be possible to convert a tuned best to integer to make it faster, but some of the speed in 'fast' will be from the smaller model. - It will be possible to add new characters by fine tuning! I got that working yesterday, and just need to finish updating the documentation.

…

On Tue, Aug 1, 2017 at 6:50 AM, Amit D. ***@***.***> wrote: My guess is that the upper case traineddata files are for 'one script multi lang'. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#62 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056YqGx17BmwGUzEaK5AFDE67fqr_rks5sTy0ggaJpZM4OpU_-> .

-- Ray.

Shreeshrii · 2017-08-01T17:45:58Z

Ray, Please see Devanagari feedback at #66 #64 ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Aug 1, 2017 at 11:08 PM, theraysmith <notifications@github.com> wrote:

I'm currently working on the training documentation, before committing more code, so as not to leave training broken for more than maybe an hour or so. Here's a quick bullet list of what's going on: - Initial capitals indicate the one model for all langs in that script, so eg Latin is all latin-based languages except vie, which has its own Vietnamese. Most of the script models include English training data as well as the script, but not for Cyrillic, as that would have a major ambiguity problem. Devanagari is hin+san+mar+nep+eng, and Fraktur is basically a combination of all the latin-based languages that have an 'old' variant, etc... I would be interested to hear more feedback on the Script models as Stefan already provided for Fraktur. - The tessdata directory doesn't have to be called tessdata any more, so I was thinking of a structuring that allowed maybe best, fast and legacy as separate directories or repos. - I noticed git complain about the size of Latin.traineddata (~100MB), but didn't yet follow the pointer to git large data. - The current code can run the 'best' models, and the existing models, but incremental and fine tuning training will be tied to 'best' with a future commit/push. (Due to a switch to ADAM and the move of the unicharset/recoder). - Fine tuning/incremental training will not be possible from the 'fast' models, as they are 8-bit integer. It will be possible to convert a tuned best to integer to make it faster, but some of the speed in 'fast' will be from the smaller model. - It will be possible to add new characters by fine tuning! I got that working yesterday, and just need to finish updating the documentation. On Tue, Aug 1, 2017 at 6:50 AM, Amit D. ***@***.***> wrote: > My guess is that the upper case traineddata files are for 'one script > multi lang'. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#62# issuecomment-319375598>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AL056YqGx17BmwGUzEaK5AFDE67fqr_rks5sTy0ggaJpZM4OpU_-> > . > -- Ray. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#62 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o7elr2M-RcaG2eiGMykqVylg0uQ1ks5sT2KmgaJpZM4OpU_-> .

amitdo · 2017-08-01T20:04:10Z

New traineddata files:
Arabic.traineddata
Armenian.traineddata
Bengali.traineddata
Canadian_Aboriginal.traineddata
Cherokee.traineddata
Cyrillic.traineddata
Devanagari.traineddata
Ethiopic.traineddata
Fraktur.traineddata
Georgian.traineddata
Greek.traineddata
Gujarati.traineddata
Gurmukhi.traineddata
HanS.traineddata
HanS_vert.traineddata
HanT.traineddata
HanT_vert.traineddata
Hangul.traineddata
Hangul_vert.traineddata
Hebrew.traineddata
Japanese.traineddata
Japanese_vert.traineddata
Kannada.traineddata
Khmer.traineddata
Lao.traineddata
Latin.traineddata
Malayalam.traineddata
Myanmar.traineddata
Oriya.traineddata
Sinhala.traineddata
Syriac.traineddata
Tamil.traineddata
Telugu.traineddata
Thaana.traineddata
Thai.traineddata
Tibetan.traineddata
Vietnamese.traineddata
bre.traineddata
chi_sim_vert.traineddata
chi_tra_vert.traineddata
cos.traineddata
div.traineddata
fao.traineddata
fil.traineddata
fry.traineddata
gla.traineddata
hye.traineddata
jpn_vert.traineddata
kor_vert.traineddata
kur_ara.traineddata
ltz.traineddata
mon.traineddata
mri.traineddata
oci.traineddata
que.traineddata
snd.traineddata
sun.traineddata
tat.traineddata
ton.traineddata
yor.traineddata

stweil · 2017-08-02T04:41:44Z

It will be possible to add new characters by fine tuning!

That's great! Then I can add missing characters (like paragraph for Fraktur) myself. Thank you, Ray.

stweil · 2017-08-02T16:03:13Z

Ray, issue #65 lists two regressions for Fraktur (missing §, ß/B confusion in word list).

theraysmith · 2017-08-03T01:09:12Z

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training. To help prevent the ß/B confusion, the words that you want to lose from the wordlists need to go in langdata/lang/lang.bad_words. Since I spotted the edits to the deu/frk wordlists before overwriting them, I will put the deleted words in the bad_words lists, so my next run of training will not contain them. Looks like I also need to add § to the desired_characters. I have not yet committed the new wordlists, desired_characters etc, since I discovered a bug. The RTL languages have their wordlists reversed, which doesn't make sense. They should be plain text readable by someone who knows the language, and the reversal should be done before the words are converted to dawgs. I have the required change in the code already, but haven't yet run the synthetic data generation.

…

On Wed, Aug 2, 2017 at 9:03 AM, Stefan Weil ***@***.***> wrote: Ray, issue #65 <#65> lists two regressions for Fraktur (missing §, ß/B confusion in word list). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#62 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056QPzkmLa31xAVUTDXnVOGOAZAEWZks5sUJ3DgaJpZM4OpU_-> .

-- Ray.

stweil · 2017-08-04T09:19:28Z

The new files can be installed locally in tessdata/best and used like that: tesseract ... -l best/eng, so we can preserve the current directory structure (also when fast will be added), and there is no need to rename best/eng.traineddata to best_eng.traineddata in local installations.

I assume that older versions of Tesseract work with hierarchies of languages, too.
That offers new possibilities: the rather lengthy list of languages could be organized in folders for example for latin based languages, indic languages etc.

Of course tesseract --list-langs should be improved to search recursively for language files.

Shreeshrii · 2017-08-04T10:29:02Z

used like that: tesseract ... -l best/eng

That is great.

I was using --tessdata-dir ../../../tessdata/best

but this is much easier :-)

Shreeshrii · 2017-08-04T10:36:31Z

FYI: The wordlists are generated files, so it isn't a good idea to modify
them, as the modifications will likely get overwritten in a future training.

@theraysmith

The training wiki changes say that new traineddata can be built by providing wordlists. Here you mention that they are generated.

Can you explain, whether user provided wordlists override the ones in traineddata and how it would impact recognition.

I haven't tried training with new code yet.

PS. Hope you have seen language specific feedback provided under issues in tessdata.

amitdo · 2017-08-04T11:47:44Z

https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf
Page 8 T-LSTM Training

amitdo · 2017-12-15T17:39:46Z

http://usir.salford.ac.uk/44370/1/PID4978585.pdf
ICDAR2017 Competition on Recognition of Early Indian Printed Documents – REID2017

Shreeshrii · 2018-05-25T08:21:02Z

@theraysmith commented on Aug 3, 2017

I have the required change in the code already, but haven't yet run the synthetic data generation.

I will put the deleted words in the bad_words lists, so my next run of training will not contain them.

@theraysmith @jbreiden Can you confirm that the traineddata files in Github repo are the result of this improved training?

stweil · 2018-05-25T08:39:48Z

They aren't, because they were added in July 2017 – that is before that comment.

Shreeshrii · 2018-05-25T09:03:29Z

What about tessdata_fast?

Initial import to github (on behalf of Ray)
Jeff Breidenbach committed on Sep 15, 2017

stweil · 2018-05-25T16:44:27Z

tessdata_fast changed the LSTM model, but not the word list and other components. I just looked for B/ß confusions. While deu.traineddata looks good (no B/ß confusions), frk.traineddata contains lots of them, for example auBer instead of außer. frk.traineddata also contains lots of words which typically are not printed in Fraktur. Neither eBay nor PCMCIA are words which I would expect in old books or newspapers.

ghost · 2018-06-11T13:30:01Z

@theraysmith can you update the Langdata/ara

kmprerna · 2019-04-22T09:23:27Z

New traineddata files:
Arabic.traineddata
Armenian.traineddata
Bengali.traineddata
Canadian_Aboriginal.traineddata
Cherokee.traineddata
Cyrillic.traineddata
Devanagari.traineddata
Ethiopic.traineddata
Fraktur.traineddata
Georgian.traineddata
Greek.traineddata
Gujarati.traineddata
Gurmukhi.traineddata
HanS.traineddata
HanS_vert.traineddata
HanT.traineddata
HanT_vert.traineddata
Hangul.traineddata
Hangul_vert.traineddata
Hebrew.traineddata
Japanese.traineddata
Japanese_vert.traineddata
Kannada.traineddata
Khmer.traineddata
Lao.traineddata
Latin.traineddata
Malayalam.traineddata
Myanmar.traineddata
Oriya.traineddata
Sinhala.traineddata
Syriac.traineddata
Tamil.traineddata
Telugu.traineddata
Thaana.traineddata
Thai.traineddata
Tibetan.traineddata
Vietnamese.traineddata
bre.traineddata
chi_sim_vert.traineddata
chi_tra_vert.traineddata
cos.traineddata
div.traineddata
fao.traineddata
fil.traineddata
fry.traineddata
gla.traineddata
hye.traineddata
jpn_vert.traineddata
kor_vert.traineddata
kur_ara.traineddata
ltz.traineddata
mon.traineddata
mri.traineddata
oci.traineddata
que.traineddata
snd.traineddata
sun.traineddata
tat.traineddata
ton.traineddata
yor.traineddata

from where we can download these trained data for better aaccuracy.

Shreeshrii · 2019-04-22T09:47:26Z

https://github.com/tesseract-ocr/tessdata_best

https://github.com/tesseract-ocr/tessdata_best/tree/master/script

kmprerna · 2019-04-22T11:19:46Z

When I'm using this trained data for hindi text based image, it's taking long time to extract text and not giving 100% accurate result. So how to reduce the response time.

stweil mentioned this issue Aug 1, 2017

Best traineddata feedback - Fraktur #65

Open

amitdo mentioned this issue Aug 1, 2017

Suggest 'deva' for Devanagari tesseract-ocr/langdata#41

Closed

Shreeshrii mentioned this issue Jan 16, 2018

Fine tuning training for a mix language tessdata tesseract-ocr/tessdata_best#15

Closed

amitdo mentioned this issue May 25, 2018

Fix some wrong German words (confusion B / ß) tesseract-ocr/langdata#54

Merged

amitdo mentioned this issue May 31, 2018

Russian language #100

Open

amitdo mentioned this issue Jul 6, 2018

Geresh and Gershayim are not included tesseract-ocr/langdata#130

Open

zdenop added the question label Oct 15, 2018

amitdo mentioned this issue Oct 12, 2022

How to improve multilingual ocr outcome? tesseract-ocr/tesseract#3919

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added best traineddatas for 4.00 alpha #62

Added best traineddatas for 4.00 alpha #62

amitdo commented Aug 1, 2017

stweil commented Aug 1, 2017

amitdo commented Aug 1, 2017 •

edited

Loading

amitdo commented Aug 1, 2017 •

edited

Loading

theraysmith commented Aug 1, 2017 via email

Shreeshrii commented Aug 1, 2017 via email

amitdo commented Aug 1, 2017 •

edited

Loading

stweil commented Aug 2, 2017

stweil commented Aug 2, 2017

theraysmith commented Aug 3, 2017 via email

stweil commented Aug 4, 2017

Shreeshrii commented Aug 4, 2017

Shreeshrii commented Aug 4, 2017

amitdo commented Aug 4, 2017

amitdo commented Dec 15, 2017 •

edited

Loading

Shreeshrii commented May 25, 2018

stweil commented May 25, 2018

Shreeshrii commented May 25, 2018

stweil commented May 25, 2018

ghost commented Jun 11, 2018

kmprerna commented Apr 22, 2019

Shreeshrii commented Apr 22, 2019

kmprerna commented Apr 22, 2019

Added best traineddatas for 4.00 alpha #62

Added best traineddatas for 4.00 alpha #62

Comments

amitdo commented Aug 1, 2017

stweil commented Aug 1, 2017

amitdo commented Aug 1, 2017 • edited Loading

amitdo commented Aug 1, 2017 • edited Loading

theraysmith commented Aug 1, 2017 via email

Shreeshrii commented Aug 1, 2017 via email

amitdo commented Aug 1, 2017 • edited Loading

stweil commented Aug 2, 2017

stweil commented Aug 2, 2017

theraysmith commented Aug 3, 2017 via email

stweil commented Aug 4, 2017

Shreeshrii commented Aug 4, 2017

Shreeshrii commented Aug 4, 2017

amitdo commented Aug 4, 2017

amitdo commented Dec 15, 2017 • edited Loading

Shreeshrii commented May 25, 2018

stweil commented May 25, 2018

Shreeshrii commented May 25, 2018

stweil commented May 25, 2018

ghost commented Jun 11, 2018

kmprerna commented Apr 22, 2019

Shreeshrii commented Apr 22, 2019

kmprerna commented Apr 22, 2019

amitdo commented Aug 1, 2017 •

edited

Loading

amitdo commented Aug 1, 2017 •

edited

Loading

amitdo commented Aug 1, 2017 •

edited

Loading

amitdo commented Dec 15, 2017 •

edited

Loading