Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user pattern/dict does not work at all #960

Closed
wosiu opened this issue May 30, 2017 · 19 comments
Closed

user pattern/dict does not work at all #960

wosiu opened this issue May 30, 2017 · 19 comments

Comments

@wosiu
Copy link

wosiu commented May 30, 2017

They do not work for me. I've been trying versions: 3.05.00 and 4.00.00alpha.
My file date.user-pattern contains one line:
2014-\d\d-\d\d
Picture is one line with date, like: 2014-03-19
I run: tesseract img.jpg stdout --user-patterns date.user-patterns -psm 8
and output: "mum-w" which obviously does not match the pattern.
Character whitelisting helps a bit, but format from pattern is not preserve and accuracy is poor.
I also tried some other examples - does not work either.
Many people have the same problem, aggregated links under this one:
https://stackoverflow.com/questions/34560697/tesseract-ocr-user-patterns
also #403
Should we assume that this feature does not work at all? Is there any official comment on this?

@wosiu wosiu changed the title user pattern does not work at all user pattern/dict does not work at all May 30, 2017
@wosiu
Copy link
Author

wosiu commented May 30, 2017

Same problem with user dictionary:
tesseract H3.png stdout --user-patterns date.user-patterns --psm 8 --user-words date.user-words -c language_model_penalty_non_dict_word=9999999999999999999 --oem 0 I tried different language_model_penalty_non_dict_word values with no luck
Related #297, which is closed, so I assume the feature doesn't work. I think it would be better for users if those flags are removed from command line and configurations, because it is misleading as long as they don't affect engine.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 3, 2017

Tested user-words option with 3.05.01 on windows (using binaries by @stweil)

Works ok. See attached test image.

bazaar config file as used (uses system dictionary + user words)

load_system_dawg     T
load_freq_dawg       T
user_words_suffix    user-words
user_patterns_suffix user-patterns

eng.user-words as used

Online
the
quick
brown
fox
jumped

image used for recognition

test

Output without user-words- Notice Dnline instead of Online

tesseract test.png stdout
shared Guruvayoor Dnline Friends's post

Output with user-words - Online recognized correctly

tesseract test.png stdout  bazaar
shared Guruvayoor Online Friends's post

So Online from eng.user-words was used, when using the bazaar config file, and led to improved accuracy.

@Shreeshrii
Copy link
Collaborator

I tested for user-patterns just now with versions 3.02 and 3.05.01, both for windows so that I didn't have to worry about correct versions of leptonica. The test image is attached.

There is no change in output with the user-patterns option in both. So, if this feature worked, it would be before 3.02.

However, just by resizing the image to 200%, the dates are correctly recognized.

date
date-small

@Shreeshrii
Copy link
Collaborator

@zdenop @amitdo @stweil Have you used user-patterns option? If so, with which version?

@stweil
Copy link
Contributor

stweil commented Jun 3, 2017

No, sorry, I never used that option. Nevertheless I also have a scenario where working user patterns would help.

@amitdo
Copy link
Collaborator

amitdo commented Jun 3, 2017

No, sorry, I never used that option.

Same answer.

@Shreeshrii
Copy link
Collaborator

I also have a scenario where working user patterns would help.

@stweil Interesting project :-)

https://groups.google.com/forum/#!searchin/tesseract-ocr/user$20patterns$20%7Csort:date/tesseract-ocr/S9CIK3jOMWw/u7dnVDASFLgJ

The ability to use user patterns was added by Tesseract 3.01, and now has a little documentation. See the comment in dict/trie.h:

http://code.google.com/p/tesseract-ocr/source/browse/tags/release-3.01/dict/trie.h

So it broke somewhere between 3.01 and 3.02...

@zdenop
Copy link
Contributor

zdenop commented Jun 4, 2017

I did not use it either.
But as far as I understand: "user patterns" just help to extend tesseract dictionary.
And as it is known putting word to dictionary does not mean tesseract will recognize it (or other way around disabling dictionaries will not cause 0% recognition). => I do not know if the feature is working at all, but I would not expect significant effect on result from it.

@stweil
Copy link
Contributor

stweil commented Jun 4, 2017

User patterns are documented in doc/tesseract.1.asc and in dict/trie.h.

@vidiecan
Copy link

With 4.0 the problem might be that the Dict class is instantiated twice

tesseract::Dict::Dict(tesseract::CCUtil * ccutil)
tesseract::Classify::Classify()
tesseract::Wordrec::Wordrec()
tesseract::Tesseract::Tesseract()
tesseract::TessBaseAPI::Init(...)

and then here

tesseract::Dict::Dict(tesseract::CCUtil * ccutil)
tesseract::LSTMRecognizer::LoadDictionary(const char * lang, tesseract::TessdataManager * mgr)
tesseract::LSTMRecognizer::Load(const char * lang, tesseract::TessdataManager * mgr)
tesseract::Tesseract::init_tesseract_lang_data(...)
tesseract::Tesseract::init_tesseract_internal(...)
tesseract::Tesseract::init_tesseract(...)
tesseract::TessBaseAPI::Init(...)

and both initialise
https://github.com/tesseract-ocr/tesseract/blob/master/dict/dict.cpp#L43

The real problem is that variables are set between these calls so LSTM dict does not get the value from user defined variables.

@asmwarrior
Copy link

asmwarrior commented Sep 28, 2017

Does this issue only happen on the command line executable? I mean I can workaround this issue by writing some C++ source file to directly call the API? Thanks.

@wosiu
Copy link
Author

wosiu commented Jan 18, 2018

@asmwarrior Answering your question: Both command line and API are affected.
Character whitelisting works for 3.05 but does not work for LSTM mode (version 4) at all.
@vidiecan Have you tried fixing the issue with whitelisting for 4.0 lstm? Your previous comment on this sounds reasonable.

@Shreeshrii
Copy link
Collaborator

Please also see comment by Ray at #403 (comment)

Don't think it has been addressed yet.

@stweil Is this something you can fix?

@Shreeshrii
Copy link
Collaborator

@vidiecan you mentioned earlier that 'With 4.0 the problem might be that the Dict class is instantiated twice'.

Do you have a suggested patch to fix this issue?

@amitdo
Copy link
Collaborator

amitdo commented May 2, 2018

#1127, #1128

@nusynergi
Copy link

Any update to this issue?
I am running Tesseract 4.00.00 Alpha on Linux via Tess4J 3.3.1
I am using the following java code in Tess4J to try and use the bazaar file and subsequently the user_patterns_suffix

TessAPI1.TessBaseAPIReadConfigFile(handle, tessdatafolder+"/configs/bazaar", 0)

I am sure it is finding this file because if I change the name of 'bazaar' it throws a warning saying file is not found.

The contents of the bazaar file is the standard -

load_system_dawg     F
load_freq_dawg       F
user_words_suffix    user-words
user_patterns_suffix user-patterns

I populate the eng.user-patterns file in the tessdata folder with the standard values as default and also add my own to equate for values I need to capture correctly from a page -

1-\d\d\d-GOOG-411
www.\n\\\*.com
\A\A\d\d\A\A\A
ML\d\d\A\A\A
\A\A\d\d\d\d\d\d\d\d

However, I do not see any change in the results I am seeing. I know it is supposed to influence the results vs force, but the text looks so clearly incorrect there must be an issue.

The last time I did a build from source was around a month ago.

Any help is greatly appreciated.

@Necklaces
Copy link

We tried using strace on tesseract 4.0.0-beta.4-26-gfd49 and it seems that the user-patterns and user-words files only get opened in legacy mode (using --oem 0).

@stweil
Copy link
Contributor

stweil commented Sep 14, 2018

So does this work when Tesseract 4 is used with --oem 0? Then it's not a regression (Tesseract 4 can then replace Tesseract 3), but a missing feature for LSTM mode.

@zdenop
Copy link
Contributor

zdenop commented Oct 13, 2018

Closed as duplicate to #403

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants