Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trying to add user words/patterns again: #2324

Closed

Conversation

bertsky
Copy link
Contributor

@bertsky bertsky commented Mar 14, 2019

  • replace (shallow/copy) Dict::LoadLSTM
    with (full/original) Dict::Load
    in LSTMRecognizer::LoadDictionary
  • pass the member ParamsVectors of the
    Tesseract instance into LSTMRecognizer by:
    • extending its Load method with params ptr
    • extending its LoadDictionary likewise
    • after constructing inner CCUtil and Dict
      with default params, overwrite these
      with the true params
      (via new ParamUtils::ResetFromParams)

- replace (shallow/copy) Dict::LoadLSTM
  with (full/original) Dict::Load
  in LSTMRecognizer::LoadDictionary
- pass the member ParamsVectors of the
  Tesseract instance into LSTMRecognizer by:
  - extending its Load method with params ptr
  - extending its LoadDictionary likewise
  - after constructing inner CCUtil and Dict
    with default params, overwrite these
    with the true params
    (via new ParamUtils::ResetFromParams)
@bertsky
Copy link
Contributor Author

bertsky commented Mar 14, 2019

This fixes #403 and #960, but one needs to set lstm_use_matrix=1 (so the LSTMRecognizer loads a LM) to see the effect, and the user patterns example does not give me those patterns exclusively, only more than before (but this is another story).

@bertsky
Copy link
Contributor Author

bertsky commented Mar 14, 2019

What is the meaning of those strange Windows build errors?

@Shreeshrii
Copy link
Collaborator

@bertsky
The unicharset and dawgs are different for legacy/base Tesseract and LSTM. See details from the tessdata repo file which has both.

$ combine_tessdata -d eng.traineddata
combine_tessdata -d eng.traineddata
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
1:unicharset:size=7477, offset=192
2:unicharambigs:size=1047, offset=7669
3:inttemp:size=976552, offset=8716
4:pffmtable:size=844, offset=985268
5:normproto:size=13408, offset=986112
6:punc-dawg:size=4322, offset=999520
7:word-dawg:size=1082890, offset=1003842
8:number-dawg:size=6426, offset=2086732
9:freq-dawg:size=1410, offset=2093158
13:shapetable:size=63346, offset=2094568
14:bigram-dawg:size=16109842, offset=2157914
17:lstm:size=1487588, offset=18267756
18:lstm-punc-dawg:size=4322, offset=19755344
19:lstm-word-dawg:size=3694794, offset=19759666
20:lstm-number-dawg:size=4738, offset=23454460
21:lstm-unicharset:size=6360, offset=23459198
22:lstm-recoder:size=1012, offset=23465558
23:version:size=80, offset=23466570

Do the changes use the LSTM ones in LSTMRecognizer::LoadDictionary?

@bertsky
Copy link
Contributor Author

bertsky commented Mar 15, 2019

Oh, now I see. Then probably this approach is wrong altogether.

So I guess this should only be about adding user_words_file and user_patterns_file after all (unfortunately, nobody answered this).

But the problem remains that these settings only enter the member params of Tesseract – not of LSTMRecognizer's Dict, and not GlobalParams.

So can someone at least answer if those 2 parameters are meant to be shared between pre-LSTM and LSTM processors?

@bertsky
Copy link
Contributor Author

bertsky commented Mar 15, 2019

But the problem remains that these settings only enter the member params of Tesseract – not of LSTMRecognizer's Dict, and not GlobalParams.

One way to achieve that would be to simply make them global again. But there is this statement in params.h:

// TODO(daria): remove GlobalParams() when all global Tesseract
// parameters are converted to members.

Should we still feel bound by that mission, or can global params be used for shared interests?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 15, 2019

if those 2 parameters are meant to be shared between pre-LSTM and LSTM processors?

If you mean, user_words_file and user_patterns_file, my guess would be yes.

IMO, for user_words to be useful, rather than as just being a hint, it should give those user_words exclusively OR there should at least be a config to limit the results to user_words.

@Shreeshrii
Copy link
Collaborator

// TODO(daria): remove GlobalParams() when all global Tesseract
// parameters are converted to members.

This is a comment from 8 years ago, most probably from Google's internal code.

@amitdo
Copy link
Collaborator

amitdo commented Mar 15, 2019

Please don't make the parameters global.

https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes#tesseract-release-notes-oct-21-2011---v301

Tesseract release notes Oct 21 2011 - V3.01

Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 15, 2019

Thanks @amitdo for that hint! Then we will need a solution along the lines of ResetFromParams above.

Perhaps this is also a problem of the semantic of the CLI options --user-words and --user-patterns: unlike their config/traineddata pendants, it is unclear what scope they should have (all sublangs, non-LSTM and LSTM).

@bertsky
Copy link
Contributor Author

bertsky commented Mar 15, 2019

Currently they override langdata and config settings for the outermost Tesseract instance / top-level language only. If that is intentional, then my best guess would be that:

  1. each sublang should be kept its own user words/patterns (with only the outermost manipulable via CLI/API) – another argument against GlobalParams btw
  2. each sublang should pass those settings to its LSTMRecognizer's dict_, should such exist

I still need a good transfer mechanism, though.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 15, 2019

I have found a way. Cancelling in favour of #2328.

@bertsky bertsky closed this Mar 15, 2019
@bertsky bertsky deleted the lstm-with-user-patterns branch March 15, 2019 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants