Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract ... lstm.train can produce defective *.lstmf files #2741

Closed
adam-funk opened this issue Oct 29, 2019 · 8 comments
Closed

tesseract ... lstm.train can produce defective *.lstmf files #2741

adam-funk opened this issue Oct 29, 2019 · 8 comments
Milestone

Comments

@adam-funk
Copy link

tess-4c-examples.zip
The attached lstmf files contain 4 bytes of nulls and, even when mixed in with a batch of valid lstmf files, cause the lstmtraining command to fail. They were generated with a script looping through the *.tif files generated by another script (along with matching *.box files):
tesseract 20190930-125338-000000007.tif 20190930-125338-000000007 lstm.train

I admit that these examples of image files are atrocious for OCR, but I'm doing experiments trying to train tesseract to deal with text superimposed on images. I'm not surprised that it rejects some of the tif/box combinations I've produced, but I don't think it should produce defective files that crash a later stage of training.

I generated 90000 matching *.box and *.tif files; the tesseract lstm.train command produced 89676 *.lstmf files, many of which were 4 bytes long. Once I ran find -name '*lstmf' -size 4c -delete, which left me with 26320 *.lstmf files, and then lstmtraining worked.

@wrznr
Copy link

wrznr commented Nov 1, 2019

This is a Tesseract-related issue. Pls. report in the corresponding repository. (tesstrain is simply an ease-of-use wrapper for training Tesseract.)

@stweil
Copy link
Contributor

stweil commented Nov 1, 2019

I can transfer the issue to Tesseract, so no need to create a new issue there.

@stweil stweil transferred this issue from tesseract-ocr/tesstrain Nov 1, 2019
@stweil
Copy link
Contributor

stweil commented Nov 1, 2019

I can confirm that even the latest code produces unusable lstmf files for this example. It should either fail with an error message or produce working lstmf files.

@stweil stweil added the bug label Nov 1, 2019
@stweil stweil added this to the 5.0.0 milestone Nov 1, 2019
@stweil
Copy link
Contributor

stweil commented Nov 1, 2019

This problem will occur with any image where Tesseract does not detect text.

@adam-funk
Copy link
Author

adam-funk commented Nov 1, 2019

Thanks for confirming that. I'm not surprised that it can't detect text in some of the horrible examples I'm using, and I now have a workaround (using find ... -delete to clean up before running the next command) for it. If the documentation warns users about this, I'm sorry I didn't see that. If it doesn't, it would be helpful to include it.

stweil added a commit to stweil/tesseract that referenced this issue Nov 1, 2019
If Tesseract cannot find text in the input image, it should not write
an empty lstmf file. This problem was reported in issue tesseract-ocr#2741.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Nov 1, 2019
@stweil
Copy link
Contributor

stweil commented Nov 1, 2019

Pull request #2744 fixes this issue. Tesseract no longer creates an lstmf file which just contains four 0 bytes. It now prints a message and fails for this case.

pull bot pushed a commit to shakir-abdo/tesseract that referenced this issue Nov 2, 2019
@adam-funk
Copy link
Author

Thanks!

stweil added a commit to stweil/tesseract that referenced this issue Nov 8, 2019
Commit 94d0f77 tried to fix issue tesseract-ocr#2741
but created a new problem.

This commit should fix both old and new issue.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil stweil mentioned this issue Nov 8, 2019
@stweil
Copy link
Contributor

stweil commented Nov 8, 2019

The previous fix created a new problem. Pull request #2751 addresses this.

zdenop pushed a commit that referenced this issue Nov 11, 2019
If Tesseract cannot find text in the input image, it should not write
an empty lstmf file. This problem was reported in issue #2741.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
zdenop added a commit that referenced this issue Nov 11, 2019
Signed-off-by: Stefan Weil <sw@weilnetz.de>

# Conflicts:
#	src/ccmain/linerec.cpp
zdenop pushed a commit that referenced this issue Nov 11, 2019
Commit 94d0f77 tried to fix issue #2741
but created a new problem.

This commit should fix both old and new issue.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@amitdo amitdo closed this as completed Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants