tesseract ... lstm.train can produce defective *.lstmf files #2741

adam-funk · 2019-10-29T13:37:46Z

tess-4c-examples.zip
The attached lstmf files contain 4 bytes of nulls and, even when mixed in with a batch of valid lstmf files, cause the lstmtraining command to fail. They were generated with a script looping through the *.tif files generated by another script (along with matching *.box files):
tesseract 20190930-125338-000000007.tif 20190930-125338-000000007 lstm.train

I admit that these examples of image files are atrocious for OCR, but I'm doing experiments trying to train tesseract to deal with text superimposed on images. I'm not surprised that it rejects some of the tif/box combinations I've produced, but I don't think it should produce defective files that crash a later stage of training.

I generated 90000 matching *.box and *.tif files; the tesseract lstm.train command produced 89676 *.lstmf files, many of which were 4 bytes long. Once I ran find -name '*lstmf' -size 4c -delete, which left me with 26320 *.lstmf files, and then lstmtraining worked.

The text was updated successfully, but these errors were encountered:

wrznr · 2019-11-01T07:55:19Z

This is a Tesseract-related issue. Pls. report in the corresponding repository. (tesstrain is simply an ease-of-use wrapper for training Tesseract.)

stweil · 2019-11-01T08:00:23Z

I can transfer the issue to Tesseract, so no need to create a new issue there.

stweil · 2019-11-01T08:23:22Z

I can confirm that even the latest code produces unusable lstmf files for this example. It should either fail with an error message or produce working lstmf files.

stweil · 2019-11-01T13:32:30Z

This problem will occur with any image where Tesseract does not detect text.

adam-funk · 2019-11-01T13:58:33Z

Thanks for confirming that. I'm not surprised that it can't detect text in some of the horrible examples I'm using, and I now have a workaround (using find ... -delete to clean up before running the next command) for it. If the documentation warns users about this, I'm sorry I didn't see that. If it doesn't, it would be helpful to include it.

If Tesseract cannot find text in the input image, it should not write an empty lstmf file. This problem was reported in issue tesseract-ocr#2741. Signed-off-by: Stefan Weil <sw@weilnetz.de>

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil · 2019-11-01T20:56:28Z

Pull request #2744 fixes this issue. Tesseract no longer creates an lstmf file which just contains four 0 bytes. It now prints a message and fails for this case.

Fix issue tesseract-ocr#2741

adam-funk · 2019-11-04T13:53:21Z

Thanks!

Commit 94d0f77 tried to fix issue tesseract-ocr#2741 but created a new problem. This commit should fix both old and new issue. Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil · 2019-11-08T16:15:01Z

The previous fix created a new problem. Pull request #2751 addresses this.

If Tesseract cannot find text in the input image, it should not write an empty lstmf file. This problem was reported in issue #2741. Signed-off-by: Stefan Weil <sw@weilnetz.de>

Signed-off-by: Stefan Weil <sw@weilnetz.de> # Conflicts: # src/ccmain/linerec.cpp

Commit 94d0f77 tried to fix issue #2741 but created a new problem. This commit should fix both old and new issue. Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil transferred this issue from tesseract-ocr/tesstrain Nov 1, 2019

stweil added the bug label Nov 1, 2019

stweil added this to the 5.0.0 milestone Nov 1, 2019

stweil added a commit to stweil/tesseract that referenced this issue Nov 1, 2019

Fail if no valid lstmf file was written (fix issue tesseract-ocr#2741)

a306cd7

Signed-off-by: Stefan Weil <sw@weilnetz.de>

pull bot pushed a commit to shakir-abdo/tesseract that referenced this issue Nov 2, 2019

Merge pull request tesseract-ocr#2744 from stweil/master

ceea079

Fix issue tesseract-ocr#2741

stweil added a commit to stweil/tesseract that referenced this issue Nov 8, 2019

Fix issue tesseract-ocr#2748

ac46b28

Commit 94d0f77 tried to fix issue tesseract-ocr#2741 but created a new problem. This commit should fix both old and new issue. Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil mentioned this issue Nov 8, 2019

Fix issue #2748 #2751

Merged

zdenop pushed a commit that referenced this issue Nov 11, 2019

Don't create an empty lstmf file

185d237

If Tesseract cannot find text in the input image, it should not write an empty lstmf file. This problem was reported in issue #2741. Signed-off-by: Stefan Weil <sw@weilnetz.de>

zdenop added a commit that referenced this issue Nov 11, 2019

Fail if no valid lstmf file was written (fix issue #2741)

975c626

Signed-off-by: Stefan Weil <sw@weilnetz.de> # Conflicts: # src/ccmain/linerec.cpp

zdenop pushed a commit that referenced this issue Nov 11, 2019

Fix issue #2748

eaf1f69

Commit 94d0f77 tried to fix issue #2741 but created a new problem. This commit should fix both old and new issue. Signed-off-by: Stefan Weil <sw@weilnetz.de>

amitdo added the training label May 18, 2020

amitdo closed this as completed Aug 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesseract ... lstm.train can produce defective *.lstmf files #2741

tesseract ... lstm.train can produce defective *.lstmf files #2741

adam-funk commented Oct 29, 2019

wrznr commented Nov 1, 2019 •

edited

Loading

stweil commented Nov 1, 2019

stweil commented Nov 1, 2019

stweil commented Nov 1, 2019

adam-funk commented Nov 1, 2019 •

edited

Loading

stweil commented Nov 1, 2019

adam-funk commented Nov 4, 2019

stweil commented Nov 8, 2019

tesseract ... lstm.train can produce defective *.lstmf files #2741

tesseract ... lstm.train can produce defective *.lstmf files #2741

Comments

adam-funk commented Oct 29, 2019

wrznr commented Nov 1, 2019 • edited Loading

stweil commented Nov 1, 2019

stweil commented Nov 1, 2019

stweil commented Nov 1, 2019

adam-funk commented Nov 1, 2019 • edited Loading

stweil commented Nov 1, 2019

adam-funk commented Nov 4, 2019

stweil commented Nov 8, 2019

wrznr commented Nov 1, 2019 •

edited

Loading

adam-funk commented Nov 1, 2019 •

edited

Loading