-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tesseract ... lstm.train can produce defective *.lstmf files #2741
Comments
This is a Tesseract-related issue. Pls. report in the corresponding repository. ( |
I can transfer the issue to Tesseract, so no need to create a new issue there. |
I can confirm that even the latest code produces unusable lstmf files for this example. It should either fail with an error message or produce working lstmf files. |
This problem will occur with any image where Tesseract does not detect text. |
Thanks for confirming that. I'm not surprised that it can't detect text in some of the horrible examples I'm using, and I now have a workaround (using |
If Tesseract cannot find text in the input image, it should not write an empty lstmf file. This problem was reported in issue tesseract-ocr#2741. Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Pull request #2744 fixes this issue. Tesseract no longer creates an lstmf file which just contains four 0 bytes. It now prints a message and fails for this case. |
Thanks! |
Commit 94d0f77 tried to fix issue tesseract-ocr#2741 but created a new problem. This commit should fix both old and new issue. Signed-off-by: Stefan Weil <sw@weilnetz.de>
The previous fix created a new problem. Pull request #2751 addresses this. |
If Tesseract cannot find text in the input image, it should not write an empty lstmf file. This problem was reported in issue #2741. Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de> # Conflicts: # src/ccmain/linerec.cpp
tess-4c-examples.zip
The attached lstmf files contain 4 bytes of nulls and, even when mixed in with a batch of valid lstmf files, cause the lstmtraining command to fail. They were generated with a script looping through the
*.tif
files generated by another script (along with matching*.box
files):tesseract 20190930-125338-000000007.tif 20190930-125338-000000007 lstm.train
I admit that these examples of image files are atrocious for OCR, but I'm doing experiments trying to train tesseract to deal with text superimposed on images. I'm not surprised that it rejects some of the tif/box combinations I've produced, but I don't think it should produce defective files that crash a later stage of training.
I generated 90000 matching *.box and *.tif files; the tesseract lstm.train command produced 89676 *.lstmf files, many of which were 4 bytes long. Once I ran
find -name '*lstmf' -size 4c -delete
, which left me with 26320 *.lstmf files, and then lstmtraining worked.The text was updated successfully, but these errors were encountered: