Fix some unittests with locale de_DE.UTF-8 #2437

stweil · 2019-05-15T20:49:53Z

The unittest failed with LANG=de_DE.UTF-8:

$ unittest/apiexample_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EuroText
[ RUN      ] EuroText.FastLatinOCR
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../../src/ccutil/unicharset.h, line 874

Signed-off-by: Stefan Weil sw@weilnetz.de

The unittest failed with LANG=de_DE.UTF-8: $ unittest/apiexample_test Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc [==========] Running 4 tests from 2 test suites. [----------] Global test environment set-up. [----------] 1 test from EuroText [ RUN ] EuroText.FastLatinOCR contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../../src/ccutil/unicharset.h, line 874 Signed-off-by: Stefan Weil <sw@weilnetz.de>

zdenop · 2019-05-15T21:29:16Z

This pull request introduces 1 alert when merging 0dcc889 into 4b397c7 - view on LGTM.com

new alerts:

1 for FIXME comment

Comment posted by LGTM.com

stweil · 2019-05-16T04:50:31Z

I added the FIXME because there follows a snprintf statement which formats double or float values. The result will depend on the locale settings, so that needs a fix, too. But first I have to find a test case which triggers that code.

amitdo · 2019-05-16T05:48:54Z

src/ccutil/unicharset.cpp

@@ -706,6 +709,7 @@ bool UNICHARSET::save_to_string(STRING *str) const {
              this->get_script_from_script_id(this->get_script(id)),
              this->get_other_case(id));
    } else {
+      // FIXME


You should clarify what is needed to be fixed here

... by adding a comment after the FIXME

I should have done that, yes. But instead of fixing the FIXME comment, I prefer to fix the code, hopefully today.

save_to_string is fixed now, too.

amitdo · 2019-05-16T05:51:00Z

src/ccutil/unicharset.cpp

@@ -815,41 +819,64 @@ bool UNICHARSET::load_via_fgets(
    float advance = 0.0f;
    float advance_sd = 0.0f;
    // TODO(eger): check that this default it ok
-    // after enabling BiDi iterator for Arabic+Cube.
+    // after enabling BiDi iterator for Arabic.


Maybe it should be Arabic+LSTM ?

Maybe. The more important point is that this TODO is either still open, or the comment should be removed.

The unittest failed with LANG=de_DE.UTF-8: $ unittest/baseapi_test Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc [==========] Running 12 tests from 2 test suites. [----------] Global test environment set-up. [----------] 10 tests from TesseractTest [ RUN ] TesseractTest.ArraySizeTest [ OK ] TesseractTest.ArraySizeTest (0 ms) [ RUN ] TesseractTest.BasicTesseractTest [ OK ] TesseractTest.BasicTesseractTest (1251 ms) [ RUN ] TesseractTest.IteratesParagraphsEvenIfNotDetected [ OK ] TesseractTest.IteratesParagraphsEvenIfNotDetected (347 ms) [ RUN ] TesseractTest.HOCRWorksWithoutSetInputName [ OK ] TesseractTest.HOCRWorksWithoutSetInputName (403 ms) [ RUN ] TesseractTest.HOCRContainsBaseline [ OK ] TesseractTest.HOCRContainsBaseline (389 ms) [ RUN ] TesseractTest.RickSnyderNotFuckSnyder [ OK ] TesseractTest.RickSnyderNotFuckSnyder (346 ms) [ RUN ] TesseractTest.AdaptToWordStrTest Trying to adapt "136 " to "1 3 6" Trying to adapt "256 " to "2 5 6" Trying to adapt "410 " to "4 1 0" Trying to adapt "432 " to "4 3 2" Trying to adapt "540 " to "5 4 0" Trying to adapt "692 " to "6 9 2" Trying to adapt "779 " to "7 7 9" Trying to adapt "793 " to "7 9 3" Trying to adapt "808 " to "8 0 8" Trying to adapt "815 " to "8 1 5" Trying to adapt "12 " to "1 2" Trying to adapt "12 " to "1 2" [ OK ] TesseractTest.AdaptToWordStrTest (788 ms) [ RUN ] TesseractTest.BasicLSTMTest [ OK ] TesseractTest.BasicLSTMTest (4525 ms) [ RUN ] TesseractTest.LSTMGeometryTest [ OK ] TesseractTest.LSTMGeometryTest (615 ms) [ RUN ] TesseractTest.InitConfigOnlyTest Error: unichar ? in normproto file is not in unichar set. Error: unichar 0.232621 in normproto file is not in unichar set. Error: unichar 0.000400 in normproto file is not in unichar set. Error: unichar 0.231864 in normproto file is not in unichar set. [...] Error: unichar ? in normproto file is not in unichar set. Error: unichar 0.233915 in normproto file is not in unichar set. Error: unichar 0.000400 in normproto file is not in unichar set. Error: unichar 0.221755 in normproto file is not in unichar set. Error: unichar 0.000400 in normproto file is not in unichar set. Error: unichar ? in normproto file is not in unichar set. baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug [INFO] Lang eng took 327ms in regular init [INFO] Lang chi_tra took 1422ms in regular init Abort trap: 6 TesseractTest.InitConfigOnlyTest is fixed by using std::istringstream instead of sscanf. Signed-off-by: Stefan Weil <sw@weilnetz.de>

zdenop · 2019-05-16T09:32:42Z

@stweil : is this PR ready for merge or do you plan to add something to it?

That function writes float values which must always use '.' as the decimal separator, no matter what the current locale setting is. Signed-off-by: Stefan Weil <sw@weilnetz.de>

zdenop · 2019-05-16T09:40:46Z

This pull request introduces 1 alert when merging 36ed6da into 4b397c7 - view on LGTM.com

new alerts:

1 for FIXME comment

Comment posted by LGTM.com

The latest code passed all unittests with locale de_DE.UTF-8 and has fixed the locale issues which were reported on GitHub. Therefore the assertions can be removed. Any remaining locale issue will be fixed when it is identified. To help finding such remaining isses, debug code now uses the user's locale settings instead of the default "C" locale for all executables which use TessBaseAPI. Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil · 2019-05-16T12:09:23Z

is this PR ready for merge or do you plan to add something to it?

Now I think that it is ready for merging. The last commit is the one we were waiting for: it removes the assertions which check the locale.

zdenop · 2019-05-16T15:02:21Z

Thanks

ghost assigned stweil May 15, 2019

ghost added the review label May 15, 2019

stweil mentioned this pull request May 16, 2019

tesseract failed loading non-english language.traineddata #1250

Closed

amitdo reviewed May 16, 2019

View reviewed changes

stweil force-pushed the locale-fix branch from 5839d44 to 36ed6da Compare May 16, 2019 09:05

stweil changed the title ~~Fix apiexample_test with locale de_DE.UTF-8~~ Fix some unittests with locale de_DE.UTF-8 May 16, 2019

Fix UNICHARSET::save_to_string for locale de_DE.UTF-8

77f9bad

That function writes float values which must always use '.' as the decimal separator, no matter what the current locale setting is. Signed-off-by: Stefan Weil <sw@weilnetz.de>

zdenop merged commit b124a5f into tesseract-ocr:master May 16, 2019

ghost removed the review label May 16, 2019

stweil deleted the locale-fix branch May 16, 2019 15:08

amitdo added the locale label Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some unittests with locale de_DE.UTF-8 #2437

Fix some unittests with locale de_DE.UTF-8 #2437

stweil commented May 15, 2019

zdenop commented May 15, 2019

stweil commented May 16, 2019

amitdo May 16, 2019

amitdo May 16, 2019

stweil May 16, 2019

amitdo May 16, 2019

stweil May 16, 2019

amitdo May 16, 2019

stweil May 16, 2019

zdenop commented May 16, 2019

zdenop commented May 16, 2019

stweil commented May 16, 2019

zdenop commented May 16, 2019

Fix some unittests with locale de_DE.UTF-8 #2437

Fix some unittests with locale de_DE.UTF-8 #2437

Conversation

stweil commented May 15, 2019

zdenop commented May 15, 2019

stweil commented May 16, 2019

amitdo May 16, 2019

Choose a reason for hiding this comment

amitdo May 16, 2019

Choose a reason for hiding this comment

stweil May 16, 2019

Choose a reason for hiding this comment

amitdo May 16, 2019

Choose a reason for hiding this comment

stweil May 16, 2019

Choose a reason for hiding this comment

amitdo May 16, 2019

Choose a reason for hiding this comment

stweil May 16, 2019

Choose a reason for hiding this comment

zdenop commented May 16, 2019

zdenop commented May 16, 2019

stweil commented May 16, 2019

zdenop commented May 16, 2019