Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some unittests with locale de_DE.UTF-8 #2437

Merged
merged 4 commits into from
May 16, 2019

Conversation

stweil
Copy link
Contributor

@stweil stweil commented May 15, 2019

The unittest failed with LANG=de_DE.UTF-8:

$ unittest/apiexample_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EuroText
[ RUN      ] EuroText.FastLatinOCR
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../../src/ccutil/unicharset.h, line 874

Signed-off-by: Stefan Weil sw@weilnetz.de

The unittest failed with LANG=de_DE.UTF-8:

    $ unittest/apiexample_test
    Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
    [==========] Running 4 tests from 2 test suites.
    [----------] Global test environment set-up.
    [----------] 1 test from EuroText
    [ RUN      ] EuroText.FastLatinOCR
    contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../../src/ccutil/unicharset.h, line 874

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@ghost ghost assigned stweil May 15, 2019
@ghost ghost added the review label May 15, 2019
@zdenop
Copy link
Contributor

zdenop commented May 15, 2019

This pull request introduces 1 alert when merging 0dcc889 into 4b397c7 - view on LGTM.com

new alerts:

  • 1 for FIXME comment

Comment posted by LGTM.com

@stweil
Copy link
Contributor Author

stweil commented May 16, 2019

I added the FIXME because there follows a snprintf statement which formats double or float values. The result will depend on the locale settings, so that needs a fix, too. But first I have to find a test case which triggers that code.

@@ -706,6 +709,7 @@ bool UNICHARSET::save_to_string(STRING *str) const {
this->get_script_from_script_id(this->get_script(id)),
this->get_other_case(id));
} else {
// FIXME
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should clarify what is needed to be fixed here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... by adding a comment after the FIXME

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have done that, yes. But instead of fixing the FIXME comment, I prefer to fix the code, hopefully today.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_to_string is fixed now, too.

@@ -815,41 +819,64 @@ bool UNICHARSET::load_via_fgets(
float advance = 0.0f;
float advance_sd = 0.0f;
// TODO(eger): check that this default it ok
// after enabling BiDi iterator for Arabic+Cube.
// after enabling BiDi iterator for Arabic.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be Arabic+LSTM ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. The more important point is that this TODO is either still open, or the comment should be removed.

The unittest failed with LANG=de_DE.UTF-8:

    $ unittest/baseapi_test
    Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
    [==========] Running 12 tests from 2 test suites.
    [----------] Global test environment set-up.
    [----------] 10 tests from TesseractTest
    [ RUN      ] TesseractTest.ArraySizeTest
    [       OK ] TesseractTest.ArraySizeTest (0 ms)
    [ RUN      ] TesseractTest.BasicTesseractTest
    [       OK ] TesseractTest.BasicTesseractTest (1251 ms)
    [ RUN      ] TesseractTest.IteratesParagraphsEvenIfNotDetected
    [       OK ] TesseractTest.IteratesParagraphsEvenIfNotDetected (347 ms)
    [ RUN      ] TesseractTest.HOCRWorksWithoutSetInputName
    [       OK ] TesseractTest.HOCRWorksWithoutSetInputName (403 ms)
    [ RUN      ] TesseractTest.HOCRContainsBaseline
    [       OK ] TesseractTest.HOCRContainsBaseline (389 ms)
    [ RUN      ] TesseractTest.RickSnyderNotFuckSnyder
    [       OK ] TesseractTest.RickSnyderNotFuckSnyder (346 ms)
    [ RUN      ] TesseractTest.AdaptToWordStrTest
    Trying to adapt "136
    " to "1 3 6"
    Trying to adapt "256
    " to "2 5 6"
    Trying to adapt "410
    " to "4 1 0"
    Trying to adapt "432
    " to "4 3 2"
    Trying to adapt "540
    " to "5 4 0"
    Trying to adapt "692
    " to "6 9 2"
    Trying to adapt "779
    " to "7 7 9"
    Trying to adapt "793
    " to "7 9 3"
    Trying to adapt "808
    " to "8 0 8"
    Trying to adapt "815
    " to "8 1 5"
    Trying to adapt "12
    " to "1 2"
    Trying to adapt "12
    " to "1 2"
    [       OK ] TesseractTest.AdaptToWordStrTest (788 ms)
    [ RUN      ] TesseractTest.BasicLSTMTest
    [       OK ] TesseractTest.BasicLSTMTest (4525 ms)
    [ RUN      ] TesseractTest.LSTMGeometryTest
    [       OK ] TesseractTest.LSTMGeometryTest (615 ms)
    [ RUN      ] TesseractTest.InitConfigOnlyTest
    Error: unichar ? in normproto file is not in unichar set.
    Error: unichar 0.232621 in normproto file is not in unichar set.
    Error: unichar 0.000400 in normproto file is not in unichar set.
    Error: unichar 0.231864 in normproto file is not in unichar set.
    [...]
    Error: unichar ? in normproto file is not in unichar set.
    Error: unichar 0.233915 in normproto file is not in unichar set.
    Error: unichar 0.000400 in normproto file is not in unichar set.
    Error: unichar 0.221755 in normproto file is not in unichar set.
    Error: unichar 0.000400 in normproto file is not in unichar set.
    Error: unichar ? in normproto file is not in unichar set.
    baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
    baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug
    [INFO]  Lang eng took 327ms in regular init
    [INFO]  Lang chi_tra took 1422ms in regular init
    Abort trap: 6

TesseractTest.InitConfigOnlyTest is fixed by using std::istringstream
instead of sscanf.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil stweil changed the title Fix apiexample_test with locale de_DE.UTF-8 Fix some unittests with locale de_DE.UTF-8 May 16, 2019
@zdenop
Copy link
Contributor

zdenop commented May 16, 2019

@stweil : is this PR ready for merge or do you plan to add something to it?

That function writes float values which must always use '.' as the
decimal separator, no matter what the current locale setting is.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@zdenop
Copy link
Contributor

zdenop commented May 16, 2019

This pull request introduces 1 alert when merging 36ed6da into 4b397c7 - view on LGTM.com

new alerts:

  • 1 for FIXME comment

Comment posted by LGTM.com

The latest code passed all unittests with locale de_DE.UTF-8
and has fixed the locale issues which were reported on GitHub.
Therefore the assertions can be removed.

Any remaining locale issue will be fixed when it is identified.
To help finding such remaining isses, debug code now uses the
user's locale settings instead of the default "C" locale for all
executables which use TessBaseAPI.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor Author

stweil commented May 16, 2019

is this PR ready for merge or do you plan to add something to it?

Now I think that it is ready for merging. The last commit is the one we were waiting for: it removes the assertions which check the locale.

@zdenop zdenop merged commit b124a5f into tesseract-ocr:master May 16, 2019
@ghost ghost removed the review label May 16, 2019
@zdenop
Copy link
Contributor

zdenop commented May 16, 2019

Thanks

@stweil stweil deleted the locale-fix branch May 16, 2019 15:08
@amitdo amitdo added the locale label Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants