Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character confusion fix suggestion #3144

Open
EucliTs0 opened this issue Oct 30, 2020 · 44 comments
Open

Character confusion fix suggestion #3144

EucliTs0 opened this issue Oct 30, 2020 · 44 comments

Comments

@EucliTs0
Copy link

Environment

Hello,
We utilize Tesseract a lot in our platform, and we most often had the following issue:
For example, if we had a sequence "2032BA065" in the image, then we would get as output: "2032BA0O65".
But this happens to other characters too, for example B -> B8, 5-> 5S. After some investigation and debugging, we came up with a fix where all cases (at least in our dataset) are corrected.

It happens at two time stamps very close (t, t+1) on the characters. Their confidence probabilities are too close to each other at time step t and time step t+1, compared to no confusing characters where confidence is close to 1.0 at each time step. Unfortunately, Tesseract doesn't filter out this kind of duplication between confused characters. To fix this issue, let's call P(t), P(t+1) the probability of recognized characters at consecutive time steps t and t+1 respectively.

D(t+1) = P(t+1) / P(t) + P(t+1),
where D(t+1) defines the confusion metric, and iif D(t+1) < threshold then we stop and ignore the confused character.

In, src/lstm/recodebeam.cpp, between line 907 and 908, we add:

Suggested Fix:

if (prev != nullptr and code > 0 and code != 139 and prev->code !=139 and prev->code > 0)
      {
        const float sum_proba_prev_current = std::max(outputs[code], outputs[prev->code]) + std::min(outputs[code], outputs[prev->code]);

        const float ratio_scores = outputs[code] / sum_proba_prev_current;
        if (ratio_scores < 0.88f) break;
      }

The threshold 0.88 is experimentally set up, but I hope that this could be of help to address this issue in next versions and generalize well.

Unfortunately, I cannot provide any documents because we work on sensitive data.

Thank you.

@stweil
Copy link
Contributor

stweil commented Oct 30, 2020

Do you want to send a pull request with the suggested fix?

@stweil
Copy link
Contributor

stweil commented Oct 30, 2020

What do you check code > 0 and code != 139?

@stweil
Copy link
Contributor

stweil commented Oct 30, 2020

Related issues: #884, #1011, #1060, #1063, #1362, #1465, #2738.

@EucliTs0
Copy link
Author

Do you want to send a pull request with the suggested fix?

I could create a PR yes, but the threshold might not be universal

@EucliTs0
Copy link
Author

What do you check code > 0 and code != 139?

Just want to avoid empty space and null char

@stweil
Copy link
Contributor

stweil commented Oct 30, 2020

Would code != null_char_ also work instead of code > 0? Where does this magic number 139 (empty space?) come from?

@stweil
Copy link
Contributor

stweil commented Oct 30, 2020

I could create a PR yes, but the threshold might not be universal.

Which other values beside 0.88 did you test? Would, for example, 0.75 or 0.9 also work fine?

@EucliTs0
Copy link
Author

I could create a PR yes, but the threshold might not be universal.

Which other values beside 0.88 did you test? Would, for example, 0.75 or 0.9 also work fine?

Yes we tested other values too, from 0.7 to 0.9 and found out that 0.88 behaves the best

@EucliTs0
Copy link
Author

Would code != null_char_ also work instead of code > 0? Where does this magic number 139 (empty space?) come from?
In our case, code = 0 corresponds to empty (or space) :
I printed the debug output of a part of string. so we get the label=0 between characters.

DECODED CHARACTER LSTM 4: 4, label=63
DECODED CHARACTER LSTM 5:  , label=0
DECODED CHARACTER LSTM 6: A, label=1

The 139 is a null char for us.
Has the null_char variable always the same code mapping?

@amitdo
Copy link
Collaborator

amitdo commented Oct 30, 2020

I believe it will be a different number in other traineddata files.

@stweil
Copy link
Contributor

stweil commented Oct 30, 2020

That's why I was asking.

@stweil
Copy link
Contributor

stweil commented Oct 31, 2020

@EucliTs0, which language(s) / script(s) did you use in your tests? Did you use fast or best traineddata?

I just have run a test on the TIFF files from test/testing and used this conditional:

      if (prev != nullptr && code != null_char_ && prev->code != null_char_) {

This fixed several confusions, all similar to this one:

-“I’'ve never forgotten that mo-
+“I've never forgotten that mo-

I would have expected “I’ve never forgotten that mo-.

Internally Tesseract has two preferred choices, with ' ranking less than :

    <span class='ocrx_cinfo' id='choice_1_119_13' title='x_confs 75.604965'>’</span>
    <span class='ocrx_cinfo' id='choice_1_119_14' title='x_confs 74.249809'>&#39;</span>

So the new code picked the wrong choice.

@amitdo
Copy link
Collaborator

amitdo commented Oct 31, 2020

@amitdo
Copy link
Collaborator

amitdo commented Oct 31, 2020

@EucliTs0
Copy link
Author

EucliTs0 commented Nov 2, 2020

@EucliTs0, which language(s) / script(s) did you use in your tests? Did you use fast or best traineddata?

I just have run a test on the TIFF files from test/testing and used this conditional:

      if (prev != nullptr && code != null_char_ && prev->code != null_char_) {

This fixed several confusions, all similar to this one:

-“I’'ve never forgotten that mo-
+“I've never forgotten that mo-

I would have expected “I’ve never forgotten that mo-.

Internally Tesseract has two preferred choices, with ' ranking less than :

    <span class='ocrx_cinfo' id='choice_1_119_13' title='x_confs 75.604965'>’</span>
    <span class='ocrx_cinfo' id='choice_1_119_14' title='x_confs 74.249809'>&#39;</span>

So the new code picked the wrong choice.

We use the best traineddata, french language

@EucliTs0
Copy link
Author

EucliTs0 commented Nov 2, 2020

https://en.wikipedia.org/wiki/Apostrophe

So, both apostrophes should be considered as OK in tesseract's output, right?

@stweil
Copy link
Contributor

stweil commented Nov 2, 2020

' is not wrong, but is better and also detected in other lines without any confusion.

If there is a confusion with two alternatives of similar confidence, I'd normally take the one with higher confidence, even if it is only slightly higher (unless there are other rules like for example a dictionary which suggest to take the second alternative).

@EucliTs0
Copy link
Author

EucliTs0 commented Nov 3, 2020

Just to clarify, the suggested fix removes one confused character, but it is not necessarily the correct one (like the example with the apostrophe).

One question, could you please provide me the exact code block where _null_char mapping is happening? Thanks.

@amitdo
Copy link
Collaborator

amitdo commented Nov 3, 2020

One question, could you please provide me the exact code block where _null_char mapping is happening? Thanks.

if (!fp->DeSerialize(&null_char_)) return false;

@mb0
Copy link

mb0 commented Feb 22, 2021

I hope it is ok for me to chime in and point out that this issue affects many users for some years now. Even if the proposed fix does not choose the best candidate, it is still very much an improvement over the current situation. Could someone experienced in C++ and tesseract please add a pull request to get the process started and the change reviewed?

@TheSeiko
Copy link

TheSeiko commented Apr 16, 2021

@stweil related to your question.
"TheSeiko, do you have example images which still show this issue? We need them to test a bug fix which was suggested in #3144".

I've already posted some images to #1060. Now I've collected more images with double characters. I'm posting them below.
I've marked the double characters bold.

All are tested with
C:\Tesseract-OCR20201127>tesseract --version
tesseract v5.0.0-alpha.20201127
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5
Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0

on Windows 10 64bit

example call:
C://Tesseract-OCR20201127/tesseract D:\var\ocrvideoreader\images\tmp\20210416211033138_1618341916852_bottom.png stdout --dpi 400 --oem 1 --psm 6 -l deu+lat

@TheSeiko
Copy link

US-Paläontologen haben eine
Tyrannosaurus rex-,Zhlung" gemacht.
20210416210716551_1618580096723_bottom

@TheSeiko
Copy link

Online-Vortragsreihe

Beginn ist um 9.30 Uhr.
Die Teilnahme ist
kostenlos. Anmeldungen
sind per E-Mail an:
frauenbuero@magq.linz.at
erforderlich.
20210416210859703_1618393387569_main

@TheSeiko
Copy link

US-Präsident Biden schlug Kremilchef Putin einen
Gipfel zur Deeskalation in einem Drittland vor.
20210416211033138_1618341916852_bottom

@TheSeiko
Copy link

Österreich

In einem derzeitigen Gesetzesentwurf
werden Razzien im Behördenbereich beinahe
verunmöjglicht.

Nach einem Treffen mit Experten ist
Justizministerin Zadic bereit, entsprechende
Änderungen am Entwurf vorzunehmen.
20210416211335774_1618273347138_main

@TheSeiko
Copy link

Shaquille ONeal
Sportskanone auf der
Suche nach neuem
Team!

Unser „Shagq“ ist sehr
menschenbezogen,
intelligent und brav.
20210416211528904_1617921093632_main

@TheSeiko
Copy link

Service

Im April auf
www.ibkinfo.at:
Innsbruck zu Fuf$ und am
Radl erkunden sowie
Neues zum Rad-
Masterplan.
20210416211658610_1617575639408_right

@TheSeiko
Copy link

Fußball
OFB-Legionáar Philipp Lienhart trifft beim
2:0-Sieg von Freiburg gegen Augsburg.
20210416211825294_1616479218850_bottom

@TheSeiko
Copy link

Politik .
Die SPO kritisiert das ,,|chaotische" Corona-
Management der Regierung scharf.
20210416212930684_1595777062447_bottom

@TheSeiko
Copy link

Kurzfilmfestival

Eine hochkarätige Aus-
wahl meist dystopischer
Filme, zusammengestellt
von ProgrammerlInnen
aus Cannes, Locarno,
Sarajevo und mehr.
20210416213215294_1585967118329_right

@TheSeiko
Copy link

Smartphone
Huawei stellt sein neues Smartphone
PA40 Pro vor.
20210416213456817_1585732285462_bottom

@TheSeiko
Copy link

Wien
Die Eröffnung der „MQ Libelle**"^** wird
auf den 25. August verschoben.
20210416213618799_1585651941180_bottom

@TheSeiko
Copy link

Ungarn/Üsterreich
Lebenslang für die vier Hauptangeklagten
nach dem A^4-Flüchtlingsdrama.
20210416214054404_1561026929372_bottom

@TheSeiko
Copy link

Auf Galaxy S10 folgt S20

Samsung sortiert seine Galaxy-S-Serie
offenbar komplett neu. Das behauptet der
Tech-Blog „SsamMobile“. Demnach wird das
neue Smartphone nicht Galaxy S11, sondern
Galaxy S20 heißen. Womóglich möchte sich
Samsung vom iPhone 11 abgrenzen.
20210416214837543_1577892112949_main

@EucliTs0
Copy link
Author

EucliTs0 commented Apr 17, 2021

From the results above, the character confusion is not fixed, right? Do you have also cases where it is fixed ?. Just to mention again, the fix is to solve this issue but it does not guarantee you get the correct character. But most of the times you get the correct character.

@TheSeiko
Copy link

TheSeiko commented Apr 23, 2021

@EucliTs0 I've just extracted images where one character becomes two characters. I didn't keep an exact list, where it was different before. But yes there were some images who had two characters before and returned only one with the latest version.

@EucliTs0
Copy link
Author

@TheSeiko Perhaps in your case you need to modify the threshold

@woodjohndavid
Copy link

Hi EucliTs0:

We have been experiencing the same behavior as yourself, with extra characters showing up in the Tesseract output stream. I am experimenting with the most recent master branch code, and I think that the line numbers in the source may be somewhat different from the version you are working with. So could you please do me the favor of providing the method name where you are putting your fix, and attaching the full recodebeam.cpp file so I can find it and try it out myself.

Thanks,

Dave

@EucliTs0
Copy link
Author

EucliTs0 commented Jun 2, 2021

@woodjohndavid

Hello @woodjohndavid,

We use the last stable version of Tesseract 4.1.1 ([https://github.com/tesseract-ocr/tesseract/tree/4.1.1]). We added this block inside void RecodeBeamSearch::ContinueContext in the src/lstm/recodebeam.cpp

I cannot attack the .cpp file, because it is not supported here so I will add it as plain text.

recodebeam.odt

@bertsky
Copy link
Contributor

bertsky commented Jun 2, 2021

We use the last stable version of Tesseract 4.1.1 ([https://github.com/tesseract-ocr/tesseract/tree/4.1.1]). We added this block inside void RecodeBeamSearch::ContinueContext in the src/lstm/recodebeam.cpp

I cannot attack the .cpp file, because it is not supported here so I will add it as plain text.

recodebeam.odt

@EucliTs0 thank you for trying to make Tesseract better!

Since AFAICT no one is working on this long-standing issue, any hint to track down the actual cause is welcome. But please use Github facilities (or at least a diff/patch) for sharing next time!

Here's your change in a reusable way:

diff --git a/src/lstm/recodebeam.cpp b/src/lstm/recodebeam.cpp
index 1c840569..bb34cd7a 100644
--- a/src/lstm/recodebeam.cpp
+++ b/src/lstm/recodebeam.cpp
@@ -615,6 +615,14 @@ void RecodeBeamSearch::ContinueContext(const RecodeNode* prev, int index,
       if (prev != nullptr && prev->code == code && !is_simple_text_) continue;
       float cert = NetworkIO::ProbToCertainty(outputs[code]) + cert_offset;
       if (cert < kMinCertainty && code != null_char_) continue;
+
+      if (prev != nullptr and code > 0 and code != 139 and prev->code !=139 and prev->code > 0)
+      {
+        const float sum_proba_prev_current = std::max(outputs[code], outputs[prev->code]) + std::min(outputs[code], outputs[prev->code]);
+        const float ratio_scores = outputs[code] / sum_proba_prev_current;
+        if (ratio_scores < 0.88f) break;
+      }
+
       full_code.Set(length, code);
       int unichar_id = recoder_.DecodeUnichar(full_code);
       // Map the null char to INVALID.

I have not tried it yet, but (in addition to @stweil's comments), a few problems stand out:

  • What do you take the max(a, b) + min(a, b) for? What's that other than an obscurism for a + b?
  • Why do you simply break out of the character hypotheses loop, instead of just continuing with valid choices? This could easily hide any good hypotheses further in the charset.
  • Foremost, why do you take the current timestep's probability outputs at the previous timestep's hypothesis prev->code in the beam? That's a totally different thing than what you described above. Your description says you want to relate probability at step t to that of step t+1, which is clearly not the case here. (Not that I understand why you wanted to do that. But what you do here does help even a little, we might get closer to understanding the problem.)

@woodjohndavid
Copy link

Hi EucliTs0:

Thanks for the information. That will help me try out your fix in the context of the latest master version and see how it goes. I will report back on this thread with my results and any suggestions I might come up with.

Regards,

Dave

@EucliTs0
Copy link
Author

EucliTs0 commented Jun 3, 2021

Hi @bertsky

  • For your first comment, I think just a+b could be sufficient.
  • We break out because at that moment we found out that there is a duplication, and we want just to ignore the duplicated character. But it does not necessarily means that we ignore the 'good' or 'bad' duplicated character. If you try to continue I think you will end up keeping some of these duplication (we tried and we saw that is many cases we did not resolve this issue).
  • For you last comment, we can consider current outputs as t+1 and previous as t.

@woodjohndavid
Copy link

Hi EucliTs0:

Please see my latest post here #3477

If you like, you can try the solution I have proposed and see if it works in your situation. I did try out the fix that you have used, but it didn't work consistently in our case. I guess it depends on the specific mix of characters that are encountered.

@woodjohndavid
Copy link

I have just created pull request #4211 which I consider to be an improved solution for diplopia.

I encourage everyone on this trail to try this out and test it with as broad a range of cases as possible.

Note by the way, there are some new configuration values that can only be set in code as things stand. These configuration values are:

bool kRemoveDiplopia - if true, enables diplopia removal functionality. If false, my changes have no effect
int kMaxDiplopiaGap - maximum number of timesteps apart to be considered diplopia, default 2

Obviously if my diplopia change is of value, then these configuration items should be made into settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants