Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM Engine Diplopia Issue and Inaccurate HOCR Character Level Box Dimensions #3477

Open
woodjohndavid opened this issue Jun 29, 2021 · 22 comments

Comments

@woodjohndavid
Copy link

Environment: Tesseract Latest Master from GitHub, Ubuntu 20.04.2

User References: @bertsky @stweil

BackGround

The problem named Diplopia (courtesy of @bertsky) consists in there being more than 1 character appearing in the LSTM output character stream for what is the same physical area of the original image.

I encountered this issue early on in my use of Tesseract, and reported it on earlier thread #2738 . It has also been reported by many others. I then attempted to implement a workaround outside of the Tesseract code itself, using the HOCR output format character level box dimensions to try to identify overlapping characters. This was unsuccessful because, as it turns out, the character level box dimensions are inaccurate for LSTM generally, and are in fact guaranteed to be inaccurate when diplopia occurs.

So I then downloaded the latest Tesseract Master code and embarked on an expedition to try to understand how it works and see if I could come up with a fix for diplopia. The rest of this post documents the key results of my investigation.

Initial Diplopia Fix

I have just now created Pull Request #3476 which I hope is an adequate fix for most diplopia cases. See the PR for more details.

This fix generally follows the current style of the RecodeBeamSearch which attempts to assemble the character level output stream from the lower level LSTM NetworkIO matrix output. This matrix output delivers a set of entries for each timestep in the LSTM process, each entry consisting of a potential matching character and an likelihood score (key) in the range from 0.0 to 1.0.

There is nothing in the current matrix output that identifies the physical location of the possible match in the source image. Consequently, my fix attempts to identify possible diplopia by looking for two matrix output entries in a given timestep which have what could be called a 'meaningful' score, that is, a score that is high enough to indicate it is likely a 'real' match. If two such entries are found in the same timestep, then the fix tries to prevent any beam from subsequently containing both.

Inaccurate LSTM HOCR Character Level Box Dimensions

I had originally tried to use the HOCR dimensions as a workaround to fix diplopia, but found them inaccurate. I then pursued the diplopia fix above separately from this issue, but I have looked at how these dimensions are created and am of the opinion that the current implementation cannot ever be successful. What it does now consists of three sequential stages:

  1. In the initial image segmenting (before either legacy or LSTM engine is called for recognition) the image is divided into words and blobs within words. I assume (but have not verified) that for the legacy engine each blob is handled individually to try to identify which single character it may match. For the LSTM engine, the blobs are re-assembled back into a word level image, and then handed to the engine for recognition. In both cases, the original blob dimensions are saved for later.
  2. During its recognition process, the LSTM engine processes the word level image in a series of so-called timesteps. These are in fact traversing the image from left to right a certain number of pixels per timestep. During the RecodeBeamSearch processing to assemble the output character stream, there is a process which attempts to calculate the character dimensions using the known overall dimensions of the word image plus the known number of timesteps. At the currently defined timestep size, the granularity level is too coarse so this process will never be able to be accurate. I have experimented with reducing the size of the timesteps, which does result in some improvement in the character box dimension accuracy, but at the expense of extra processing time and reduced recognition accuracy. Perhaps retraining is necessary if the timestep size is reduced.
  3. The final stage is the assembly of the HOCR output. If the legacy engine was used, then the original blob dimensions from step#1 are used, which is as good as it can get in that case. If the LSTM engine was used, then there is code which tries to decide whether the original blob dimensions or the calculated ones from step#2 are 'better'. In general, if the number of characters found in a word by LSTM is the same as the original blob count, then it seems to use the blob dimensions and ignores the calculated ones. When diplopia occurs, this mechanism is completely unsuccessful.

Long Term Solution to LSTM Diplopia and Character Box Dimensions

So as it turns out, these issues are in fact related, or at least the solution for both is. What both of them really need is the precise physical image location of the character match being attempting. If the character box dimensions were accurate, then diplopia could be solved either during the RecodeBeamSearch, or after the LSTM engine has done its thing. It would have to be determined how much of a physical overlap would mean we have diplopia, but that could be an easy configuration setting.

As I see it, therefore the LSTM matrix processing using the NetworkIO interface needs to add to its return values (in addition to the possible character and the likelihood score) the starting pixel location of the possible match, and the horizontal size of the potential match image from the train data. Once that is done, the rest should be relatively straightforward.

Having said that, I have spent a fair bit of time to try to understand the matrix operations, but so far have not been successful in how to accomplish the above suggestion. It MUST be the case that somewhere down in there that location information can be retrieved, and I intend to continue to look. But if anybody can give me some hints, it would be appreciated.

@wollmers
Copy link

wollmers commented Jul 2, 2021

It should be possible to filter cases of diplopia (for testing) if ground truth is available:

  • inserted characters
  • overlapping bounding boxes

This is a typical one (punctuation at end of line):

similarity 0.925925925925926 < 0.99 : ONB_aze_18950706_1.jpg_tl_119.gt.txt
$grt_line: möglich iſt, daß die „Oppoſition“ bei der Vertrauens⸗
$ocr_line: möglich iſt, dab "Die „Oppoſition“ bei Der Vertrauens⸗—

Or difficult shape/separation:

H 18 0 52 96 0
M 18 0 97 92 0
A 102 4 176 94 0
R 181 7 244 96 0
P 242 6 288 96 0
O 289 7 343 94 0
C 349 7 406 89 0
R 420 0 481 96 0
A 415 5 546 91 0
T 531 9 621 96 0
E 628 7 692 96 0
S 701 7 762 95 0

image

@nagadomi
Copy link
Contributor

I have just now created Pull Request #3476 which I hope is an adequate fix for most diplopia cases

What is the status of this pull request?

In my personal opinion,
As far as I have tried, this change is worthwhile because it has no adverse effects and fixes many of the diplopica issues.
However, I am concerned about the impact on the training process and changing the default behavior.
So I think it would be better to add a configuration variable that is off by default, and use it only if people want to use it (If this change is unacceptable in its current state).

Also,
I recently trained a jpn_vert model with my own rendered training data, and the diplopia issue has become less frequent.
I think the difference between the training data and the real world data is also the cause of the diplopia issue, since text2image generates binarized images and the rendered font is bolder than typical printing (for japanese text).

@user123-source
Copy link

user123-source commented Nov 24, 2021

The problem seems to come from characters that are joined, but the right coordinate seems to be more reliable. In my situation I just look at the right coordinate to correct the left coordinate.

@exander77
Copy link

I am not sure if this is related, but my symbol boxes are completely off in a lot of cases. I wanted to implement bold font recognition by measuring the pixel density inside a symbol box for each letter, but when symbol boxes are completely off, this is a no-go.
image
Left there is source that was sent to Tesseract with lines, word and symbol boxes. Right is my own threshold image with symbol boxes.
image
This is a zoomed out version, so you can see how the boxes are all over the place.

The boxes are sometimes gigantic, contain more than one character (so one character is inside multiple boxes, and they overlap) or I have an instance where a box spans vertically over two lines. This looks to me like some kind of bug, because some boxes are really crazy.

@exander77
Copy link

exander77 commented May 8, 2022

With the pull request above: #3476

image

Some boxes still has weird height for some reason, it spans above and below the letter, but like 10x better result, I support this pull to be merged.

@exander77
Copy link

exander77 commented May 8, 2022

I have filtered out boxes that are mostly correct (some correct boxes with ý etc. are there as well). Seems like diplopia causes the box of one of the symbols to be vertically broken (space above and below the letter).

image

@woodjohndavid Does it give you enough info to improve or should I look myself into your pull request?

Below clear test image (Czech language, process with -l ces).
tmp2

@exander77
Copy link

Adding with lines for completness.

image

@exander77
Copy link

exander77 commented May 8, 2022

Hmm, now I have built just main without the patch, and it may have been improved by other patches and that pull request actually does nothing for me.

@exander77
Copy link

Can anybody point me out to a code that cause these diplopias?

@exander77
Copy link

@woodjohndavid Can you point me out to the code where LSTM engine returns values?

@woodjohndavid
Copy link
Author

@exander77 glad you are interested in addressing this issue. I would encourage you to re-read the starting entry on this thread, which, among other things, explains my understanding of why the bounding boxes are inaccurate. This is in a way related to the diplopia issue, in that the ultimate fix for both lies in the same area I believe.

The code you are looking for is found in recodebeam.cpp. Method ComputeTopN is the start where the LSTM engine incoming results are first processed.

@exander77
Copy link

@woodjohndavid Yes, I am interested. I can't get the hang on how are the results got from the network. Where is the info about position of the character and the with available? I see that each output contains a number of floats and that is passed to CompouteTopN. I have no idea what is going on there.

@wollmers
Copy link

@exander77

Interesting example. The text recognition with CTC/LTSM seems very accurate and has no diplopia with your sample. It's a character bounding box problem (with some influence of a not so perfect training model for CES).

I tried to apply my script hacked together for PR#3599 and #3787. Find results here https://github.com/wollmers/ocr-bbox-gt/tree/main/data/issue_3477.

Read #3599 (comment) for an explanation how it works.

What I didn't take in to account are the 3 different fonts (bold, regular, italic) in your example. If we want to measure statistically the "best" width of a character, we must do this per font. This is a "hen and egg" problem: We need correct bounding boxes to identify the font. But we need also the font identified to get correct bounding boxes. Also, even with --oem 0 not all bounding boxes are correct. An improved post-correction would take the bounding boxes from --oem 0, calculate features of the characters as possible and then identify the font. Per font the "best" metrics per character can be learned and outliers corrected. If the metrics of a character are known, outliers (hight, width) can be detected.

I still have a solution for font classification, which needs good quality bounding boxes, to measure features like width, height, aspect, density, vertical position, ascender, descender. Thanks to your input I got new ideas for improvements.

@exander77
Copy link

@wollmers Yes, the diplopia is not an issue for me, the characters appear only once in the output stream. The character wrongly appears in two bounding boxes, or the bounding boxes are generally inaccurate.

Also, I think there actually is a recognition problem as well.

image

Word: vodikm)
Word: jch Word: zontu)`

Word voíkových is split into vodikm) and jch. And I am not really sure why? Seems like part of v got merged into o and recognized as m? How did that happened? The italic o there is clearly readable and I don't and easily recognizable?

Word iontů followed by ) is mangled into zontu). Most likely because the accents were cropped from the recognition? It is easy to confuse i and z when you crop accent over i. The accent is also cropped fromů creating u.

Also: : in bazický: is not part of word box. This happens inside ComputeWordBounds.

Interestingly, what I wanted to do is to identify bold text, and I was unable to do so, because the bounding boxes are not correct.

If there is a solution for font classification, I would have use for it.

Just compiling with the PRs you mentioned (and the diplopia one as well). See how that behaves.

@exander77
Copy link

exander77 commented May 10, 2022

With the 3 PRs:

Word: hydroxylových
Word: iontů

So the recognition is actually significantly improved with these PRs.

Strangely, now the whole ý: is outside of bazický:

image

@exander77
Copy link

exander77 commented May 10, 2022

Also, the symbol <| on the right side of the page is now correctly separated.

@wollmers
Copy link

@exander77

Which version are you using? With Tesseract 5.1.0 on Intel Mac I get:

$ tesseract 3477.jpg 3477.psm6 -l ces  --psm 6 \
	--tessdata-dir  /usr/local/share/tessdata makebox hocr txt pdf

$ cat 3477.psm6.txt
Aljoša, Aljoška v. Alexej

alka, -v ž. (2. mn. alek) severský mořský pták S
z příbuzenstva racků; z001. rod Alca a Plautus:
a. malá; a. velká ,

alkajský příd. liter. a. verš řecký n. latinský
verš nazvaný podle básníka Alkaia; a-á strofa
složená z těchto veršů

alkalický příd. (z arab. zákl.) chem. jevicí vlast.
nosti zásad; zásaditý, bazický: a. roztok
mající menší koncentraci vodíkových iontů
(větší koncentraci hydroxylových iontů) než

$ diff 3477.psm6.gt.txt 3477.psm6.txt
4c4
< z příbuzenstva racků; zool. rod Alca a Plautus:
---
> z příbuzenstva racků; z001. rod Alca a Plautus:
11c11
< alkalický příd. (z arab. zákl.) chem. jevící vlast-
---
> alkalický příd. (z arab. zákl.) chem. jevicí vlast.

# 5 character errors (without the left pointing triangle symbol):

Word mismatches:
"jevící"     1
  "jevicí"     1 (1.0000)
"vlast-¶"     1
  "vlast.¶"     1 (1.0000)
"zool."     1
  "z001."     1 (1.0000)

With --oem 0 (legacy method)

$ tesseract 3477.jpg 3477.oem0.psm6.ces -l t5data/ces --oem 0 --psm 6 
	--tessdata-dir  /usr/local/share/tessdata makebox hocr txt pdf

I get nice bounding boxes (few misinterpretations like speckles, split of Háček, overlaps because italic/kerning):

3477 oem0 psm6 ces box

@exander77
Copy link

@wollmers I built head of master (f36c0d019be59cae3b96da0d89d870dbe83e9714) I checked out a few days ago.

With legacy:
image

$  diff 3477.psm6.txt 3477.oem0.psm6.ces.txt 
1c1
< Aljoša, AÍioška v. Aléxej
---
> Alio's'a, AÍioška v. Alexej
3,5c3,5
< alka, -v ž. (2. mn. alék) severský mořský pták <
< z přibuzenstva racků; zool. rod Alca a Plautus:
< a. malá; a. velká P
---
> alka, -_v ž. (2. mn. alek) severský mořský ptákd
> z příbuzenstva racků; mol. rod Alca a Plautus:
> &. malá; a. velká, .
7,8c7,8
< alkajský příd. liter. a. verš řecký n. latinský
< verš nazvaný podle básníka Alkaia; a-á strofa
---
> alkaiský přid. liter. &. verš řecký n. latinský
> verš nazvaný podle básnika Alkaia; a-a'. strofa
11,14c11,14
< alkalický příd. (z arab. zákl.) chem. jevící vlast-
< nosti zásad; zásaditý, bazický: a. roztok
< mající menší koncentraci vodikových iontů
< (větší: koncentraci hydroxylových wontů) než
---
> alkalický přid. (z arab. Em.) chem. jevíci vlast-
> nosti zásad; zásaditý, bazický: &. roztok
> mající menší koncentraci vodíkových iontů
> (větši koncentraci hydroxylových iqntů) než

@exander77
Copy link

tmp2.jpg.zip

Attached as zip for the possibility it gets altered.

@wollmers
Copy link

With the 3 PRs:

Word: hydroxylových Word: iontů

So the recognition is actually significantly improved with these PRs.

There are still the typical bbox errors:

Bildschirmfoto 2022-05-10 um 16 08 10

My observation is, that the number of boxes is correct (each recognised character has a box), but many boxes have wrong positions and/or width.

@exander77
Copy link

@wollmers Yes, it is in no way perfect. Legacy is far superior to this. Except for : that is outside the word without a box.

@p12tic
Copy link
Contributor

p12tic commented May 15, 2022

FYI I've updated #3787 with some bug fixes that I found since the initial implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants