Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HOCR output always sets textangle 180 and omits baseline info if Tesseract is compiled with --disable-legacy #4010

Closed
robertknight opened this issue Jan 30, 2023 · 2 comments

Comments

@robertknight
Copy link

robertknight commented Jan 30, 2023

Basic Information

tesseract 5.3.0-19-ga3b9ac, compiled with --disable-legacy

Operating System

macOS 13 Ventura

Compiler

clang 14.0

Current Behavior

When Tesseract is compiled with --disable-legacy, hOCR output reports each line as being upside-down (textangle 180) and omits baseline information.

Steps to reproduce:

./configure --disable-legacy
./tesseract some-image.jpg output hocr

In the generated output.hocr file, ocr_line entries look like this:

<span class='ocr_line' id='line_1_142' title="bbox 1334 3054 2119 3088; textangle 180; x_size 34; x_descenders 8; x_ascenders 9">

Expected Behavior

If orientation information isn't available I'd expect the image to always be treated as if it were page-up. So entries should look like this:

<span class='ocr_line' id='line_1_142' title="bbox 1334 3054 2119 3088; baseline 0 -8; x_size 34; x_descenders 8; x_ascenders 9">

Suggested Fix

Internally, it looks like the issue is that:

  1. ColumnFinder::text_rotation_ is initialized to a null vector. When the legacy engine is disabled, the ColumnFinder::CorrectOrientation function does not get called, and so this vector remains null.
  2. This null vector gets propagated to PageIterator::Orientation, which does not handle this case correctly, as it converts this null vector to ORIENTATION_PAGE_DOWN -
    if (up_in_image.y() > 0.0F) {
  3. The HOCR renderer then maps this orientation value to textangle 180 and omits baseline info

Some fixes I tested locally were to change the initialization of ColumnFinder::text_rotation_ to be the same as the norotation value in ColumnFinder::CorrectOrientation, or to change the logic in PageIterator::Orientation to handle null rotation vectors by mapping them to ORIENTATION_PAGE_UP. I'm happy to submit a PR but I'm not sure the preferred way to go.

robertknight added a commit to robertknight/tesseract-wasm that referenced this issue Jan 30, 2023
hOCR output was missing `baseline` information for `ocr_line` entries and
incorrectly reporting every line as being upside-down. This was happening due to
tesseract-ocr/tesseract#4010. Work around this issue
by handling missing rotation information better in `PageIterator::Orientation`,
by assuming the page is facing up.
@amitdo
Copy link
Collaborator

amitdo commented Mar 26, 2023

#3997 seems related to this issue.

@amitdo
Copy link
Collaborator

amitdo commented Mar 26, 2023

When the legacy engine is disabled, the ColumnFinder::CorrectOrientation function does not get called

#ifndef DISABLED_LEGACY_ENGINE
if (equ_detect_) {
equ_detect_->LabelSpecialText(to_block);
}
BLOBNBOX_CLIST osd_blobs;
// osd_orientation is the number of 90 degree rotations to make the
// characters upright. (See tesseract/osdetect.h for precise definition.)
// We want the text lines horizontal, (vertical text indicates vertical
// textlines) which may conflict (eg vertically written CJK).
int osd_orientation = 0;
bool vertical_text =
textord_tabfind_force_vertical_text || pageseg_mode == PSM_SINGLE_BLOCK_VERT_TEXT;
if (!vertical_text && textord_tabfind_vertical_text && PSM_ORIENTATION_ENABLED(pageseg_mode)) {
vertical_text = finder->IsVerticallyAlignedText(textord_tabfind_vertical_text_ratio, to_block,
&osd_blobs);
}
if (PSM_OSD_ENABLED(pageseg_mode) && osd_tess != nullptr && osr != nullptr) {
std::vector<int> osd_scripts;
if (osd_tess != this) {
// We are running osd as part of layout analysis, so constrain the
// scripts to those allowed by *this.
AddAllScriptsConverted(unicharset, osd_tess->unicharset, &osd_scripts);
for (auto &lang : sub_langs_) {
AddAllScriptsConverted(lang->unicharset, osd_tess->unicharset, &osd_scripts);
}
}
os_detect_blobs(&osd_scripts, &osd_blobs, osr, osd_tess);
if (pageseg_mode == PSM_OSD_ONLY) {
delete finder;
return nullptr;
}
osd_orientation = osr->best_result.orientation_id;
double osd_score = osr->orientations[osd_orientation];
double osd_margin = min_orientation_margin * 2;
for (int i = 0; i < 4; ++i) {
if (i != osd_orientation && osd_score - osr->orientations[i] < osd_margin) {
osd_margin = osd_score - osr->orientations[i];
}
}
int best_script_id = osr->best_result.script_id;
const char *best_script_str = osd_tess->unicharset.get_script_from_script_id(best_script_id);
bool cjk = best_script_id == osd_tess->unicharset.han_sid() ||
best_script_id == osd_tess->unicharset.hiragana_sid() ||
best_script_id == osd_tess->unicharset.katakana_sid() ||
strcmp("Japanese", best_script_str) == 0 ||
strcmp("Korean", best_script_str) == 0 || strcmp("Hangul", best_script_str) == 0;
if (cjk) {
finder->set_cjk_script(true);
}
if (osd_margin < min_orientation_margin) {
// The margin is weak.
if (!cjk && !vertical_text && osd_orientation == 2) {
// upside down latin text is improbable with such a weak margin.
tprintf(
"OSD: Weak margin (%.2f), horiz textlines, not CJK: "
"Don't rotate.\n",
osd_margin);
osd_orientation = 0;
} else {
tprintf(
"OSD: Weak margin (%.2f) for %d blob text block, "
"but using orientation anyway: %d\n",
osd_margin, osd_blobs.length(), osd_orientation);
}
}
}
osd_blobs.shallow_clear();
finder->CorrectOrientation(to_block, vertical_text, osd_orientation);
#endif // ndef DISABLED_LEGACY_ENGINE

I think we can fix the issue by enabling some parts of the code in this block instead of disabling the whole block of code when the legacy engine is disabled.

amitdo added a commit to amitdo/tesseract that referenced this issue Mar 28, 2023
Enable some code blocks that were wrongly disabled when the legacy engine is disabled at compile time.
amitdo added a commit that referenced this issue Mar 28, 2023
Enable some code blocks that were wrongly disabled when the legacy engine is disabled at compile time.
@amitdo amitdo closed this as completed Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants