Skip to content

Commit

Permalink
Merge pull request tesseract-ocr#111 from tesseract-ocr/ryanfb-update…
Browse files Browse the repository at this point in the history
…-lat

Update Latin langdata
  • Loading branch information
zdenop committed Feb 21, 2018
2 parents 3e88c57 + dfcd1bd commit d3b1a0d
Show file tree
Hide file tree
Showing 11 changed files with 1,109,104 additions and 499,712 deletions.
4 changes: 0 additions & 4 deletions lat/desired_characters

This file was deleted.

25 changes: 25 additions & 0 deletions lat/lat.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Tesseract Latin training - http://ryanfb.github.io/latinocr/
# Build from the https://github.com/ryanfb/latinocr-lat/ repository
# commit: b6885bca0fa755fbed2bbb36d3f5cebf866a15e0

# New segsearch produces better results
enable_new_segsearch 1

# Increase penalty for incorrect punctuation, important as
# diacritics can easily be misrecognised as punctuation
language_model_penalty_punc 0.35

# Increase minimum linesize. This minimises cases of accents
# being incorrectly recognised as separate lines.
textord_min_linesize 2.25

# Also helps to ensure that accents aren't incorrectly recognised
# as separate lines
textord_occupancy_threshold 0.7

# Helps to ensure rows don't overlap
textord_excess_blobsize 0.6

# Disable rare, variant, macron characters
# (can be enabled with tessedit_char_unblacklist)
tessedit_char_blacklist ĀāĒēĪīŌōŪū
Loading

0 comments on commit d3b1a0d

Please sign in to comment.