Skip to content

Commit

Permalink
Document some more config options for tesseract
Browse files Browse the repository at this point in the history
Clarify also the name(s) of the generated OCR result file(s):
Tesseract does not create a file named outbase.txt by default.

Fix also a sentence in the language section.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
  • Loading branch information
stweil committed Oct 5, 2018
1 parent e03ee93 commit 383dcf7
Showing 1 changed file with 17 additions and 4 deletions.
21 changes: 17 additions & 4 deletions doc/tesseract.1.asc
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ IN/OUT ARGUMENTS

'outputbase'::
The basename of the output file (to which the appropriate extension
will be appended). By default the output will be named 'outbase.txt'.
will be appended). By default the output will be a text file
with `.txt` added to the basename unless there are one or more
'configfile' options which explicitly specify the desired output.

'stdout'::
Instruction to sent output data to standard output
Expand Down Expand Up @@ -88,8 +90,19 @@ OPTIONS
contains a list of variables and their values, one per line, with a
space separating variable from value. Interesting config files
include: +
* hocr - Output in hOCR format instead of as a text file.
* pdf - Output in pdf instead of a text file.
* `hocr` - Output in hOCR format (file extension `.hocr`).
* `pdf` - Output PDF (file extension `.pdf`).
* `tsv` - Output TSV (file extension `.tsv`).
* `txt` - Output plain text (file extension `.txt`).
* `get.images` - Write images.
* `logfile` - Write debug file `tesseract.log`.
* `lstm.train` - Used for LSTM training.
* `makebox` - Output box file.
* `quiet` - Write debug file to /dev/null.

It is possible to select several config files, for example
`tesseract image.png demo hocr pdf txt` will create three output files
`demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.

*Nota Bene:* The options `-l lang` and `--psm N` must occur
before any 'configfile'.
Expand Down Expand Up @@ -122,7 +135,7 @@ LANGUAGES

The currently available traineddata files for tesseract 4.0
for the following languages are in
(in https://github.com/tesseract-ocr/tessdata_fast):
https://github.com/tesseract-ocr/tessdata_fast:

*afr* (Afrikaans),
*amh* (Amharic),
Expand Down

0 comments on commit 383dcf7

Please sign in to comment.