Version 0.0.4, June 2024
txt2epub
is a command-line utility for Linux,
for converting one
or more plain text files into an EPUB document. It will insert the
standard author/title meta-data, generate a table of contents,
and can include a cover image.
Limited formatting is possible using Markdown-style text markup,
or full XHTML if required.
This utility is intended as a relatively quick way to convert books provided as plain text into a format that can be handled more easily by e-readers. Although most portable reading devices and software can handle plain text perfectly well, the lack of meta-data or a cover image makes collections of such documents unwieldy.
Although it is not its main function,
txt2epub
can be used with just a plain text editor
to produce a commercial-quality EPUB novel,
that will pass most publishers' validation checks. However,
books that have complex formatting, or embedded images,
need a more sophisticated approach.
One of the design goals of txt2epub
is to produce "clean"
documents, free of software-specific stylesheets and formatting. The EPUBs it
creates will not specify fonts, absolute text sizes, colours, margins, or
layout. The way the text is
rendered is thus completely under the control of the reader. Its output should
thus be acceptably readable on screens of different sizes.
txt2epub\ -o\ dickens\_great\_expectations.epub \
--author\ "Dickens,\ Charles"\ --title\ "Great\ Expectations" \
--cover-image ge.jpg\
chapter01.txt chapter02.txt chapter03.txt\ ...
Convert files chapter01.txt
, etc., into an
EPUB document, setting the
author and title meta-data appropriately. Each file will receive an
entry in the table of contents. The image ge.jpg
will
form the book cover.
The only external dependencies are on the standard
linux zip
utility, and the PCRE regular expression parsing library. Both
should be available in the repositories of most Linux distributions.
For RHEL/Fedora: yum install zip pcre-devel
.
txt2epub
will probably build and run on other
Linux-like systems, but this has not been tested.
The usual:
$ make
$ sudo make install
txt2epub
may be found in the binary repositories of some Linux distributions.
While installing from a repository will usually be quicker than building
from source, repositories are often less up-to-date than the source.
Unless it is disabled (--ignore-markdown
), txt2epub
processes a small set of markdown-type formatting markers:
#This is a heading
##This is a subheading
###This is a subsubheading
This is _italic_. This is *bold*
A line that ends in two spaces (which may not be visible at all in a text editor) is terminated with a line-break. This is a simple way to include pre-formatted text.
These markdown constructions are turned into basic (X)HTML tags (not style classes).
Note that Markdown-style markup cannot span lines. A very long italic
passage, for example, must be rendered as a single line, or the
italic marker repeated on subsequent lines.
txt2epub
does not support Markdown list constructs, but
these could be created using (X)HTML tags, if desired.
Any text files supplied to
txt2epub
are just inserted into the body of an XHTML
document, as this is the presentation format that EPUB specifies.
It follows that files can contain XHTML markup if necessary.
XHTML is a stricter standard than HTML, and many ordinary HTML files will not
pass muster -- a typical problem is not closing tags, or using
the wrong letter case in tags.
Providing XHTML can be useful for heavily formatted pages such as a
title page or dedication. However, the files provided should
not be full XHTML files --
txt2epub
will write the proper XHTML header and footer.
Instead, just provide the body text.
Many commercial e-readers use existing HTML rendering software to display text; in such cases you might get away with using ordinary HTML. However, it isn't strictly correct, and won't pass a publisher's validation checks.
txt2epub
does not write a contents page in the document.
However, it does write an NCX table of contents, which most e-readers
will be able to display at any point whilst reading a book.
This form of table of contents also enables the next-chapter/previous-chapter
controls on readers that have them.
By default, the entries in the table of contents will be taken from the input filename, after removing any extension. So a nicely-formatted table of contents will require that the filenames are as a reader should see them, with capital letters where appropriate and spaces between words.
An alternative approach to generating a table of contents is to ensure that the first line of each file is a chapter heading, and use the `--first-lines` switch. This will also format the first line as a heading (specifically, it will embed it in an H1 tag).
E-book text files tend to be formatted in one of four ways:
- One very long line per paragraph, with a blank line between each paragraph
- One very long line per paragraph, with no blank lines between paragraphs
- Variable-length lines with blank lines to indicate paragraph breaks
- Variable-length lines with no blanks; paragraph breaks are indicated by white-space intended lines
No special effort is required to handle the first type. The second
type will be formatted by most readers as a solid block of uninterrupted
text, which is not pleasant to read. The --extra-para
switch might help here, by inserting a paragraph break after each
input line.
Files of the third type present no problem.
txt2epub
attempts to handle the fourth type by treating any
line that starts with three or more whitespace characters as a paragraph
break. Because some files that are formatted as variable-length lines
end up with spaces at the start of each line, this behaviour can be
turned off using --ignore-indent
.
txt2epub
will read from standard input (stdin) if
a minus sign (-) is used for the filename. It will be necessary
to specify the EPUB filename (-o
) in such a case.
The switch --cover-image
can be used to provide the EPUB
document with a leading cover page. This image is presented as a single-page
without any annotation, at the start of the book. EPUB guidelines
suggest that a cover image should be 590 pixels wide by 750 high. No
check is made that the image meets this guideline -- it is simply
copied into the EPUB. An error message will be shown if the image file
does not exist, but the EPUB will still be created.
The EPUB specification states that images files must be in JPEG, GIF, SVG, or PNG formats. No checks are made that this rule is being followed -- `txt2epub` will install images of any type but, as with wrongly sized images, EPUB viewers vary in their willingness to display them.
txt2epub
has no built-in support for splitting long
text files into sections or chapters. There are many ways in which
this might be done, and Linux already has useful utilities for doing it.
Consider, for example, a long file called fred.txt
, that
is divided into sections headed by "Chapter 1", "Chapter 2", etc.
This can be split into chapters like this:
csplit -f chapter_ -b %02d.txt fred.txt /Chapter.*/ {*}
This command will create the files chapter\_00.txt
,
chapter\_01.txt
, etc. These chapters can then be
assembled into an EPUB like this:
txt2epub -a "Fred Blogs" -t "My Life as a Dog" -f -o blogs.epub\ chapter*.txt
(being careful about the use of the filename wildcard, as discussed above.)
The -f
switch instructs txt2epub
to use the
first line of each file as a chapter heading, both in formatting and
in the table of contents. This works here because the use of csplit
ensures
that every file (with the possible exception of the first) begins
with the specified pattern.
EPUB text is required to be formatted as UTF-8. Plain ASCII works fine, as it is a subset of UTF-8. 8-bit extended ASCII variants will display with varying degrees of ugliness, depending on how many extended characters are used. A typical symptom of encoding mismatches of this sort is to see double-quotes rendered as upside-down question marks, or similar punctuation errors.
txt2epub
assumes that all text input is in UTF-8 or ASCII format.
If this assumption causes problems, the iconv
utility
may be used to pre-process the text and fix the encoding. Unfortunately,
if you receive a text document that has been converted from
Microsoft Word or some other proprietary word processor, it can often
be quite difficult to guess what the character encoding is. Consequently,
some trial-and-error may be needed.
txt2epub
can not decode PDF documents, but reasonable results
may sometimes be obtained by using it to process the output of
pdftotext -layout -nopgbrk
. The -layout
switch tells
pdftotext
to
attempt to preserve page layout; this is usually impossible, but it does
mean that you will usually get blank lines between paragraphs. These
are needed for txt2epub
to identify paragraph breaks.
The -nopgbrk
switch prevents page break (ctrl-L) characters
being written into the text. These don't usually cause problems in
EPUB viewers -- in fact, they are usually ignored. But, strictly speaking,
they are illegal in UTF-8 XML.
Documents converted from print sources often have page numbers and
other unhelpful text embedded in the document body. Most of this is
difficult to remove, but
txt2epub
will attempt to remove page numbers, if the
--remove-pagenum
switch is specified. A page number is
taken to be any line that consists of white space, followed by
digits. Unfortunately, while (for example) a single line containing
"23" will be removed, "Page 23" won't. Documents with this kind
of detritus may need more sophisticated pre-processing.
By default txt2epub
writes plain paragraph tags to delineate paragraphs
in the output. Ebook readers usually render this formatting with a
blank line between paragraphs. Using the --para-indent
switch will
make the utility output a <style>
header to set the paragraph
separation to a plain left indent, which can be help when reading on a
small screen. In general txt2epub
does not try to control formatting,
in the hope that viewer software will be sufficiently flexible as to
allow the user to choose preferences. This -- paragraph separation --
is an area when viewer software tends to fall short.
txt2epub
presently does not remove unnecessary byte-order
markers and similar encoding detritus from text files.
Users should be wary of using constructions like "book*.txt" to include lists of files. While Linux shells usually present files in alphanumeric order, subtleties like locale and collation settings can modify this. It may be safer to list the files explicitly.
Files created by this utility will not necessarily pass the validation
in
epubcheck
because the "mimetype" file is not always the first in the archive. This
can be fixed if it causes problems with readers; so far none have been
reported.
txt2epub
does not write a "guide" section in the
NCX table-of-contents. This is optional and, so far as I know, no EPUB
reader takes much notice of it.
More detailed command-line usage information is avaialble in the manual:
man txt2epub
.
txt2epub
is maintained by Kevin Boone and other contributors, and
distributed under the terms of the GNU Public Licence, version 3.0.
Essentially, you may do whatever you like with it, provided the original
authors are acknowledged, and you accept the risks involved in its use.
0.0.4, June 2024
Tidied up Makefile to work better with Gentoo. Fixed an error where later versions of gcc enforce 3-argument open() in certain usages
0.0.3, May 2023
Fixed a nasty bug where space indents were being processed in the first line of a file, causing the header to be split between paras
0.0.2, May 2023
Added --para-indent
feature (contributed by KenH2000)