pdfsandwich: A tool to make "sandwich" OCR pdf files

© 2010-2014 Tobias Elze
Home

sandwichWhat is pdfsandwich?

pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.

pdfsandwich is a command line tool which is supposed to be useful to OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text.

Essentially, pdfsandwich is a wrapper script which calls the following binaries: unpaper (since version 0.0.9), convert, gs, hocr2pdf (for tesseract prior to version 3.03), and tesseract. It is known to run on Unix systems and has been tested on Linux and MacOS X. It supports parallel processing on multiprocessor systems.

While pdfsandwich works with any version of tesseract from version 3.0 on, tesseract 3.03 or later is recommended for best performance. By default, pdfsandwich runs unpaper to enhance the readability of scanned pages and to improve OCR. For instance, slightly rotated pages are automatically straightened and dark edges removed. For optimally scanned pdf files, this can be switched off by option -nopreproc to speed up processing.

What's new?

Latest version is 0.1.3 (August 10, 2014).

Since version 0.0.9 pdfsandwich optionally preprocesses scanned pdfs by unpaper.

Since version 0.0.5 pdfsandwich uses tesseract instead of cuneiform for OCR.

Download and Installation

Linux

Ubuntu

deb packages for Ubuntu (12.04 or later) are available for Download on the project website.

At the moment of this writing, these dep packages are not yet available over the standard Ubuntu packages sources (like universe). You have two options to install the deb package: First, if you can create a local deb package repository on your computer (recommended). Second, if you are too lazy to generate a local repository, there is the following quick and dirty way: Download the deb file, e.g. pdfsandwich_0.1.0_amd64.deb, to some local directory, and execute the following commands in this directory:

sudo dpkg -i pdfsandwich_0.1.0_amd64.deb  # There will be an error message. Ignore it and proceed!
sudo apt-get -fy install

Replace pdfsandwich_0.1.0_amd64.deb with the name of your downloaded deb file. After the first command, there will be an error message related to missing dependencies. Ignore it, this will be fixed by the second command.

Gentoo

A Gentoo ebuild is available.

MacOS X

A Macports Portfile exists in the package sources, respectively can be generated by

make Portfile
It is unfortunately untested for versions later than 0.0.3.

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

./configure
make
make install

Usage

pdfsandwich is a command line utility. If you have a scanned pdf file, for instance this one: alice.pdf (which is the first chapter of a novel you might have heard of), invoke pdfsandwich like this:

pdfsandwich alice.pdf

This will generate a file alice_ocr.pdf which looks like the orginal file, but the recognized text will be placed behind the scanned images. You can make full text searches now or select text areas.

For some pdf files, pdfsandwich produces much larger files after OCR processing. In this case, it might help to call pdfsandwich again on the already OCR'ed file.

Preprocessing

pdfsandwich provides a number of preprocessing procedures to enhance the quality of the scanned pages before text recognition. Most OCR specific preprocessing options are provided via the program unpaper, such as layout optimization, the removal of dark edges, and the straightening of skewed scans (deskewing). Let's illustrate some preprocessing options and their impact on OCR with a scanned page from the book Inquiries into Human Faculty by Francis Galton (1907). The original pdf (galton.pdf) looks like this:

galton_orig

Without any preprocessing, text recognition of page 117 is impaired by skew. The beginning of p. 117 with the option -nopreproc reads:

Visionaries 1 l7 elaborate and silver coloured ligree ornaments patterns ; of gold carpets and silver in bril lian ower-stan t tint s ds, are etc. not ; uncommon.

Obviously, tesseract is unable to appropriately separate the lines, and OCR breaks down. Standard preprocessing, i.e. calling pdfsandwich without any options, is able to remove black edges:

galton_standard

Although the scanned image looks nicer, the problem with the skewed left-hand side is not yet solved, text recognition is similarly disastrous. However, we can tell pdfsandwich explicitly about the layout of the page:

pdfsandwich -layout double galton.pdf

This tells unpaper to search for two separate sub-pages on the page and to apply preprocessing procedures to each of them separately. The output pdf looks considerably better now:

galton_double

Particularly the deskewing has a tremendous impact on text recognition of p. 117:

Visionaries
I I7
and silver ligree ornaments ; gold and silver ower-stands, etc. ;
elaborate coloured patterns of carpets in brilliant tints are not
uncommon.
Another peculiarity resides in the extreme restlessness of
my visual objects. It is often very difficult to keep them still,
as well as from changing in character. They will rapidly oscil-
late or else rotate to a most perplexing degree, and when the
characters change at the same time a critical examination is
almost impossible. When the process is in full activity,l feel
as if I were a mere spectator at a diorama of a very eccentric
kind, and was in no way concerned with the getting up of the
performance.
When a. succession of images has been passing, I sometimes
alez ermz'ne to introduce an object, say a watch. Very often it is
next to impossible to succeed. There is an evident struggle.
The watch, pure and simple, will not come; but some hybrid
structure appears something round, perhaps but it lapses into
a warming-pan or other unexpected object.
This practice has brought to my mind very clearly the dis-
tinction between at least one form of automatism of the brain
and volition; but the strength of the former is enormous, for
the visual objects, when in full career of the change, are impera-
tive in their refusal to be interfered with.
[...]

The layout specification, which is switched off by default, allows the options single for each pdf page containing a single scanned page, and double, as shown above. Note that in certain circumstances rigid preprocessing options can substantially deteriorate results. For instance, if the wrong layout specifications are provided, whole sub-pages might get filtered out, or figures might be considered visual noise and disappear. If the scan quality is good, it is advisable to completely switch off preprocessing by the -nopreproc option. Note that simple deskewing of a rotated page (without sub-pages) can be obtained by convert, without any involvement of unpaper. A useful convert option for this is, for instance:

pdfsandwich -coo "-deskew 40%" scanned_file.pdf

Both convert and unpaper provide a large number of potentially useful preprocessing options which can be specified by -coo and -unpo, respectively. You might want to have a look at the convert online manual or at the man page of unpaper.

Options

pdfsandwich supports the following command line options:

  -convert       -convert filename : name of convert binary (default: convert)
  -coo           -coo options : additional convert options; make sure to quote;
                  e.g. -coo "-normalize -black-threshold 75%"
                  call convert --help or man convert for all convert options
  -debug         keep all temporary files in /tmp (for debugging)
  -enforcehocr2pdf       use hocr2pdf even if tesseract >= 3.03
  -first_page    -first_page number : number of page to start OCR from (default: 1)
  -grayfilter            enable unpaper's gray filter; further options can be set by -unpo
  -gs            -gs filename : name of gs binary (default: gs)
  -hocr2pdf      -hocr2pdf filename : name of hocr2pdf binary (default: hocr2pdf);
                  ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set
  -hoo           -hoo options : additional hocr2pdf options; make sure to quote
  -identify      -identify filename : name of identify binary (default: identify)
  -last_page     -last_page number : number of page up to which to process OCR (default: number of pages in inputfile)
  -lang          -lang language : language of the text; option to tesseract (defaut: eng)
                  e.g: eng, deu, deu-frak, fra, rus, swe, spa, ita, ...
                  see option -list_langs;
                  Multiple languages may be specified, separated by plus characters.
  -layout        -layout { single | double | none } : layout of the scanned pages; requires unpaper
                  single: one page per sheet
                  double: two pages per sheet
                  none: no auto-layout (default)
  -list_langs    list currently available languages and exit;
                  in case of custom binaries of tesseract, place this after the -tesseract option
  -maxpixels     -maxpixels NUM : maximal number of pixels allowed for input file
                  if (resolution/72)^2 *width*height > maxpixels then scale page of input file down
                  prior to OCR so that page size in pixels corresponds to maxpixels;
                  default: 17415167 (A3 @ 300 dpi)
  -noimage       do not place the image over the text (requires hocr2pdf; ignored without -enforcehocr2pdf option)
  -nopreproc     do not preprocess with unpaper
  -nthreads      -nthreads number : number of parallel threads (default: guessed number of CPUs; if guessing fails: 1)
  -o             -o filename : output file; default: inputfile_ocr.pdf
  -pagesize      -pagesize { original | NUMxNUM } : set page size of output pdf
                  original: same as input file (default)
                  NUMxNUM: width x height in pixel (e.g. for A4: -pagesize 595x842)
  -resolution    -resolution NUM : resolution (dpi) used for OCR (default: 300)
  -rgb           use RGB color space for images (default: black and white);
                  use with care: causes problems with some color spaces
  -sloppy_text   sloppily place text, group words, do not draw single glyphs;
                  ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set
  -tesseract     -tesseract filename : name of tesseract binary (default: tesseract)
  -tesso         -tesso options : additional tesseract options; make sure to quote
  -unpaper       -unpaper filename : name of unpaper binary (default: unpaper)
  -unpo          -unpo options : additional unpaper options; make sure to quote
  -quiet         suppress output
  -verbose       produce more output
  -version       print version and quit
  -help  Display this list of options
  --help  Display this list of options

Text size, resolution, and page size

PDF is a document format optimized for printing. It specifies its page size in units of the the paper on which the file is supposed to be printed, such as A4 or letter. OCR, however, is an image processing operation which requires a digital image as a raster of pixels. Therefore, we need to rasterize each page of the PDF with a resolution which yields character sizes suitable for tesseract. For text sizes which are conveniently readable on a printed A4 or letter page, the default resolution of pdfsandwich, 300 dpi, is a reasonable choice. More specifically, it is recommended that the x-height, that is the height of the lower case letter x, should be around 20 pixels, but definitely not smaller than 10 pixels (see tesseract FAQ).

Sometimes, a scanner software generates PDFs with unreasonably large page size settings, which you typically notice when you need to zoom the pages to very low percentages in your PDF reader to be able to read the page content properly. If such a huge PDF page would be rasterized with 300 dpi, very large digital images would be the result which would slow down tesseract and would require large amounts of memory to be processed. As in most cases, such huge page sizes are errors of the scanner software, the default settings of pdfsandwich cause such pages to be scaled down to around page size A3 prior to OCR, generate the sandwich pdf, and finally scale the output pdf up to its original paper size (or any output paper size set by option -pagesize). If you know for sure that the very large pages of your input file are intended, for instance in cases of scanned posters, you can increase the parameter -maxpixels to prevent pdfsandwich from scaling down the page size prior to OCR.

Languages

Specifying the language(s) of the scanned document may considerably improve text recognition, as language specific dictionaries can be applied during the OCR process. pdfsandwich allows to specify languages by the -lang option followed by the language abbreviation. Multiple languages in one document can be specified separated by plus characters. If your document contains parts in English, French, and German, for instance, call pdfsandwich like this:

pdfsandwich -lang eng+fra+deu multilingual_document.pdf

Via Tesseract, numerous language packagess available - have a look at the tesseract download page for a complete list. Here is an incomplete selection of supported languages and their abbreviations:

ara (Arabic), aze (Azerbaijani), bul (Bulgarian), cat (Catalan), ces (Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), chr (Cherokee), dan (Danish), deu (German), ell (Greek), eng (English), enm (Old English), epo (Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun (Hungarian), ind (Indonesian), ita (Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch), nor (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese).

Note that the respective tesseract language package needs to be installed on your system to be usable by pdfsandwich. This option lists the languages which are available on your system:

pdfsandwich -list_langs

Feedback

For further questions or comments, please contact me: sourceforge [at] tobias-elze.de.

Tobias.