pdfsandwich: A tool to make "sandwich" OCR pdf files

What is pdfsandwich?

pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.

pdfsandwich is a command line tool which is supposed to be useful to OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text.

Essentially, pdfsandwich is a wrapper script which calls the following binaries: unpaper (since version 0.0.9), convert, gs, hocr2pdf (for tesseract prior to version 3.03), and tesseract. It is known to run on Unix systems and has been tested on Linux and MacOS X. It supports parallel processing on multiprocessor systems.

While pdfsandwich works with any version of tesseract from version 3.0 on, tesseract 3.03 or later is recommended for best performance. By default, pdfsandwich runs unpaper to enhance the readability of scanned pages and to improve OCR. For instance, slightly rotated pages are automatically straightened and dark edges removed. For optimally scanned pdf files, this can be switched off by option -nopreproc to speed up processing.

What's new?

Latest version is 0.1.7 (August 10, 2018).

Note: If you use Tesseract 4 or later, it is highly recommended to use pdfsandwich 0.1.7 or later, as Tesseract may freeze when called in multiple threads.

Since version 0.1.5 pdfsandwich uses pdfinfo and pdfunite instead of ghostscript for most operations. Ghostscript is now optional only needed for resizing pdf pages, if the respective command line option is given.

Since version 0.0.9 pdfsandwich optionally preprocesses scanned pdfs by unpaper.

Since version 0.0.5 pdfsandwich uses tesseract instead of cuneiform for OCR.

Download and Installation

Linux

Debian/Ubuntu

Debian and Ubuntu provide pdfsandwich through their standard repositories, although not always the latest versions. Independent of this, I maintain pdfsandwich deb packages which are available for Download on the project website. If you prefer to install the latest version, download the respective deb file, e.g. pdfsandwich_0.1.7_amd64.deb to some local directory, and either use your preferred graphical package manager or execute the following commands in this directory:

sudo dpkg -i pdfsandwich_0.1.7_amd64.deb  # If there are error messages due to missing dependencies, ignore them and proceed.
sudo apt-get -fy install

Other Distributions

Several other Linux distributions ship pdfsandwich through their standard repositories, such as Arch or Gentoo. An (incomplete) list of pdfsandwich ports can be found on repology.org.

Gentoo

MacOS

pdfsandwich is available through Homebrew.

FreeBSD/OpenBSD

Ports are available for FreeBSD and OpenBSD.

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

./configure
make
make install

Usage

pdfsandwich is a command line utility. If you have a scanned pdf file, for instance this one: alice.pdf (which is the first chapter of a novel you might have heard of), invoke pdfsandwich like this:

pdfsandwich alice.pdf

This will generate a file alice_ocr.pdf which looks like the orginal file, but the recognized text will be placed behind the scanned images. You can make full text searches now or select text areas.

For some pdf files, pdfsandwich produces much larger files after OCR processing. In this case, it might help to call pdfsandwich again on the already OCR'ed file.

Preprocessing

pdfsandwich provides a number of preprocessing procedures to enhance the quality of the scanned pages before text recognition. Most OCR specific preprocessing options are provided via the program unpaper, such as layout optimization, the removal of dark edges, and the straightening of skewed scans (deskewing). Let's illustrate some preprocessing options and their impact on OCR with a scanned page from the book Inquiries into Human Faculty by Francis Galton (1907). The original pdf (galton.pdf) looks like this:

Without any preprocessing, text recognition of page 117 is impaired by skew. The beginning of p. 117 with the option -nopreproc reads:

Visionaries 1 l7 elaborate and silver coloured ligree ornaments patterns ; of gold carpets and silver in bril lian ower-stan t tint s ds, are etc. not ; uncommon.

Obviously, tesseract is unable to appropriately separate the lines, and OCR breaks down. Standard preprocessing, i.e. calling pdfsandwich without any options, is able to remove black edges:

Although the scanned image looks nicer, the problem with the skewed left-hand side is not yet solved, text recognition is similarly disastrous. However, we can tell pdfsandwich explicitly about the layout of the page:

pdfsandwich -layout double galton.pdf

This tells unpaper to search for two separate sub-pages on the page and to apply preprocessing procedures to each of them separately. The output pdf looks considerably better now:

Particularly the deskewing has a tremendous impact on text recognition of p. 117:

Visionaries
I I7
and silver ligree ornaments ; gold and silver ower-stands, etc. ;
elaborate coloured patterns of carpets in brilliant tints are not
uncommon.
Another peculiarity resides in the extreme restlessness of
my visual objects. It is often very difficult to keep them still,
as well as from changing in character. They will rapidly oscil-
late or else rotate to a most perplexing degree, and when the
characters change at the same time a critical examination is
almost impossible. When the process is in full activity,l feel
as if I were a mere spectator at a diorama of a very eccentric
kind, and was in no way concerned with the getting up of the
performance.
When a. succession of images has been passing, I sometimes
alez ermz'ne to introduce an object, say a watch. Very often it is
next to impossible to succeed. There is an evident struggle.
The watch, pure and simple, will not come; but some hybrid
structure appears something round, perhaps but it lapses into
a warming-pan or other unexpected object.
This practice has brought to my mind very clearly the dis-
tinction between at least one form of automatism of the brain
and volition; but the strength of the former is enormous, for
the visual objects, when in full career of the change, are impera-
tive in their refusal to be interfered with.
[...]

The layout specification, which is switched off by default, allows the options single for each pdf page containing a single scanned page, and double, as shown above. Note that in certain circumstances rigid preprocessing options can substantially deteriorate results. For instance, if the wrong layout specifications are provided, whole sub-pages might get filtered out, or figures might be considered visual noise and disappear. If the scan quality is good, it is advisable to completely switch off preprocessing by the -nopreproc option. Note that simple deskewing of a rotated page (without sub-pages) can be obtained by convert, without any involvement of unpaper. A useful convert option for this is, for instance:

pdfsandwich -coo "-deskew 40%" scanned_file.pdf

Both convert and unpaper provide a large number of potentially useful preprocessing options which can be specified by -coo and -unpo, respectively. You might want to have a look at the convert online manual or at the man page of unpaper.

Options

pdfsandwich supports the following command line options:

  -convert 	 -convert filename : name of convert binary (default: convert)
  -coo 		 -coo options : additional convert options; make sure to quote;
		  e.g. -coo "-normalize -black-threshold 75%"
		  call convert --help or man convert for all convert options
  -debug 	 keep all temporary files in /tmp (for debugging)
  -enforcehocr2pdf 	 use hocr2pdf even if tesseract >= 3.03
  -first_page 	 -first_page number : number of page to start OCR from (default: 1)
  -grayfilter 	 enable unpaper's gray filter; further options can be set by -unpo
  -gray          use grayscale for images (default: black and white);
                  will be overridden by use of rgb
  -gs 		 -gs filename : name of gs binary (default: gs); optional, only required for resizing
  -hocr2pdf 	 -hocr2pdf filename : name of hocr2pdf binary (default: hocr2pdf);
		  ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set
  -hoo 		 -hoo options : additional hocr2pdf options; make sure to quote
  -identify 	 -identify filename : name of identify binary (default: identify)
  -last_page 	 -last_page number : number of page up to which to process OCR (default: number of pages in inputfile)
  -lang 	 -lang language : language of the text; option to tesseract (defaut: eng)
		  e.g: eng, deu, deu-frak, fra, rus, swe, spa, ita, ...
		  see option -list_langs;
		  Multiple languages may be specified, separated by plus characters.
  -layout 	 -layout { single | double | none } : layout of the scanned pages; requires unpaper
		  single: one page per sheet
		  double: two pages per sheet
		  none: no auto-layout (default)
  -list_langs 	 list currently available languages and exit;
		  in case of custom binaries of tesseract, place this after the -tesseract option
  -maxpixels 	 -maxpixels NUM : maximal number of pixels allowed for input file
		  if (resolution/72)^2 *width*height > maxpixels then scale page of input file down
		  prior to OCR so that page size in pixels corresponds to maxpixels;
		  default: 17415167 (A3 @ 300 dpi)
  -noimage 	 do not place the image over the text (requires hocr2pdf; ignored without -enforcehocr2pdf option)
  -nopreproc 	 do not preprocess with unpaper
  -nthreads 	 -nthreads number : number of parallel threads (default: guessed number of CPUs; if guessing fails: 1)
  -o 		 -o filename : output file; default: inputfile_ocr.pdf (if extension is different
		  from .pdf, original extension is kept)
  -omp_thread_limit      -omp_thread_limit number : number of threads tesseract may use for each page (default: 1)
  -pagesize 	 -pagesize { original | NUMxNUM } : set page size of output pdf (requires ghostscript)
		  original: same as input file (default)
		  NUMxNUM: width x height in pixel (e.g. for A4: -pagesize 595x842)
  -pdfinfo 	 -pdfinfo filename : name of pdfinfo binary (default: pdfinfo)
  -pdfunite 	 -pdfunite filename : name of pdfunite binary (default: pdfunite)
  -resolution 	 -resolution NUM : resolution (dpi) used for OCR (default: 300)
  -rgb 		 use RGB color space for images (default: black and white);
		  use with care: causes problems with some color spaces
  -sloppy_text 	 sloppily place text, group words, do not draw single glyphs;
		  ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set
  -tesseract 	 -tesseract filename : name of tesseract binary (default: tesseract)
  -tesso 	 -tesso options : additional tesseract options; make sure to quote
  -unpaper 	 -unpaper filename : name of unpaper binary (default: unpaper)
  -unpo 	 -unpo options : additional unpaper options; make sure to quote
  -quiet 	 suppress output
  -verbose 	 produce more output
  -version 	 print version and quit
  -help  Display this list of options
  --help  Display this list of options

Text size, resolution, and page size

PDF is a document format optimized for printing. It specifies its page size in units of the the paper on which the file is supposed to be printed, such as A4 or letter. OCR, however, is an image processing operation which requires a digital image as a raster of pixels. Therefore, we need to rasterize each page of the PDF with a resolution which yields character sizes suitable for tesseract. For text sizes which are conveniently readable on a printed A4 or letter page, the default resolution of pdfsandwich, 300 dpi, is a reasonable choice. More specifically, it is recommended that the x-height, that is the height of the lower case letter x, should be around 20 pixels, but definitely not smaller than 10 pixels (see tesseract FAQ).

Sometimes, a scanner software generates PDFs with unreasonably large page size settings, which you typically notice when you need to zoom the pages to very low percentages in your PDF reader to be able to read the page content properly. If such a huge PDF page would be rasterized with 300 dpi, very large digital images would be the result which would slow down tesseract and would require large amounts of memory to be processed. As in most cases, such huge page sizes are errors of the scanner software, the default settings of pdfsandwich cause such pages to be scaled down to around page size A3 prior to OCR and then generate the sandwich pdf. If you know for sure that the very large pages of your input file are intended, for instance in cases of scanned posters, you can increase the parameter -maxpixels to prevent pdfsandwich from scaling down the page size prior to OCR.

Languages

Specifying the language(s) of the scanned document may considerably improve text recognition, as language specific dictionaries can be applied during the OCR process. pdfsandwich allows to specify languages by the -lang option followed by the language abbreviation. Multiple languages in one document can be specified separated by plus characters. If your document contains parts in English, French, and German, for instance, call pdfsandwich like this:

pdfsandwich -lang eng+fra+deu multilingual_document.pdf

Via Tesseract, numerous language packagess available - have a look at the tesseract download page for a complete list. Here is an incomplete selection of supported languages and their abbreviations:

ara (Arabic), aze (Azerbaijani), bul (Bulgarian), cat (Catalan), ces (Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), chr (Cherokee), dan (Danish), deu (German), ell (Greek), eng (English), enm (Old English), epo (Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun (Hungarian), ind (Indonesian), ita (Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch), nor (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese).

Note that the respective tesseract language package needs to be installed on your system to be usable by pdfsandwich. This option lists the languages which are available on your system:

pdfsandwich -list_langs

Feedback

For further questions or comments, please contact me: sourceforge [at] tobias-elze.de.

Tobias.