pdfsandwich: A tool to make "sandwich" OCR pdf files

© 2010-2012 Tobias Elze
Home

sandwichWhat is pdfsandwich?

pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.

pdfsandwich is a command line tool which is supposed to be useful to OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text.

Essentially, pdfsandwich is a wrapper script which calls the following binaries: convert, gs, hocr2pdf, and tesseract. It is known to run on Unix systems and has been tested on Linux and MacOS X. It supports parallel processing on multiprocessor systems.

What's new?

Since version 0.0.5 pdfsandwich uses tesseract instead of cuneiform for OCR.

Latest version is 0.0.6.

Download and Installation

Linux

Ubuntu

deb packages for Ubuntu (12.04 or newer) are available for Download on the project website.

Gentoo

A Gentoo ebuild is available.

MacOS X

A Macports Portfile for version 0.0.3 has been submitted and is awaiting approval. However, there seem to be problems with cuneiform on MacOS 10.6. (I still have 10.5, and here everything works fine.)

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn co https://pdfsandwich.svn.sourceforge.net/svnroot/pdfsandwich/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

./configure
make
make install

Usage

pdfsandwich is a command line utility. If you have a scanned pdf file, for instance this one: alice.pdf (which is the first chapter of a popular classical novel), invoke pdfsandwich like this:

pdfsandwich alice.pdf

This will generate a file alice_ocr.pdf which looks like the orginal file, but the recognized text will be placed behind the scanned images. You can make full text searches now or select text areas.

For some pdf files, pdfsandwich produces much larger files after OCR processing. In this case, it might help to call pdfsandwich again on the already OCR'ed file.

Options

The following command line options exist:

  -convert       -convert filename : name of convert binary (default: convert)
  -coo           -coo options : additional convert options; make sure to quote;
                  e.g. -coo "-normalize -black-threshold 75%"
                  call convert --help or man convert for all convert options
  -first_page    -first_page number : number of page to start OCR from (default: 1)
  -gs            -gs filename : name of gs binary (default: gs)
  -hocr2pdf      -hocr2pdf filename : name of hocr2pdf binary (default: hocr2pdf)
  -last_page     -last_page number : number of page up to which to process OCR (default: number of pages in inputfile)
  -lang          -lang language : language of the text; option to tesseract (defaut: eng)
                  e.g: eng, deu, deu-frak, fra, rus, swe, spa, ita...
                  see http://code.google.com/p/tesseract-ocr/downloads/list
  -noimage       do not place the image over the text
  -nthreads      -nthreads number : number of parallel threads (default: guessed number of CPUs; if guessing fails: 1)
  -o             -o filename : output file; default: inputfile_ocr.pdf
  -resolution    -resolution NUMxNUM : resolution used for OCR (default: 300x300)
  -rgb           use RGB color space for images (default: black and white);
                  use with care: causes problems with some color spaces
  -sloppy_text   sloppily place text, group words, do not draw single glyphs
  -tesseract     -tesseract filename : name of tesseract binary (default: tesseract)
  -tesso         -tesso options : additional tesseract options; make sure to quote
  -quiet         suppress output
  -verbose       produce more output
  -version       print version and quit
  -help  Display this list of options
  --help  Display this list of options

Feedback

For further questions or comments, please contact me: sourceforge [at] tobias-elze.de.

Tobias.


The sandwich image on this page is a modified version from Fritz Saalfeld's McRib image, licensed under Creative Commons Attribution ShareAlike 2.5.