pdfsandwich: A tool to make "sandwich" OCR pdf files
|© 2010-2012 Tobias Elze|
pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.
pdfsandwich is a command line tool which is supposed to be useful to OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text.
Essentially, pdfsandwich is a wrapper script which calls the following binaries: convert, gs, hocr2pdf, and tesseract. It is known to run on Unix systems and has been tested on Linux and MacOS X. It supports parallel processing on multiprocessor systems.
Since version 0.0.5 pdfsandwich uses tesseract instead of cuneiform for OCR.
Latest version is 0.0.6.
deb packages for Ubuntu (12.04 or newer) are available for Download on the project website.
A Gentoo ebuild is available.
A Macports Portfile for version 0.0.3 has been submitted and is awaiting approval. However, there seem to be problems with cuneiform on MacOS 10.6. (I still have 10.5, and here everything works fine.)
pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:
svn co https://pdfsandwich.svn.sourceforge.net/svnroot/pdfsandwich/src pdfsandwich
If OCaml is installed on your system, you can compile and install as follows:
./configure make make install
pdfsandwich is a command line utility. If you have a scanned pdf file, for instance this one: alice.pdf (which is the first chapter of a popular classical novel), invoke pdfsandwich like this:
This will generate a file alice_ocr.pdf which looks like the orginal file, but the recognized text will be placed behind the scanned images. You can make full text searches now or select text areas.
For some pdf files, pdfsandwich produces much larger files after OCR processing. In this case, it might help to call pdfsandwich again on the already OCR'ed file.
The following command line options exist:
-convert -convert filename : name of convert binary (default: convert) -coo -coo options : additional convert options; make sure to quote; e.g. -coo "-normalize -black-threshold 75%" call convert --help or man convert for all convert options -first_page -first_page number : number of page to start OCR from (default: 1) -gs -gs filename : name of gs binary (default: gs) -hocr2pdf -hocr2pdf filename : name of hocr2pdf binary (default: hocr2pdf) -last_page -last_page number : number of page up to which to process OCR (default: number of pages in inputfile) -lang -lang language : language of the text; option to tesseract (defaut: eng) e.g: eng, deu, deu-frak, fra, rus, swe, spa, ita... see http://code.google.com/p/tesseract-ocr/downloads/list -noimage do not place the image over the text -nthreads -nthreads number : number of parallel threads (default: guessed number of CPUs; if guessing fails: 1) -o -o filename : output file; default: inputfile_ocr.pdf -resolution -resolution NUMxNUM : resolution used for OCR (default: 300x300) -rgb use RGB color space for images (default: black and white); use with care: causes problems with some color spaces -sloppy_text sloppily place text, group words, do not draw single glyphs -tesseract -tesseract filename : name of tesseract binary (default: tesseract) -tesso -tesso options : additional tesseract options; make sure to quote -quiet suppress output -verbose produce more output -version print version and quit -help Display this list of options --help Display this list of options
For further questions or comments, please contact me: sourceforge [at] tobias-elze.de.
The sandwich image on this page is a modified version from Fritz Saalfeld's McRib image, licensed under Creative Commons Attribution ShareAlike 2.5.