pdfsandwich: A tool to make "sandwich" OCR pdf files
|© 2010-2014 Tobias Elze|
pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.
pdfsandwich is a command line tool which is supposed to be useful to OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text.
Essentially, pdfsandwich is a wrapper script which calls the following binaries: unpaper (since version 0.0.9), convert, gs, hocr2pdf (for tesseract prior to version 3.03), and tesseract. It is known to run on Unix systems and has been tested on Linux and MacOS X. It supports parallel processing on multiprocessor systems.
While pdfsandwich works with any version of tesseract from version 3.0 on, tesseract 3.03 or later is recommended for best performance. By default, pdfsandwich runs unpaper to enhance the readability of scanned pages and to improve OCR. For instance, slightly rotated pages are automatically straightened and dark edges removed. For optimally scanned pdf files, this can be switched off by option -nopreproc to speed up processing.
Latest version is 0.1.1 (July 14, 2014).
Since version 0.0.9 pdfsandwich optionally preprocesses scanned pdfs by unpaper.
Since version 0.0.5 pdfsandwich uses tesseract instead of cuneiform for OCR.
deb packages for Ubuntu (12.04 or later) are available for Download on the project website.
At the moment of this writing, these dep packages are not yet available over the standard Ubuntu packages sources (like universe). You have two options to install the deb package: First, if you can create a local deb package repository on your computer (recommended). Second, if you are too lazy to generate a local repository, there is the following quick and dirty way: Download the deb file, e.g. pdfsandwich_0.1.0_amd64.deb, to some local directory, and execute the following commands in this directory:
sudo dpkg -i pdfsandwich_0.1.0_amd64.deb # There will be an error message. Ignore it and proceed! sudo apt-get -fy install
Replace pdfsandwich_0.1.0_amd64.deb with the name of your downloaded deb file. After the first command, there will be an error message related to missing dependencies. Ignore it, this will be fixed by the second command.
A Gentoo ebuild is available.
A Macports Portfile exists in the package sources, respectively can be generated by
make PortfileIt is unfortunately untested for versions later than 0.0.3.
pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:
svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich
If OCaml is installed on your system, you can compile and install as follows:
./configure make make install
pdfsandwich is a command line utility. If you have a scanned pdf file, for instance this one: alice.pdf (which is the first chapter of a popular classical novel), invoke pdfsandwich like this:
This will generate a file alice_ocr.pdf which looks like the orginal file, but the recognized text will be placed behind the scanned images. You can make full text searches now or select text areas.
For some pdf files, pdfsandwich produces much larger files after OCR processing. In this case, it might help to call pdfsandwich again on the already OCR'ed file.
The following command line options exist:
-convert -convert filename : name of convert binary (default: convert) -coo -coo options : additional convert options; make sure to quote; e.g. -coo "-normalize -black-threshold 75%" call convert --help or man convert for all convert options -debug keep all temporary files in /tmp (for debugging) -enforcehocr2pdf use hocr2pdf even if tesseract >= 3.03 -first_page -first_page number : number of page to start OCR from (default: 1) -gs -gs filename : name of gs binary (default: gs) -hocr2pdf -hocr2pdf filename : name of hocr2pdf binary (default: hocr2pdf); ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set -hoo -hoo options : additional hocr2pdf options; make sure to quote -last_page -last_page number : number of page up to which to process OCR (default: number of pages in inputfile) -lang -lang language : language of the text; option to tesseract (defaut: eng) e.g: eng, deu, deu-frak, fra, rus, swe, spa, ita... see http://code.google.com/p/tesseract-ocr/downloads/list or LANGUAGE section of man page; Multiple languages may be specified, separated by plus characters. -noimage do not place the image over the text (requires hocr2pdf) -nopreproc do not preprocess with unpaper -nthreads -nthreads number : number of parallel threads (default: guessed number of CPUs; if guessing fails: 1) -o -o filename : output file; default: inputfile_ocr.pdf -resolution -resolution NUMxNUM : resolution used for OCR (default: 300x300) -rgb use RGB color space for images (default: black and white); use with care: causes problems with some color spaces -sloppy_text sloppily place text, group words, do not draw single glyphs; ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set -tesseract -tesseract filename : name of tesseract binary (default: tesseract) -tesso -tesso options : additional tesseract options; make sure to quote -unpaper -unpaper filename : name of unpaper binary (default: unpaper) -unpo -unpo options : additional unpaper options; make sure to quote -quiet suppress output -verbose produce more output -version print version and quit -help Display this list of options --help Display this list of options
Via Tesseract, there are currently language packs available for at least the following languages:
ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces (Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), deu (German), ell (Greek), eng (English), enm (Old English), epo (Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun (Hungarian), ind (Indonesian), ita (Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch), nor (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese).
For further questions or comments, please contact me: sourceforge [at] tobias-elze.de.