Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
themeRDark
titleConvert a PDF to Images
$ convert -density 300 /path/to/my/document.pdf -depth 8 -strip -background white -alpha off file.tiff

There are several things going on here:

  • -density 300 and -depth 8 control the resolution of the resulting TIFF image. OCR works best with high-resolution images; if you leave this out, you're likely to get garbled results.
  • "-strip -background white -alpha off" removes any alpha channels, and makes the background white. Tesseract is rather picky about this kind of thing.

The resulting file, file.tiff in the example above, should be a multi-page TIFF file. For a 15-page PDF, you can expect the resulting TIFF to be around 300MB. 

Tesseract

Once you have a TIFF representation of your document, you can use Tesseract to (attempt to) extract plain text. The basic syntax is:

Code Block
languagebash
themeRDark
titleExtract text from a TIFF image with Tesseract OCR
$ tesseract file.tiff output.txt

This tells Tesseract to perform OCR on file.tiff, and put the resulting text in output.txt. If your TIFF file contains multiple pages, Tesseract will sequentially append pages to your output file.  

By default, Tesseract assumes that your documents are in English. If you are working with documents in another language, use the "-l" flag. For example:

Code Block
languagebash
themeRDark
titleExtract text from a non-English language document
$ tesseract -l [lan] file.tiff output.txt

[lan] should be a three-letter language code. See the LANGUAGES section in the Tesseract documentation for a list of supported languages.