...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
$ convert -density 300 /path/to/my/document.pdf -depth 8 -strip -background white -alpha off file.tiff |
There are several things going on here:
- -density 300 and -depth 8 control the resolution of the resulting TIFF image. OCR works best with high-resolution images; if you leave this out, you're likely to get garbled results.
- "-strip -background white -alpha off" removes any alpha channels, and makes the background white. Tesseract is rather picky about this kind of thing.
The resulting file, file.tiff in the example above, should be a multi-page TIFF file. For a 15-page PDF, you can expect the resulting TIFF to be around 300MB.
Tesseract
Once you have a TIFF representation of your document, you can use Tesseract to (attempt to) extract plain text. The basic syntax is:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
$ tesseract file.tiff output.txt |
This tells Tesseract to perform OCR on file.tiff, and put the resulting text in output.txt. If your TIFF file contains multiple pages, Tesseract will sequentially append pages to your output file.
By default, Tesseract assumes that your documents are in English. If you are working with documents in another language, use the "-l" flag. For example:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
$ tesseract -l [lan] file.tiff output.txt |
[lan] should be a three-letter language code. See the LANGUAGES section in the Tesseract documentation for a list of supported languages.