Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If text isn't already embedded in the PDF, then you'll need to use OCR to extract the text. Tesseract is an excellent open-source engine for OCR. But it can't read PDFs on its own. So we'll need to do this in three two steps:

  1. Convert the PDF into images: one image per page.;
  2. Use OCR to extract text from each image.Stitch the text from each image (page) together into a single text filethose images.

Convert PDF to images

A PDF is a jumble of instructions for how to render a document on a screen or page. Although it may contain images, a PDF is not itself an image, and therefore we can't perform OCR on it directly. To convert PDFs to images, we use ImageMagick's convert function.

...