Page tree
Skip to end of metadata
Go to start of metadata

 

In this tutorial we will explore how to extract plain text from PDFs, including Optical Character Recognition (OCR). OCR is a machine-learning technique used to transform images that contain text (e.g. a scan of a document) into actual text content. For a quick introduction to the mechanics of OCR, see the readings for this module.

Before You Begin

Be sure to install all of the software required for this module.

Is the text already there?

Many PDFs already have plain text embedded in them, either because they were born-digital (i.e. created from a word processing document) or because OCR was already performed on them (e.g. JSTOR does this for all of the articles in their database). You can usually tell whether or not text is embedded in the PDF by attempting to select a short passage with your mouse. If you can select words and phrases on a page, then there is embedded text present in the document.

To extract embedded text from a PDF, we can use an application called pdftotext (part of the Xpdf package). From the terminal, execute the following command:

Extract Embedded Text using pdftotext
$ pdftotext /path/to/my/document.pdf myoutputfile.txt

This will create a new file called "myoutputfile.txt" in your current working directory. If you open it, you should see the text that pdftotext was able to extract from your PDF document. Remember, this is not OCR: we're just extracting text that is already embedded in the PDF file.

Nope. OCR it is.

If text isn't already embedded in the PDF, then you'll need to use OCR to extract the text. Tesseract is an excellent open-source engine for OCR. But it can't read PDFs on its own. So we'll need to do this in two steps:

  1. Convert the PDF into images;
  2. Use OCR to extract text from those images.

Convert PDF to images

A PDF is a jumble of instructions for how to render a document on a screen or page. Although it may contain images, a PDF is not itself an image, and therefore we can't perform OCR on it directly. To convert PDFs to images, we use ImageMagick's convert function.

The basic syntax to convert a PDF to images is:

Convert a PDF to Images
$ convert -density 300 /path/to/my/document.pdf -depth 8 -strip -background white -alpha off file.tiff

There are several things going on here:

  • -density 300 and -depth 8 control the resolution of the resulting TIFF image. OCR works best with high-resolution images; if you leave this out, you're likely to get garbled results.
  • "-strip -background white -alpha off" removes any alpha channels, and makes the background white. Tesseract is rather picky about this kind of thing.

The resulting file, file.tiff in the example above, should be a multi-page TIFF file. For a 15-page PDF, you can expect the resulting TIFF to be around 300MB. 

OCR with Tesseract

Once you have a TIFF representation of your document, you can use Tesseract to (attempt to) extract plain text. The basic syntax is:

Extract text from a TIFF image with Tesseract OCR
$ tesseract file.tiff output.txt

This tells Tesseract to perform OCR on file.tiff, and put the resulting text in output.txt. If your TIFF file contains multiple pages, Tesseract will sequentially append pages to your output file. 

By default, Tesseract assumes that your documents are in English. If you are working with documents in another language, use the "-l" flag. For example:

Extract text from a non-English language document
$ tesseract file.tiff -l [lan] output.txt

[lan] should be a three-letter language code. See the LANGUAGES section in the Tesseract documentation for a list of supported languages.

Advanced: Bulk Extraction

Extracting text one file at a time is a bit arduous. We can use a bash script to automate the steps above. The script below will attempt to extract text from a whole directory full of PDFs. It will first attempt to use pdftotext, and if that fails will attempt OCR with Tesseract.

Sample shell script to extract text from a directory of PDF files
#!/bin/bash
BPATH=$1  # Path to directory containing PDFs.
OPATH=$2  # Path to output directory.
LANG=$3   # See man tesseract > LANGUAGES
MIN_WORDS=5     # Number of words required to accept pdftotext result.
if [ $(echo "$LANG" | wc -c ) -lt 1 ]   # Language defaults to eng.
    then
        LANG='eng'
fi
# If the output path does not exist, attempt to create it.
if [ ! -d "$OPATH" ]; then
    mkdir -p "$OPATH"
fi
for FILEPATH in $BPATH*.pdf; do
    # Extracts plain text content from a PDF.
    #
    # First, attempts to extract embedded text with pdftotext. If that fails,
    #  converts the PDF to TIFF and attempts to perform OCR with Tesseract.
    #
    # Path to text file to be created. E.g. ./myfile.txt
    OUTFILE=$OPATH$(basename $FILEPATH).txt
    touch "$OUTFILE"    # The text file will be created regardless of whether
                        #  text is successfully extracted.
    # First attempt ot use pdftotext to extract embedded text.
    echo -n "Attempting pdftotext extraction..."
    pdftotext "$FILEPATH" "$OUTFILE"
    FILESIZE=$(wc -w < "$OUTFILE")
    echo "extracted $FILESIZE words."
    # If that fails, try Tesseract.
    if [[ $FILESIZE -lt $MIN_WORDS ]]
        then
            echo -n "Attempting OCR extraction..."
            # Use imagemagick to convert the PDF to a high-rest multi-page TIFF.
            convert -density 300 "$FILEPATH" -depth 8 -strip -background white \
                    -alpha off ./temp.tiff > /dev/null 2>&1
            # Then use Tesseract to perform OCR on the tiff.
            tesseract ./temp.tiff "$OUTFILE" -l $LANG > /dev/null 2>&1
            # We don't need then intermediate TIFF file, so discard it.
            rm ./temp.tiff
            FILESIZE=$(wc -w < "$OUTFILE")
            echo "extracted $FILESIZE words."
    fi
done

To use this script:

  • Save the code above in a file called "extract_text.sh"
  • In the terminal, go to the directory where you saved extract_text.sh. For example:
    • cd /Users/me/myscripts
  • Make the script executable with the following command:
    • chmod +x extract_text.sh
  • Put all of your PDFs in a single directory. 
  • Assuming that you put all of your PDFs in /Users/me/mypdfs, and you want to write text output to the directory /Users/me/myplaintext, call extract_text.sh as follows:
    • ./extract_text.sh /Users/me/mypdfs/ /Users/me/myplaintext/
  • If you are working with German-language texts, you can add a language code at the end. For example:
    • ./extract_text.sh /Users/me/mypdfs/ /Users/me/myplaintext/ deu

For OSX El Capitan

El Capitan introduced a feature called "System Integrity Protection" which makes it difficult to use pdftotext. If you want to skip that part and use only Tesseract, try the code below (courtesy of Christoph).

#!/bin/bash
BPATH=$1  # Path to directory containing PDFs.
OPATH=$2  # Path to output directory.
LANG=$3   # See man tesseract > LANGUAGES
MIN_WORDS=5     # Number of words required to accept pdftotext result.
if [ $(echo "$LANG" | wc -c ) -lt 1 ]   # Language defaults to eng.
    then
        LANG='eng'
fi
# If the output path does not exist, attempt to create it.
if [ ! -d "$OPATH" ]; then
    mkdir -p "$OPATH"
fi
for FILEPATH in $BPATH*.pdf; do
    # Extracts plain text content from a PDF.
    #
    # First, attempts to extract embedded text with pdftotext. If that fails,
    #  converts the PDF to TIFF and attempts to perform OCR with Tesseract.
    #
    # Path to text file to be created. E.g. ./myfile.txt
    OUTFILE=$OPATH$(basename $FILEPATH).txt
    touch "$OUTFILE"    # The text file will be created regardless of whether
                        #  text is successfully extracted.
    # First attempt ot use pdftotext to extract embedded text.
    #echo -n "Attempting pdftotext extraction..."
    #pdftotext "$FILEPATH" "$OUTFILE"
    #FILESIZE=$(wc -w < "$OUTFILE")
    #echo "extracted $FILESIZE words."
    # If that fails, try Tesseract.
    #if [[ $FILESIZE -lt $MIN_WORDS ]]
    # Nils Nilsson   then
            echo -n "Attempting OCR extraction..."
            # Use imagemagick to convert the PDF to a high-rest multi-page TIFF.
            convert -density 300 "$FILEPATH" -depth 8 -strip -background white \
                    -alpha off ./temp.tiff > /dev/null 2>&1
            # Then use Tesseract to perform OCR on the tiff.
            tesseract ./temp.tiff "$OUTFILE" -l $LANG > /dev/null 2>&1
            # We don't need then intermediate TIFF file, so discard it.
            rm ./temp.tiff
            FILESIZE=$(wc -w < "$OUTFILE")
            echo "extracted $FILESIZE words."
    #       fi
done
  • No labels