In this tutorial we will explore how to extract plain text from PDFs, including Optical Character Recognition (OCR). OCR is a machine-learning technique used to transform images that contain text (e.g. a scan of a document) into actual text content. For a quick introduction to the mechanics of OCR, see the readings for this module.
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
$ tesseract file.tiff -l [lan] file.tiff output.txt |
[lan] should be a three-letter language code. See the LANGUAGES section in the Tesseract documentation for a list of supported languages.
Advanced: Bulk Extraction
Extracting text one file at a time is a bit arduous. We can use a bash script to automate the steps above. The script below will attempt to extract text from a whole directory full of PDFs. It will first attempt to use pdftotext, and if that fails will attempt OCR with Tesseract.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
#!/bin/bash
BPATH=$1 # Path to directory containing PDFs.
OPATH=$2 # Path to output directory.
LANG=$3 # See man tesseract > LANGUAGES
MIN_WORDS=5 # Number of words required to accept pdftotext result.
if [ $(echo "$LANG" | wc -c ) -lt 1 ] # Language defaults to eng.
then
LANG='eng'
fi
# If the output path does not exist, attempt to create it.
if [ ! -d "$OPATH" ]; then
mkdir -p "$OPATH"
fi
for FILEPATH in $BPATH*.pdf; do
# Extracts plain text content from a PDF.
#
# First, attempts to extract embedded text with pdftotext. If that fails,
# converts the PDF to TIFF and attempts to perform OCR with Tesseract.
#
# Path to text file to be created. E.g. ./myfile.txt
OUTFILE=$OPATH$(basename $FILEPATH).txt
touch "$OUTFILE" # The text file will be created regardless of whether
# text is successfully extracted.
# First attempt ot use pdftotext to extract embedded text.
echo -n "Attempting pdftotext extraction..."
pdftotext "$FILEPATH" "$OUTFILE"
FILESIZE=$(wc -w < "$OUTFILE")
echo "extracted $FILESIZE words."
# If that fails, try Tesseract.
if [[ $FILESIZE -lt $MIN_WORDS ]]
then
echo -n "Attempting OCR extraction..."
# Use imagemagick to convert the PDF to a high-rest multi-page TIFF.
convert -density 300 "$FILEPATH" -depth 8 -strip -background white \
-alpha off ./temp.tiff > /dev/null 2>&1
# Then use Tesseract to perform OCR on the tiff.
tesseract ./temp.tiff "$OUTFILE" -l $LANG > /dev/null 2>&1
# We don't need then intermediate TIFF file, so discard it.
rm ./temp.tiff
FILESIZE=$(wc -w < "$OUTFILE")
echo "extracted $FILESIZE words."
fi
done |
To use this script:
- Save the code above in a file called "extract_text.sh"
- In the terminal, go to the directory where you saved extract_text.sh. For example:
- cd /Users/me/myscripts
- Make the script executable with the following command:
- chmod +x extract_text.sh
- Put all of your PDFs in a single directory.
- Assuming that you put all of your PDFs in /Users/me/mypdfs, and you want to write text output to the directory /Users/me/myplaintext, call extract_text.sh as follows:
- ./extract_text.sh /Users/me/mypdfs/ /Users/me/myplaintext/
- If you are working with German-language texts, you can add a language code at the end. For example:
- ./extract_text.sh /Users/me/mypdfs/ /Users/me/myplaintext/ deu
For OSX El Capitan
El Capitan introduced a feature called "System Integrity Protection" which makes it difficult to use pdftotext. If you want to skip that part and use only Tesseract, try the code below (courtesy of Christoph).
Code Block | ||
---|---|---|
| ||
#!/bin/bash
BPATH=$1 # Path to directory containing PDFs.
OPATH=$2 # Path to output directory.
LANG=$3 # See man tesseract > LANGUAGES
MIN_WORDS=5 # Number of words required to accept pdftotext result.
if [ $(echo "$LANG" | wc -c ) -lt 1 ] # Language defaults to eng.
then
LANG='eng'
fi
# If the output path does not exist, attempt to create it.
if [ ! -d "$OPATH" ]; then
mkdir -p "$OPATH"
fi
for FILEPATH in $BPATH*.pdf; do
# Extracts plain text content from a PDF.
#
# First, attempts to extract embedded text with pdftotext. If that fails,
# converts the PDF to TIFF and attempts to perform OCR with Tesseract.
#
# Path to text file to be created. E.g. ./myfile.txt
OUTFILE=$OPATH$(basename $FILEPATH).txt
touch "$OUTFILE" # The text file will be created regardless of whether
# text is successfully extracted.
# First attempt ot use pdftotext to extract embedded text.
#echo -n "Attempting pdftotext extraction..."
#pdftotext "$FILEPATH" "$OUTFILE"
#FILESIZE=$(wc -w < "$OUTFILE")
#echo "extracted $FILESIZE words."
# If that fails, try Tesseract.
#if [[ $FILESIZE -lt $MIN_WORDS ]]
# Nils Nilsson then
echo -n "Attempting OCR extraction..."
# Use imagemagick to convert the PDF to a high-rest multi-page TIFF.
convert -density 300 "$FILEPATH" -depth 8 -strip -background white \
-alpha off ./temp.tiff > /dev/null 2>&1
# Then use Tesseract to perform OCR on the tiff.
tesseract ./temp.tiff "$OUTFILE" -l $LANG > /dev/null 2>&1
# We don't need then intermediate TIFF file, so discard it.
rm ./temp.tiff
FILESIZE=$(wc -w < "$OUTFILE")
echo "extracted $FILESIZE words."
# fi
done |