The Giles Ecosystem is a distributed system to run OCR on images and extract images and texts from PDF files.
Components
The core components of the Giles Ecosystem are located in the following repositories:
- Giles: https://github.com/diging/giles-eco-giles-web (this repository)
- Nepomuk: https://github.com/diging/giles-eco-nepomuk (file storage)
- Cepheus: https://github.com/diging/giles-eco-cepheus (image extraction from PDF files)
- Andromemda: https://github.com/diging/giles-eco-andromeda (text extraction from PDF files)
- Cassiopeia: https://github.com/diging/giles-eco-cassiopeia (OCR using Tesseract)
Dependencies
The system depends on the following software:
- Apache Tomcat 8
- Apache Kafka
- Apache Zookeeper
- MySQL (or PostgreSQL)
- Tesseract OCR (https://github.com/tesseract-ocr/)
- Digilib
Documentation
The Giles Ecosystem documentation (in progress) can be found here: Giles Ecosystem Home.
Running the Giles Ecosystem
There is a Docker compose file to run the Giles Ecosystem in several Docker containers: https://github.com/diging/giles-eco-docker