The Giles Ecosystem is a distributed version of Giles. It is designed to handle high numbers of request by distributing work employing Apache Kafka.
Currently, there are four applications in the Giles Ecosystem:
- A Giles Head: https://github.com/diging/giles-eco-giles-web
- Nepomuk: https://github.com/diging/giles-eco-nepomuk
- Cepheus: https://github.com/diging/giles-eco-cepheus
- Cassiopeia: https://github.com/diging/giles-eco-cassiopeia
Giles Head
A Giles Head looks and behaves the same way as the full Giles version. However, instead of extracting images and running OCR on them, a Giles Head inserts extraction, OCR, etc. requests into Apache Kafka for other components to fulfill the request processing. The main responsibility of a Giles Head is to provide a stable API and user interface and to coordinate the file processing workflow.
Nepomuk
Nepomuk is the storage system of the Giles Ecosystem. It's main responsibility is to store files and provide an API to retrieve stored files. It listens to storage requests in Apache Kafka and sends storage complete requests, once a file has been stored.
Cepheus
Cepheus is an app to extract images and texts from PDFs. A PDF file submitted to Cepheus for image extraction will be turned into a series of images according to the configurations of Cepheus (image format and dpi can be specified when setting up Cepheus). When a PDF file is submitted to Cepheus for text extraction, Cepheus will attempt to extract the complete text embedded in the PDF as well as a page-wise text extraction. Note that Cepheus does not run any OCR processes on submitted files.
Cepheus listens to image and text extraction requests in Kafka and submits image and text extraction complete requests.
Cassiopeia
Cassiopeia is an app to run OCR routines on images using Tesseract. It listens to OCR requests in Apache Kafka and sends OCR complete requests after successful processing.