A Document Recognition System for Early Modern Latin
- Creation: 2006
Open for research.
Language of Materials
Large-scale digitization of manuscripts is facilitated by high-accuracy optical character recognition (OCR) engines. The focus of our work is on using these tools to digitize Latin texts. Many of the texts in the language, especially the early modern, make heavy use of special characters like ligatures and accented abbreviations. Current OCRs are inadequate for our purpose: their built-in training sets do not include all these special characters, and further, post-processing of OCR output is based on data and methods specific to the domain language, most of the current systems do not implement error-correction tools for Latin. This abstract outlines the development of a document recognition system for medieval and early modern Latin texts. We first evaluate the performance of the open source OCR framework, Gamera, on these manuscripts. We then incorporate language modeling functions to sharpen the character recognition output.