A Document Recognition System for Early Modern Latin
Dates
- Creation: 2006
Creator
- Reddy, Sravana (Person)
- Crane, Gregory (Person)
Access
Open for research.
Language of Materials
English
Source
PB
Description
Large-scale digitization of manuscripts is facilitated by high-accuracy optical character recognition (OCR) engines. The focus of our work is on using these tools to digitize Latin texts. Many of the texts in the language, especially the early modern, make heavy use of special characters like ligatures and accented abbreviations. Current OCRs are inadequate for our purpose: their built-in training sets do not include all these special characters, and further, post-processing of OCR output is based on data and methods specific to the domain language, most of the current systems do not implement error-correction tools for Latin. This abstract outlines the development of a document recognition system for medieval and early modern Latin texts. We first evaluate the performance of the open source OCR framework, Gamera, on these manuscripts. We then incorporate language modeling functions to sharpen the character recognition output.
Subject
- Perseus Project (Organization)
Repository Details
Part of the Tufts Archival Research Center Repository
35 Professors Row
Tisch Library Building
Tufts University
Medford Massachusetts 02155 United States
617-627-3737
archives@tufts.edu