Skip to main content

A Document Recognition System for Early Modern Latin

 Digital Image
A Document Recognition System for Early Modern Latin, 2006
A Document Recognition System for Early Modern Latin, 2006

Dates

  • Creation: 2006

Creator

Access

Open for research.

Language of Materials

English

Source

PB

Description

Large-scale digitization of manuscripts is facilitated by high-accuracy optical character recognition (OCR) engines. The focus of our work is on using these tools to digitize Latin texts. Many of the texts in the language, especially the early modern, make heavy use of special characters like ligatures and accented abbreviations. Current OCRs are inadequate for our purpose: their built-in training sets do not include all these special characters, and further, post-processing of OCR output is based on data and methods specific to the domain language, most of the current systems do not implement error-correction tools for Latin. This abstract outlines the development of a document recognition system for medieval and early modern Latin texts. We first evaluate the performance of the open source OCR framework, Gamera, on these manuscripts. We then incorporate language modeling functions to sharpen the character recognition output.

Subject

Repository Details

Part of the Tufts Archival Research Center Repository

Contact:
35 Professors Row
Tisch Library Building
Tufts University
Medford Massachusetts 02155 United States
617-627-3737