This work proposes several approaches that can be used for generating correspondences between real scanned
books and their transcriptions which might have different modifications and layout variations, also taking OCR
errors into account. Our approaches for the alignment between the manuscript and the transcription are based on
weighted finite state transducers (WFST). In particular, we propose adapted WFSTs to represent the transcription
to be aligned with the OCR lattices. The character-level alignment has edit rules to allow edit operations
(insertion, deletion, substitution). Those edit operations allow the transcription model to deal with OCR segmentation
and recognition errors, and also with the task of aligning with different text editions. We implemented
an alignment model with a hyphenation model, so it can adapt the non-hyphenated transcription. Our models
also work with Fraktur ligatures, which are typically found in historical Fraktur documents. We evaluated our
approach on Fraktur documents from Wanderungen durch die Mark Brandenburg" volumes (1862-1889) and
observed the performance of those models under OCR errors. We compare the performance of our model for
three different scenarios: having no information about the correspondence at the word (i), line (ii), sentence (iii)
or page (iv) level.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.