Paper
17 January 2005 Study of style effects on OCR errors in the MEDLINE database
Penny Garrison, Diane L. Davis, Tim L. Andersen, Elisa H. Barney Smith
Author Affiliations +
Proceedings Volume 5676, Document Recognition and Retrieval XII; (2005) https://doi.org/10.1117/12.589408
Event: Electronic Imaging 2005, 2005, San Jose, California, United States
Abstract
The National Library of Medicine has developed a system for the automatic extraction of data from scanned journal articles to populate the MEDLINE database. Although the 5-engine OCR system used in this process exhibits good performance overall, it does make errors in character recognition that must be corrected in order for the process to achieve the requisite accuracy. The correction process works by feeding words that have characters with less than 100% confidence (as determined automatically by the OCR engine) to a human operator who then must manually verify the word or correct the error. The majority of these errors are contained in the affiliate information zone where the characters are in italics or small fonts. Therefore only affiliate information data is used in this research. This paper examines the correlation between OCR errors and various character attributes in the MEDLINE database, such as font size, italics, bold, etc. The motivation for this research is that if a correlation between the types of characters and types of errors exists it should be possible to use this information to improve operator productivity by increasing the probability that the correct word option is presented to the human editor. Using a categorizing program and confusion matrices, we have determined that this correlation exists, in particular for the case of characters with diacritics.
© (2005) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Penny Garrison, Diane L. Davis, Tim L. Andersen, and Elisa H. Barney Smith "Study of style effects on OCR errors in the MEDLINE database", Proc. SPIE 5676, Document Recognition and Retrieval XII, (17 January 2005); https://doi.org/10.1117/12.589408
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Databases

Matrices

Algorithm development

Error analysis

Associative arrays

Mars

Back to Top