Paper
7 March 1996 Extraction of text-related features for condensing image documents
Dan S. Bloomberg, Francine R. Chen
Author Affiliations +
Proceedings Volume 2660, Document Recognition III; (1996) https://doi.org/10.1117/12.234726
Event: Electronic Imaging: Science and Technology, 1996, San Jose, CA, United States
Abstract
A system has been built that selects excerpts from a scanned document for presentation as a summary, without using character recognition. The method relies on the idea that the most significant sentences in a document contain words that are both specific to the document and have a relatively high frequency of occurrence within it. Accordingly, and entirely within the image domain, each page image is deskewed and the text regions of are found and extracted as a set of textblocks. Blocks with font size near the median for the document are selected and then placed in reading order. The textlines and words are segmented, and the words are placed into equivalence classes of similar shape. The sentences are identified by finding baselines for each line of text and analyzing the size and location of the connected components relative to the baseline. Scores can then be given to each word, depending on its shape and frequency of occurrence, and to each sentence, depending on the scores for the words in the sentence. Other salient features, such as textblocks that have a large font or are likely to contain an abstract, can also be used to select image parts that are likely to be thematically relevant. The method has been applied to a variety of documents, including articles scanned from magazines and technical journals.
© (1996) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Dan S. Bloomberg and Francine R. Chen "Extraction of text-related features for condensing image documents", Proc. SPIE 2660, Document Recognition III, (7 March 1996); https://doi.org/10.1117/12.234726
Lens.org Logo
CITATIONS
Cited by 14 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Image segmentation

Optical character recognition

Image processing

Halftones

Image analysis

Feature extraction

Statistical analysis

RELATED CONTENT

Document image orientation based on both text and image
Proceedings of SPIE (February 21 2012)
Coding Textures
Proceedings of SPIE (May 01 1986)
Document image binarization based on texture analysis
Proceedings of SPIE (March 23 1994)
Multiresolution morphological analysis of document images
Proceedings of SPIE (November 01 1992)
Retrieval of historical documents by word spotting
Proceedings of SPIE (January 19 2009)

Back to Top