The paper considers the classification methods for business documents images data extracted after recognition. The peculiarities of the recognized text analysis are pointed out. The identification mechanism for the recognized words is described. The advantages and disadvantages of the Levenshtein distance are listed. Other string distance metrics are considered: Jaro–Winkler similarity, multiset metric, Most Frequent K Characters (MFKC) metric. The standard Levenshtein distance is compared with other string distance metrics. A modification of the Levenshtein distance is proposed, which is aimed at the peculiarities of recognized characters. The paper provides the experimental results illustrating the proposed distance application.
This article is focused on methods of search for falsifications in scanned copies of business documents. This task arises from a comparison of two copies of business documents signed by two parties. The comparison should be performed to detect possible changes made by one of the parties. This problem is relevant, for instance, in the banking sector when signing agreements on paper. The method of partial search for matching flexible documents, where text attributes may be changed, and unintentional modifications of non-essential words may be made is considered. The method of comparison of two scanned images based on the recognition and analysis of N-grams word sequences is proposed. The proposed method has been tested on private dataset. The proposed method has demonstrated high quality and reliability of the search for differences in two samples of one agreement-type document.
In this work the methods of comparison of digitized copies of administrative documents were considered. This problem arises, for example, when comparing two copies of documents signed by two parties in order to find possible modifications made by one party, in the banking sector at the conclusion of contracts in paper form. The proposed method of document image comparison is based on a combination of several ways of image comparison of words that are descriptors of text feature points. Testing was conducted on public Payslip Dataset (French). The results showed the high quality and the reliability of finding differences in two images that are versions of the same document.
This paper considers problems regarding the development of stochastic models consistent with the results of character image recognition in video stream. Assumptions about their structure and properties are formulated for the constructed models. The description of the model components defines the Dirichlet distribution and its generalizations. The parameters of these distributions are determined using statistical estimation methods. The Akaike information criterion is used to rank models. The verification of the agreement of the proposed theoretical distributions to the sample data is carried out.
This paper discusses a task of document recognition on a sequence of video frames. In order to optimize the processing speed an estimation is performed of stability of recognition results obtained from several video frames. Considering identity document (Russian internal passport) recognition on a mobile device it is shown that significant decrease is possible of the number of observations necessary for obtaining precise recognition result.
In this paper the problem statement is given to compare the digitized pages of the official papers. Such problem appears during the comparison of two customer copies signed at different times between two parties with a view to find the possible modifications introduced on the one hand. This problem is a practically significant in the banking sector during the conclusion of contracts in a paper format. The method of comparison based on the recognition, which consists in the comparison of two bag-of-words, which are the recognition result of the master and test pages, is suggested. The described experiments were conducted using the OCR Tesseract and the siamese neural network. The advantages of the suggested method are the steady operation of the comparison algorithm and the high exacting precision, and one of the disadvantages is the dependence on the chosen OCR.
In this paper we consider a task of improving optical character recognition (OCR) results of document fields on low-quality and average-quality images using N-gram models. Cyrillic fields of Russian Federation internal passport are analyzed as an example. Two approaches are presented: the first one is based on hypothesis of dependence of a symbol from two adjacent symbols and the second is based on calculation of marginal distributions and Bayesian networks computation. A comparison of the algorithms and experimental results within a real document OCR system are presented, it's showed that the document field OCR accuracy can be improved by more than 6% for low-quality images.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.