Authors of short papers such as letters or editorials often express complementary opinions, and sometimes contradictory
ones, on related work in previously published articles. The MEDLINE® citations for such short papers are required to
list bibliographic data on these "commented on" articles in a "CON" field. The challenge is to automatically identify the
CON articles referred to by the author of the short paper (called "Comment-in" or CIN paper). Our approach is to use
support vector machines (SVM) to first classify a paper as either a CIN or a regular full-length article (which is exempt
from this requirement), and then to extract from the CIN paper the bibliographic data of the CON articles. A solution to
the first part of the problem, identifying CIN articles, is addressed here. We implement and compare the performance of
two types of SVM, one with a linear kernel function and the other with a radial basis kernel function (RBF). Input
feature vectors for the SVMs are created by combining four types of features based on statistics of words in the article
title, words that suggest the article type (letter, correspondence, editorial), size of body text, and cue phrases.
Experiments conducted on a set of online biomedical articles show that the SVM with a linear kernel function yields a
significantly lower false negative error rate than the one with an RBF. Our experiments also show that the SVM with a
linear kernel function achieves a significantly higher level of accuracy, and lower false positive and false negative error
rates by using input feature vectors created by combining all four types of features rather than any single type.
There is a strong demand for developing automated tools for extracting pertinent information from the biomedical
literature that is a rich, complex, and dramatically growing resource, and is increasingly accessed via the web. This paper
presents a hybrid method based on contextual and statistical information to automatically identify two MEDLINE
citation terms: NIH grant numbers and databank accession numbers from HTML-formatted online biomedical
documents. Their detection is challenging due to many variations and inconsistencies in their format (although
recommended formats exist), and also because of their similarity to other technical or biological terms. Our proposed
method first extracts potential candidates for these terms using a rule-based method. These are scored and the final
candidates are submitted to a human operator for verification. The confidence score for each term is calculated using
statistical information, and morphological and contextual information. Experiments conducted on more than ten
thousand HTML-formatted online biomedical documents show that most NIH grant numbers and databank accession
numbers can be successfully identified by the proposed method, with recall rates of 99.8% and 99.6%, respectively.
However, owing to the high false alarm rate, the proposed method yields F-measure rates of 86.6% and 87.9% for NIH
grants and databanks, respectively.
MEDLINE(R) is the premier bibliographic online database of the National Library of Medicine, containing approximately
14 million citations and abstracts from over 4,800 biomedical journals. This paper presents an automated method based
on support vector machines to identify a "comment-on" list, which is a field in a MEDLINE citation denoting previously
published articles commented on by a given article. For comparative study, we also introduce another method based on
scoring functions that estimate the significance of each sentence in a given article. Preliminary experiments conducted
on HTML-formatted online biomedical documents collected from 24 different journal titles show that the support vector
machine with polynomial kernel function performs best in terms of recall and F-measure rates.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.