Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
NEWPAR: an automatic feature selection and weighting schema for category ranking
Ruiz-Rico F., Vicedo J., Rubio-Sánchez M.  Document engineering (Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, Oct 10-13, 2006)128-137.2006.Type:Proceedings
Date Reviewed: Jan 12 2007

The classification of plain-text documents is an ongoing challenge in information research. This paper proposes an original mixture of existing ideas for the categorization of plain-text documents.

Document classification is usually done by associating each document with a vector of weights, computed from terms that appear in the document. A training set is used to establish a set of vectors, each one of which is a prototype for a particular category. Documents are assigned to categories based on the closeness of the document’s weight vector to the prototype vector of a category. It is possible to assign a document to more than one category. The distinctiveness of NEWPAR, the technique described in the paper, is based in part on the use of only certain n-grams from the text, namely, nouns or nouns preceded by adjectives; verbs in particular are discarded. N-grams that match category descriptors, or those included in titles, are given greater weight.

Measures such as term frequency and document frequency, instead of being taken over the whole corpus, are used within each category to select the most discriminating expressions. The category frequency, which measures the number of categories in which an expression occurs, is also used to discriminate among categories.

The paper includes results from experiments in which NEWPAR was applied to existing data sets. While in isolated cases NEWPAR is outperformed by one of the other algorithms to which it is compared, NEWPAR with the simple sum of weights criterion is shown to perform well in all cases.

The paper is clearly written, and can be read by anyone who has a basic understanding of support vector methods. One issue that is not addressed is that of the overhead for expression extraction: since this relies on stemming and part-of-speech tagging, it may be substantial.

Reviewer:  J. P. E. Hodgson Review #: CR133792
Bookmark and Share
  Reviewer Selected
Featured Reviewer
 
 
Indexing Methods (H.3.1 ... )
 
 
Induction (I.2.6 ... )
 
 
Information Filtering (H.3.3 ... )
 
 
Information Search And Retrieval (H.3.3 )
 
Would you recommend this review?
yes
no
Other reviews under "Indexing Methods": Date
Computation of term/document discrimination values by use of the cover coefficient
Can F. (ed), Ozkarahan E. Journal of the American Society for Information Science 38(3): 171-183, 1987. Type: Article
Mar 1 1988
Automatic indexing of full texts
Jonák Z. Information Processing and Management: an International Journal 20(5-6): 619-627, 1984. Type: Article
Jul 1 1985
Evaluation of access methods to text documents in office systems
Rabitti F., Zizka J.  Research and development in information retrieval (, King’s College, Cambridge,401984. Type: Proceedings
Sep 1 1985
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy