Computing Reviews, the leading online review service for computing literature.

Search

NEWPAR: an automatic feature selection and weighting schema for category ranking
Ruiz-Rico F., Vicedo J., Rubio-Sánchez M. Document engineering (Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, Oct 10-13, 2006)128-137.2006.Type:Proceedings

Date Reviewed: Jan 12 2007

The classification of plain-text documents is an ongoing challenge in information research. This paper proposes an original mixture of existing ideas for the categorization of plain-text documents. Document classification is usually done by associating each document with a vector of weights, computed from terms that appear in the document. A training set is used to establish a set of vectors, each one of which is a prototype for a particular category. Documents are assigned to categories based on the closeness of the document’s weight vector to the prototype vector of a category. It is possible to assign a document to more than one category. The distinctiveness of NEWPAR, the technique described in the paper, is based in part on the use of only certain n-grams from the text, namely, nouns or nouns preceded by adjectives; verbs in particular are discarded. N-grams that match category descriptors, or those included in titles, are given greater weight. Measures such as term frequency and document frequency, instead of being taken over the whole corpus, are used within each category to select the most discriminating expressions. The category frequency, which measures the number of categories in which an expression occurs, is also used to discriminate among categories. The paper includes results from experiments in which NEWPAR was applied to existing data sets. While in isolated cases NEWPAR is outperformed by one of the other algorithms to which it is compared, NEWPAR with the simple sum of weights criterion is shown to perform well in all cases. The paper is clearly written, and can be read by anyone who has a basic understanding of support vector methods. One issue that is not addressed is that of the overhead for expression extraction: since this relies on stemming and part-of-speech tagging, it may be substantial.

Reviewer: J. P. E. Hodgson	Review #: CR133792

Indexing Methods (H.3.1 ... )

Induction (I.2.6 ... )

Information Filtering (H.3.3 ... )

Information Search And Retrieval (H.3.3 )

Would you recommend this review?

yes

Other reviews under "Indexing Methods":	Date

Computation of term/document discrimination values by use of the cover coefficient Can F. (ed), Ozkarahan E. Journal of the American Society for Information Science 38(3): 171-183, 1987. Type: Article	Mar 1 1988

Automatic indexing of full texts Jonák Z. Information Processing and Management: an International Journal 20(5-6): 619-627, 1984. Type: Article	Jul 1 1985

Evaluation of access methods to text documents in office systems Rabitti F., Zizka J. Research and development in information retrieval (, King’s College, Cambridge,401984. Type: Proceedings	Sep 1 1985

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy