The classification of plain-text documents is an ongoing challenge in information research. This paper proposes an original mixture of existing ideas for the categorization of plain-text documents.
Document classification is usually done by associating each document with a vector of weights, computed from terms that appear in the document. A training set is used to establish a set of vectors, each one of which is a prototype for a particular category. Documents are assigned to categories based on the closeness of the document’s weight vector to the prototype vector of a category. It is possible to assign a document to more than one category. The distinctiveness of NEWPAR, the technique described in the paper, is based in part on the use of only certain n-grams from the text, namely, nouns or nouns preceded by adjectives; verbs in particular are discarded. N-grams that match category descriptors, or those included in titles, are given greater weight.
Measures such as term frequency and document frequency, instead of being taken over the whole corpus, are used within each category to select the most discriminating expressions. The category frequency, which measures the number of categories in which an expression occurs, is also used to discriminate among categories.
The paper includes results from experiments in which NEWPAR was applied to existing data sets. While in isolated cases NEWPAR is outperformed by one of the other algorithms to which it is compared, NEWPAR with the simple sum of weights criterion is shown to perform well in all cases.
The paper is clearly written, and can be read by anyone who has a basic understanding of support vector methods. One issue that is not addressed is that of the overhead for expression extraction: since this relies on stemming and part-of-speech tagging, it may be substantial.