Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Automated categorization in the international patent classification
Fall C., Törcsvári A., Benzineb K., Karetka G. ACM SIGIR Forum37 (1):10-25,2003.Type:Article
Date Reviewed: Nov 25 2003

The availability of standard benchmarks (also known as test collections) is a key factor in the progress of disciplines such as information retrieval, which are heavily based on the experimental method. Text categorization (TC), the subfield of information retrieval concerned with automatically building text classifiers from a training set of preclassified documents, is no exception; one may observe that the explosion of TC research in the mid-1990s closely followed the appearance of TC benchmarks, such as Reuters-21578 and OHSUMED.

This paper announces the availability of World Intellectual Property Organization (WIPO)-alpha, a new benchmark for patent classification (the task of automatically classifying patent descriptions under a taxonomy of patent classes), and discusses a set of TC experiments performed on this benchmark using a set of off-the-shelf TC packages. WIPO-alpha contains about 75,000 documents, classified under a subset of the International Patent Classification (IPC) taxonomy consisting of about 100 broad categories and 450 finer-grained ones.

The reported experiments say nothing novel about the comparative performance of different TC systems. For example, the fact that support vector machines tend to outperform all other classification methods just confirms a fact well known in TC. Nevertheless, the discussion of the newly available test collection is indeed interesting and worthwhile. Patent classification is an important application of TC, since the accuracy of classification is of critical importance in this case, and the task is a hard one, since patent applicants often try to disguise the lack of novelty underlying their claimed inventions by the use of nonstandard language, which puts text analysis software under added strain. This paper also reveals that there are some nonstandard aspects of patent classification that had not previously been considered by TC research, such as the fact that a document may have a primary category and several secondary categories; this may call for the definition of new measures of what accuracy means.

The availability of this new benchmark is likely going to encourage research into patent classification, and this paper will be an important reference for this field.

Reviewer:  F. Sebastiani Review #: CR128657 (0404-0477)
Bookmark and Share
 
Information Search And Retrieval (H.3.3 )
 
 
Content Analysis And Indexing (H.3.1 )
 
Would you recommend this review?
yes
no
Other reviews under "Information Search And Retrieval": Date
Nested transactions in a combined IRS-DBMS architecture
Schek H. (ed)  Research and development in information retrieval (, King’s College, Cambridge,701984. Type: Proceedings
Nov 1 1985
An integrated fact/document information system for office automation
Ozkarahan E., Can F. (ed) Information Technology Research Development Applications 3(3): 142-156, 1984. Type: Article
Oct 1 1985
Access methods for text
Faloutsos C. ACM Computing Surveys 17(1): 49-74, 1985. Type: Article
Jan 1 1986
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy