The extraction of unknown words--such as compound nouns, and the names of people and organizations--is crucial for success in broad areas of language processing in Chinese, including machine translation and information retrieval. Generally, statistical methods and rule-based methods dominate studies of unknown word extraction in Chinese. Statistical methods extract unknown words by exploring the associative relationships of the characters in a string, while rule-based methods investigate grammatical rules for unknown-word construction.
The main contribution of this paper is the proposal of what the authors call “accessor variety” for recognizing unknown words. The paper assumes that any meaningful and widely used Chinese character strings can be regarded as words. This kind of string is likely to occur in many different language environments, which are flagged by the predecessors or the successors of the strings, hence, high “accessor variety.” In addition, a set of rules called adhesive-judge rules is employed to filter out spurious extracted words whose accessors are not true words, but rather “adhesive characters,” such as tense markers.
As the paper claims, the accessor variety-based method distinguishes itself from previous work by its simplicity, while maintaining comparable performance. It also does not rely on the resolution of Chinese word segmentation, which is another problematic topic. It would be better, however, if the paper had explained a seeming inconsistency of the partial recall in Tables 7, 8, and 9.
Those who are interested in unknown word extraction in Chinese will find that this paper brings a new perspective on this topic.