With the availability of huge amounts of electronically accessible texts, real-life natural language processing (NLP) has become a major application area of machine learning (ML). However, the best-studied ML algorithms were originally designed for instances represented by flat, vector-like data configurations such as database records, while NLP usually deals with hierarchical, tree-like structures such as syntactic trees. A standard way to cope with this discrepancy is to map such complex items to vectors, thus discarding their internal structure, and apply a classical ML algorithm to these vectors; the mapping function is called a kernel. Another option is to look for ML algorithms that directly make use of the complex input structures.
The paper experimentally compares two supervised ML algorithms implementing these two strategies: voted perception (VP) and recursive neural network (RNN). VP is a state-of-the-art kernel-based algorithm similar in quality to a support vector machine (SVM) but faster, while RNN is one of the few known ML algorithms that relies on following the structure of the input tree in its decision making. The authors report on their experiments in wide-coverage syntactic disambiguation, which is a standard NLP task, and claim that the RNN outperforms the VP in most cases in terms of the quality of the results--a conclusion one would naturally expect.
However, analysis of their data shows that this is not always so, even when the authors misleadingly claim it (as seems to be the case in Table 4). In addition, their results do not seem to be statistically significant enough to decisively prove the superiority of one method over the other. Though the RNN does show better results on average, there is only a very small margin. In spite of the authors’ enthusiasm, their data show little or no significant superiority of structure-driven methods (at least the RNN) over simpler kernel-based ones. I found this fact--and not the claimed superiority of the RNN--quite surprising and tutorial. Given the similar quality of the results, other properties of the two methods come into play. The RNN trains much faster than VP, which is an advantage in wide-coverage NLP applications. On the other hand, the authors report that its learning curve is too peaked, which is a disadvantage for an ML method, especially one that is applied to NLP with its great diversity of data. It may be promising to look for a solution to this problem or to try structure-driven methods other than RNN.
Although the paper delivers less than its somewhat grand title suggests, it may be of interest for NLP and (to a lesser degree) ML experts and students. Besides experimental results, it contains an extensive introduction to kernel-based and structure-based ML methods and a detailed description of both VP and RNN algorithms. This introduction is, however, rather math-intensive and difficult to understand for a reader without a sufficient background in ML.