This is a very interesting paper on the integration of sub-symbolic and symbolic systems. One of the main features of the described system is its ability to learn, both under unsupervised and supervised training. The authors have achieved an important step in the quest for artificial intelligence: from visual and acoustic inputs, give the system the ability to learn how to correlate what is important in the shown sequence, and then select an appropriate action for the perceived input signals. This was achieved in real-time, with real data, through inexpensive hardware--two personal computers, Web cameras, and a microphone.
Simple real-world scenarios were used to demonstrate the principles through card games using a pack of cards with pictures of objects with different attributes (for instance, color and shape). The system uses Prolog as a high-level formalism to represent objects and relationships, while PROGOL is used for inductive learning, working directly from raw visual and acoustic data (color, shape, and single word utterances). An attention mechanism based on motion analysis is used to select key frames, and objects’ attributes are clustered using unsupervised learning where different classes are denoted by the attribute labels. Finally, a supervised learning method is applied over the object’s attributes (a vector quantization-based nearest neighbor classifier is used). Audio signals are processed in a similar fashion using K-means clustering over the set of utterances. For each utterance, a symbolic data stream is created. In order to relate an object’s attributes to the uttered word, it is necessary to keep track of time. Once an utterance is classified, it is backtracked to the particular video segment so that audio and visual symbols can be correlated. On the issue of linking perception to action, actions are defined as utterances; the system will choose to play back a sequence of video showing a person speaking the selected word according to current perceived visual signals.
The authors should be commended for tackling the difficult issue of symbol grounding. The burning question is how sensory projections can give rise to iconic representations, such that symbols can be attached to these providing a semantic interpretation of the world. A clear answer is provided, and its limitations are highlighted in this paper.