Image captioning is an intriguing problem in the field of computer vision: given an input image, come up with suitable concise text that verbalizes that image well. This is currently a hot topic in the context of image understanding, and leverages two advanced fields of artificial intelligence (AI): computer vision (CV) and natural language processing (NLP). This is especially challenging because, in addition to identifying what we can see in a given image, the caption also needs to capture the underlying semantic information, which is a very difficult task in itself. Among the different classes of methodologies to tackle this problem, the most promising results come from deep neural network (DNN)-based image captioning, which is in fact the focus of this survey.
The authors describe three major classes of DNN-based image captioning methods: retrieval-based methods, template-based methods, and end-to-end learning-based techniques. The first method is addressed only briefly (see Section 2) as it is rarely used anymore. The second method is adequately described (Section 3), albeit with a slightly lesser focus than the end-to-end DNN-based techniques (Section 4) that mainly combine the use of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to generate fluent captioning texts for an image.
In Section 3, the authors describe the template-based methods as a two-stage pipeline: the first stage involves object detection and classification, and the second stage leverages language models (LM) to express the detected objects by appropriate sentences. Here, the “appropriateness” of a sentence depends on “the attributes of these objects” and their relationships with the environment. For finding initial words, CNN, region-based CNN (R-CNN), SPP-net (an improved version of R-CNN), and fast R-CNN (which is “mapped to the feature map regions of the last layer of the CNN,” leading to faster performance) are discussed. As for the LM step, the authors discuss both the basic version as well as NN-based LMs that principally leverage RNN, bidirectional RNN, long short-term memory (LSTM) networks, and gated recurrent units. Toward the end of this section, the authors nicely summarize the limitations and drawbacks of the template-based methods for generating captions, despite their high flexibility and the improved grammatical accuracy of sentences. The authors’ final comment in this section nicely motivates readers to go on to learn about the end-to-end models covered in Section 4: “Although LMs based on neural networks improve this problem, further developing of image captioning is still possible by an end-to-end model.”
In general, “the end-to-end learning model combines [deep CNN, DCNN] with deep RNN, and it makes use of the images and their corresponding captions to train the joint model directly.” The authors cover various models in depth, namely the neural image caption (NIC) model, attention mechanism-based models (and several variants thereof), and dense captioning-based methods.
The authors do not end by merely discussing the different techniques; rather, in Section 5, they thoroughly discuss “the automatic evaluation criteria for evaluating the quality and performance of [image captioning].” In particular, the authors cover both bilingual evaluation understudy (BLEU) and meteor metrics, which were originally generated to evaluate machine translation. They further discuss recall-oriented understudy for gisting evaluation (ROUGE), which is used for automatic summary evaluation. They also describe consensus-based image description evaluation (CIDEr) and SPICE, which are customized for captioning.
Finally, in Section 6, the authors discuss challenges and future developments in image captioning after nicely presenting some interesting and useful experimental results, with a goal to compare and analyze the models quantitatively.