Bringing emotions and sentiment to virtual (human-centric) environments such as social media further improves decision-making, learning, and communication. This task is rather complex and requires an interdisciplinary approach, including computer science (CS), psychology, social sciences, and cognitive science. This book focuses on the CS part and specifically concentrates on “novel methods for text-based sentiment analysis.” As an application, the authors present an employment of the methods “to improve multimodal polarity detection and emotion recognition.” They introduce a framework for sentiment and emotion detection in opinionated videos (such as product reviews). The proposed framework combines three modalities: text (on which the authors focus the most and deliver their most significant contribution), video, and audio.
The book is structured in eight chapters and contains a rich list of references and an index. Chapter 1 introduces the topic, discusses the research challenges, shortly introduces the proposed framework, and summarizes the contributions of the authors presented in the book.
Chapters 2 and 3 provide the necessary background on the topics discussed in the book. The authors ground their work in the broader topic of affective computing and discuss and compare the terms, such as subjectivity versus objectivity, sentiment (and polarity), and emotions. In chapter 3, they provide background on the methods and summarize available datasets in the domain and related recent works done using these datasets.
In summarizing the existing state of the art, these chapters are of utmost interest especially to new doctoral students starting to work on the topic, or to practitioners wanting a more in-depth understanding of the methods they apply when developing multimodal sentiment analysis applications. As such, it is always challenging to summarize such a broad domain in a way where the book is accessible to various audiences. In this respect, the authors had to make decisions on what methods and concepts to include (especially in chapter 2) and which to exclude, which means that the topics discussed are not always on the same level of abstraction or depth. Also, the terminology in chapter 2 is not always as clear as it could be. For example, I am not convinced that principal component analysis (PCA) is a feature selection method--it is a method of dimensionality reduction and feature transformation--or that bagging, used as a method for training ensembles of methods, should be discussed together with bootstrapping when talking about model validation. Nevertheless, these chapters do provide a comprehensive overview of the selected domain.
Chapters 4, 5, and 6 provide the authors’ contribution to the textual sentiment and emotion analysis. In chapter 4, the authors propose a method of extracting concepts from the text, which they evaluate on a small manually created dataset. In chapter 5, they describe their contribution toward creating a dictionary of terms (concepts) containing information on their polarity and emotion (EmoSenticSpace, which is publicly available, although the authors do not provide a link to download it). All the steps were evaluated and the utility of the created EmoSenticSpace was tested on various tasks. In chapter 6, the authors propose a set of rules to analyze sentiment of a sentence based on their structure (parse tree) as well as a neural network (NN)-based approach. These were combined together and evaluated on three datasets, outperforming the referenced state of the art.
These three chapters contain (in my eyes) most of the authors’ contributions (dedicating three chapters to textual analysis and only one chapter to other modalities and their combination with text). Although the chapters logically follow each other, they can also be read separately as standalone texts.
Chapter 7 provides the authors’ contribution to multimodal sentiment analysis. The main contribution is in evaluating various feature fusion strategies (early fusion of features versus late fusion of the trained model, that is, decision-level fusion). However, the authors do not only use existing features extraction approaches for video and audio, but also propose a novel method of visual modality processing using convolutional neural networks (CNNs) in combination with recurrent neural networks (RNNs). Again, all these methods are thoroughly evaluated and compared with the performances reported in the existing works. The only problem is that the selected state-of-the-art methods are of older publication years (between 2010 and 2014), which is true also for previous chapters on textual analysis. Perhaps this is because the book summarizes the authors’ work in the domain over the past five-plus years, so it is natural that some of their older contributions would serve as a comparison.
Chapter 8 summarizes the authors’ contributions, enumerates the limitations of the current work, and outlines the possibilities for future work.
Overall, I consider the book a useful resource for various audiences interested in the topic of multimodal sentiment analysis. It offers a thorough review of the state of the art and important domain concepts, and includes considerable contributions by the authors toward various aspects of the discussed topics. The book attests to the continuous shift of state-of-the-art approaches toward deep learning, but it also shows the utility of using the combination of more classical approaches (for example, syntactical or semantical rules for text, or handcrafted features for machine learning) with the deep learning ones, which can sometimes bring better performance or better efficiency.