Computing Reviews

Environmental audio scene and sound event recognition for autonomous surveillance: a survey and comparative studies
Chandrakala S., Jayalakshmi S.  ACM Computing Surveys 52(3): 1-34, 2019. Type: Article
Date Reviewed: 11/11/21

As my colleagues and I define in a previous paper, “environmental sound recognition (AESR) is a relatively new discipline of computer science destined to extend the field of speech-based applications, or the study of music sounds, by exploring the vast range of environmental non-speech sounds” [1]. With this paper, Chandrakala and Jayalakshmi address computational auditory scene analysis (CASA), a complex field of AESR that concerns the recognition of combinations of sound sources using computational means to simulate human listening perception. More specifically, the material investigates environmental audio scene recognition (EASR) and sound event recognition (SER): EASR refers to the recognition of indoor or outdoor acoustic scenes (for example, cafes versus crowded or silent streets, forest landscape, countryside); and SER investigates specific acoustic events in audio environments, for example, a dog barking, a child crying, or gunshots.

The paper is a dense presentation of the state-of-the-art methodologies applied in the two areas. It mainly explores the types of features extracted from acoustic data and feature space modeling approaches used in recognition systems; the specific databases needed for testing and evaluating such systems; and the most relevant studies. The section dedicated to the features used to accomplish EASR and SER tasks reviews traditional approaches such as Mel-analysis driven and linear prediction, as well as the more refined auditory image-based features, for example, spectrogram-based representations or features tuned through learning approaches with the goal to deliver lower and enhanced representations of the feature set. Two sections cover the simple and hybrid modeling methods and their corresponding feature sets; they are grouped as generative model-based approaches, discriminative, deep learning, and hybrid methods. The rest of the paper reviews “some of the publicly available datasets commonly used in audio scene recognition” and relevant systems and studies on EASR and SER, along with comparative performance diagrams.

The paper includes many references, which will be very useful to researchers. Information related to the preprocessing framework is missing, for example, a discussion of the framing and sub-framing steps. Another omission is the fusion of knowledge from multiple devices, which is a very important topic as such systems are usually based on networks of sensors. The fusion of features, solely, is maybe not capable of coping with the issue.

This valuable survey brings together the latest knowledge and trends in the emerging field of AESR, systematizing the great diversity of already existing computational approaches.

[1] Segarceanu, S.; Suciu, G.; Gavat, I. Environmental acoustics modelling techniques for forest monitoring. Advances in Science, Technology and Engineering Systems Journal (ASTESJ) 6 (3): 15-26, 2021.
Reviewer:  Svetlana Segarceanu Review #: CR147383

Reproduction in whole or in part without permission is prohibited.   Copyright 2021 ComputingReviews.com™
Terms of Use
| Privacy Policy