Enabling the automatic human-level (or better) detection and classification of audio events and sound environments would be a clear plus for artificial intelligence (AI)-based applications such as robotics and social signal processing. Typical machine learning approaches to such analysis problems rely on the prior extraction of description features from raw data before semantic analysis; audio-specific feature proposals abound, from frame-based mel-frequency cepstral coefficients (MFCCs) to recurrence quantification analysis (RQA) data.
This paper provides experimental evidence that accuracy gains can be expected from both aggregating short-time features and separating the event detection and classification tasks. First, a new framework for the automatic frequency-domain-based recognition of environmental sounds and a new single-channel noise reduction algorithm are introduced and used in four experiments. Experiment 1 focuses on RQA and suggests that the RQA+MFCC combination performs better than existing related approaches for scene classification on the D-CASE2013, “in-house,” and Rouen datasets. Experiment 2 reaches similar conclusions regarding aggregation for event classification. Experiment 3 addresses segmentation issues, where the goal is to detect events independently of their class. Finally, experiment 4 looks at joint detection and classification; here, aggregating some features (RQA) helped, as did noise reduction, while others (derivative statistics), not so much.
Overall, this rather technical paper provides some experimental motivation for additional research focusing on independent segmentation, detection, and classification of environmental sounds, in particular using the promising approach of feature aggregation. It will be of interest to researchers and advanced graduate students well versed in audio semantic analysis techniques.