Computing Reviews, the leading online review service for computing literature.

Search

Unsupervised feature learning based on deep models for environmental audio tagging
Xu Y., Huang Q., Wang W., Foster P., Sigtia S., Jackson P., Plumbley M. IEEE/ACM Transactions on Audio, Speech and Language Processing25 (6):1230-1241,2017.Type:Article

Date Reviewed: Jun 15 2018

The paper promises tagging--this is just the labeling of audio, which is referred to as “tagging” in the context of audio--of environmental audio using a deep network of learned features. A multi-label approach learned using a deep network is proposed. The training starts with a set of weak labels. Unlabeled data is also labeled. Asymmetric deep denoising auto-encoders (asyDAEs) (weights not tied between input and output) and symmetric deep denoising auto-encoders (syDAEs) (weights tied between input and output) are proposed for feature learning. Error in reconstruction is smaller when a syDAE is used. Deep neural networks (DNNs) are trained using chunk-level rather than frame-level labels. Is it at all possible to get ground truth at the frame level? Even if it is manually annotated, different annotators will have different results, especially at the frame level. The compression seems pretty good, though, since the reconstruction error is small, but this has been used as part of the criterion function itself. The idea of deep principal component analysis (PCA) is nice. However, how different is this from nonlinearly transforming the data and then using standard PCA? Why is this robust? I do not see a significant improvement in equal error rate (EER) in table 3. The statistical t-test is the only convincing answer. Many techniques are pretty close. The proposed approach seems to fare better on male speech. The formula for the F-measure is wrong; recall should be in the denominator. Finally, the paper falls short on one point. The claim of robustness is made based on some empirical studies. There is no effort to understand what the syDAEs/asyDAEs are learning. One spectrogram shows that the noise is smoothed, but this will happen with any low-pass filter. Importantly, I would have liked to see an interpretation of DAE-DNN learning vis-à-vis the different types of sound; highlighting EERs based on the lowest is sometimes wrong in table 3.

Reviewer: Hema Murthy	Review #: CR146090 (1808-0447)

Audio Input/ Output (H.5.1 ... )

Learning (I.2.6 )

Would you recommend this review?

yes

Other reviews under "Audio Input/Output":	Date

Multimedia network file servers Gemmell D., Han J. (ed) Multimedia Systems 1(6): 240-252, 1994. Type: Article	Mar 1 1996

3-D sound for virtual reality and multimedia Begault D., Academic Press Prof., Inc., San Diego, CA, 1994. Type: Book (9780120847358)	Jun 1 1996

Multimedia sound and music studio Essex J., Random House Inc., New York, NY, 1996. Type: Book (9780679761914)	Dec 1 1996

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy