The paper promises tagging--this is just the labeling of audio, which is referred to as “tagging” in the context of audio--of environmental audio using a deep network of learned features. A multi-label approach learned using a deep network is proposed. The training starts with a set of weak labels. Unlabeled data is also labeled.
Asymmetric deep denoising auto-encoders (asyDAEs) (weights not tied between input and output) and symmetric deep denoising auto-encoders (syDAEs) (weights tied between input and output) are proposed for feature learning. Error in reconstruction is smaller when a syDAE is used. Deep neural networks (DNNs) are trained using chunk-level rather than frame-level labels. Is it at all possible to get ground truth at the frame level? Even if it is manually annotated, different annotators will have different results, especially at the frame level. The compression seems pretty good, though, since the reconstruction error is small, but this has been used as part of the criterion function itself.
The idea of deep principal component analysis (PCA) is nice. However, how different is this from nonlinearly transforming the data and then using standard PCA?
Why is this robust? I do not see a significant improvement in equal error rate (EER) in table 3. The statistical t-test is the only convincing answer. Many techniques are pretty close. The proposed approach seems to fare better on male speech.
The formula for the F-measure is wrong; recall should be in the denominator.
Finally, the paper falls short on one point. The claim of robustness is made based on some empirical studies. There is no effort to understand what the syDAEs/asyDAEs are learning. One spectrogram shows that the noise is smoothed, but this will happen with any low-pass filter. Importantly, I would have liked to see an interpretation of DAE-DNN learning vis-à-vis the different types of sound; highlighting EERs based on the lowest is sometimes wrong in table 3.