Authors
Stepan Shishkin, Fraunhofer Institute for Digital Media Technology IDMT
Danilo Hollosi, University of Oldenburg, Dept. of Medical Physics and Acoustics
Simon Doclo, University of Oldenburg, Dept. of Medical Physics and Acoustics, Fraunhofer Institute for Digital Media Technology IDMT
Stefan Goetze, The University of Sheffield, Dept. of Computer Science
Read the full paper
What is this paper about?
Machine learning methods are in principle capable to detect and distinguish between different sounds. This is called Sound Event Classification and usually relies on audio training material annotated by humans. This means that humans have to listen to example sounds and describe the audio material to train the acoustic classifier to distinguish between the target sound(s) and all other sounds.
Since annotation of large amounts of training data is very time-consuming, Active Learning is a way of collaboration between the technical classification system and the human annotator if only limited effort should be spent by the human annotator. In Active Leaning, the classifier is constantly retrained on the first very limited but slowly increasing labelled amount of data. Based on initial labels the learning method is trained and decides for yet unlabelled sound examples which ones it can classify with high confidence or which on it is unsure about. The latter are presented to the human annotator in the next labelling step. This human-machine collaboration leads to a quickly increasing decision reliability even for low amounts of annotated data. The paper describes how the confidence can be calculated by using Monte-Carlo Dropout.
Why is the research important and/or novel?
Humans are constantly perceiving sound events and have astonishing ability to detect, identify and classify sound events they are surrounded by e.g., to decide which ones are important and which are less important.
Technical sound event detection and classification systems can be used to automatically detect audio events of interest. Such acoustic events may be from a security context e.g., breaking glass, baby cries or emergence car sirens, a health monitoring context, e.g., shouts for help or just washing hands as it is relevant in the current COVID context, or from wildlife sound classification if microphones are placed to detect rare animals by their sounds. A further context is machine monitoring e.g., for predictive maintenance. Here, the target is to predict when a machine will fail.
Often unusual sounds are or particular importance. Experienced machinery maintenance staff are often able to predict failure of technical systems like motors, gears etc. long in advance just by listening and identifying unusual sound the machinery produces. Learning of such unusual sounds is particularly difficult since example sound events for training humans or technical systems are rare. Active Learning focuses on situation in which training samples are available in general, but only few of them can be labelled by a human annotator. The descried Active Learning is therefore highly relevant in many or the mentioned real-world applications.
Anything else that you would like to highlight about the paper?
The paper is the result of a collaboration between the Fraunhofer Institute for Digital Media Technology IDMT, The University of Oldenburg, and The University of Sheffield.
It won the Best Student Paper Award at DCASE2021 Workshop on Detection and Classification of Acoustic Scenes and Events