The Department of Computer Science-led MAP project has designed and developed a series of large-scale self-supervised music models, which use machine learning to generate and analyse music without relying on explicit human-labelled data for training. Using a technique called self-supervised learning, the MAP models can effectively learn the structure and patterns from large-scale unlabelled music data, and have demonstrated promising results on a wide range of music information retrieval tasks including genre classification, chord detection, beat tracking and more.
The project team released the MAP models (namely, Music2Vec and MERT) on HuggingFace in mid March 2023. Since the release, their models have garnered significant attention from the community and amassed over 30 thousands downloads within a month.
The MAP project was initiated by Dr. Chenghua Lin and his PhD student Yizhi Li from the Department of Computer Science, with collaborators including Jie Fu & Ge Zhang from Beijing Academy of Artificial Intelligence (BAAI), Roger Dannenberg & Ruibin Yuan from Carnegie Mellon University (CMU), Yike Guo from Hong Kong University of Science and Technology (HKUST), and Emmanouil Benetos & Yinghao Ma from Queen Mary University of London (QMUL).
Dr. Chenghua Lin, Deputy Director of Research and REF Lead from the Department of Computer Science, said: “The MAP project is a significant step towards making high-performance computational music research affordable to academic, industrial, and customer-grade applications. It aims to tackle the whole ecosystem of music AI, from retrieval and understanding to generation.”
The impressive performance of the MAP models demonstrates the capacity of deep learning algorithms, powered by large-scale datasets, to understand the domain knowledge behind music without expensive annotation from human experts. MAP models can interpret information from music, ranging from high-level semantic attributes like genre and emotions, to low-level acoustic characteristics such as beat and vocal elements.
This proficiency in general music understanding is achieved through masked language modelling on vast quantities of diverse music data. More details of the model and results can be found in the paper preprint. Additionally, they provided an interactive demo to demonstrate the utilities of the MAP model on a wide range of music applications.
To provide a more comprehensive evaluation of the pre-trained music models, which is crucial for the research of music representation learning, MAP will also release general music understanding evaluation toolkits, protocols, and leaderboards for pre-trained music models in the upcoming Marble Benchmark. The benchmark will include a large set of music understanding tasks and can be expanded to facilitate new frontiers of computational music research.
In the next stage, the team aims to develop music audio generation models, targeting not only unconditional music generation but also enabling multimodal interface control, including text and image inputs.