SingVERSE: a Diverse Real-world Benchmark for Singing Voice Enhancement

Shaohan Jiang☆1 , Junan Zhang☆1 , Yunjia Zhang☆1 , Jing Yang2, Fan Fan2, Zhizheng Wu1

Equal Contribution. Name order is randomly generated. (shuffle on refresh)

1The Chinese University of Hong Kong, Shenzhen
2Central Media Technology Institute, Huawei

Abstract

This paper presents a benchmark for singing voice enhancement. The development of singing voice enhancement is limited by the lack of realistic evaluation data. To address this gap, this paper introduces SingVERSE, the first real-world benchmark for singing voice enhancement, covering diverse acoustic scenarios and providing paired, studio-quality clean references. Leveraging SingVERSE, we conduct a comprehensive evaluation of state-of-the-art models and uncover a consistent trade-off between perceptual quality and intelligibility. Finally, we show that training on in-domain singing data substantially improves enhancement performance without degrading speech capabilities, establishing a simple yet effective path forward. This work offers the community a foundational benchmark together with critical insights to guide future advances in this underexplored domain.

Dataset Overview

Utterance Count Distribution

Duration Distribution (seconds)

Audio Samples

Select Scenario

Recordings were captured in diverse and challenging real-world environments.

Filter by Device

Both professional-grade and common consumer devices were used for recording.

Dataset Statistics

All Scenarios

Utterances
1,847 / 2,082
Average Duration
8.0s / 8.6s
Total Duration
14,792s / 17,853s
Professional Non-Professional