Paper
28 October 2024 ManFolkSyn: an end-to-end model for Mandarin folk song singing voice synthesis
Weizhao Zhang, Yadong Zuo, Hongwu Yang
Author Affiliations +
Proceedings Volume 13404, Fifth International Conference on Control, Robotics, and Intelligent System (CCRIS 2024); 1340414 (2024) https://doi.org/10.1117/12.3050029
Event: Fifth International Conference on Control, Robotics, and Intelligent System (2024), 2024, Macau, China
Abstract
Singing voice synthesis (SVS) has advanced to the point where it can produce high-quality synthetic voices based on input text and musical scores. However, current SVS research predominantly focuses on synthesizing pop songs, with little attention given to other musical genres. This paper presents an end-to-end folk song singing voice synthesis model to address this gap. Firstly, we constructed a Mandarin folk song singing voice dataset named Folk107, which comprises 107 Mandarin folk songs and nursery rhymes. These songs were recorded in professional settings, resulting in a total duration of approximately three hours. Then, we developed a fully end-to-end model for Mandarin folk singing voice synthesis, named ManFolkSyn. Finally, we conducted both SVS and singing voice conversion (SVC) experiments. In the SVS experiments, MOS scores for two singers exceeded 3.60, while in the SVC experiments, similarity scores surpassed 4.00. These results demonstrate the utility of the dataset and the effectiveness of the model we proposed.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Weizhao Zhang, Yadong Zuo, and Hongwu Yang "ManFolkSyn: an end-to-end model for Mandarin folk song singing voice synthesis", Proc. SPIE 13404, Fifth International Conference on Control, Robotics, and Intelligent System (CCRIS 2024), 1340414 (28 October 2024); https://doi.org/10.1117/12.3050029
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Digital signal processing

Education and training

Data modeling

Acoustics

Molybdenum

Scalable video coding

Sampling rates

Back to Top