SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models

Abstract

Discrete representation has shown advantages in speech generation tasks, where discrete tokens are obtained by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, direct application of speech SSL models for singing generation encounters domain gap between speech and singing. Moreover, singing generation requires more refined representation than normal speech. To address these challenges, we introduce SingOMD, a novel method to extract singing oriented mutli-resolution discrete representations from speech SSL models. Specifically, we first adapt the features from speech SSL through a resynthesis task, and incorporate multi-resolution based on resampling module to better serve singing generation. These adapted multi-resolution features are then discretized via clustering. Extensive experiments demonstrate the robustness, efficiency and effectiveness of these representations in singing vocoders and singing voice synthesis.

Overall workflow

MY ALT TEXT

Demo for Singing Resynthesis

Demo for Singing Voice Synthesis