Multi Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity
We propose a multi-singer emotional singing voice synthesizer, Muse-SVS, that expresses emotion at various intensity levels by controlling subtle changes in pitch, energy, and phoneme duration while accurately following the score. To control multiple style attributes while avoiding loss of fidelity and expressiveness due to interference between attributes, Muse-SVS represents all attributes and their relations together by a joint embedding in a unified embedding space. Muse-SVS can express emotional intensity levels not included in the training data, including even stronger emotions than those in the training data through embedding interpolation and extrapolation. We also propose a statistical pitch predictor to express pitch variance according to emotional intensity, and a context-aware residual duration predictor to prevent the accumulation of variances in phoneme duration, which is crucial for synchronization with instrumental parts. In addition, we propose a novel ASPP-Transformer, which combines atrous spatial pyramid pooling(ASPP) and Transformer, to improve fidelity and expressiveness by referring to broad contexts. In experiments, Muse-SVS exhibited improved fidelity, expressiveness, and synchronization performance compared with the baseline models. The visualization results show that Muse-SVS effectively express the variance in pitch,energy, and phoneme duration according to emotional intensity. To the best of our knowledge, Muse-SVS is the first neural SVS capable of controlling emotional intensity.
![]() |
|
Model Overall Structure | Variance Adaptor Structure |
Please listen to the samples, focusing on expressiveness and fidelity.
(There are the samples of other intensity levels in here)
Samples of female singer (Happy)
Neutral | Happy 1.0 | ||
---|---|---|---|
MuSE-SVS (proposed) |
|||
MSME-VISinger | VISinger Demo | ||
MSME-FFTSinger | |||
GT |
Samples of male singer (Sad)
Neutral | Sad 1.0 | ||
---|---|---|---|
MuSE-SVS (proposed) |
|||
MSME-VISinger | VISinger Demo | ||
MSME-FFTSinger | |||
GT |
Statistical Pitch Predictor estimates the distribution of the F0 frequencies at the phoneme level.
Please listen to the samples focusing on vibrato and expressiveness which have high similarity with G.T.
The audio samples are corresponding to figure 3 in the paper. Please check more details of the figure in the paper.
Neutral | Sad 1.0 | |
---|---|---|
G.T. | ||
Deterministic Pitch Predictor (conventional) |
||
Statistical Pitch Predictor (proposed) |
These audio sample and figure demonstrate that CRDP exhibited a significantly lower synchronization error
than the baseline predictors for a long song.
Please listen to the sample, focusing on synchronization between duration of synthesized voice and MIDI.
[Figure7 of the paper] Please check more details of the figure in the paper.
This figure and audio samples demonstrate that ASPP-Transformer refers to broader
contexts than ordinary Transformer,
producing a more stable
Mel-spectrogram and vibrato, which leads to improved fidelity
and expressiveness.
Please listen to the samples focusing on fidelity and expressiveness.
[Figure5 of the paper] Please check more details of the figure in the paper.
ASPP-Transformer | Transformer |
---|---|
Muse-SVS can synthesize singing voices with emotional intensities,
which is not included in training data by emotion embedding interpolation and extrapolation.
Please listen to the samples, focusing on expressiveness and fidelity.
[Figure8(b) of the paper] Please check more details of the figure in the paper.
Embedding Space
Emotional Intensity | Audio | Included in the training data |
---|---|---|
Neutral | O | |
Happy 0.3 | O | |
Happy 0.5 (Interpolation) |
X | |
Happy0.7 | O | |
Happy 1.0 | O | |
Happy 1.7 (Extrapolation) |
X |