MaskedSpeech: Context-aware Speech Synthesis with Masking Strategy
Abstract: Many speech synthesis systems only consider the information within each sentence and ignore the contextual semantic and acoustic features. This makes it inadequate to generate high-expressiveness paragraph-level speech. In this paper, a context-aware speech synthesis system named MaskedSpeech is proposed, which considers both contextual semantic and acoustic features. Inspired by the masking strategy in speech editing research, the acoustic features of the current sentence are masked out and concatenated with those of contextual speech, and further used as additional model input. Furthermore, cross-utterance coarse-grained and fine-grained semantic features are employed to improve the prosody generation. The model is trained to reconstruct the masked acoustic features with the augmentation of both the contextual semantic and acoustic features. Experimental results demonstrate that the MaskedSpeech outperformed the baseline systems significantly in terms of naturalness and expressiveness.
Speech Demo
1. Mandarin demos
The two sentences are separated by "**", the first sentence is natural recording and the second sentence is the current synthesized speech. The texts of the current sentences are in red.