MaskedSpeech: Context-aware Speech Synthesis with Masking Strategy

Abstract: Many speech synthesis systems only consider the information within each sentence and ignore the contextual semantic and acoustic features. This makes it inadequate to generate high-expressiveness paragraph-level speech. In this paper, a context-aware speech synthesis system named MaskedSpeech is proposed, which considers both contextual semantic and acoustic features. Inspired by the masking strategy in speech editing research, the acoustic features of the current sentence are masked out and concatenated with those of contextual speech, and further used as additional model input. Furthermore, cross-utterance coarse-grained and fine-grained semantic features are employed to improve the prosody generation. The model is trained to reconstruct the masked acoustic features with the augmentation of both the contextual semantic and acoustic features. Experimental results demonstrate that the MaskedSpeech outperformed the baseline systems significantly in terms of naturalness and expressiveness.
Speech Demo

1. Mandarin demos

The two sentences are separated by "**", the first sentence is natural recording and the second sentence is the current synthesized speech. The texts of the current sentences are in red.

1. 不过既然你们痴心妄想的打算在炎城立足。**那我或许便是得教教你们,炎城的一些规矩!
Fastspeech2 modelPBE-based model MaskedSpeech w/o PBE Random MaskedSpeech
2.他能够感受到,**浑身的细胞,都是在这一刻发出了抗议的声音,
Fastspeech2 modelPBE-based model MaskedSpeech w/o PBE Random MaskedSpeech
3. 然后便是凝在了药池内的一株黑色的灵药,**那灵药的模样,很像是一条小蛇盘踞。
Fastspeech2 modelPBE-based model MaskedSpeech w/o PBE Random MaskedSpeech
4. 有了赤参,他的修炼应该能快一些。**离族比只有半年时间了。
Fastspeech2 modelPBE-based model MaskedSpeech w/o PBE Random MaskedSpeech
5. 盯着那地上狼狈的岳山。这就败了?**这位号称炎城顶尖强者的血狼帮帮主岳山,
Fastspeech2 modelPBE-based model MaskedSpeech w/o PBE Random MaskedSpeech
6. 所以,若是万金商会不插手的话。**林家与鬼刀门交手,恐怕会付出不小的代价。
Fastspeech2 modelPBE-based model MaskedSpeech w/o PBE Random MaskedSpeech
7. 即便只是最低级的。切,真没见识,**那刀形灵宝虽强,但另外一件,却丝毫不会比它弱,
Fastspeech2 modelPBE-based model MaskedSpeech w/o PBE Random MaskedSpeech