April 23, 2024


The Internet Generation

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Video-to-Speech (VTS) synthesis is a undertaking of reconstructing speech alerts from silent video clip by exploiting their bi-modal correspondences. A new study printed on arXiv.org proposes a novel multi-speaker VTS process, Voice Conversion-dependent Video clip-To-Speech.

Video editing.

Video clip modifying. Graphic credit rating: TheArkow by way of Pixabay, no cost license

Whilst past approaches instantly map cropped lips to speech, as a result main to inadequate interpretability of representations figured out by the design, the paper supplies a much more legible mapping from lips to speech. To begin with, lips are transformed to intermediate phoneme-like acoustic units. Then, the spoken articles is properly restored. The method can also crank out substantial-excellent speech with adaptable manage of the speaker identity.

Quantitative and qualitative final results display that state-of-the-art performance can be accomplished less than equally constrained and unconstrained ailments.

Though substantial progress has been built for speaker-dependent Movie-to-Speech (VTS) synthesis, little notice is devoted to multi-speaker VTS that can map silent online video to speech, although allowing flexible regulate of speaker identity, all in a single procedure. This paper proposes a novel multi-speaker VTS method based on cross-modal knowledge transfer from voice conversion (VC), exactly where vector quantization with contrastive predictive coding (VQCPC) is made use of for the written content encoder of VC to derive discrete phoneme-like acoustic models, which are transferred to a Lip-to-Index (Lip2Ind) community to infer the index sequence of acoustic models. The Lip2Ind network can then substitute the content encoder of VC to type a multi-speaker VTS process to change silent video to acoustic models for reconstructing correct spoken articles. The VTS program also inherits the positive aspects of VC by using a speaker encoder to make speaker representations to effectively regulate the speaker identification of produced speech. Comprehensive evaluations confirm the success of proposed strategy, which can be applied in each constrained vocabulary and open vocabulary circumstances, reaching point out-of-the-artwork efficiency in making superior-good quality speech with significant naturalness, intelligibility and speaker similarity. Our demo site is unveiled right here: this https URL

Research paper: Wang, D., Yang, S., Su, D., Liu, X., Yu, D., and Meng, H., “VCVTS: Multi-speaker Video-to-Speech synthesis through cross-modal knowledge transfer from voice conversion”, 2022. Connection: https://arxiv.org/stomach muscles/2202.09081