Is this system can run inference only with audio prompt and text input? #121

orantake · 2023-05-13T14:38:49Z

Hi, since I get the reasonable result in LibriTTS dataset, I want this model works in zero-shot task.

In inference code, I think it needs both text prompt and audio prompt when giving prompts.

Could it works in only audio prompt and text input?

chenjiasheng · 2023-05-14T07:12:35Z

If you want to inference with a target text but without an audio prompt, the answer is yes. In this case the model will generate speech of a random speaker and style.
If you want to inference with a text-less audio prompt, I think the answer is no, because the model is expecting audio prompt's transcription as the head of text. You should transcript the audio prompt into text in advance either by hand or by an ASR engine.

orantake · 2023-05-14T07:45:08Z

Is this training system is identical to original paper? If not, I want to know the reason of making different in training AR system.

chenjiasheng · 2023-05-14T08:31:15Z

yes, as far as I know it is identical to the official paper.

…

---Original--- From: ***@***.***> Date: Sun, May 14, 2023 15:45 PM To: ***@***.***>; Cc: ***@***.***>;"State ***@***.***>; Subject: Re: [lifeiteng/vall-e] Is this system can run inference only withaudio prompt and text input? (Issue #121) Is this training system is identical to original paper? If not, I want to know the reason of making different in training AR system. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: ***@***.***>

orantake · 2023-05-14T08:38:02Z

Then this system should also works with (text input & audio prompt) I think, which MS has applied and made demo wave files in their official site. I know that the structure is identical.

chenjiasheng · 2023-05-14T11:32:24Z

First, let's clarify the terms, by "text input" do you mean the target text only, which is "the phoneme sequence for
synthesis", not including the prompt text which is "the phoneme sequence of the enrolled recording"?

This is what the paper says at chapter 4.2.1, take a look:

During inference, given an enrolled recording, we
should concatenate the phoneme sequence of the enrolled recording and the phoneme sequence for
synthesis together. Meanwhile, the acoustic token sequence of the enrolled recording is used as the
prefix in AR decoding, as formulated in equation 1. We will study the superiority of this setting in
the experiment.

However, I believe it only takes a few lines of code change to satisfy your textless prompt requirements.

chenjiasheng closed this as completed May 14, 2023

chenjiasheng added the question Further information is requested label May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this system can run inference only with audio prompt and text input? #121

Is this system can run inference only with audio prompt and text input? #121

orantake commented May 13, 2023

chenjiasheng commented May 14, 2023

orantake commented May 14, 2023

chenjiasheng commented May 14, 2023 via email

orantake commented May 14, 2023

chenjiasheng commented May 14, 2023

Is this system can run inference only with audio prompt and text input? #121

Is this system can run inference only with audio prompt and text input? #121

Comments

orantake commented May 13, 2023

chenjiasheng commented May 14, 2023

orantake commented May 14, 2023

chenjiasheng commented May 14, 2023 via email

orantake commented May 14, 2023

chenjiasheng commented May 14, 2023