Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this system can run inference only with audio prompt and text input? #121

Closed
orantake opened this issue May 13, 2023 · 5 comments
Closed
Labels
question Further information is requested

Comments

@orantake
Copy link

Hi, since I get the reasonable result in LibriTTS dataset, I want this model works in zero-shot task.

In inference code, I think it needs both text prompt and audio prompt when giving prompts.

Could it works in only audio prompt and text input?

@chenjiasheng
Copy link
Collaborator

If you want to inference with a target text but without an audio prompt, the answer is yes. In this case the model will generate speech of a random speaker and style.
If you want to inference with a text-less audio prompt, I think the answer is no, because the model is expecting audio prompt's transcription as the head of text. You should transcript the audio prompt into text in advance either by hand or by an ASR engine.

@chenjiasheng chenjiasheng added the question Further information is requested label May 14, 2023
@orantake
Copy link
Author

Is this training system is identical to original paper? If not, I want to know the reason of making different in training AR system.

@chenjiasheng
Copy link
Collaborator

chenjiasheng commented May 14, 2023 via email

@orantake
Copy link
Author

Then this system should also works with (text input & audio prompt) I think, which MS has applied and made demo wave files in their official site. I know that the structure is identical.

@chenjiasheng
Copy link
Collaborator

First, let's clarify the terms, by "text input" do you mean the target text only, which is "the phoneme sequence for
synthesis", not including the prompt text which is "the phoneme sequence of the enrolled recording"?

This is what the paper says at chapter 4.2.1, take a look:

During inference, given an enrolled recording, we
should concatenate the phoneme sequence of the enrolled recording and the phoneme sequence for
synthesis together. Meanwhile, the acoustic token sequence of the enrolled recording is used as the
prefix in AR decoding, as formulated in equation 1. We will study the superiority of this setting in
the experiment.

However, I believe it only takes a few lines of code change to satisfy your textless prompt requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants