Spaces:
Running on Zero
Running on Zero
subhankar-ghosh commited on
Updated Feb version
Browse files- app.py +6 -8
- requirements.txt +2 -1
app.py
CHANGED
|
@@ -35,17 +35,15 @@ It employs a two-stage pipeline architecture: a language model generates discret
|
|
| 35 |
high-fidelity audio using a neural audio codec: [NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps). \
|
| 36 |
The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, \
|
| 37 |
[John Van Stan](https://librivox.org/reader/9017?primary_key=9017&search_category=reader&search_page=1&search_form=get_results&search_order=alpha). \
|
| 38 |
-
Each speakers can speak
|
| 39 |
a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement \
|
| 40 |
for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy \
|
| 41 |
Optimization (GRPO) for improved alignment.
|
| 42 |
|
| 43 |
### Key Features of the model
|
| 44 |
-
|
| 45 |
-
- **Multilingual Support** — Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, and Mandarin
|
| 46 |
- **Expressive Voices** — Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
|
| 47 |
-
- **Text Normalization** — Built-in text normalization for handling numbers, abbreviations, and special characters for all languages
|
| 48 |
-
|
| 49 |
### Resources
|
| 50 |
- 🤗 **MagpieTTS Weights**: [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m)
|
| 51 |
- 🤗 **NanoCodec Weights**: [nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps)
|
|
@@ -53,11 +51,11 @@ Optimization (GRPO) for improved alignment.
|
|
| 53 |
|
| 54 |
Note about the demo:
|
| 55 |
- Text normalization takes time to load, so if you want faster generation select "Do not apply TN."
|
| 56 |
-
- Text normalization works for En, Es, De, Fr, It, Zh,
|
| 57 |
- Text normalization is required for numbers to be processed and spoken.
|
| 58 |
- When the model runs on ZeroGPU Hardware, expect slower generation because the model checkpoint is loaded everytime.
|
| 59 |
- As the model's speakers are all native English speakers, expect accented speech in the other languages.
|
| 60 |
-
- The current model
|
| 61 |
- Loan words are not supported at this time. For example, English characters in Mandarin will lead to unexpected results.
|
| 62 |
- For the enterprise offering, see the [MagpieTTS NIM](https://build.nvidia.com/nvidia/magpie-tts-multilingual) which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
|
| 63 |
"""
|
|
@@ -130,7 +128,7 @@ demo = gr.Interface(
|
|
| 130 |
fn=demo_tts,
|
| 131 |
inputs=[gr.Textbox(label="Text to synthesize"),
|
| 132 |
gr.Dropdown(
|
| 133 |
-
choices=["en", "de", "es", "fr", "it", "vi", "zh"],
|
| 134 |
label="Target Language",
|
| 135 |
info="Select the target language for the speech to be synthesized in",
|
| 136 |
),
|
|
|
|
| 35 |
high-fidelity audio using a neural audio codec: [NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps). \
|
| 36 |
The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, \
|
| 37 |
[John Van Stan](https://librivox.org/reader/9017?primary_key=9017&search_category=reader&search_page=1&search_form=get_results&search_order=alpha). \
|
| 38 |
+
Each speakers can speak nine different languages (En, Es, De, Fr, Vi, It, Zh, Hi, Ja). The model predicts discrete audio codec tokens autoregressively using \
|
| 39 |
a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement \
|
| 40 |
for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy \
|
| 41 |
Optimization (GRPO) for improved alignment.
|
| 42 |
|
| 43 |
### Key Features of the model
|
| 44 |
+
- **Multilingual Support** — Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, Mandarin, Hindi and Japanese.
|
|
|
|
| 45 |
- **Expressive Voices** — Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
|
| 46 |
+
- **Text Normalization** — Built-in text normalization for handling numbers, abbreviations, and special characters for all languages.
|
|
|
|
| 47 |
### Resources
|
| 48 |
- 🤗 **MagpieTTS Weights**: [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m)
|
| 49 |
- 🤗 **NanoCodec Weights**: [nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps)
|
|
|
|
| 51 |
|
| 52 |
Note about the demo:
|
| 53 |
- Text normalization takes time to load, so if you want faster generation select "Do not apply TN."
|
| 54 |
+
- Text normalization works for En, Es, De, Fr, It, Zh, Vi, Hi, Ja. Added support for Vi, Hi, Ja text normalization in V26.02 version of the checkpoint.
|
| 55 |
- Text normalization is required for numbers to be processed and spoken.
|
| 56 |
- When the model runs on ZeroGPU Hardware, expect slower generation because the model checkpoint is loaded everytime.
|
| 57 |
- As the model's speakers are all native English speakers, expect accented speech in the other languages.
|
| 58 |
+
- The current model can generate long-form speech (more than 20 seconds) in English. However, longer generations can cause timeout due to Huggingface timeout limit.
|
| 59 |
- Loan words are not supported at this time. For example, English characters in Mandarin will lead to unexpected results.
|
| 60 |
- For the enterprise offering, see the [MagpieTTS NIM](https://build.nvidia.com/nvidia/magpie-tts-multilingual) which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
|
| 61 |
"""
|
|
|
|
| 128 |
fn=demo_tts,
|
| 129 |
inputs=[gr.Textbox(label="Text to synthesize"),
|
| 130 |
gr.Dropdown(
|
| 131 |
+
choices=["en", "de", "es", "fr", "it", "vi", "zh", "hi", "ja"],
|
| 132 |
label="Target Language",
|
| 133 |
info="Select the target language for the speech to be synthesized in",
|
| 134 |
),
|
requirements.txt
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
-
nemo_toolkit[tts]@git+https://github.com/NVIDIA/NeMo.git@
|
|
|
|
| 2 |
spaces
|
| 3 |
gradio
|
| 4 |
kaldialign
|
|
|
|
| 1 |
+
nemo_toolkit[tts]@git+https://github.com/NVIDIA/NeMo.git@b4bc9f5aea0eb68bebbff6df83472cb4740248d9
|
| 2 |
+
git+https://github.com/NVIDIA/NeMo-text-processing.git@0153962265c77dbd43eab5584450954d9a4f8af0
|
| 3 |
spaces
|
| 4 |
gradio
|
| 5 |
kaldialign
|