Spaces:

nvidia
/

magpie_tts_multilingual_demo

Running on Zero

App Files Files Community

subhankar-ghosh commited on Mar 3

Commit

23322e7

unverified ·

1 Parent(s): 0dc8871

Updated Feb version

Browse files

Files changed (2) hide show

app.py +6 -8
requirements.txt +2 -1

app.py CHANGED Viewed

@@ -35,17 +35,15 @@ It employs a two-stage pipeline architecture: a language model generates discret
 high-fidelity audio using a neural audio codec: [NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps). \
 The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, \
 [John Van Stan](https://librivox.org/reader/9017?primary_key=9017&search_category=reader&search_page=1&search_form=get_results&search_order=alpha). \
-Each speakers can speak seven different languages (En, Es, De, Fr, Vi, It, Zh). The model predicts discrete audio codec tokens autoregressively using \
 a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement \
 for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy \
 Optimization (GRPO) for improved alignment.
 ### Key Features of the model
-- **Multilingual Support** — Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, and Mandarin
 - **Expressive Voices** — Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
-- **Text Normalization** — Built-in text normalization for handling numbers, abbreviations, and special characters for all languages except Vietnamese
 ### Resources
 - 🤗 **MagpieTTS Weights**: [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m)
 - 🤗 **NanoCodec Weights**: [nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps)
@@ -53,11 +51,11 @@ Optimization (GRPO) for improved alignment.
 Note about the demo:
 - Text normalization takes time to load, so if you want faster generation select "Do not apply TN."
-- Text normalization works for En, Es, De, Fr, It, Zh, but not for Vi.
 - Text normalization is required for numbers to be processed and spoken.
 - When the model runs on ZeroGPU Hardware, expect slower generation because the model checkpoint is loaded everytime.
 - As the model's speakers are all native English speakers, expect accented speech in the other languages.
-- The current model is limited to generations of up to 20 seconds. Generations past 20s will be truncated.
 - Loan words are not supported at this time. For example, English characters in Mandarin will lead to unexpected results.
 - For the enterprise offering, see the [MagpieTTS NIM](https://build.nvidia.com/nvidia/magpie-tts-multilingual) which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
 """
@@ -130,7 +128,7 @@ demo = gr.Interface(
     fn=demo_tts,
     inputs=[gr.Textbox(label="Text to synthesize"),
             gr.Dropdown(
-                choices=["en", "de", "es", "fr", "it", "vi", "zh"],
                 label="Target Language",
                 info="Select the target language for the speech to be synthesized in",
                 ),

 high-fidelity audio using a neural audio codec: [NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps). \
 The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, \
 [John Van Stan](https://librivox.org/reader/9017?primary_key=9017&search_category=reader&search_page=1&search_form=get_results&search_order=alpha). \
+Each speakers can speak nine different languages (En, Es, De, Fr, Vi, It, Zh, Hi, Ja). The model predicts discrete audio codec tokens autoregressively using \
 a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement \
 for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy \
 Optimization (GRPO) for improved alignment.
 ### Key Features of the model
+- **Multilingual Support** — Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, Mandarin, Hindi and Japanese.
 - **Expressive Voices** — Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
+- **Text Normalization** — Built-in text normalization for handling numbers, abbreviations, and special characters for all languages.
 ### Resources
 - 🤗 **MagpieTTS Weights**: [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m)
 - 🤗 **NanoCodec Weights**: [nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps)
 Note about the demo:
 - Text normalization takes time to load, so if you want faster generation select "Do not apply TN."
+- Text normalization works for En, Es, De, Fr, It, Zh, Vi, Hi, Ja. Added support for Vi, Hi, Ja text normalization in V26.02 version of the checkpoint.
 - Text normalization is required for numbers to be processed and spoken.
 - When the model runs on ZeroGPU Hardware, expect slower generation because the model checkpoint is loaded everytime.
 - As the model's speakers are all native English speakers, expect accented speech in the other languages.
+- The current model can generate long-form speech (more than 20 seconds) in English. However, longer generations can cause timeout due to Huggingface timeout limit.
 - Loan words are not supported at this time. For example, English characters in Mandarin will lead to unexpected results.
 - For the enterprise offering, see the [MagpieTTS NIM](https://build.nvidia.com/nvidia/magpie-tts-multilingual) which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
 """
     fn=demo_tts,
     inputs=[gr.Textbox(label="Text to synthesize"),
             gr.Dropdown(
+                choices=["en", "de", "es", "fr", "it", "vi", "zh", "hi", "ja"],
                 label="Target Language",
                 info="Select the target language for the speech to be synthesized in",
                 ),

requirements.txt CHANGED Viewed

@@ -1,4 +1,5 @@
-nemo_toolkit[tts]@git+https://github.com/NVIDIA/NeMo.git@72e2bb0f9904711ca54eb54dc72efcf9fe52752f
 spaces
 gradio
 kaldialign

+nemo_toolkit[tts]@git+https://github.com/NVIDIA/NeMo.git@b4bc9f5aea0eb68bebbff6df83472cb4740248d9
+git+https://github.com/NVIDIA/NeMo-text-processing.git@0153962265c77dbd43eab5584450954d9a4f8af0
 spaces
 gradio
 kaldialign