subhankar-ghosh commited on
Commit
23322e7
·
unverified ·
1 Parent(s): 0dc8871

Updated Feb version

Browse files
Files changed (2) hide show
  1. app.py +6 -8
  2. requirements.txt +2 -1
app.py CHANGED
@@ -35,17 +35,15 @@ It employs a two-stage pipeline architecture: a language model generates discret
35
  high-fidelity audio using a neural audio codec: [NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps). \
36
  The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, \
37
  [John Van Stan](https://librivox.org/reader/9017?primary_key=9017&search_category=reader&search_page=1&search_form=get_results&search_order=alpha). \
38
- Each speakers can speak seven different languages (En, Es, De, Fr, Vi, It, Zh). The model predicts discrete audio codec tokens autoregressively using \
39
  a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement \
40
  for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy \
41
  Optimization (GRPO) for improved alignment.
42
 
43
  ### Key Features of the model
44
-
45
- - **Multilingual Support** — Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, and Mandarin
46
  - **Expressive Voices** — Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
47
- - **Text Normalization** — Built-in text normalization for handling numbers, abbreviations, and special characters for all languages except Vietnamese
48
-
49
  ### Resources
50
  - 🤗 **MagpieTTS Weights**: [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m)
51
  - 🤗 **NanoCodec Weights**: [nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps)
@@ -53,11 +51,11 @@ Optimization (GRPO) for improved alignment.
53
 
54
  Note about the demo:
55
  - Text normalization takes time to load, so if you want faster generation select "Do not apply TN."
56
- - Text normalization works for En, Es, De, Fr, It, Zh, but not for Vi.
57
  - Text normalization is required for numbers to be processed and spoken.
58
  - When the model runs on ZeroGPU Hardware, expect slower generation because the model checkpoint is loaded everytime.
59
  - As the model's speakers are all native English speakers, expect accented speech in the other languages.
60
- - The current model is limited to generations of up to 20 seconds. Generations past 20s will be truncated.
61
  - Loan words are not supported at this time. For example, English characters in Mandarin will lead to unexpected results.
62
  - For the enterprise offering, see the [MagpieTTS NIM](https://build.nvidia.com/nvidia/magpie-tts-multilingual) which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
63
  """
@@ -130,7 +128,7 @@ demo = gr.Interface(
130
  fn=demo_tts,
131
  inputs=[gr.Textbox(label="Text to synthesize"),
132
  gr.Dropdown(
133
- choices=["en", "de", "es", "fr", "it", "vi", "zh"],
134
  label="Target Language",
135
  info="Select the target language for the speech to be synthesized in",
136
  ),
 
35
  high-fidelity audio using a neural audio codec: [NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps). \
36
  The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, \
37
  [John Van Stan](https://librivox.org/reader/9017?primary_key=9017&search_category=reader&search_page=1&search_form=get_results&search_order=alpha). \
38
+ Each speakers can speak nine different languages (En, Es, De, Fr, Vi, It, Zh, Hi, Ja). The model predicts discrete audio codec tokens autoregressively using \
39
  a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement \
40
  for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy \
41
  Optimization (GRPO) for improved alignment.
42
 
43
  ### Key Features of the model
44
+ - **Multilingual Support** — Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, Mandarin, Hindi and Japanese.
 
45
  - **Expressive Voices** — Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
46
+ - **Text Normalization** — Built-in text normalization for handling numbers, abbreviations, and special characters for all languages.
 
47
  ### Resources
48
  - 🤗 **MagpieTTS Weights**: [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m)
49
  - 🤗 **NanoCodec Weights**: [nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps)
 
51
 
52
  Note about the demo:
53
  - Text normalization takes time to load, so if you want faster generation select "Do not apply TN."
54
+ - Text normalization works for En, Es, De, Fr, It, Zh, Vi, Hi, Ja. Added support for Vi, Hi, Ja text normalization in V26.02 version of the checkpoint.
55
  - Text normalization is required for numbers to be processed and spoken.
56
  - When the model runs on ZeroGPU Hardware, expect slower generation because the model checkpoint is loaded everytime.
57
  - As the model's speakers are all native English speakers, expect accented speech in the other languages.
58
+ - The current model can generate long-form speech (more than 20 seconds) in English. However, longer generations can cause timeout due to Huggingface timeout limit.
59
  - Loan words are not supported at this time. For example, English characters in Mandarin will lead to unexpected results.
60
  - For the enterprise offering, see the [MagpieTTS NIM](https://build.nvidia.com/nvidia/magpie-tts-multilingual) which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
61
  """
 
128
  fn=demo_tts,
129
  inputs=[gr.Textbox(label="Text to synthesize"),
130
  gr.Dropdown(
131
+ choices=["en", "de", "es", "fr", "it", "vi", "zh", "hi", "ja"],
132
  label="Target Language",
133
  info="Select the target language for the speech to be synthesized in",
134
  ),
requirements.txt CHANGED
@@ -1,4 +1,5 @@
1
- nemo_toolkit[tts]@git+https://github.com/NVIDIA/NeMo.git@72e2bb0f9904711ca54eb54dc72efcf9fe52752f
 
2
  spaces
3
  gradio
4
  kaldialign
 
1
+ nemo_toolkit[tts]@git+https://github.com/NVIDIA/NeMo.git@b4bc9f5aea0eb68bebbff6df83472cb4740248d9
2
+ git+https://github.com/NVIDIA/NeMo-text-processing.git@0153962265c77dbd43eab5584450954d9a4f8af0
3
  spaces
4
  gradio
5
  kaldialign