Skip to content
VibeVoice TTS

VibeVoice TTS

Highly expressive, long-form, multi-speaker conversational audio generation

Features

Open SourceTTS

System Requirements

16GB RAM recommended. 18GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11 64-bit: NVIDIA GPU with 6GB+ VRAM required.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

Only the 1.5B model is downloaded by default. Selecting the 7B model will initiate an additional download of approximately 32GB of model files. Larger models have significantly higher VRAM demands; please download according to your hardware configuration.

1. What is VibeVoice?

VibeVoice is a frontier open-source Text-to-Speech (TTS) framework developed by the Microsoft research team. Unlike traditional "single-speaker, short-text" narration tools on the market, VibeVoice is an audio content creation engine specifically engineered to generate high-quality, long-form, multi-speaker conversational audio (such as AI podcasts, audiobooks, drama scripts, and multi-turn dialogues).

2. What Technologies are Under the Hood?

VibeVoice's architecture fuses today's most advanced AI concepts:

  • LLM Backbone: It leverages a Large Language Model (specifically the Qwen2.5 1.5B size in its public release) to comprehend the textual context, emotional dynamics, and narrative flow just like a human reader.
  • Next-Token Diffusion Framework: By integrating a lightweight diffusion-based decoding head, it predicts high-fidelity acoustic features, rendering highly realistic and detailed voice outputs.
  • Ultra-low Frame Rate Tokenizers: It introduces continuous semantic and acoustic speech tokenizers operating at a mere 7.5 Hz, significantly reducing computational overhead when processing massive context lengths.

3. Product Characteristics (Pros & Cons)

✨ Key Advantages (Pros):

  • Seamless Multi-Speaker Support: It can host up to 4 distinct speakers in a single audio episode. It automatically manages natural turn-taking and speaker consistency with high emotional expressiveness, capturing real conversational nuances such as sighs, sudden emotional changes, or even spontaneous humming.
  • Massive Long-form Generation: Breaking free from the typical limits of traditional TTS, VibeVoice handles a context length of up to 64K tokens, enabling it to synthesize a continuous, coherent audio piece lasting up to 90 minutes in a single pass.

⚠️ Potential Drawbacks (Cons):

  • Hardware Demands: Smooth local generation requires decent GPU VRAM. Running more advanced weights, such as the 7B variations, presents a steep hardware requirement for everyday consumers.
  • LLM "Hallucinations": Because it is a raw LLM without rigid, traditional text normalization pipelines, it might occasionally misread rare punctuation, miss words when rushing, or unpredictably hallucinate faint background music or unintended breath noises.