Skip to content
PilotTTS

PilotTTS

Voice cloning with text-tag control over 11 emotions, 4 paralinguistic sounds (like laughter/breathing), and 14 Chinese dialects

Features

Open SourceTTS

System Requirements

16GB RAM recommended. 20GB+ storage recommended.
macOS 15+: M-series chips required.
Windows 10/11 64-bit: NVIDIA GPU with 6GB+ VRAM required.
Note: For NVIDIA GPUs, install a newer driver.

Introduction

1. Background & Team PilotTTS is an open-source, lightweight autoregressive text-to-speech (TTS) system released in late May 2026 by the Amap Voice Team (AMAPVOICE), a subsidiary of AutoNavi/Alibaba. Driven by real-world demands for high-fidelity, dialect-rich, and emotional voice assistance in navigation and in-car systems, the team designed PilotTTS to deliver a production-ready speech synthesis solution that is both highly performant and user-friendly.

2. Key Features & Product Characteristics For content creators and general users, PilotTTS offers immense and highly controllable value:

  • Advanced Emotion & Paralinguistic Control (Pros): It completely solves the traditional AI voice issue of "randomly guessing the tone." Users can directly insert text tags to guide the generation. It supports 11 distinct emotion categories (e.g., Happy, Sad, Serious, Concern) and allows precise placement of 4 paralinguistic sounds (LAUGH, BREATH, CRY, COUGH), making the audio sound like a real human actor.
  • Cross-Dialect Voice Cloning (Pros): It supports 14 Chinese dialects (e.g., Sichuanese, Cantonese, Northeastern). Remarkably, it features robust "cross-dialect synthesis" — a user can provide a short audio clip speaking only Mandarin, and the model can clone their voice to speak an authentic local dialect, which is perfect for creating viral social media content.
  • State-of-the-Art Zero-Shot Cloning (Pros): With just 3 to 5 seconds of a reference audio sample, it can replicate any voice with incredible similarity and textbook-level content accuracy, leaving almost no room for skipped words or mispronunciations.
  • Limitations for General Users (Cons): The project currently prioritizes English and Chinese dialects, meaning its native multilingual capabilities for other global languages are weak. Furthermore, the local acceleration strictly relies on NVIDIA GPUs; while it runs smoothly on budget cards (8GB VRAM), running it on non-NVIDIA hardware or pure CPU remains painfully slow.

3. Target Scenarios PilotTTS is ideally suited for smart travel and navigation systems, audiobook/podcast production, anime/game NPC voice acting, social media short video editing, and enterprise-level interactive digital humans.

4. Underlying Technology The core innovation of PilotTTS lies in its "minimalist modular recipe" combined with "rigorous data engineering." Instead of chasing bloated parameter sizes, it elegantly stitches together well-established open-source components:

  • LLM Backbone: Utilizes Alibaba's lightweight Qwen3-0.6B (only 600 million parameters).
  • Audio Feature Extractor: Employs Meta's w2v-bert-2.0.
  • Speech Generation Backend: Integrates CosyVoice3's Conditional Flow Matching (CFM) decoder and Vocoder.
  • By introducing a Q-Former-based conditioning mechanism, it successfully decouples speaker identity from dynamic speaking style. Additionally, the team released a fully open-source, multi-stage data processing pipeline, proving that with meticulous data filtering, a model trained on just 200K hours can outperform systems trained on millions of hours.