Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion

fiverr
Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion
fiverr

The Evolution of Text-to-Speech Technology: A Deep Dive into Fish Audio’s S2-Pro

In the realm of Text-to-Speech (TTS), a notable shift is occurring from traditional modular pipelines to integrated Large Audio Models (LAMs). Fish Audio’s recent introduction of S2-Pro, the flagship model within the Fish Speech ecosystem, signifies a move towards open architectures capable of producing high-fidelity, multi-speaker synthesis with sub-150ms latency. This release sets the stage for zero-shot voice cloning and precise emotional control through a Dual-Auto-Regressive (AR) approach.

Unveiling the Dual-AR Framework and RVQ Architecture

The distinctive feature of Fish Audio S2-Pro lies in its hierarchical Dual-AR architecture. Conventional TTS models often grapple with the challenge of balancing sequence length and acoustic detail. S2-Pro tackles this dilemma by dividing the generation process into two specialized stages: a ‘Slow AR’ model and a ‘Fast AR’ model.

  • The Slow AR Model (4B Parameters): This component focuses on the time-axis, handling linguistic input processing and semantic token generation. With a substantial parameter count of around 4 billion, the Slow AR model captures long-range dependencies, prosody, and speech structural nuances.
  • The Fast AR Model (400M Parameters): Operating on the acoustic dimension, this component predicts the residual codebooks for each semantic token. Its smaller size and faster processing ensure efficient generation of high-frequency audio details like timbre, breathiness, and texture.

The system relies on Residual Vector Quantization (RVQ), where raw audio is compressed into discrete tokens across multiple layers or codebooks. This approach allows the model to reconstruct high-fidelity 44.1kHz audio while maintaining a manageable token count for the Transformer architecture.

okex

Emotional Control through In-Context Learning and Inline Tags

Fish Audio S2-Pro achieves what developers term as ‘absurdly controllable emotion’ using two key mechanisms: zero-shot in-context learning and natural language inline control.

In-Context Learning (ICL): Unlike older TTS systems that required explicit fine-tuning to mimic specific voices, S2-Pro leverages the Transformer’s in-context learning capability. By providing a reference audio clip (ideally 10-30 seconds long), the model extracts the speaker’s identity and emotional state, treating the reference as a prefix in its context window. This allows for seamless continuation of the ‘sequence’ in the same voice and style.

Inline Control Tags: Developers can incorporate dynamic emotional transitions within a single generation pass by inserting natural language tags directly into the text prompt. For instance, using tags like [whisper] or [laugh] allows the model to adjust pitch, intensity, and rhythm in real-time without the need for separate emotional embeddings.

Performance Benchmarks and SGLang Integration

When integrating TTS into real-time applications, a critical factor is ‘Time to First Audio’ (TTFA). Fish Audio S2-Pro is optimized for sub-150ms latency, with impressive benchmarks on NVIDIA H200 hardware achieving around 100ms.

  • SGLang and RadixAttention: S2-Pro is designed to work seamlessly with SGLang, a high-performance serving framework. It utilizes RadixAttention for efficient Key-Value (KV) cache management, particularly beneficial in scenarios where the same voice prompt is used repeatedly. RadixAttention caches the prefix’s KV states, eliminating the need for re-computation and reducing prefill time.
  • Multi-Speaker Single-Pass Generation: The architecture supports multiple speaker identities within the same context window, enabling the generation of complex dialogues or multi-character narrations in a single inference call. This eliminates latency associated with switching models or reloading weights for different speakers.

Technical Implementation and Data Scaling

The Fish Speech repository offers a Python-based implementation using PyTorch. The model’s training dataset comprises over 300,000 hours of diverse multi-lingual audio, enabling robust performance across languages and the handling of ‘non-verbal’ vocalizations.

  • VQ-GAN Training: Involves training the quantizer to map audio into a discrete latent space.
  • LLM Training: Focuses on training the Dual-AR transformers to predict latent tokens based on text and acoustic prefixes.

The VQ-GAN in S2-Pro is finely tuned to minimize artifacts during decoding, ensuring that even at high compression ratios, the reconstructed audio remains ‘transparent’ to the human ear.

Key Takeaways

From the Dual-AR architecture that optimizes both detail and speed to the sub-150ms latency engineered for real-time applications, Fish Audio’s S2-Pro offers a range of innovative features:

  • Dual-AR Architecture (Slow/Fast): S2-Pro divides tasks between a 4B parameter ‘Slow AR’ model and a 400M parameter ‘Fast AR’ model, enhancing both detail and speed.
  • Sub-150ms Latency: Tailored for real-time conversational AI, the model achieves a Time-to-First-Audio (TTFA) of approximately 100ms, ideal for interactive applications.
  • Hierarchical RVQ Encoding: Through Residual Vector Quantization, the system compresses audio into discrete tokens across multiple layers, reconstructing complex vocal textures efficiently.
  • Zero-Shot In-Context Learning: Developers can clone voices and emotional states by providing reference clips, eliminating the need for extensive fine-tuning.
  • RadixAttention & SGLang Integration: Leveraging RadixAttention for efficient cache management and supporting multi-speaker generation in a single pass, S2-Pro is optimized for production environments.

For further details, check out the Model Card and Repo. Stay updated by following us on Twitter and joining our ML SubReddit and Newsletter. You can also connect with us on Telegram for more interactive discussions.

Transform the following sentence into a question:

“He is going to the store.”

Is he going to the store?

Changelly

Be the first to comment

Leave a Reply

Your email address will not be published.


*