MisoTTS Text-to-Speech Model Launched by Miso Labs

Miso Labs has introduced the MisoTTS text-to-speech model, an 8-billion-parameter system designed to generate expressive speech from text and audio context. Released on June 4, 2026, this model uses a unique approach called residual vector quantization (RVQ) to enhance its sonic range without expanding its parameter count, according to a report by Marktechpost.

What is MisoTTS and How Does It Work?

MisoTTS is an 8B-parameter text-to-speech model that utilizes RVQ to broaden its audio vocabulary. Inspired by the Sesame CSM architecture, the model pairs a Llama 3.2-style backbone with a smaller audio decoder. It generates Mimi audio codes by conditioning on both text and prior audio, allowing it to respond to a speaker’s tone. This innovation aims to solve the “uncanny valley” effect commonly associated with TTS models.

How Does Residual Vector Quantization Improve MisoTTS?

Residual vector quantization (RVQ) addresses the vocabulary size problem by allowing MisoTTS to emit a vector of indices instead of a single token index. Each audio token comprises 32 codebook indices, which helps scale the sonic range without increasing the parameter count. This method results in an addressable vocabulary of approximately 10¹⁰⁵ tokens.

TipsAI in Engineering: Exploring Applications and Opportunities

What is the Architecture of MisoTTS?

MisoTTS features a two-transformer architecture: a 7.7B-parameter backbone and a 300M-parameter decoder. The backbone predicts the initial codebook index, while the decoder predicts remaining indices autoregressively over depth. This architecture allows MisoTTS to condition on both text and audio context, improving its conversational capabilities.

What Are the Strengths and Challenges of MisoTTS?

The MisoTTS model offers several strengths, including open weights available under a modified MIT license, the ability to condition on audio context, and documentation of its architecture and mathematics. However, it also presents challenges, such as its requirement for a capable CUDA GPU and the need for third-party testing to verify latency and quality claims.

Frequently Asked Questions

What is the vocabulary size problem in TTS models?

The vocabulary size problem in TTS models arises because standard transformers rely on a fixed vocabulary of discrete tokens, which is insufficient to cover the variability of human speech. MisoTTS addresses this by using RVQ to expand its audio vocabulary without increasing parameter requirements.

What are the open weights of MisoTTS?

MisoTTS is released with open weights under a modified MIT license, allowing developers and researchers to access and use the model freely. This openness promotes collaboration and innovation in the field of text-to-speech technology.

How does MisoTTS compare in latency to other models?

According to Miso Labs, the MisoTTS model demonstrates a latency of 110 milliseconds, which is significantly faster compared to ElevenLabs at 700 milliseconds and Sesame at 300 milliseconds. This reduced latency can enhance real-time applications.