Advanced Voice Settings: Generation, Stability, Style, and Similarity

When fine-tuning your AI voice settings, four key controls influence how your voice agent sounds: Generation Speed, Stability, Style, and Similarity. Each setting plays a unique role in shaping the final voice output, balancing speed, accuracy, and naturalness.

Generation Speed

This setting determines how quickly the AI generates and plays back the voice. A higher value means the response time is faster, but this comes at the cost of lower resolution and accuracy in pronunciation and tone. A lower value will take longer to process but results in a more refined and natural-sounding voice.

Stability

The stability slider controls how consistent the voice sounds across different responses.

Higher Stability: The voice maintains a consistent tone and inflection, making it sound more robotic and controlled. This setting is useful for professional or serious use cases where predictability is important.
Lower Stability: The voice becomes more expressive, allowing for greater emotional range and variation. However, setting it too low can lead to erratic intonations, making the voice sound unpredictable or overly dramatic.

If you’re looking for a more lifelike performance, experiment with lower stability values—but if consistency is your goal, keep it higher.

Style

Every AI-generated voice starts with a baseline model before tuning to match the intended speaker. The style setting adjusts how much the final voice sticks to the original AI base versus fully transforming into the target speaker’s unique vocal characteristics.

A lower style setting keeps it closer to the AI’s default, making it sound more like a generic, well-rounded voice.
A higher style setting exaggerates the nuances of the trained speaker, making the voice more distinctive and personalized.

This setting doesn’t change the raw sound of the voice as much as it enhances the subtleties—such as pacing, rhythm, and expressive details—that make a voice feel unique.

Similarity

This setting determines how closely the AI sticks to the original voice per individual generation. While stability controls how uniform the voice is across different sentences, similarity ensures that within a single message, the voice remains consistent in tone and pronunciation.

Higher Similarity: Keeps the voice steady and uniform within a single response.
Lower Similarity: Allows for more fluctuation and natural variation in the way words are spoken.

For the best balance, most users find that a stability of ~50 and similarity of ~75 provide the most natural yet consistent performance. However, depending on your needs, tweaking these values can help you fine-tune the voice for a more robotic or expressive feel.

Speaker Boost

Speaker Boost enhances clarity by reducing noise and improving articulation, making the voice easier to understand. However, this process requires additional computation, which can slightly slow down real-time performance.

Finding Your Ideal Settings

AI voice generation is non-deterministic, meaning the same settings won’t always produce identical results. Instead, think of these sliders as guides to help shape the AI’s vocal output. A bit of experimentation goes a long way in finding the perfect balance for your specific use case.

Try out different values and generate multiple samples to see what works best!