Hyper-Realistic Text-to-Speech: Comparing Tortoise and Bark for Voice Synthesis

Hyper-Realistic Text-to-Speech: Comparing Tortoise and Bark for Voice Synthesis

If you're creating voice-enabled products, this guide will help you choose between AI models Bark and Tortoise TTS.

·

16 min read

Text-to-speech (TTS) technology has seen rapid advances thanks to recent improvements in deep learning and generative modeling. Two models leading the pack are Bark and Tortoise TTS. Both leverage cutting-edge techniques like transformers and diffusion models to synthesize amazingly natural-sounding speech from text.

For engineers and researchers building speech-enabled products, choosing the right TTS model is now a complex endeavor given the capabilities of these new systems. While Bark and Tortoise have similar end goals, their underlying approaches differ significantly.

This article will dive deep into how Bark and Tortoise work under the hood, their respective strengths and weaknesses, and when each one is the superior choice. Whether you're developing a voice assistant, synthesizing audiobook narration, or exploring new generative frontiers in audio, understanding these models is key to success.

Subscribe or follow me on Twitter for more content like this!

By the end, you'll clearly understand which model aligns best with your needs and constraints when bringing next-gen TTS into your products. You'll also learn about some other text-to-audio models you can check out. Let's get started!

Use cases and capabilities

Let's take a high-level look at what each model can do before we get into a more detailed comparison.

All about Bark

Bark is a text-to-audio generative model created by Suno AI. It utilizes a transformer architecture to generate high-quality, realistic audio from text prompts.

Some key capabilities of Bark:

  • It can synthesize natural, human-like speech in multiple languages. This makes it suitable for voice assistant applications, audiobook narration, and more.

  • Beyond just speech, Bark can also generate music, sound effects, and other audio. This flexibility enables creative use cases like producing customized audio for videos, games, or interactive apps.

  • The model supports generating laughs, sighs, and other non-verbal sounds to make speech more natural and human-sounding. I find these really compelling and these imperfections make the speech sound much more real. Check out an example here (scroll down to "pizza.webm").

  • Bark allows control over tone, pitch, speaker identity and other attributes through text prompts. This level of control is useful for developing distinct voice personas.

  • It requires no additional data annotation beyond text transcripts. The model learns directly from text-audio pairs.

In summary, Bark is a powerful generative model capable of synthesizing high-quality speech and diverse audio entirely from text. Its flexibility enables a range of potential applications from voice assistants to audio production tools.

Note: you can use Bark to produce non-speech sounds like sound effects. This is similar to another model called AudioLDM, which we have a guide for here.

Bark's inputs and outputs

Here's a breakdown of the inputs and outputs for the Bark model implemented by Suno on Replicate.com, using data from the API spec page.

Inputs:

  • prompt (string): The input prompt that provides the initial context for the generation. The default value is "Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."

  • history_prompt (string): A choice for audio cloning history. This allows you to choose from a list of predefined speaker IDs in various languages (e.g., en_speaker_0, es_speaker_1, fr_speaker_2, etc.). This history helps the model understand the voice style to generate audio in.

  • custom_history_prompt (file): If provided, this .npz file overrides the previous history_prompt setting. You can provide your own history choice for audio cloning.

  • text_temp (number): Generation temperature for the text generation process. A higher value (e.g., 1.0) makes the output more diverse, while a lower value (e.g., 0.0) makes it more conservative. The default value is 0.7.

  • waveform_temp (number): Generation temperature for the waveform generation process. Similar to text_temp, this parameter affects the diversity of audio generation. The default value is 0.7.

  • output_full (boolean): If set to true, the model returns the full generation as a .npz file, which can be used as a history prompt for future generations.

Outputs:

The model's output structure is described by the following JSON schema:

{
  "type": "object",
  "title": "ModelOutput",
  "required": [
    "audio_out"
  ],
  "properties": {
    "audio_out": {
      "type": "string",
      "title": "Audio Out",
      "format": "uri"
    },
    "prompt_npz": {
      "type": "string",
      "title": "Prompt Npz",
      "format": "uri"
    }
  }
}

Some additional details you may find helpful:

  • audio_out (string): A URI that points to the generated audio file. This is the primary output of the model, containing the audio representation of the generated text prompt.

  • prompt_npz (string): A URI that points to the .npz file representing the prompt used for generating the audio. This can be useful for keeping track of the input context that led to the audio generation.

In summary, the Bark model takes input prompts, history choices, and generation temperature settings to produce audio output. The output includes a link to the generated audio file and a link to the .npz file representing the prompt.

All about Tortoise TTS

Tortoise TTS is a text-to-speech model optimized for exceptionally realistic and natural-sounding voice synthesis. It was created by James Betker.

Key capabilities of Tortoise TTS:

  • It excels at cloning voices using just short audio samples of a target speaker. This makes it easy to produce text in many distinct voices.

  • The quality of the synthesized voices is extremely high, nearly indistinguishable from human speakers. This makes Tortoise great for use cases like audiobook narration.

  • Tortoise supports fine-grained control of speech characteristics like tone, emotion, pacing, etc through priming text. This flexibility helps bring voices to life.

  • The model efficiently leverages smaller datasets by training an autoencoder for voice compression. Less data is needed relative to other TTS models.

  • Tortoise focuses specifically on speech synthesis. While it lacks flexibility for music or sound effects, it provides unparalleled realism for voice.

In summary, Tortoise TTS is an exceptionally high-fidelity text-to-speech model optimized for cloning voices and narrating long-form speech content like books or articles. The quality and control it provides over voice synthesis makes Tortoise suitable for a range of applications from virtual assistants to audiobook creation. You can even use Tortoise to create voice clones of celebrities like Barack Obama, Donald Trump, Walter White, Tony Stark, and more!

Tortoise TTS's inputs and outputs

Here's an overview of the inputs and outputs for the Tortoise model, again looking at the implementation on Replicate, this time by creator afiaka87.

Inputs:

  • text (string): The text input that you want the model to generate speech for. The default value is "The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them." I am not sure why that's the default... but it is!

  • voice_a (string): Selects the primary voice to use for speech generation. You can choose from a list of predefined voices (e.g., angie, deniro, halle, etc.), or use special values like random, custom_voice, or disabled. The default value is random, which selects a random voice.

  • custom_voice (file): If provided, this file should contain an mp3 audio of a speaker that you want to use as a custom voice. The audio should be at least 15 seconds long and only contain one speaker. This parameter overrides the voice_a setting.

  • voice_b and voice_c (strings, optional): These parameters allow you to create a new voice by averaging the latents of the selected voices. You can choose from predefined voices or use disabled to disable voice mixing. These parameters are optional.

  • preset (string): Specifies a voice preset to use for generation. The preset determines the quality and speed of the generated speech. Allowed values are ultra_fast, fast, standard, and high_quality. The default value is fast.

  • seed (integer): A random seed that can be used to reproduce results. The default value is 0.

  • cvvp_amount (number): Determines how much the CVVP (Concatenative Voice Variational Posterior) model should influence the output. Increasing this value can reduce the likelihood of multiple speakers in the generated speech. The default value is 0 (disabled).

Output:

The model's output structure is described by the following JSON schema:

{
  "type": "string",
  "title": "Output",
  "format": "uri"
}

The output is a URI (Uniform Resource Identifier) that points to the generated speech audio file. This audio file represents the synthesized speech based on the provided input text and voice settings.

In summary, the Tortoise TTS model takes input text, voice selections, preset options, and other parameters to generate speech. The output is a URI pointing to the audio file containing the generated speech.

Comparing Bark and Tortoise TTS

Now that we've seen what kind of inputs and outputs the models work with, let's take a comparative look across a number of different dimensions:

  • Architecture

  • Speech generation ability

  • Language

  • Accents

  • Output quality

By the end of the article, you'll understand when to use Bark and when to use Tortoise. We'll also look at some other models you may want to check out, so you can find the proper fit for your use case.

Model Architecture

The architecture used by a TTS model impacts what it can generate and how well it performs. Understanding these technical differences helps interpret the strengths and limitations of building products with them.

The key differences between Bark and Tortoise:

  • Bark uses a flexible transformer architecture that can generate diverse sounds like music. But it requires huge amounts of varied training data.

  • Tortoise uses custom components focused specifically on reproducing human voices realistically. But this specialization makes it harder to expand to other sounds.

Looking closer, Bark employs a transformer architecture similar to GPT-3, as described in the README. It embeds text into abstract tokens without phonemes. A second transformer converts these into audio codec tokens to synthesize the waveform. Transformers leverage self-attention to model relationships in data, enabling generative capabilities. This provides flexibility in sounds like music but needs lots of data for high fidelity.

Tortoise uses a Tacotron-style encoder-decoder for text and an autoencoder for audio compression. It then decodes compressed audio using a diffusion model, as described in the paper.

This specialized configuration targets voice realism. The autoencoder clones voices efficiently. The diffusion model gives Tortoise exceptional quality. The tradeoff is less flexibility than Bark.

These architectural differences have implications for product possibilities. Bark offers flexibility for apps with diverse audio needs. Tortoise prioritizes voice quality for use cases like audiobooks. Understanding these strengths and weaknesses helps pick the right model for your needs.

Voice Customization

The ability to customize and control the synthesized voice is important for some applications. You'll need to decide how important it is for yours because Bark and Tortoise take different approaches to enabling voice control.

Bark has a limited set of built-in voice presets, but no straightforward way for end users to clone new voices. As described in the documentation, Bark supports 100+ speaker presets across languages. These allow controlling attributes like tone, pitch, and emotion. However, adding new custom voices requires advanced technical skills.

In contrast, Tortoise excels at cloning voices using just short audio samples. Its autoencoder compression enables efficiently capturing speaker characteristics. As explained in the source code repo for Suno's implementation, users can clone voices by providing a few audio clips of a target speaker.

For simple voice assistant applications with a limited set of voices, Bark's presets may suffice. But for product ideas requiring extensive voice cloning of arbitrary speakers, Tortoise is likely the better choice despite additional complexity.

Supported Languages and Accents

Bark and Tortoise take different approaches to supporting multiple languages and accents, with implications for product localization and access.

Bark supports many languages relatively well out of the box, as listed in the documentation:

- English (en)
- German (de)  
- Spanish (es)
- French (fr)
- Hindi (hi)
- Italian (it) 
- Japanese (ja)
- Korean (ko)
- Polish (pl)
- Portuguese (pt) 
- Russian (ru)
- Turkish (tr)
- Chinese, simplified (zh)

Supported bark languages

Bark handles code-switching and accents smoothly, automatically detecting the language from the text prompt.

In contrast, Tortoise was trained mostly on English data. As explained in the paper, it lacks diversity in supported languages and accents. Non-English speech would require collecting additional training data and retraining the models.

This gives Bark an advantage for products aimed at global markets or supporting multilingual users. Bark's built-in multilingual support reduces the effort required for localization. Tortoise would involve more work to expand beyond English.

For products highly optimized for a single language like English audiobooks, Tortoise provides superior quality. But Bark is generally a better choice if easily supporting many languages and accents is critical.

Output Quality

While both models produce excellent results, Tortoise TTS edges out Bark in default audio quality right out of the box. However, Bark can match Tortoise given sufficient tuning and prompt engineering.

As noted in the documentation, Bark's audio quality is very good, but some creative prompting is needed to achieve the best results. Guiding the model with brackets, capitalization, speaker tags, and other markup can improve fidelity. You may also need to post-process some audio if super-high-quality sound is important to your use case.

In contrast, Tortoise offers exceptional audio quality without any prompt tuning needed. Synthesized voices are extremely close to human speech. The samples sound virtually indistinguishable from real people, with only a few artifacts.

This difference highlights Tortoise's focus specifically on optimizing voice reproduction. The diffusion model and conditioning workflow deliver consistently amazing results unmatched by other TTS systems.

However, Bark's flexibility as a general audio model means it can likely match Tortoise's quality given enough experimentation with prompts. This prompt tuning requires more effort and skill. I haven't spent enough time with it to pull this off, but you may be able to.

In summary, Tortoise exceeds Bark in default out-of-the-box output quality. But Bark can achieve equivalent quality with sufficient prompting expertise, at the cost of additional effort.

Building Startups with Bark and Tortoise

Both Bark and Tortoise enable creation of a wide range of speech-focused products. What kind of products could you build with these tools? Here are some example startup ideas that play to the strengths of each model:

Bark

  • Voice assistant service supporting multiple languages - Bark's built-in multilingual capabilities make it easy to build assistants for global markets.

  • Interactive audio games - Bark's flexibility allows generating sound effects, background music, and dialogue on the fly.

  • Foreign language learning apps - Combine Bark's speech synthesis with language education tools.

Tortoise

  • Hyper-realistic voice cloning service - Clone customer voices or famous voices using Tortoise's exceptional quality.

  • Synthetic voice actors - Use Tortoise to easily craft distinct voices for animation/video content.

  • Automated audiobook creation - Produce audiobooks from ebooks leveraging Tortoise's strengths at long-form narration.

  • Personalized guided meditations - Generate customized meditation audio in your own voice with Tortoise.

Both

  • Custom text-to-speech APIs - Offer TTS as a service focused on quality and voice control.

  • Text-to-podcast service - Automatically convert blog posts into podcast episodes with Bark's audio generation skills.

  • Text-based adventure games - Immerse players with reactive voice narration and effects.

  • Accessibility tools - Enable those with disabilities to convert text to realistic speech.

The quality and voice control of Bark and Tortoise opens up many new product possibilities spanning entertainment, education, accessibility, and productivity. What are you going to build with these tools?

Comparing Bark and Tortoise to Alternative TTS Models

While this article has focused on Bark and Tortoise TTS, there are a few other leading text-to-speech models worth considering:

  • AudioLDM is a latent diffusion model created by haoheliu focused on generating high-quality audio from text transcripts. It produces very realistic voices and speech like Bark and Tortoise. However, it lacks the flexibility of Bark for non-speech audio and the exceptional voice cloning capabilities of Tortoise.

  • Whisper is a speech recognition model originally created by OpenAI and now maintained by Anthropic. It specializes in transcribing audio to text, rather than text-to-audio generation. This makes it complementary to the other models.

  • Free VC by jagilley is an audio-to-audio voice conversion model. It can alter and convert voices while retaining the same speech patterns and style. This can be used along with TTS models like Tortoise that support voice cloning.

While Bark and Tortoise are good choices, these alternative models can provide complementary capabilities like speech-to-text, easier voice cloning, and voice style transfer to consider when building voice-enabled products. They might be a better fit for your product, depending on its needs.

Note: If you're shopping around for the right model, you can also describe your project here and get a recommended set of models based on their similarity to your exact use case.

Here is a comparison of the text-to-speech models Bark, Tortoise TTS, AudioLDM, Whisper, and Free VC across different use cases and product applicability:

  • For voice assistant applications, Bark is likely the best option given its built-in multilingual capabilities and flexibility in generating sound effects and music in addition to speech. Tortoise TTS and AudioLDM also work well for voice assistants focused just on speech realism.

  • For audiobooks and long-form speech synthesis, Tortoise TTS is likely the best choice given its exceptional voice cloning capabilities and natural prosody. Bark and AudioLDM can also work well but may require more tuning.

  • For transcribing speech to text, Whisper is purpose-built for this use case and would be the recommended model.

  • For voice cloning and conversion, Free VC specializes in this capability, making it the best fit. Tortoise TTS also enables voice cloning via samples.

The table below summarizes the various use cases I discussed above.

ModelBest Use CasesKey Strengths
BarkVoice assistants, audio generationFlexibility, multilingual
Tortoise TTSAudiobooks, voice cloningNatural prosody, voice cloning
AudioLDMVoice assistantsHigh-quality speech
WhisperTranscriptionAccuracy, flexibility
Free VCVoice conversionRetains speech style

Each model has strengths making them best suited for certain use cases and products, though there is also overlap in capabilities across models. Experiment to find the right one!

Conclusion

Text-to-speech technology has advanced rapidly, providing startups many options for building voice-enabled products. While Bark and Tortoise are good choices, alternatives like AudioLDM, Whisper, and Free VC provide complementary capabilities to consider.

The key is picking the right model for your specific use case and constraints. For multi-language voice assistants, Bark is likely the top contender. Tortoise excels at hyper-realistic audiobook narration and voice cloning. And there are other applications for which either model could be used.

I hope this guide provides a solid foundation for choosing the right text-to-speech model for your next product. Let me know if you have any other questions!

Subscribe or follow me on Twitter for more content like this!

I'm always happy to help interpret the landscape of generative AI models to build amazing new applications. Thanks for reading.

Resources and Further Reading

You may find these links helpful as you learn more about the world of generative text-to-speech models.

Did you find this article valuable?

Support MikeLabs by becoming a sponsor. Any amount is appreciated!