Suggestions for making TTS voices sound better?

AM working on a series of “fanimations” using Cepstral TTS voices as the actors. Latest episode (so you can see what i mean) is below:

Is there anything I can do to process the voices and make them sound better?

How do you mean “better”? More natural? Is there a particular reason why you want to use Cepstral TTS rather than recording a real voice?

More natural would be great, but I’d be happy with “generally clearer” at this point.

This is the latest series I’ve worked on, and on previous series I’ve tried going the recorded voice direction. However, you then have folks sending you stuff in various formats, and trying to get the different voices leveled is a hassle - not to mention noise filtering and such… and it’s hard to complain if you’re asking folks to volunteer their time to do so. And if I paid the voice actors to get a professional result, then the musicians would (rightfully) expect to get paid, and then I’d suddenly have a substantial budget…

So wanted to see how far I could get with generated voices. I took a look at the Vocaloid series, but it’s hard to get a speaking voice out from it that doesn’t sound like they’re talking in sing-song.


You’ve got something of a battle on your hands.
To my ear, the most obtrusive downfall of the speech you are using is the unnatural rhythm. This could be improved with a lot of editing in Audacity, but would be difficult and very time consuming.

Some text readers, such as the TextAssist program that was packaged with SoundBlaster cards allowed quite detailed shaping of pitch and timing through using a phonetic language combined with special “tags”. I believe this was originally (or at least in part) developed by Microsoft, and has subsequently been superseded by Microsoft Reader text-to-speech. If you are on Windows, it may be worth looking into that.

Even with the best of current technology it is difficult if indeed possible to synthesize completely natural speech, so it may ultimately be best to try and find a couple of voice artists that are sufficiently enthusiastic about your productions to work with you on the voice-overs. On the other hand, a totally synthesized voice could add an interesting flavour to the productions, but I think that you need to find a way of improving the rhythm of the speech.

Yes I have done this on my movies. But I try to use live actors when they are available. First try not to use the “standards” like MS Sam or Mike. There is a cool site that has amazing TTS voices you can use Audacity to copy them from the website, just run Audacity in the background while accessing the TTS voices on you don’t have to sign up just play with the tools at hand. Now as far as making a machine voice appear more human. Let’s do this phrase. “I like the color blue.”

If I want feeling in it I need to stress the verb (LIKE) and stretch the word (BLUE) subject. I select the word LIKE and Apmlify it then I select the word BLUE and slow the tempo without changing the pitch. It is all a matter of pyschology. People tend to talk faster when they lie. They tend to say words with special mean louder and their voice go up higher when frightened.

I used the technique to make a robot on my videos sing and recite poems. His voice was MS Mary and I knocked it up 35% higher on the pitch. People are always asking who does his voice? So I think that says it all. If you stop by you can see waffles the robot in many of my movies. Waffles first appeared in this movie.

Appreciate the tips. While I’m not sure the Gizmos voices are that much better of a base (maybe I selected the wrong ones?), the tips for how to play with the recorded voices make a lot of sense. I’ll see what I can do.


Some of Gizmos voices have accents and just sound different from all those standards. AT&T Labs are working on natural sounding voices. But I really wish there was a wizard for this in Audacity but it is too complicated for any machine to grasp the meaning behind the sentance and show the proper emotion.

Good Luck!