Making text-to-speech sound natural

I’m making an animation where the voice will be made with text-to-speech. I already got the accentuation (or what it’s called in english) and most things right.
The problem is, that when the voice doesn’t move between pitches quickly, the roughness and stutteting (or what’s the word) of the text-to-speech can be heard, and therefore it’s obvious that it’s not natural.

The “COULD IT BE” and “THAT IT WAS HER AGAIN” attachments were made the exact same way, exact same text-to-speech, exact same pitch and formant and tempo changing. But one has quickly changing pitches (more than what can be heard) and the other doesn’t. One sounds natural and the other has a robotic voice texture. I tried making ones like the second have more pitch changing by jumping to the same pitch an octave higher and stuff, but it ruined the accentuation, since it sounds less disappointed and more happy.

So, that is all i want, to smoothen out the texture of the voice somehow, so it sounds as natural as the first attachment. I tried searching for it in many ways, i also tried to speed up the voice until the roughness can’t be heard anymore and then slow it back down, but it just turns back to same roughness with worse quality.
It obviously has to do with the wave looking like blades of a saw, like if half the samples were removed, but i don’t know what to do with it.
Using Windows 7 and Audacity 3.0.2

I don’t think Audacity is going to be much help. Not that we can’t make emphasis and tonal corrections, but you will need to do it word by word. It’s a retirement project.

There was one forum poster who claimed he was going to correct his audiobook reading word by word. He’s probably still doing one book years later.

There’s two solutions. Write big checks to the companies who have natural-sounding Text to Speech, or contact one of the many human posters on the forum getting set up for Voice-Over Production. They may do it for you “for the experience.”


Well, i did the pitch and formant and tempo changing smaller than word-by-word, and it’s not a book’s length i need to fix, so any suggestions would be nice.

Any way of upsampling a sound? Like with Lánczos resampling or something?

And as i wrote, simply increasing the pitch solves the rough sounding, but then the pitch is not on the correct level, so isn’t there a way to decrease the pitch but keep the sample rate/resonation rate somehow?.

That sounds like non-English text-to-speech trying to speak English.

There are far better sounding text-to-speech apps.
Even the one in Google translate sounds better

[ Up to you to check if your text-to-speech voice can legally be used in your project ].

Yes, it’s “non-English text-to-speech trying to speak English”, but there are reasons for that, and lame pronounciation is not a problem to me (since it will be subtitled anyways), it’s the robotic sounding that is. (And as far as i know, it’s legal to use it.)

It might have sounded smoother at the start, but i had to change the voice from female to male, flatten the pitch, then manually adjust the pitch and formant and tempo. So if i used a different text-to-speech, it might turn out the same anyway. Plus it sounds good enough in the first attachment which was made the same way, which is why i thought i can just improve it without starting over.

I tried simply duplicating the track, and time-shifting it so both of them kindof fill the holes of each other, and it does sound better. The problem is that it’s obvious that there are two tracks on each other, since they still overlap in places.
Tried “auto duck” but it seems to do the exact opposite, since i need one track to get quieter when the other is louder, not to mimic the loudness of one to the other (that’s what ducking seems to do).

Any way to reverse-duck?

i had to change the voice from female to male

That’s one of the famous failure jobs. How did you do that? Did you use a voice manager, or try to use Effect > Change Pitch?

Not everything changes between Male and Female voices. That’s not an easy job.


Used Rovee for that, and Kerovee to flatten the pitch. sounds good enough for most of the samples, it’s 2-3 sentences which sound robotic like that, and i would rather not start them over, so i can either improve them from this state, or leave them like this.

I assume you are using Sliding Stretch…