I’m working on a project that involves synthesized speech, and I’m producing this speech with Festival. Like a lot of computer generated speech, the result sounds kind of harsh, with abrupt breaks between different sounds. I’m wondering if there’s a method for smoothing this kind of thing out?
An analogy: If you have a photograph with harsh artifacts from a bad digital camera or a bad image compression, you can sometimes correct this by softening the picture in your image editor, then resharpening it. Is there a similar process I can do to a wav file?
Using SoX, I’m already adding a small reverb and a lowpass filter to the output from Festival, but this doesn’t quite solve the problem.
I’ve uploaded an example voice from Festival here to make it easier for people to help. It is the “Alan” voice saying “This is a computer-synthesized voice”.
I thought Effect > Compression helped quite a lot. You might want to try Truncate Silence (in 1.3 Beta)too especially if you have too long breaks between sentences. I didn’t feel artefacts were much of a problem in my example. Have you got an example smaller than 1 MB you can upload? Have you done Analyze > Plot Spectrum on the words that are more abrupt?
Thanks for your suggestions. I’ll play around with the dynamics, and see if that helps.
Here are a couple sample files. 1.wav is the raw output from the Festival voice I am using. 2.wav is after some small reverb and a lowpass filter (“sox 1.wav 2.wav reverb 20 20 lowpass -1 2500”), but it’s still not good enough. The voice is saying “This is some spoken text. It could use some smoothing out in my opinion.”
The voice is basically Festival’s “voice_kal_diphone”, but I’ve changed the durations and pauses to slow it down.
Truncate silence is not an option. The project needs to generate text-to-speech audio files of algorithmically written bedtime stories. That’s why I’m slowing the voice down, and adding pauses. That’s also why “smoothness” is so important.
you have a font of handwritten letters
and you want them to look like real handwriting when you print
but they only give you 26 letters
and some of them dont match up with others
what you really needed was 26x26x26 sized font
with each letter done in combination with each other letter
so when you print it they do line up
but you would have to pick the correct letter in the font to use
based on the starting and ending letter surrounding it
expensive to build
pain to use
but only way if you dont want those gaps and glitches
most people say good enough and live with it
you need sound samples that start and finish with every other sound sample so that they can be pieced together smoothly with no gaps
you could cut out the gaps manually
and edit the connections to be smoother
but is it really worth the trouble?
Many years ago I produced a performance piece that involved singing computers. This was using a text to speech engine, not Festival, but I imagine that many of the same principles apply. For singing, the pitch and duration of words are obviously of great importance. I tried many methods of post-processing the speech output, but there was little that could make much improvement - filtering, adding reverb, dynamic processing and so on would make some improvement to some aspect of the sound, but at the expense of another. For example, I could make the voice “smoother” but it would also become less clear.
To produce the best results involved making adjustments within the speech engine before generating, rather than trying to fix things up after generating.
The speech engine that I was using was the one from Creative Labs. It doesn’t have the best sounding voices, but it provided easy access to the speech generation on a phoneme to phoneme level. To generate slow speech (slow words singing), it was necessary to stretch certain phonemes, but not others. For example, if saying “This is some spoken text”, rather than slowing the whole thing down (t-h-i-s…i-s…s-o-m-e…s-p-o-k-e-n…t-e-x-t), it would primarily be the vowel sounds that would have their durations increased (theeeis eees soeeerm spowwwwken tehxt). Much of this could be done within the text-to-speech interface, but in some instances required “tempo stretching” of very short sections of sound.
The process was very labour intensive, but I was happy with the final result. I’ve only played very briefly with Festival so I’m not familiar with how to manipulate individual phonemes, but I would guess that it is possible. In Audacity 1.3.12 you can use the “Stretch Tempo” or “Sliding Time Scale/Pitch shift”. This latter effect is higher quality, but is a bit buggy in the current version - I would recommend installing the Nightly version of Audacity if you uses this effect extensively. Minor adjustments to the length of a vowel sound can be made by copying a few cycles of the waveform and repeating them.
If you have a lot of text-to-speech to produce, it may be more time efficient to edit the voice data (the original recordings that are used for generating the speech) or create your own voice data set.
Producing high quality text-to-speech is not easy and can be extremely time consuming. (the computer song that I wrote was about 1 minute duration and took about two weeks to produce).
Steve, all the slowing down I’ve done is inside Festival, not after. Unfortunately, I can’t massage the audio by hand after it’s made, since everything needs to be completely automated in this application. The user sets some parameters, clicks a button, and it outputs a unique, finished story as an mp3 file.
I’ve more or less solved the problem by adding some background noise–a slowly changing pink noise like wind or waves which has the frequency range of the voice cut out of it. Oddly enough, adding this noise makes the voice somewhat easier to understand. Perhaps it’s a kind of heavy-handed dithering?