I have a question, is it possible in Audacity to analyze one audio track and use it as a sample settings (like volume, speed, frequencies etc) to be used in a second audio track.
Both audio tracks only contain speech (one person), no music or other sound sources etc.
I am using the automatic transcript function in youtube to generate a transcript for my audio.
Now, I have one video for which the generated transcript has a high degree of correctness.
The second video has a low degree of correctness, every second word is transcribed incorrectly (even entire sentences).
This is due to the quality of speech ( a little bit muffled).
I have managed to increase the legibility of the speech by using a high pass filter (and playing around with the settings) to make it a little bit clearer.
Also I managed to cut out a lot of background noise (hissing) by using the noise filter.
But it is still not close to the legibility of the speech from the first audio track.
I understand that the quality of the original recording is the most important, and that you can’t turn low quality to high quality audio
But is it possible to at least increase the legibility a little bit and get it as close as possible to the example audio track (the one of which I know youtube has no problems with)?
I don’t care if the end result sounds good or not, the important bit is that googles algorithm can extract and transcribe the speech.
The problem is that you need to know exactly what it is that makes transcribing easier for Google. If it is just a simple matter of frequency range, then I would have expected that the Google software would already be applying filters as necessary for the best transcription results. Similarly, if the audio level is important then I would expect their software to automatically normalize the audio to the optimum level. On the other hand, if it is such issues as regional accent and enunciation, then changing those is beyond the capabilities of audio editors.
I posted an old request for a link between spectrum analysis and the equalizer tool. It seems a natural, but it’s very hard to do.
Analyze > Spectrum Analyze the original work and “remember the settings.” Analyze the second work, subtract those two settings and apply the difference to Effect > Equalizer. In effect, Effect > Equalizer is completely automatic. It doesn’t even have to display. It can work in the background.
There are a number of jobs that become trivial with those links. See that analysis? Flatten it out and automatically apply the correction to Effect > Equalize.
So far it’s only a feature request.
You can do it manually. Analysis > Frequency Analyze the original and note the lumps and bumps in the pattern.
Then analyze the second work and note those. Then subtract all those corresponding points and apply each correction to Effect > Equalize. Now you see why I want to do it automatically. You actually don’t have to do all those points. You can make Analyze sloppy and only use 20 or 30 points instead of hundreds.
Your ear is a lot better at digging meaning out of ratty sounds than the software – so far. It’s even better if you can see people talking, hence the popularity of video conferencing and the practical requirement in the case of training.
Sometimes you don’t need speech at all. Tests are done with people listening to intentional rubbish and they transcribe it into printed words. Your head just makes it all up. This drives forensics crazy.
The short stroke is keep doing what you’re doing until it starts working. There is a practical limit to how ratty speech can be. You might try the brute force telephone filter. Effect > Equalize > Select Curve > Telephone.
I was hoping for it to be autmatic.
I guess I will have to do it manually then.
It’s a shame there is no, “make audio profile” and “apply to entire track” feature like with the noise filter.
I wouldn’t say that there are no possibilities within Audacity. I can only speak for Nyquist, which may be a little bit to slow for such purposes. There are several approaches to this problem. The most natural would be to employ convolution, I guess.
The principle is to construct an automatic filter based on the “ideal” Audio. By convolving the input signal with the found impulse response you’ll get an output signal with the same spectral colouration as the original. Several VST-plug-ins can do this job already.
Unfortunately, speech is somewhat the worst audio to clone (Otherwise thousands of persons would speak like the US-president in their Youtube videos). The spectrum can change from person to person and from mood to mood.
It would maybe be the simplest way to start with an multi-band eq in thirds, that saves the differences of the prototype audio and ordinary white noise. These settings could afterwards be applied to the target sound.
However, the target sound has still to be prpared manually. Pitch (fundamental frequency shift) and speed must be similar.
Unfortunately, speech is somewhat the worst audio to clone (Otherwise thousands of persons would speak like the US-president in their Youtube videos).
However, the target sound has still to be prpared manually.
Right, which brings up back to the original question. “This voice is perfect and that voice isn’t. Make them match.”
Not so far. You can dig all around that and pile up handy tools, but you can’t ever exactly get there.
And yes, totally. Voices are really rough. See: “Make me into an announcer.” Also roots in “I recorded my voice damaged, can you undamage it?”