Help With Vocal Audio Clip

I’ve been looking for help removing room or hall ‘echo’ and have learned that’s virtually impossible, apparently. But listening to a sample on one post I realise my track is very different. So perhaps there’s hope for it. Not such a severe problem maybe.
Can anyone help me with how I would best fix this track so’s the speaker’s voice is more tolerable and understandable?

I’m afraid there’s no hope. This is the classic case of hall echo, and as far as I know there’s nothing you can do about it.

There are apparently some paid, proprietary tools that claim to be able to “de-reverb” a recording. Perhaps someone else can chime in if they know what they are and if they work.

– Bill

Ah, bad luck, eh. It sounded so different to the clip from the other forum and it’s so unlike the traditional echo ’ hello… hello… hello… ’ that I thought maybe it was something a tad different giving some hope. I entertained thoughts of something like (being totally ignorant of the whole discipline, you understand) the higher frequencies getting reflected back and making the voice more ‘brilliant’ (I’ve heard of ‘acoustically brilliant’ rooms) but amenable to correction by limiting the higher frequencies…

something like that.

Hummm. Nothing, eh? And my sample is the original pure, classic, ‘hall echo’ ? Right, well that’s something I’ve learned and gained then.

Thanks for checking it out for me.

I’ll maybe get onto Chalmers if I can and he might put up a recording of the same lecture but of better standard that he made somewhere, some time.

:slight_smile:

You can at least improve it a bit:

  • split to mono (Track Drop Down menu)
  • noise removal, get noise profile on the first track (exclusively selected)
  • delete or mute first track
  • select second track and apply noise removal
    (-18 dB Red., -13 dB Sens., 400 Hz Smoothing or just experiment)

I’m sure You’ll endure it now 2 minutes longer to listen to it :wink:

Hall Echo has a philosophy problem. Echo is technically the performers own voice arriving at the microphone slightly late having bounced off the walls and ceiling of the hall. So the job of Echo Removal® is to remove the performer from himself.

[Booming kettle drum sound.]

You can mess with Noise Removal and you might get some relief.

https://forum.audacityteam.org/t/mitigate-overdose-of-reverb/35125/1

Koz

Dueling posts.
Koz

I saw this DeVerberate plugin recently , (free-trial version has spoiler silence inserted randomly)

IMO it doesn’t really work , their before-after examples show it in the best light possible, ( short-duration-notes and reduction of predictable computer-generated reverb ), and it still sounds unimpressive.

Surprises me. I would have thought it the first remedial function to have been worked out - and I’d have thought it fairly easy to work out, too.

Because of the predictability of it. Surely? The time delay will be a constant? So you know when to look. And then the sound is equally predictable, surely? Some kind of degradation of the original sound? You’d think mere ‘pattern matching’ would find it. And where it exists in isolation it can be removed and where it exists on top of a sound they’re summed so it can be subtracted?

And I’m totally wrong. Of course. I realise that… I have great respect for all professionals in whatever field of endeavour and I know they put a lot of work, a lot of intelligence into things. If it’s not being done it must be that it can’t be done. Yet. At least by them.

So where am I wrong in my thinking? A total misunderstanding of how audio processing is done? A total misunderstanding of the whole thing? Very possible. Technically I know nothing.

Or could it be that essentially I’ve got it right but the problem is the computational intensity of the task? i.e. the theory is okay but the practice is simply not feasible unless you’ve got a super computer?

I would like to know. I remember a few American movies with our hero spies amplifying and separating out the suspects voice from a distance of a hundred metres or more, etc… etc… busy, crowded street…

I enjoy my feelings of superiority when I can pontificate to everyone on how impossible such a thing is for this reason or that…

:slight_smile:

Not impossible , but it’s going to be computationally demanding …

Thanks for that. Very interesting.

Looks to me like the problem, one way or the other, must have had some very high powered thinkers and machines applied to it and I’d be very surprised if it hasn’t been cracked to a large extent.

Just how computationally intensive that ‘crack’ is I don’t know and didn’t see any mention in those pages.

I imagine that algorithm will filter down to us before too long.

We wouldn’t need total fidelity, either, would we? Which might help. I mean I just want to hear the words clearly. I don’t care too much about the timbre of his voice, the exact character of it, resonance, frequency envelope, whatever goes into defining a voice.

So if a crude algorithm with fewer computational requirements returned a version that perhaps massively distorted the voice - moved it to the upper register perhaps, making a woman’s voice out of a man’s - I wouldn’t care, I’d be satisfied.

That wouldn’t happen in a finished product, of course, because it’d be trivial, I guess, to shift that voice back down a bit and therefore that’s what would happen I suppose. But the finished product may not sound like the speaker at all, that’s what I mean. And yet it would still serve a purpose, still be a very useful algorithm for many functions.

I’ll give you a simple example why it is so complicated:
Imagine a simple addition:
30 + 20 = 50
If you have one of the other variables, it’s quite simple to get:
50 - 20 = 30
The “Blind” means that you do not have a second argument, thus you have
50 -0
49 -1
48 -2

The value “30” would be our initial signal, and we want to re-create it from the output signal and a unknown, “blind” addition (or convolution)

There are only assumptions we can make about the initial signaland the rest.
The “rest” means a linear transformation (the convolution) and a non-linear (e.g positive samples are treated differently than negative ones) and additional noise.
Thus, 4 signal (at least) are summed/convoluted/multiplied in a single output and a algorithm has to estimate these portions.
The results will be better if you can put some constraints on it, i.e.

  • noise appears at -60 dB
  • the input has around -6 dB
  • it decays with 30 dB per second
  • no non-linear amplification is involved (e.g. tube amplification)
  • the beginning of the signal is pure (no reverb)
  • the last part of the signal is only reverb
  • and so on.

Put simply, convolution is trivial and de-convolution is not.

Another way of looking at “why it is so complicated”:

“Reverberation” is caused by lots of reflections of a sound from the surfaces in a room. Hard surfaces tend to be more reflective for sound. Large rooms tend to have a “longer” reverberation due to the time it takes for the sound to cross the space and bounce back.

When a person speaks in a reverberant location, their voice gets mixed up with hundreds of reflections of their voice. Note that the reflections are still “their voice”. What you are asking of the algorithm, is a way to remove “their voice” from “their voice” to leave a pristine version of “their voice”.

Yes, I saw that way of putting it some time ago, when it was first posted I guess.

I don’t think it is quite accurate. It is an amusing dramatisation of the situation with much truth in it but as it stands it is more or less a conundrum pointing to an impossibility.

Which in fact is not what we have. As that link which encompassed Astronomy and Earthquake prediction analysis indicated.

The returned echoes are of course additive to the source signal.

Just as recording a signal over itself is additive. Such an addition as that I suppose would be trivial to edit out.

Editing out the echoed addition is currently not trivial but that’s what we’re trying do - not trying to remove the source signal at all, despite the witty phrase.

The returned echo would have a signature all it’s own. Certain frequencies would be returned more strongly than others.

The dynamics clearly are very different.

And that raises the question of intelligibility. I think. If I’ve got the right word. What I mean is : What parts of a speaker’s voice do we most respond to in order to extract the meaning, the words, what’s being said?

Just as it has been shown that English can be written without vowels and still be found intelligible and I even saw quite recently a demonstration of English written with only the first and last letters of each word kept in place and all the letters between those two swapped into random order - and it was quite intelligible.

Well I’d surmise there’s some kind of parallel in the audio world. That speech stripped of many things would still be quite intelligible. That there’s a ‘core’, an ‘essence’ that the ear listens for to catch the meaning.

If that core can be found and isolated it may be a much simpler job to remove the echoed ‘core’ from it.

Of course, as I think I said once before, or meant to say if I didn’t, I’m speaking from the point of view of making clear the words regardless of any effect on the ‘quality’ or ‘personal nature’ of the voice. For my purposes and what underlies all my comments it is immaterial if the finished result sounds nothing at all like the original speaker.

Whereas in this particular field there’d be many wishing to clean up recordings and retain the original voice in all it’s unique characteristics. Especially if the speaker had a famous voice, or an especially ‘good’ speaking voice - or was a singer, say.

Yep. All very valid.

But not what I’m concerning myself with. I’m just on about getting a clear rendition of the words, the same as those astronomers want to get a clear picture of that portion of sky and those earthquake people (seismologists) were only interested in uncovering the location of the source of their sound.

And in there you have the problem - how can you tell what belongs to the original voice, and what belongs to the echoes? Both are complex and continuously changing sounds with very similar characteristics, added together to form a new, complex and continuously changing sound.

On method that can be used it to look at the frequency spectrum at a particular point in time, and project the decay rate if it were an impulse. Then compare the projected spectrum with the actual spectrum at a later point in time, and reduce the level in each frequency band of the actual spectrum by an amount equivalent to the estimated projection. This method has been used with some success in a program called “PostFish”, which is unfortunately only available as source code and has not been maintained in years. (http://svn.xiph.org/trunk/postfish/README). “Deverberation” is certainly not “impossible”, but it is certainly not easy either.

Can’t a good line on the ‘shape’ of the echo be got from sampling the end of a ‘period’ when the echo is all you have?

On my particular recording there’s many different sounds to look at: short words and long words/phrases, perhaps even exclamations, perhaps even furniture sounds and periods of silence here and there.

That’s the context I’m thinking/talking about. Is such a context helpful at all?

Thanks for the link to Postfish. I’ll see if I can download it to my Ubuntu box tomorrow, or download the source and compile perhaps… if it gets too technical I’ll be lost… many years since I did any C or in fact any real programming at all.

I’m thinking the returned echo will always have a certain characteristic - certain frequencies bouncing back very strongly.

If those frequencies can be identified they could be reduced in volume all across the recording.

Like the same as the notion of identifying only the parts of the word sound that make that word intelligible it might be possible to identify only the parts of the echo that are of real nuisance value.

Doesn’t work like that? Can’t be done?

Can’t a good line on the ‘shape’ of the echo be got from sampling the end of a ‘period’ when the echo is all you have?

It’s “echoes are all you have.” The echo at the end is a grouping of all the echoes in the last spoken sentence.

Someone posited that if you walked out just before the lecture and clapped once, you could generate a profile of the room (full of students) and get echo data that way. It could also be argued that you could do this after the lecture in the empty hall and get close, the walls and ceiling not having moved.

Koz

I’m thinking the returned echo will always have a certain characteristic - certain frequencies bouncing back very strongly.

You can process a small room that way. That’s “honking” or talking into a wine glass effect where the echo has a strong pitch personality — the walls in the room ring or resonate. Bigger venues tend to not do that. They have the half-second echo thing.

You can get partial correction of honk with simple Analyze > Plot Spectrum. Screen grab that and generate an equalization that flattens out some of the humps.

We had a sound room two companies ago that had little or no soundproofing, but many good recordings were made in there. It’s secret was no two walls were parallel. The ceiling was off, too. Echoes would just drop dead. I believe it was originally constructed for band practice.

My office, on the other hand, would sustain a clap for several hours.

Koz

I think this might be a ‘wine glass effect’ situation.

I should have posted a link a long time ago.

https://www.youtube.com/watch?v=DIBT6E2GtjA

You can counter the room resonance (the “wineglass” effect) to some degree using the Equalization effect. http://manual.audacityteam.org/o/man/equalization.html

The reverberation can also be reduced using a “trick” that Robert J H mentioned right at the start of this topic - using the Noise Removal effect with a profile taken from the whole recording (not just the “silences”). Effectively this is doing the “impossible” task of “removing the voice from the voice”, but the extreme setting of the Sensitivity slider causes it to mis-track the amplitude, so it affects the reverb more than the direct voice.

These are the Noise Removal settings that I used:
windowplus-Noise Removal-000.png
This is the Eq curve that I used:
fullwindow-Equalization-000.png
Finally, I Normalized to -1 dB.

This is the result (note that the “clean-up” has made distortion that was present in the original audio, much more obvious - not much we can do about that).

I did the Noise Removal first, then the Eq. You may get better results if you spend more time than I did tweaking the settings.