Additional approah to Peak Pruning

Hi Steve, i saw on the sources that you are using Peak Pruning as described by Tolonen and Karjalainen. But, have you tried the variation of it called “Multipitch Analysis and Tracking for Automatic Music Transcription by Richard Baumgartner” ?
Is seems promissor and maybe worth to test the results
If you suceed to implement this algorithm we can use it as a second “pitch spectrogram” that can be displayed upon the menu access.

Also, for noise reduction, it seems that there is an good algo called: “observer based beamforming algorithm, as described by Richard Baumgartner”

I´m asking all of these, because we may get better results of pitch identification, and also the right path to build a accurated sound pattern algorithm, if we can actually identify the sources of the soundwaves. I mean, i was reading some articles on how to measure the distance of a given sound to the human ear to better identify the sound source, and it seems that if we can be able to identify the different sound sources calculating the distance of their sources, we can isolate them better, once they are identified. For example let´s say that we have 2 different persons speaking one in front of the other, and Im listening to bóth of them. Perhaps identifying which sound belongs to which people can be calculated computing the distance between them. I mean, different sound sources positioned in different distances may have a distinguishable pattern on the spectrogram. If we can actually measure the distance between the XX frequency related to the XX person (assuming we can identify it), then we can isolate the other sound, simply measure the distance of the source.
Person A is distance 10 meters from me. He says “Ahhh”…the sound travels at a given speed and reach my ears. When the 1st sound wave come to my ears it have YYY samples of sound at the correspondant frequency and volume. So say that we can measure the 1st sound wave containing 100 samples (represented here as air moleculles that vibrates the bones inside my ear). Then milliseconds later, comes it´s 2nd sound wave at XXX frequency, and YYY volume etc…It also will produce some sort of signature that is the same from the 1st sound wave, right ?

but…Say we also have a Person B located few meters behind the person A and also say “Ahhh”.
Person B is located 15 meter from me…When he says “Ahhh” his sound wave reaches my ears and produces the same signature as described above. But…his samples are different from Person A, and yet they can be verified, right ?

So, no matter if we can produce a pattern for each sound way, considering (frequency, volume, distance from the source), we can isolate the other sound, simply selecting the frequencies related to person A and frequencies related to person B at a given period of time.

Cans such thing be done ?

My impression of the discussion is a lab experiment that ignores many real-world problems.

I think the only part of that which would work is having two different sound sources arriving at a very good stereo microphone system from different directions. It should be possible to tell directional information by the arrival of the waves, given that it’s only going to give approximate information due to it failing completely at 0 degrees and 180 degrees (in a doughnut shape) and substandard at other angles. I think that’s the only value in the proposal. If my sister and I are both speaking and she’s further away than I am, the system will not be able to derive any meaningful information. The quality of voice (very different) and distance information may be the same/ambiguous. That’s given neither of us or the environment is moving.

Say you have a nice stereo microphone system on top of a camcorder. I elect not to use sticks (tripod) but to go hand-held. The nice microphone array on top of the camera is right next to my warm-charcoal coloured, all-cotton Carhartt baseball cap --but only on one side. The other side is in free air screwing up all the proximity, arrival and frequency information. And everything is moving by the way.

Most microphone systems are not stereo, so that kills may of the propositions. Recording news gathering under unfavorable conditions is noisy, messy, dangerous, unstable, and mono. These are precisely the conditions under which software is expected to help. “I recorded my wedding and the microphone was next to my dog who panted through the whole thing. Can you help?”

No, probably not.

The holy grail of sound processing is software that directly knows and recognizes human voices under all conditions rather than trying to do it with secondary information. Not may of those around.


Oh, one more item. Even though my sister and I are speaking from different distances with very different voices, our sibilents are the same. SSS and FFF sound the same between us, so you can’t use those to figure out who’s speaking. Once you lose those, the rest of speech becomes rubbish. Koz

I’ve seen and heard some very impressive “lab” demonstrations of spatial analysis of sound sources, but nothing so far that can get close to the ability of human hearing.

Probably not.