Sample Set Alignment and Difference

I would like to analyze the difference between two .wav files. The sample sets should be very similar. They will have some differences in harmonic content and so on but it can be assumed that the two sample sets will not be wildly different. So I have written a libsndfile program to do a basic sample-for-sample subtraction and then write the difference between each sample to a third wav file. So I can see and even listen to the difference.

But it is clear that this little program is deficient in a number of ways and it occurs to me that other people may have already written similar code. So I will try to describe what I think I need to do to compare two sample sets for “differences” and if it sounds like anything that Audacity already does or if there is a library that already does what I need, can you please direct me?

So the way the two sample sets are generated is to use PC software and a Fast Track USB external sound card to send a “stimulus” (a sequence of tones) out to the Device Under Test (DUT) and then simultaneously record the output. Then the DUT is changed in some presumably subtle way and the test is performed again. The two recorded files are the two sample sets to be compared.

The first problem is that even though the PC software starts recording as soon as stimulus starts, the latency between recordings is not constant. Meaning the samples in each set are not aligned. Ideally I should just use hardware that uses a fixed number of cycles between writing a sample and reading one so that each time I do a test the records are aligned in time to the nearest half clock cycle. But clearly my Fast Track USB is not up to doing this. The offset is at least several milliseconds. So to align the samples I was going to implement an alignment algorithm. Specifically, I was thinking about dividing each sample by the previous sample to get a “relative change” value. Then I would maybe log() and truncate those values to give a quantized set that can be efficiently compared. The algorithm would simply compare and shift to produce a new set of scores for offsets plus and minus a certain number of samples like say 1024 but depending on the sample rate should equate to 20ms or so. The index of the score indicating the minimum amount of change is the offset that is considered “aligned”.

The next problem is that even with the samples aligned, it does not mean the actual audio is aligned. It can be off by as much as one half of a sample period. I’m not sure about this (I normally write networking software so DSP is out of my range) but I think I need to do some kind of re-sampling with interpolation. Meaning create a buffer 16 times the size of the sample set and then generate 15 interpolated values for each original value. Then maybe I employ the same alignment algorithm again to re-align the sample sets again.

Then, I can perform a simple difference computation, reduce the interpolated values back to the desired sample rate and write the .wav diferrence file.

So is there any code that already does this type of thing?

Or more generally can you recommend a method for sending a “stimulus” out through a “device” and then recording the output in a way that repeated recordings can thereafter be compared so that I can see and listen to the “differences” attributed only to changes to the device?

What feature do you want to examine precisely?
I am asking because the simple substraction may not tell you very much about the difference between the two files (especially with regard to the harmonic content).
All you will do is removing all content that is 100% correlated.
If the two tracks were aligned correctly, you could simply use the Voice Removal effect (simple mode). The preliminary step would be to import the two files, then (if they are mono) to make them stereo (just “make stereo track” from the track drop down menu of the upper track).
The result after the voice removal is the difference between the two channels, scaled by 0.5.
You can split the new stereo track - both channels are identical and amplify one by 6 dB to get the real difference.
You can also try this plug-in:https://forum.audacityteam.org/t/karaoke-rotation-panning-more/30112/1
The preliminary steps are the same but you’ll end up with the rest of the first wave file in the left channel and the rest of the second file in the right channel.
What’s more, there’s a control that let’s you shift one of the two channels by an exact amount of samples (“delay”).
But we now have to face the fact that this offset is not known in the first place.
You’ve already proposed a method that works with the slope of the signal.
In Nyquist, that’s really easy,

(slope s)

returns this value (multiplied by the sounds samplerate). s is the global variable that holds the input sound passed from Audacity either mono or stereo.
For the log you can either choose ‘(s-log )’ or directly convert to dB ‘(linear-to-db )’.
Quantization is also available ‘(quantize )’.
However, I don’t wanna introduce you to Nyquist but it may be that we end up with some code that you can directly try from the Nyquist Prompt.
I am not sure how you want to find the exact alignement from the values obtained so far.
The question is how much a signal differs from the other. You’re lost if there are any phase shifts or so introduced. We currently assume that the recording starts in both cases with a slope that is practically identical.
There are several approaches that seem to work better.
Firstly, I would work with the integral of the absolute sample values of a little portion of the two sounds. This gives a wavy, ascending line.
You can compare the two lines with a least-squares calculation. This gives you two values:
A - the y-axis crossing and B the slope, i.e. Left-Channel = A + B * Right-Channel.
If the two are aligned correctly, you’ll should get the values 0 and 1.
This can be done repeatedly with a different delay until you’ve found the minima (or when the correlation has its maximum).
We can shorten this process by remapping the values such that the signal value instead of the time line serves as x-axis.
In this case, the A-coefficient wil tell us how much latency offset we have.
That’s like shifting our ascending line left/right instead of up/down until the error is minimal.
(that’s why we must integrate the signal, the recursive method would work with the pure signals)
The final algorithm really depends on the similarity of the two signals and if there is any noise in the beginning or pure digital silence.