Hey guys. Newbie here.
Sorry that my first post is a request for help but I’m almost literally going insane over this.
I’ve been programming a sound format that achieves 32:9 compression by the use of linear prediction.
It’s basically a challenge to myself to see whether a format that achieves such compression can be done with little loss in signal.
The general idea for the format is: [beware really verbose descriptions]
//! These types just make the code easier to understand
//! ... for me, anyway =P
typedef unsigned char u8; //! Unsigned 8-bit
typedef unsigned short u16; //! Unsigned 16-bit
typedef unsigned int u32; //! Unsigned 32-bit
typedef unsigned long long u64; //! Unsigned 64-bit
//! Sound type
typedef struct {
u32 numBlk; //! Number of sample blocks
float pCoef[16][2]; //! Global prediction coefficients [16 pairs]
//! Each sample block is composed of 32/8 [4] frames of sample data.
//! Each frame, in turn, is composed of 64/4 [16] sample nibbles and
//! an 8-bit header:
//! Bit0-3: Coefficient selection
//! Bit4-7: Shift scale
//! Basically, the 'coefficient selection' chooses the pair of
//! coefficients [from the header] that best predicts the data in
//! the frame, while the 'shift scale' provides error correction
//! scaling. [the data is in the range of -32768..+32767]
//! Since there's only 16 coefficient pairs, prediction will never
//! be exact, so there's error correction.
//! Each nibble is a value [-8..+7] that is multiplied by the shift
//! scale [which is really 2^n] in order to more closely match the
//! original sound data.
//! The data is packed in a nifty way so that it is properly aligned
//! [if you remember, a frame has an 8-bit header] to a DWORD.
struct {
u8 frameHeader[32/8]; //! 32 = sizeof(DWORD). Four frames.
u64 frameData [32/8]; //! Ditto.
} block[];
} sound_t;
//! Decoding the sample data
void Decode(sound_t *snd, float *dest) {
//! storage for the last two samples [for prediction]
float old[2] = {0.f,0.f};
//! decode each block
for(int blk = 0 ; blk < snd->numBlk ; blk++) {
//! decode each frame
for(int frm = 0 ; frm < 32/8; frm++) {
//! read frame header
u32 frameHead = snd->block[blk].frameHeader[frm];
u32 coefSelect = frameHead & 0xF;
float errorScale = pow(2.f, (float)(frameHead>>4));
//! read frame data
u64 frameData = snd->block[blk].frameData[frm];
//! make a shortcut to the coefficients
//! [makes code slightly cleaner]
float *coef = snd->pCoef[coefSelect];
//! decode each sample
for(int i = 0 ; i < 64/4; i++) {
//! find predicted data
float predict = coef[0]*old[0] + coef[1]*old[1];
//! find the sample correction data
float corNib = (float)(frameData & 0xF) * errorScale;
//! find the corrected sample
float sample = predict + corNib;
//! output this sample
*dest++ = sample;
//! shuffle the data
old[1] = old[0],
old[0] = sample;
//! shift out the old nibble to get the next sample
frameData >>= 4ULL;
}
}
}
}
While I’ve read a bit of ‘Numerical Recipes in C’, in particular the linear prediction section, it never covers what I’m trying to do.
What it ‘talks’ about is how to find the prediction coefficients for a given set of data, but it never mentions how to reduce these predictors to, say, 16 or so in such a way that the error is minimized.
Oops, sorry that I didn’t see the reply - I didn’t receive an e-mail notification for some reason [will check into this in a bit].
Yeah, I’ve read the Wiki articles, read at least another six articles found on Google, but I still can’t figure it out.
I think the main issue with my inability to fully understand this is that, while I’m quite good at math, I haven’t fully studied DSP-related mathematics which hinders my ability to actually code this. [I can fully understand the part about calculating the coefficients for a sample frame, but I can’t get past that and calculating optimized coefficients].
I’ll ask on the developers’ list like you suggested, though. Cheers.
Well, I’ve been reading some more and have come up with three solutions:
Subset/feature selection
Apparently, this is the more ‘mathematically correct’ way of doing it. It’s also very computationally expensive and WAY past me.
k-d trees
In image palette quantization, you can use an octree to reduce the number of colours. Since I’m working with two coefficients, I’m thinking of using a quadtree in order to do this. The problem lies in that it’s also relatively expensive, specially since I’ll be dealing with literally thousands of coefficients.
Median cut
I haven’t read much about this, but it may be possible to adapt the general cube-based approach (for RGB images) into a square to quantize the coefficients.
Okay, I’ve found a mathematically correct, better way.
Instead of quantizing the data using the previous methods, I’ll do the following:
//! Calculate dot product of two vectors
double DotProduct(double *v0, double *v1) {
double dot = 0.0;
for(int i=0;i<ORDER;i++) dot += v0[i] * v1[i];
return dot;
}
//! Merge n vectors into m, taking into account their relation
void Merge(int srcCnt, int dstCnt, double *src, double *dst) {
//! For each vector...
int vecUse[dstCnt];
for(int i=0;i<dstCnt;i++) vecUse[i] = 0;
for(int i=0;i<srcCnt;i++) {
//! ... find destination vector with smallest distance
int bestPos = 0;
double bestDist = DBL_MAX;
for(int j=0;j<dstCnt;j++) {
//! Take dot product
double dot = DotProduct(src[i], dst[j]);
//! Is it more related than the current best?
if(dot < bestDist) bestPos = j, bestDist = dot;
}
//! Accumulate target data
for(int j=0;j<ORDER;j++) dst[bestPos][j] += src[i][j];
//! Increase usage count
vecUse[bestPos]++;
}
//! Take average of accumulated data
for(int i=0;i<dstCnt;i++)
//! Have anything?
int cnt = vecUse[i];
if(!cnt)
//! No data, clear
for(int j=0;j<ORDER;j++) dst[i][j] = 0.0;
//! Next
continue;
}
//! Divide each element
for(int j=0;j<ORDER;j++) dst[i][j] /= (double)cnt;
}
}
I’ve tried this, but it doesn’t seem to quite work.
I think the main issue is that the linear prediction coefficients I end up with are not normalized in any way (I’ve even had coefficients being over 500.0).
Does anyone else pick up anything wrong with this?
I think if you ask on a DSP-math specialized usenet group like comp.dsp you will get more (and hopefully better) answers than here, or on the Audacity developers list, where obviously nobody knows an answer to your question, too. Searching the internet for “Matlab” or “Octave” (math software) together with “LPC” (linear predictive coding) will probably give better results than searching for audio related topics. Because practical audio recording and mixing has not much to do with math, most people in audio forums cannot answer advanced math problems in a really helpful way.
Thanks for the links - for some reason, it didn’t cross my mind to go to DSP-math groups >_<
I’ll give it a shot and, if successful, I’ll post the results here in case there are others like me looking for a solution to this.
On a semi-related note: I wonder if it’d be at all useful to standardize this format so that it works on codec files like WAV (atm, the format is optimized for size and so it is extremely basic, with no support for anything other than looping and multiple channels, since the intended purpose of this format is for an internet-transmission-friendly compression for use on a computer).
I once had a paper (ten years ago or something) where was mathematically described why it is not possible with LPC to achieve a better compression than some specific amount (sorry, number forgotten), because the more the compression, the bigger the error, and there is a certain limit where the error becomes bigger than the compression. Unfortunately I can’t find that paper anymore (file probably deleted long time ago).
There are some explanations in the Praat docs about the limits of LPC encoding:
The Praat source code is for free, so maybe you find more examples in the code.
LPC originally was developed for realtime speech encoding, where “predicitve” means that the predicted data is “future” data that still does not exist at the time where the predicted value is computed. Of course this looks different if you use LPC on audio data that already exists in memory or on disk, but even then it is not possible to comletely eliminate the error, what makes LPC pretty useless for encoding lossless formats like WAV or AIFF because the LPC error will reduce the audio quality with every copy generation.
Because LPC is a lossy codec (where I am an electrician for low-noise microphone preamplifiers), I never was really much interested in that topic, so I’m sorry again that I cannot help very much here.
Well, in most LPC codecs, the signal is predicted with no error correction (because it’s normally implemented for speech, and you rarely need error correction for it to be recognizable). I account for this in my code (using 4bit nibbles as the correction data and a 2^n scalar for each sample frame).
Also, WAV isn’t necessarily lossless - it can store aLaw/muLaw, GSM, multiple ADPCM types (2-5 bits), MP3, and others (obsolete as they may be) with the extensible format flag.
What I ended up doing was:
-Use recursion instead of an inverted matrix to solve a system of linear equations [O(n^2) vs O(n^3)]
-Calculate linear prediction coefficients for every frame [frame size is now 256 samples] as well as correlation data [this is important]
-To merge the coefficients, find the destination [final] coefficient vector with the best Tanimoto distance and perform a weighted average (where the weight is the correlation)
-To compress each frame, do a bruteforce search for the best prediction model, data scale [shift], and error correction that result in the smallest RMS error [this takes VERY long]
For a 90,000 sample waveform, the coefficient calculation [and merging] took ~4ms, and frame compression took 2-5ms [variance is explained by the search routine cutting off a candidate if its error exceeds the best one].
The samples that result in the best S/N ratio are predictable, sine wave-ish samples, but by adjusting the prediction order, you can compress other stuff with minimal noise (like Rhodes pianos).
I can probably optimize the routine a bit more to get more speed, but so far, I’m happy with it (even if it is slower than MP3 at the best settings =P).
I’ve modified the format slightly, and now it not only sounds better, but has better compression (26.67% [frame size = 15] or 28.57% [frame size = 7]).
What I ended up doing was scrapping the ‘multiple prediction models’ idea and using a single model for all the data, and increase the order to 24 just because I can and it sounds best (and doesn’t affect the actual sample compression at all - just increases the header size).
The compressed data is broken up into frames [no blocks, this time] of 7 or 15 samples [depending on how much quality you want].
There’s a 4-bit ‘shift’ value [2^n scalar for…] and n nybbles [where n is 7 or 15]. This results in each frame fitting in 32 or 64 bits.
Of course, you can’t really judge by just reading numbers so I’ve uploaded a few demonstrations here.
From what I’ve noticed, unpredictable samples [like bells, saw leads, etc] tend to sound worst, but other ones [like pianos, etc] sound best.
EDIT:
Oops, looks like I uploaded a wrong sample for the bells… heh. xP