To me, the dynamic range is defined by the audio’s bit depth, e.g. 0 dB to -96 dB (For 16 bit). The dynamic range for a specific audio file is therefore: the highest RMS value present minus the lowest representable, non-zero (-inf dB) RMS value.
For example:
24 bit with 6 dB head room:
-6 dB minus -144.5 dB = 138.5 dB Dynamic Range.
Note: this those are actually peak values, rather than RMS, just for simplicity sake
However, you are rather interested in a kind of signal-to-noise ratio.
The contrast tool would spit out this measure if you select the useful signal and then pure noise. This may give you a value of 40 dB difference.
Or you do it with fore- and background audio, in which case the difference will be less.
This measurement is misleading because the foreground has most likely quiet passages inbetween.
I think that the only senseful measurement is a histogram of all RMS values. You would simply count how many times a certain RMS value is present in the signal, if measured 100 times per second or so.
Something like:
0 dB -
-10 dB |
- 20 dB |||
-30 dB ||
-40 dB |||
-50 dB|||
-60 dB ||
-70 dB ||||
-80 db ||||||
-inf ||||||||| (the silent parts)
The local peaks should now represent the different layers, namely foreground, background, noise, silence (and dither noise) and perhaps more.
This is a statistical approach that is often taken to refine blind source separation, i.e. separating voice and ambient noise. It assumes that pure voice has a different distribution than other audio.
To make this curve comparable, you should provide different samles such as pure noise, pure environment noise, instrumental music, narrated texts and of course the files you actually want to analyze.
By the way, those tactics could also be employed for an “ideal”, automatic dynamic compression for audio books and alike.