Is it possible to create audio map of volume?
Let say sound is defined as 33 db or more… Anything less is not sound.
Lenght of an element/step in the map would be 0,2s.
0 for no sound, 1 for sound.
000001111100000 would mean that the file has 3s duration, 1 s silence, 1 s sound and 1s silence
Is there an actual sound job behind this? That’s how Noise Gate works. It either passes sound or it doesn’t. There’s little or no middle-ground processing.
Audacity is a sound or show manager. I don’t know of any way to make Noise Gate tell us the state of what it’s doing.
Another problem with your tool is deciding what to call “almost sounds.” Noise Gate has settings to be intentionally sloppy so it can make firm decisions on messy sound or sound that spends a lot of time right in the middle.
Koz
I need to generate this file to detect sentences and words positions in audio file - as part of player app. It would be offtopic to describe why I need it.
I don’t think Audacity has any tools which can do that.
Koz
I’ve found out I have to write my own program for what I want to do.
If you have wild sound, also remember to “look ahead” to make sure the gate is already open when the first sound arrives. If you don’t do that, you can have errors where the program chops off the first sounds of each word. The early broadcast gates had that problem.
That’s in addition to being able to manage quiet sounds and decide whether they’re good or not.
This isn’t easy.
Koz
Ok guys, I am writing this in Delphi, I have a library which can read data from audio file and process it. But I am not expert. I have created sample file using mike, my voice. On the begin there is almost 0 volume, but the program (reading first 8 bits of 16-bit data) detects crazy values … So I found out that this values do not represent level of volume. So what do they represent? After some seeking I have found this question:
https://dsp.stackexchange.com/questions/55436/error-reading-a-pcm-file
He says he has wav file as input, same frequency and sampling. He uses some calculations which I can repeat… And now I get some interesting values (up, down):
(2205,4)
(11025,26)
(3675,16)
(490,1)
(3675,2)
(1225,4)
etc.
This is in my head:
Are these values what I need to get? Is it level of the volume? Why the guy uses resampling? Is it only to fit on the diagram? Maybe should I calculate difference
result := up-down;
and devide by two to get level of volume?
result := (up-down) div 2;
Then the result for value (11025,26) is 5499,5 but it seems pretty high when I expect silence… Maybe to devide it by 1000? To get db?
What do you think? It is pretty hard to find some forum, where anyone could help me with this calculations.
Yet I have found that the “up” value are always one of these when I proccess my voice…
315, 525, 2250, 2450, 3150, 3675, 4410, 7350, 11025
which rather looks like frequencies…
And the “down” value is always different, but maximum is 121
If I understand the question correctly, then you can do that in Audacity, but to do it entirely in Audacity you have to use “Nyquist” (Nyquist - Audacity Manual)
If you post an example file that you want to analyze in this way, please post it and I’ll help you to write the necessary code.
Steve:
The audio files which I would like to process are listed on this page:
https://www.mechon-mamre.org/p/pt/ptmp3prq.htm
For example:
https://www.mechon-mamre.org/mp3/t0103.mp3
I dont know to use that programming language.
Yet a note:
Im using Audacity 2.1.2, Windows XP.
This may be run in the Nyquist Prompt: (https://manual.audacityteam.org/man/nyquist_prompt.html)
;version 4
;type analyze
;; For one data point every 0.2 seconds, we need to look at the
;; peak level each 0.2 seconds (= 0.2 x original sample rate).
(setf step (truncate (* 0.2 *sound-srate*)))
;; Convert the "threshold" of -33 dB to linear-scale
(setf threshold (db-to-linear -33))
(let ((data (snd-avg *track* step step op-peak))
(output ()))
(setf *track* nil)
(do ((val (snd-fetch data) (snd-fetch data))) ; fetch "data" samples one at a time
((not val) (print (reverse output))) ; Stop when we run out of samples, and print the list of values
(if (> val threshold)
(push 1 output)
(push 0 output))))
Yet I have found that the “up” value are always one of these when I proccess my voice…
315, 525, 2250, 2450, 3150, 3675, 4410, 7350, 11025
which rather looks like frequencies…
And the “down” value is always different, but maximum is 121
Each sample value represents the positive or negative* amplitude of a wave at one instant in time. [u]Here[/u] is an easy introduction to how digital audio works. The frequency information (which you shouldn’t need) has to be derived from that amplitude data and the sample rate, using FFT.
The positive & negative peaks are a rough indication of volume/loudness, or you can calculate an average (or moving average) of the positive values, or the absolute values,** or you can calculate RMS. Perceived loudness (such as the [u]UBUR 128 standard[/u]) takes frequency and the [u]equal loudness curves[/u] into account, but if you’re only working with voice RMS or one of the average methods, or just finding the peaks (over some short time-window) should work.
Then the result for value (11025,26) is 5499,5 but it seems pretty high when I expect silence… Maybe to devide it by 1000? To get db?
“Digital silence” is a series of zeros. (A single zero can simply be a waveform zero-crossing.) If you record with a microphone there will be acoustic & electrical noise so you’ll never get “pure digital silence”.
Decibels are calculated as 20 x log(A/Aref) where A is the amplitude. With digital integer formats the 0dB reference is the highest you can “count” with a given number of bits. So if you have a 16-bit file with peaks of −32,768 or +32,767 you have 0dB peaks. Analog-to-digital converters, digital-to-analog converters. “regular” wave files, and CDs are all integer based so 0dB is the “digital maximum” and digital dB values are usually negative,. (Everything is scaled by the drivers so a 0dB n 8-bit file plays just as loud as a 0dB 16-bit file.) With floating-point audio, the 0dB reference is 1.0 and for all practical purposes there is no upper (or lower) limit.
The 0dB reference for SPL (acoustic sound pressure level) is approximately the quietest sound that humans can hear so dB SPL levels are positive.
\
- Just to keep things “interesting”, 8-bit WAV files use unsigned values with silence biased at 128.
** A simple average won’t work because you have a waveform that’s positive half the time and negative half the time with an average of zero.
I did not see new page added. Sorry for late answer.
Now tested, result 0.022 …
error: unbound variable - STEP
if continued: try evaluating symbol again
Function: #<FSubr-LET: #b024570>
Arguments:
((DATA (SND-AVG TRACK STEP STEP OP-PEAK)) (OUTPUT NIL))
(SETF TRACK NIL)
(DO ((VAL (SND-FETCH DATA) (SND-FETCH DATA))) ((NOT VAL) (PRINT (REVERSE OUTPUT))) (IF (> VAL THRESHOLD) (PUSH 1 OUTPUT) (PUSH 0 OUTPUT)))
1>
But on different file it worked fine. But I would increase the level a bit to accept more noise.
Is it possible to write the data to file?
Thank you
DVDdoug:
Wow, now I understand correctly, what the wiki article on PCM means about amplitude…
I thought the value is anywhere on the curve, but it is the top and bottom (peak) of the wave.
I have done some tests in Delphi, with a file “silence plus”. There are 8 cicles and different versions of the file after I reduced the signal:
w1 and w2 are words read from file. Yet with the library I use I have problem with description because in the demo they write they use block, instead of chunk. There is method
procedure TForm1.AudioProcessor1GetData(Sender: TComponent; var Buffer: Pointer; var NBlockBytes: Cardinal);
and NBlockBytes is 7056.
The AudioProcessor1GetData procedure is executed more times. It does this:
AudioProcessor1.Input.GetData(Buffer, NBlockBytes);
B16 := Buffer; // is type PBuffer16
end_ := (NBlockBytes div 2) - 1;
for i := 0 to end_ do
begin
move(B16[i*2], w1, 2); // here I copy two bytes to w1 word.
move(B16[i*2+1], w2, 2);
end;
I use Windows XP so I expect the data are in Little endian and don’t need to swap bits.
The file that you sent is mono. The code is written for mono tracks only.
The “threshold” level is set here:
;; Convert the "threshold" of -33 dB to linear-scale
(setf threshold (db-to-linear -33))
To change the threshold level to, say, -18 dB, change it to this (the line starting with a semi-colon is just a comment - the second line is the relevant code)
;; Convert the "threshold" from dB to linear-scale
(setf threshold (db-to-linear -18))
Yes it’s possible, but much easier to just run the code with the “Debug” button, then (manually) copy and paste from the debug window into a file.
i did not mean to write the code to file, but to write the results into file.
In delphi I have this:
var list: TStringList;
begin
list := TStringList.create;
list.loadFromFile(file1);
list.saveFromFile(file2); // save the results
end;
Lots of material to read. But I am tired at this time, I cannot manage to read all at this moment.
Is the result in seconds? The value is still very low. I also see only the normalized measure on y axis. I could not find out how to switch it.
Genesis 1:3 (Berek gimel) is at 0.699 or 0.7s but at 0.33dB the result is 0.022 so it makes not sense for me.
It would be fine if you would multiply by 10 and round. So the result should be 6. But then it would be better to convert it to 4 byte integer and write it to file. Because you need to go through all the file.
But let me ask you yet one thing. Would it be possible to use Nyquist to detect the second space after “Berek gimel”? There is introdution on the begin of the file which I would like to remove from the file. It seems like withing 3 seconds from the begin, there can be two areas of silences … That is definition of the area which I would like to remove. After the introduction is removed, than I would like to detect the array of values, where each element is begin of each word. Would this be possible? I think this would be harder, because, there are two noice levels. One level for begin of the sentence and one level for begin of the word. And the distance between one word and next word can be around 0.025-0.029s - listen WeHan are two words we-han. But I think the operation could take a lot of time because you would need to scan all the file by small sequences lets say 0.025 by you need small steps like 0,006s to make sure that the sequences will overlap when detecting the area… Because you cannot use step of 0.25 because you could miss the area (would not fit).
I don’t know what you mean. The result is 1’s and 0’s as you requested.
As described in the code comments (of the Nyquist script that I posted), the selected audio is processed in blocks (“steps”) of 0.2 seconds. If the peak level in a block is greater than the threshold, it outputs a “1”, and if below the threshold it outputs a “0”. If that’s not what you want, then please define exactly what you want.
What I see is dialog with a text: Nyquist returned value 0.022387