AU File Numbering

Number .au files rationally. Creating .au files with irrational numbering makes disaster recovery confusing, difficult, or impossible.

Koz

Moderator note: this thread was originally in the “Adding Features To Audacity” section. I have moved the gist of this to the Pending Feature Requests page in the Wiki - but I have retained the full thread here on the forum. WC

+1

And also, if possible, add metadata to the .au files so they can be sorted in the case of hard drive failure where the files are recoverable but not their names…

I’ve thought about this too, and I think it’s not a good idea in the long run.

Disclaimer: I am not a developer and I don’t know the code. What follows is my understanding of how Audacity works.

Those little AU files are what gives Audacity it’s speed. Your project is stored as chunks of data no more that 1 MB in size. When you perform an edit, only those chunks that are involved in the edit need to be re-written to disk. Instead, the contents of the AUP file are updated - this is very fast as it is just an XML file that specifies where the chunks are in your project. Consider deleting a few seconds off the start of a one hour long recording. All Audacity has to do is 1) write one (or two in the case of a stereo track) new AU file(s) that correspond to the AU file that had part of its audio deleted, and 2) update the AUP file. If metatdata were included in the AU files, all would have to be updated. At this point you have defeated the purpose of using the AU files.

One could argue that this metadata update could happen only when closing a project, but that would make closing a project painfully slow as every AU would need to be updated.

Hard drive failure where files but not file names are recoverable is, I think, a rare case. If your work is that important you should be doing regular backups.

On a related note, I’ve been testing 1.3.13-alpha and the automatic crash recovery is much more robust than 1.3.12 or any earlier version. Knowing the numbering scheme of AU files would not, I think, help with project recovery. Even with the recovery tools for 1.2.x (which relied on file names being sequential) would not help if many edits were done to the file, only for recovering a crash during recording. The latest 1.3.13 will recover from a crash during recording, and will sometimes recover all of the recording, and other times miss the last few seconds, depending on when the crash occurs.

– Bill

Is there any document explaining how and why the project files mechanism actually work?

I don’t think “rational” numbering would help except in the case of recovering from a crash during recording. If we go back to the sequential system of 1.2.x, then files are sequentially numbered as they are recorded. But once you start editing and Audacity starts creating and deleting AU files, the file names bear less and less relation to their place in the tracks.

If you are concerned about crash recovery you should have a look at this page http://bugzilla.audacityteam.org/show_bug.cgi?id=20

Download the latest nightly from here: http://www.audacity.homerow.net/index.php?dir=mac%2F and try this test.

Open a new project and start recording. Let it go for a minute or two.
Force Quit Audacity while it is recording.
Restart Audacity and let it do the automatic recovery.
What I find is that Audacity recovers all but the last few seconds of the recording.

Similarly, if you do a bunch of edits then Force Quit, all the edits will be recovered.

Not that I know of. Gale?

– Bill

If we go back to the sequential system of 1.2.x, then files are sequentially numbered as they are recorded.

True. And the problem with that is what? Somebody must have thought that technique was deficient.

I’m fascinated with the idea that someone thought, no doubt after a particularly difficult evening at the pub, that the files need to be numbered completely at random. I’m even more fascinated with the idea that some method must be found to avoid duplicating filenames. I can well imagine that after an hour or two of recording, the bookkeeping task alone must be larger than the actual show.

Koz

I’ve already answered this here:
https://forum.audacityteam.org/t/all-those-little-au-files/16946/1

my understanding being that the way crash recovery was implemented, it depends on random, unique numbering of blockfiles and project folders. I can’t explain why recovery depends on random numbering (I can understamd “unique”).

A side effect of the random numbering is that if manual recovery is needed, the blockfiles must be sorted by timestamp and numbered sequentially within that, to mimic the ordering of files within 1.2. Even if you do that, you are not going to recover anything but a mono, unedited recording without further manual cutting and pasting of blocks. That’s a reduction on what you could recover with blockfiles that had come out of 1.2, but not that much of a reduction because you would sill have problems if the project had been edited.

We may be getting a lot closer to reliable recovery in 1.3.13 than we were in previous betas. There may always be a few cases where recovery won’t work despite there being blockfiles available (for example the autosave or .aup file gets wiped out or corrupted in the crash), but is it worth writing a brand new crash recovery utility to cover those cases?

As Bill says, how would it work unless it added/modified metadata for all the blockfiles even when you only edited part of the audio? Wouldn’t that impact performance? The only way I could think it could work would be if the _data folder itself contained a backup copy of the autosave or .aup file, or (for manual recovery) contained a special metadata file that told a new recovery utility the correct time sequencing of the blockfiles.




Gale

I don’t think anybody is expecting a rational ordering of files after several hours of editing. However, we get a number of complaints of crashes during a long recording and anyone opening up the folders would automatically conclude that the show is irreparably damaged just looking at a normal file structure.

OMG!

OK. So show recovery tools depend on irrational numbering of files. How do you keep from repeating a filename – keeping in mind that this is all happening in real time during a show? It still seems that the bookkeeping needed far outweighs any benefit.

I further shouldn’t wonder that a crash is much worse with irrational filenames because of the danger of not only losing the music structure, but the bookkeeping as well.

You might respond that you could always fall back on the creation dates and times. This makes irrational numbering of files irrational.

Koz

Since you asked …

// only determines appropriate filename and subdir balance; does not
// perform maintainence
wxFileName DirManager::MakeBlockFileName()
{
   wxFileName ret;
   wxString baseFileName;

   unsigned int filenum,midnum,topnum,midkey;

   while(1){

      /* blockfiles are divided up into heirarchical directories.
         Each toplevel directory is represented by "e" + two unique
         hexadecimal digits, for a total possible number of 256
         toplevels.  Each toplevel contains up to 256 subdirs named
         "d" + two hex digits.  Each subdir contains 'a number' of
         files.  */

      filenum=0;
      midnum=0;
      topnum=0;

      // first action: if there is no available two-level directory in
      // the available pool, try to make one

      if(dirMidPool.empty()){
         
         // is there a toplevel directory with space for a new subdir?

         if(!dirTopPool.empty()){

            // there's still a toplevel with room for a subdir

            DirHash::iterator i = dirTopPool.begin();
            int newcount        = 0;
            topnum              = i->first;
            

            // search for unused midlevels; linear search adequate
            // add 32 new topnum/midnum dirs full of  prospective filenames to midpool
            for(midnum=0;midnum<256;midnum++){
               midkey=(topnum<<8)+midnum;
               if(BalanceMidAdd(topnum,midkey)){
                  newcount++;
                  if(newcount>=32)break;
               }
            }

            if(dirMidPool.empty()){
               // all the midlevels in this toplevel are in use yet the
               // toplevel claims some are free; this implies multiple
               // internal logic faults, but simply giving up and going
               // into an infinite loop isn't acceptible.  Just in case,
               // for some reason, we get here, dynamite this toplevel so
               // we don't just fail.
               
               // this is 'wrong', but the best we can do given that
               // something else is also wrong.  It will contain the
               // problem so we can keep going without worry.
               dirTopPool.erase(topnum);
               dirTopFull[topnum]=256;
            }
            continue;
         }
      }

      if(dirMidPool.empty()){
         // still empty, thus an absurdly large project; all dirs are
         // full to 256/256/256; keep working, but fall back to 'big
         // filenames' and randomized placement

         filenum = rand();
         midnum  = (int)(256.*rand()/(RAND_MAX+1.));
         topnum  = (int)(256.*rand()/(RAND_MAX+1.));
         midkey=(topnum<<8)+midnum;

            
      }else{
         
         DirHash::iterator i = dirMidPool.begin();
         midkey              = i->first;

         // split the retrieved 16 bit directory key into two 8 bit numbers
         topnum = midkey >> 8;
         midnum = midkey & 0xff;
         filenum = (int)(4096.*rand()/(RAND_MAX+1.));

      }

      baseFileName.Printf(wxT("e%02x%02x%03x"),topnum,midnum,filenum);

      if(blockFileHash.find(baseFileName) == blockFileHash.end()){
         // not in the hash, good. 
         if(AssignFile(ret,baseFileName,TRUE)==FALSE){
            
            // this indicates an on-disk collision, likely due to an
            // orphaned blockfile.  We should try again, but first
            // alert the balancing info there's a phantom file here;
            // if the directory is nearly full of orphans we neither
            // want performance to suffer nor potentially get into an
            // infinite loop if all possible filenames are taken by
            // orphans (unlikely but possible)
            BalanceFileAdd(midkey);
 
         }else break;
      }
   }
   // FIX-ME: Might we get here without midkey having been set?
   BalanceFileAdd(midkey);
   return ret;
}

The keys in here seem to be BalanceMidAdd and BalanceFileAdd

int DirManager::BalanceMidAdd(int topnum, int midkey)
{
   // enter the midlevel directory if it doesn't exist

   if(dirMidPool.find(midkey) == dirMidPool.end() &&
         dirMidFull.find(midkey) == dirMidFull.end()){
      dirMidPool[midkey]=0;

      // increment toplevel directory fill
      dirTopPool[topnum]++;
      if(dirTopPool[topnum]>=256){
         // this toplevel is now full; move it to the full hash
         dirTopPool.erase(topnum);
         dirTopFull[topnum]=256;
      }
      return 1;
   }
   return 0;
}

void DirManager::BalanceFileAdd(int midkey)
{
   // increment the midlevel directory usage information
   if(dirMidPool.find(midkey) != dirMidPool.end()){
      dirMidPool[midkey]++;
      if(dirMidPool[midkey]>=256){
         // this middir is now full; move it to the full hash
         dirMidPool.erase(midkey);
         dirMidFull[midkey]=256;
      }
   }else{
      // this case only triggers in absurdly large projects; we still
      // need to track directory fill even if we're over 256/256/256
      dirMidPool[midkey]++;
   }
}

I don’t pretend to understand all of this, but looks like “pools” of available directory/file names are being maintained, as well as “lists” of in-use directory/file names. It also looks like a large part of the code is doing error checking - the meat of the code that actually assigns new directory/file names is pretty compact and simple.

Koz, you really should have a look at http://bugzilla.audacityteam.org/show_bug.cgi?id=20 to see the long discussion on resolving the issue of automatic crash recovery. Download a Mac nightly and use it to record something. Force quit while it’s recording then relaunch and you will get back nearly everything you recorded (the very last blockfile will not have been written [no way to avoid that] so you may be missing up to six seconds at the end). Edit away to your heart’s content then force quit, relaunch and recover and you will get back all your edits.

With the improved crash recovery in 1.3.13 users should never have to look inside the _data folder.

– Bill

I am going to transfer the gist of the FR part of this thread to the Pending FRs in the Wiki - but I will retain the thread in the forum, moved to Audio Processing.

But personally I am minded to agree with Bill - that Recover works wso much better in 1.3.13 that users should not in the future need to go furtling around with their .au files.

I’ve never yet had a recovery failure in 1.3.12 or 1.3.13 so I amquite confident in the latest recovery process.

WC