Normalize path during export

dudido · October 30, 2014, 4:20pm

Hi,

I’m frequently using Audacity lables to mark passages which I subsequently automatically export using the File → Export Multiple functionality. The labels content is oftenmuch phonetic transcription, and accordingly contains numerous non ASCII codepoints. Unfortunately there are still many pieces of software out there which keep running into issues with non-ASCII filenames. So could you please implement a feature to normalize/harmonize/flatten the filenames on export? So for example a label called

ndràngalíilě n̺áʰ‿n̥t̺ɛ̀shɛ́ɹɛ̀

should be exported as a file

07-ndrangaliile na_ntesheɹe.wav

or similar, not

07-ndràngalíilě n̺áʰ‿n̥t̺ɛ̀shɛ́ɹɛ̀.wav

Replacing replaced elements of the string by an underscore could easily create problems with the path length IMHO. Where filename conflicts occur, such could be circumvented by incrementing file names, e.g.

ndràngalíilě n̺áʰ‿n̥t̺ɛ̀shɛ́ɹɛ̀

becoming

and

ndràngalíilě n̺áʰ‿n̥t̺ɛ̀shɛ́ɹɛ́
(where only the last ` was replaced by ´)

becoming

08-ndrangaliile na_ntesheɹe_01.wav

Thanks for the support, and thanks for creating great software

steve · October 30, 2014, 7:59pm

I guess that “flattening” to ASCII encoding “could” be done, but there is a problem. Audacity is using Unicode and many Unicode characters have no direct equivalent in ASCII. I don’t see that Audacity has any way of knowing what page encoding you want, without the considerable complexity of adding mapping for dozens of possibilities and asking the user to pick one. Here is a list of some of the most common:

111 – Greek (AST Premium Exec DOS 5.0)
112 – Turkish (AST Premium Exec DOS 5.0)
113 – Yugoslavian (AST Premium Exec DOS 5.0)
151 – Nafitha Arabic (ADOS)
161 – Arabic (ADOS)
162 – Arabic (ADOS)
163 – Arabic (ADOS)
164 – Arabic (ADOS)
165 – Arabic (ADOS)
367 – US-ASCII (7-bit)
437 – Original IBM PC hardware code page
667 - Polish (Mazovia)
668 - Slavic
708 – Arabic/Middle Eastern (ASMO 708)
709 – Arabic/Middle Eastern (ASMO 449+/BCON V4)
710 – Arabic/Middle Eastern (Transparent Arabic)
711 – Arabic (Nafitha Enhanced)
720 – Arabic/Middle East
737 – Greek
770 - Baltic
771 - Lithuanian/Cyrillic
772 - Lithuanian/Cyrillic
773 - Estonian, Lithuanian and Latvian
774 - Lithuanian
775 – Estonian, Lithuanian and Latvian
776 - Lithuanian (extended CP770)
777 - Accented Lithuanian (old) (extended CP771)
778 - Accented Lithuanian (extended CP775)
790 - Polish (Mazovia)
808 - Cyrillic with euro
813 - ISO 8859-7
819 - ISO 8859-1
848 - Ukrainian with euro
849 - Belarusian with euro
850 – "Multilingual (Latin-1)" (Western European languages)
851 - Greek
852 – "Slavic (Latin-2)" (Central and Eastern European languages)
853 - Turkish (Latin-3)
854 - Spanish
855 – Cyrillic
856 – Hebrew
857 – Turkish
858 – "Multilingual" with euro symbol
859 - "Multilingual" (Latin-9)
860 – Portuguese
861 – Icelandic
862 – Hebrew
863 – French (Quebec French)
864 - Arabic/Middle East
865 – Danish/Norwegian
866 – Cyrillic
867 – Czech (Kamenický), can also apply to Hebrew (based on CP862), (conflictive ID)
868 - Arabic/Middle East/Urdu
869 – Greek
872 - Cyrillic with euro
874 – Thai[8]
881 – Latin 1 (AST Premium Exec DOS 5.0) (conflictive ID)
882 – Latin 2 (AST Premium Exec DOS 5.0) (conflictive ID)
883 – Latin 3 (AST Premium Exec DOS 5.0) (conflictive ID)
884 – Latin 4 (AST Premium Exec DOS 5.0) (conflictive ID)
885 – Latin 5 (AST Premium Exec DOS 5.0) (conflictive ID)
891 - Korean
895 - Czech (Kamenický), (conflictive ID)
900 - Cyrillic
901 - Extension of ISO 8859-13 with euro
902 - ISO Estonian with euro
912 - Extension of ISO 8859-2
913 - ISO 8859-3
914 - ISO 8859-4
915 - Extension of ISO 8859-5
916 - ISO 8859-8
919 - ISO 8859-10
920 - ISO 8859-9
921 - Extension of ISO 8859-13
922 - ISO Estonian
923 - ISO 8859-15
932 - Japanese (DOS/V) (DBCS) (conflictive ID; Windows version is IBM 943)
934 - Korean (DOS/V) (DBCS)
936 - Chinese (DOS/V) (DBCS)
938 - Taiwanese (DOS/V, OS/2)
942 - Japanese SAA (OS/2)
943 - Japanese (Windows CP 932)
944 - Korean SAA (OS/2)
948 - Traditional Chinese SAA (OS/2)
949 – Korean (Unified Hangul / Extended Wansung)
950 – Chinese traditional / Taiwanese / Hong Kong
966 – Saudi Arabian
991 - Polish (Mazovia)
1098 - Farsi
1111 - ISO 8859-2
1116 - Estonian
1117 - Latvian
1118 - Lithuanian
1119 - Lithuanian/Cyrillic
1124 - ISO 8859-5
1125 - Ukrainian
1129 - ISO Vietnamese
1131 - Belarusian

The alternative is to convert to “plain old ASCII”, but then you example:
ndràngalíilě n̺áʰ‿n̥t̺ɛ̀shɛ́ɹɛ̀
comes out as:
ndràngalíil? n?á??n?t???sh???
which could optionally be shortened to:
ndràngalíil_ n_á_n_t_sh_

I don’t think any of those options are ideal.

dudido · October 31, 2014, 12:32pm

I agree with you that those options are not ideal, but probably I should not have mentioned ASCII. But then again I wrote flatten, and my idea was more general than that. Maybe instead of going down to the small repertoire of letters offered by ASCII, only combining marks and non-letter marks should be eliminated, such as:

ndràngalíilě n̺áʰ‿n̥t̺ɛ̀shɛ́ɹɛ̀

becoming

07-ndrangaliile nantɛshɛɹɛ.wav

I believe that would also be a workable solution! I’m not sure what the shared common standard of supported code points across plattforms is - Probably there is none - but I can see that many applications fail to handle files containing combining marks, non-letter marks (and sometimes other codepoints, which are not entirely clear). However a solution is needed for preventing any issues with automation.

steve · October 31, 2014, 2:31pm

Unicode (Unicode - Wikipedia), which is what Audacity uses. Unfortunately some other applications have not yet adopted this standard, and as far as I can see, that is what is causing the incompatibilities. The real solution is that those applications that have the problems should update to the “computing industry standard”, which is Unicode (Audacity did this a few years ago, but prior to that it had all sorts of problems when users entered characters that had no direct ASCII substitute).

dudido · October 31, 2014, 4:45pm

Yes but Unicode uptake was and continues to be very slow so just saying other software should be fixed is way too conservative and not a solution - Even operating systems still have problems sometimes with Unicode. I’m not suggesting that Audacity should drop Unicode support but provide a work around to save time for users. If not users would have to do mass renaming (possibly using other tools), which should not be the solution.

steve · October 31, 2014, 5:43pm

Or users could restrict themselves to using only characters that are accepted by their ‘legacy’ software.

You’re not kidding - Unicode has been around for well over 20 years, which is a very long time in the computer world.
Even Microsoft (who are not renowned for adopting standards quickly) have supported Unicode since Windows XP.

The problem, as I see it (but I’m not a C++ programmer), is that there is no clear way to perform the “translations” that you are asking for.
Taking as an example:

The character “ɛ” (in your “flattened” string) may either be:
Unicode character code U+025b (Epsilon, Latin small letter) from the IPA Extension set,
or
Unicode character code U+0190 (Latin Capital Letter Open E) from the Latin Extended-B set.

The character “ɛ” is neither in the 127 base set of ASCII codes or the 255 “Extended ASCII Codes” http://www.asciitable.com/ so there is a very good chance that an application that does not support Unicode will choke on that character. Single byte characters can only be in the range 0 to 255.
Unicode includes thousands of characters that have no direct representation in either the basic of extended ASCII tables, and as far as I’m aware there is no standard way to map arbitrary Unicode characters to valid ASCII characters.

The usual programming way to convert Unicode strings into “safe” ASCII strings is to replace Unicode characters that do not have an ASCII equivalent with some sort of “dummy” character, such as a “?” or “_” or some other place-holder symbol. So as I described previously,
ndràngalíilě n̺áʰ‿n̥t̺ɛ̀shɛ́ɹɛ̀
becomes
ndràngalíil? n?á??n?t???sh???