Page 3 of 4

Re: Error: reference to invalid character number at line 6

Posted: Thu Aug 17, 2017 3:58 am
by Yarn366
In case any devs see this, I posted a message about this and my proposed fix to audacity-devel about a month and a half ago here: https://sourceforge.net/p/audacity/mail ... /35931978/. I'm still waiting for a response.

Re: Error: reference to invalid character number at line 6

Posted: Thu Aug 17, 2017 8:46 am
by steve
Yarn366 wrote:In case any devs see this, I posted a message about this and my proposed fix to audacity-devel about a month and a half ago here: https://sourceforge.net/p/audacity/mail ... /35931978/. I'm still waiting for a response.
There are some responses to your pull request: https://github.com/audacity/audacity/pull/197

Re: Error: reference to invalid character number at line 6

Posted: Thu Aug 17, 2017 4:55 pm
by Yarn366
steve wrote: There are some responses to your pull request: https://github.com/audacity/audacity/pull/197
Only one of those responses (not counting the "milestone") isn't mine, and I already explained in the post after that why the suggestion in that post would be an unnecessary complication. And it was after the fake milestone was applied to my pull request that I posted to audacity-devel, and apparently nobody noticed that post.

Re: Error: reference to invalid character number at line 6

Posted: Mon Aug 21, 2017 12:03 pm
by steve
Yarn366 wrote:Only one of those responses (not counting the "milestone") isn't mine, and I already explained in the post after that why the suggestion in that post would be an unnecessary complication. And it was after the fake milestone was applied to my pull request that I posted to audacity-devel, and apparently nobody noticed that post.
There seems to be confusion from several angles.

One is that although this bug has been known about for a long time, no-one can find it logged on our bug tracker, and our main bugzilla guy is no longer with us (sadly he died a few weeks ago). So the first thing that we need to do is to get this properly logged on bugzilla. Are you able to provide a small Audacity project to demonstrate the problem? (I'm on Linux so I've never seen this bug first hand).

Secondly there appears to be some confusion about whether your current pull request is intended to be a full fix for the problem, or whether there remains a "much deeper problem" as your Git comment of June 9th suggests. Could you clarify that?

Re: Error: reference to invalid character number at line 6

Posted: Mon Aug 21, 2017 5:18 pm
by waxcylinder
steve wrote:
Yarn366 wrote:Only one of those responses (not counting the "milestone") isn't mine, and I already explained in the post after that why the suggestion in that post would be an unnecessary complication. And it was after the fake milestone was applied to my pull request that I posted to audacity-devel, and apparently nobody noticed that post.
There seems to be confusion from several angles.

One is that although this bug has been known about for a long time, no-one can find it logged on our bug tracker
I don't believe it ever got logged as a bug ...

We do hwever have a long-standing (very long) FAQ about it in the Audacity Manual:
http://manual.audacityteam.org/man/faq_errors.html#nwf

Peter.

Re: Error: reference to invalid character number at line 6

Posted: Sat Sep 23, 2017 7:51 pm
by Yarn366
Sorry for taking so long to reply.
steve wrote: One is that although this bug has been known about for a long time, no-one can find it logged on our bug tracker, and our main bugzilla guy is no longer with us (sadly he died a few weeks ago). So the first thing that we need to do is to get this properly logged on bugzilla. Are you able to provide a small Audacity project to demonstrate the problem? (I'm on Linux so I've never seen this bug first hand).
Test project is attached. The project file is valid and loads fine on any platform. Its title contains "🎧", which represents a supplementary character (the headphone emoji). (I could have encoded the character directly and the file still would have been fine, but I chose to use the escape sequence.) However, if you resave it with the Windows version of Audacity, that character becomes "��", which is invalid in XML and won't load in any version of Audacity. (To clarify, the problem is with saving supplementary characters as "&#xd###;&#xd###;", not with failing to load files that contain characters encoded in that manner.)
steve wrote: Secondly there appears to be some confusion about whether your current pull request is intended to be a full fix for the problem, or whether there remains a "much deeper problem" as your Git comment of June 9th suggests. Could you clarify that?
The fix that I provided should be enough to fix the problem.

The first change that I described my "much deeper problem" post would make unnecessary to check the size of wxUChar, but it would likely require major changes to Audacity; thankfully, it's not really necessary for fixing this problem. (And wxString appears to use a 2-byte character type on Windows anyway, so that change probably wouldn't do much good unless the string type is also changed.)

Re: Error: reference to invalid character number at line 6

Posted: Wed Sep 27, 2017 4:46 pm
by steve
Yarn366 wrote:Sorry for taking so long to reply.
No problem, I've only recently returned from my vacation ;)

Thanks for the test project. I had some time today to test it on Windows 10, and test your proposed fix.
I've certainly got enough information about the problem now to log it as a bug, and your work gives a good lead-in to understanding the problem.

I don't personally have in-depth knowledge about wxWidgets XML / Unicode handing, so I don't know that your fix is the "right" way to fix it.

I can see how your fix prevents the problem from occurring, but I have a niggling feeling that there should be a better way to fix this. In particular, I don't understand why or how surrogate pairs are being created. If I'm reading your code correctly, your fix handles the surrogate pairs when they occur, but why / where / how do they occur in the first place? I thought that UTF-16 encoding should only happen when conversion to UTF-16 is explicitly called. :?

Re: Error: reference to invalid character number at line 6

Posted: Wed Sep 27, 2017 5:20 pm
by steve

Re: Error: reference to invalid character number at line 6

Posted: Fri Sep 29, 2017 3:44 am
by Yarn366
I found an article in the official wxWidgets 3.0.2 documentation describing how wxString works:

http://docs.wxwidgets.org/3.0.2/overvie ... g_internal

The section that concerns this issue is "Internal wxString Encoding," which I suggest reading thoroughly. Here are the important bits that I gathered from there:
  • wxString can store strings in UTF-8, UTF-16, or UTF-32 encoding, depending on the platform and compile-time flags.
  • By default, wxString uses UTF-16 under Windows, and either UTF-8 or UTF-32 under Linux and macOS (the article is a bit conflicting here, although it appears to suggest UTF-8 more strongly).
  • When wxString uses UTF-8 encoding, it indexes code points rather than code units (bytes in the case of UTF-8). It also handles encoding and decoding of multi-byte sequences automatically. This means that programs don't have to do anything special in this case; they can just treat each unit as being one character.
  • This is the most important part: When wxString uses UTF-16 encoding, it indexes code units, not code points, and it does absolutely nothing to handle surrogate pairs. This means that programs need to implement this handling themselves, at least when interfacing directly with wxString.
All of this means that Audacity's XML-escape function still needs to handle surrogate pairs (unless, of course, I'm missing something important).

Re: Error: reference to invalid character number at line 6

Posted: Fri Sep 29, 2017 9:05 am
by steve
Yarn366 wrote:I found an article in the official wxWidgets 3.0.2 documentation describing how wxString works:
http://docs.wxwidgets.org/3.0.2/overvie ... g_internal
Excellent. That clearly answers my question about why the UTF-16 encoding is happening.

The bit that grabbed me was (emphasis mine):
Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user code has to take care of surrogate pairs himself.
Which, if I understand correctly, is what your patch does.

It would appear that an alternative solution would be to build WxWidgets on Windows with wxUSE_UNICODE_UTF8=1 so that UTF-8 encoding is used on all platforms.
Given Audacity's dependence on XML, and that (presumably) we want to allow all and any printable characters in all and any language, perhaps this would be a better solution (?)
The possible downside that I notice in that documentation is a performance hit when Iterating wxString Characters. Is that likely to be a significant issue for Audacity?

Have you tried building wxWidgets with wxUSE_UNICODE_UTF8=1 ?
I'm not likely to have time to try that 'till next week, but I think it would be worth testing - if nothing else, it could confirm that the problem is what we think it is.