Character-Encodings

Get help with the installation and running of the Zeus IDE. Please do not post bug reports or feature requests here. When in doubt post your question here.
Post Reply
amix

Character-Encodings

Post by amix »

Hi,

I am editing HTML right now. I came across several difficulties/problems.

a) It seems, that the character-encoding of the HTML file is only to be specified at the save-dialog (UTF-8, Unicode, ANSI).
a.1) Other encodings (like European) are missing.
a.2) It would be nice if one could set a default-encoding per filetype. I use UTF-8 for all my HTML, though it often happens, that I forget to switch to UTF-8 at saving time.

b) Saving in UTF-8 and then using HTMLTidy with the UTF-8 input-switch (input is UTF-8) causes this to happen:

Code: Select all

line 37 column 388 - Warning: Warning: replacing invalid UTF-8 bytes (char. code
 U+00BB)
line 37 column 398 - Warning: Warning: replacing invalid UTF-8 bytes (char. code
 U+00AB)
line 41 column 4283 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00BB)
line 41 column 4290 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00AB)
line 41 column 5582 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00BB)
line 41 column 5595 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00AB)
line 41 column 5714 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00BB)
line 41 column 5727 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00AB)
line 43 column 87 - Warning: Warning: replacing invalid UTF-8 bytes (char. code
U+00BB)
line 43 column 99 - Warning: Warning: replacing invalid UTF-8 bytes (char. code
U+00AB)
line 52 column 7472 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00BB)
line 52 column 7478 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00AB)
line 52 column 7544 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00BB)
line 52 column 7559 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00AB)
line 54 column 1923 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00BB)
line 54 column 1948 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00AB)
line 54 column 2465 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00BB)
line 54 column 2512 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00AB)
line 54 column 3792 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00BB)
line 54 column 3797 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00AB)
line 54 column 4827 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00BB)
line 54 column 4914 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+00AB)
line 55 column 309 - Warning: Warning: replacing invalid UTF-8 bytes (char. code
 U+00BB)
line 55 column 626 - Warning: Warning: replacing invalid UTF-8 bytes (char. code
 U+00AB)
line 59 column 4751 - Warning: Warning: replacing invalid UTF-8 bytes (char. cod
e U+0097)
Info: Doctype given is "-//W3C//DTD HTML 4.01 Transitional//EN"
Info: Document content looks like HTML 4.01
50 warnings, 0 errors were found!

Character codes for UTF-8 must be in the range: U+0000 to U+10FFFF.
The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also
allows for the use of five- and six-byte sequences to encode
characters that are outside the range of the Unicode character set;
those five- and six-byte sequences are illegal for the use of
UTF-8 as a transformation of Unicode characters. ISO/IEC 10646
does not allow mapping of unpaired surrogates, nor U+FFFE and U+FFFF
(but it does allow other noncharacters). For more information please refer to
http://www.unicode.org/unicode and http://www.cl.cam.ac.uk/~mgk25/unicode.html
When loading the result into Zeus German Umlauts (which have not been escaped by html-entities) get shown using strange characters.
Also the final result does not show special-characters other than with the "?" substitute in Mozilla, even if manually selecting UTF-8 as char-encoding for the browser.

Am I doing something wrong ?
Within the data flow I took great care to let all be UTF-8 (Zeus save, HTML char-encoding definition, Browser encoding)
Guest

Post by Guest »

I am editing HTML right now. I came across several difficulties/problems.
I will try my best to help but be warned I am no Unicode expert :(
It seems, that the character-encoding of the HTML file is only to be specified at the save-dialog (UTF-8, Unicode, ANSI).
a.1) Other encodings (like European) are missing.
I am not 100% sure what this means since I really do not know what European UTF encoding is :(

But what is probably going on is your European UTF file is in fact a UTF-8 file that contains double byte characters to express the European characters. This is a problem for Zeus since it is NOT a true Unicode editor :(

The Zeus editor engine dates back to the time before Unicode and this basically means it is a and 8 bit ascii with only limited Unicode and UTF-8 support. Zeus is very much a single byte character set UTF-8 and asci code page Unicode.

If you need true 100% Unicode/UTF-8, UTF-16 support then Zeus is not the editor to be using :(
a.2) It would be nice if one could set a default-encoding per filetype. I use UTF-8 for all my HTML, though it often happens, that I forget to switch to UTF-8 at saving time.

I will look to add this feature.
When loading the result into Zeus German Umlauts (which have not been escaped by html-entities) get shown using strange characters.

These might be the double byte characters I mentioned earlier, but they may also be a language character set issue. The Zeus 3.94 Version only uses the Default Windows Character Set and it might be this that is causing the characters to diaplay incorrectly :( ut for the next release the characters set will be configurable :)
Am I doing something wrong ?

Within the data flow I took great care to let all be UTF-8 (Zeus save, HTML char-encoding definition, Browser encoding)

Was the UTF-8 file 100% created in Zeus? If it was created then I am not sure why HTMLTidy is rejecting it?

Even if the file was not created in Zeus, it should still read and write UTF-8 files with double by characters. I was under the impression that in case like these the double byte characters will not get displayed, but they should be correctly maintained by the read and write :?

Could you send a very short example file as a file attachment to jussij@zeusedit.com with subject set to Zeus UTF8 and I will try to see what is going wrong.

Cheers Jussi
Guest

Post by Guest »

I am not 100% sure what this means since I really do not know what European UTF encoding is :(
Sorry, I was not precise. I mean ISO-8859-1 and ISO-8859-15 encodings.
Not European UTF-8. Such a thing does not exist, afaik.
When loading the result into Zeus German Umlauts (which have not been escaped by html-entities) get shown using strange characters.

These might be the double byte characters I mentioned earlier, but they may also be a language character set issue. The Zeus 3.94 Version only uses the Default Windows Character Set and it might be this that is causing the characters to diaplay incorrectly :( ut for the next release the characters set will be configurable :)
Sorry, again a mistake on my side. I was too quick writing to the bulletin and too effortful into making all UTF-8.

BTW: Why not making load and save go through char-encoding processors (translation tables). These could be external modules to be configured. So Zeus would stay flexible and you could choose the way of encoding within the editor (in this case leave it as it is, since you say, Zeus is pre-Unicode era).

The text I am editing I took from the web and now I change it to my needs. It already had all the html-entities (umlauts) right, but I configured output-utf8: yes in my tidy.cfg, which, of course, replaced all entities by their UTF-8 charset counterparts, and this is what came up strange in Zeus. I removed that config-option and now all umlauts stay as entities.

However, the main problem still exists (Tidiy's warnings about illegal chars).
Post Reply