[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [JDEV] Character Encodings and Languages thread



> Sorry, but this is nonsense. We are *only* talking about CDATA here,
> that is 8 bit bytes. The XML parser simply pulls these out without
> interpretation (apart from escaped characters) and gives them to you.
> (Well, at least that's what my XML parser does!). I am free to
> interpret those bytes in anyway I chose. The encoding is only relevant
> to the rendering software, it has nothing to do with the parser at
> all. Implementing it is essentially trivial. 

Take a look at http://www.w3.org/TR/1998/REC-xml-19980210#sec-guessing

It sure doesn't sound or look trivial, and I couldn't expect any XML
parser to attempt to do that at the start of every <![CDATA[ block, and
they might have a problem finding the end of the CDATA block depending on
how the bits are laid out(as in a non-8bit encoding might accidentically
be bit indentical to the 8bit ]]> which would be disasterous).

As others have mentioned, mixing character encodings is a path that hasn't
fared well in other places and likely a bad idea here.

> This doesn't help at all. What I want to be able to do is to
> communicate with my friends in Korea in korean, my friends in japan in
> Japanese and use English here. I want to do this over a single message
> stream. Not everyone will have UTF-8 support on their machines.

Well, if they don't have UTF-8 they don't have XML since it's required of
all XML parsers to understand. 

What I'm really wondering here is, this is just character encodings,
correct?  I mean, all the same characters are still there in ANY
encoding(unless it's a severly restricted one), they are just encoded
differently.  If it's all UTF-8 you can still do all of the korean,
japanese, english, etc, characters just fine, they are just encoded in
UTF-8.  It's just a language and font display issue at that point.  So
what is the problem with using the XML-forced UTF-8 encoding?

Jer