[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[JDEV] Character Encodings and Languages thread



[note: I'm not sure I fully understand character encodings yet, so pardon
any major blunders I'm making and correct me :]

I've read up on this thread regarding character encodings, and I think
we're going to have to take a different approach... 

The problem is that we are dealing with an XML document here, the
conversation between the client and server is just a normal streaming XML
document.  I don't think it's possible to change character encodings on
the fly within the document, it would be like changing byte-order randomly
in any protocol stream.  This would cause problems implementing clients
and most importantly, there aren't any XML parsers that would support this
type of thing.  So adding an encoding="" to each message or tag wouldn't
be feasible.

But, we can still support the required international charset functionality
I believe.  When each client connects to the server, it identifies in the
opening <?xml ?> tag it's character encoding.  By default it's UTF-8, but
the server will and should support a range of other common encodings for
clients to specify.  The *entire stream* is then encoded in what was
specified.  The server is normally going to be sending back the default
UTF-8 encoding.  I'm not sure what it would take to support this, but
there might be value in adding a server option to change the default
outgoing encoding, so that servers that are primarily international can
use the most common encoding for data sent back to those clients. 

What this means is that the server is going to have to translate
internally between different encodings.  This is where things start to get
a little fuzzy for me... Is there a library out there for doing this sort
of thing, is it not common in other software?

The lowest common denominator is going to win here... if a US client on a
US server is using UTF-8 in it's connection, and they receive a message
from an international server in a different encoding, the US server will
have to translate it to UTF-8 for that client.  But an international
server talking to an international server with both of their clients
connected via the same alternate encoding would be able to use that
encoding everywhere.  Hopefully this arrangement can satisfy most/all of
the requirements, since I can't see any other way of handling it besides
just mandating once particular encoding(which seems like a bad thing). 


Now, on to the language issue... thankfully, it's already answered for us
and part of XML:
	http://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag

More comments/suggestions

Jer