[rt-devel] Re: RT 2.1.56 (wrong charset)

Sat Jan 4 19:03:52 EST 2003

--- Jesse Vincent <jesse at bestpractical.com> wrote:
> > > First is that the charset information should be 
> > > sent in HTTP header, 
> 
> And, actually, it is:
> 
> Content-Type: text/html; charset=utf-8

aha, the situation is more complicated: the strings that 
Stefan Fischer has sent, are not Unicode! and neither ISO latin1.

Unfortunately, I've got no server to check it quickly, 
but I suspect it went through these steps:

1) 
lib/RT/I18N/de.po is encoded Latin1.

2)
Then it goes through lib/RT/I18N.pm and is presented as wanna-be Unicode. 
I'm not sure at this stage if it really produces unicode. 

3)
Then it goes through HTML::Entities (as told by default_escape_flags => 'h'), 
and all non-ascii characters are replaced with entities: 
&Auml; for a-umlaut etc. 
At this stage, HTML::Entities depends on Perl version (Stefan, what's yours?). 

If it's 5.6, it treats each non-ascii byte (remember, Unicode 
symbols come as two-byte symbols?) as non-ascii character, and 
produces two HTML entities per each Unicode symbol. 

In 5.8, each non-ascii Unicode symbol (two bytes) is 
replaced with a HTML entity. In HTML::Entities, they are defined 
for Latin1 symbols only. It means, Cyrillic (Russian) symbols would 
be replaced with (one or two?) numeric entities. 
Some browsers will survive that (in case if it's still one entity), 
but it's definitely wrong way. 

The right way would be to totally avoid entity'izing, and 
shoot out the plain text, with correct charset in HTTP header. 

With regards, 

Stan