[rt-devel] Re: RT 2.1.56 (wrong charset)

Sat Jan 4 19:52:08 EST 2003

In addition, in Stefan's message, each Latin1 letter
was replaced with 4 characters, not 2. This means, that 
HTML::Entities received already a 4-byte sequence for 
each symbol. It means either double Latin1->UTF-8 
conversion, or surprisingly appeared UTF-16.

--- Stanislav Sinyagin <ssinyagin at yahoo.com> wrote:
> 1) 
> lib/RT/I18N/de.po is encoded Latin1.
> 
> 2)
> Then it goes through lib/RT/I18N.pm and is presented as wanna-be Unicode. 
> I'm not sure at this stage if it really produces unicode. 
> 
> 3)
> Then it goes through HTML::Entities (as told by default_escape_flags => 'h'), 
> and all non-ascii characters are replaced with entities: 
> &Auml; for a-umlaut etc. 
> At this stage, HTML::Entities depends on Perl version (Stefan, what's yours?). 
> 
> If it's 5.6, it treats each non-ascii byte (remember, Unicode 
> symbols come as two-byte symbols?) as non-ascii character, and 
> produces two HTML entities per each Unicode symbol. 
> 
> In 5.8, each non-ascii Unicode symbol (two bytes) is 
> replaced with a HTML entity. In HTML::Entities, they are defined 
> for Latin1 symbols only. It means, Cyrillic (Russian) symbols would 
> be replaced with (one or two?) numeric entities. 
> Some browsers will survive that (in case if it's still one entity), 
> but it's definitely wrong way.