[rt-users] Bad characters in names loaded from LDAP (AD)

Mon Oct 10 23:41:00 EDT 2016

On 10 Oct 2016, at 16:26, Jan Burian wrote:

> Hi all,
>
> we have RT 4.4.0 on CentOS 7 and Perl v5.22.1. And we are starting to
> use RT in production.
>
> We configured RT to authenticate users via LDAP
> (RT::Authen::ExternalAuth::LDAP). Our LDAP server is MS AD (Win 2008 
> R2).
[...]
> Authentication is working fine. Users can log in, if the user doesn't
> exist in RT the account is autocreated. All the configured attributes
> are transferred.

This is a strong sign that the LDAP part is working correctly. If the 
LDAP server (AD) and client (Perl's Net::LDAP module) are using 
mismatched encodings, it is likely to show up in authentication failures 
due to incompatible encodings of the same (logical) characters that 
8-bit encodings assign to byte values 0x80-0xff.

Fortunately, it is somewhere between arcane and impossible to make 
Net::LDAP use anything other than UTF-8. There's *probably* some way to 
make it do T.61 for ancient-history compatibility, but that's mostly 
pointless.

[...]
> We had similar problem with Moodle. When we configured Moodle against
> Active Directory and set cp1250 encoding, then it was doing exactly 
> same
> thing. After we changed encoding for LDAP connector to utf-8 then the
> names was
> corrected.

Which makes sense: LDAP v3 by default uses UTF-8 and you have a modern 
system with a mature LDAP client. I know of no way to configure a CentOS 
7/Perl 5.22 system such that the LDAP interaction with an AD LDAP server 
talking UTF-8 would be the source of this sort of encoding conflict. I'm 
mildly surprised that anything talking LDAPv3 can be made to use cp1250 
encoding, but I suppose Microsoft makes their own rules to go along with 
their own unique code pages.

[...]
> Also I red thath MS AD in LDAP protocol version 3 returns any string 
> to
> LDAP client in utf-8 encoding.
> I really don't know where could be a problem.

The most likely place is in your database. I'm guessing that you are 
using MySQL, which defaults to latin1 encoding. When you store a UTF-8 
string into a latin1 table, it breaks any multi-byte characters into 2 
or 3 characters, but the right bits are still there. This issue has come 
up a few times on this list over the past decade and I think Best 
Practical has documented how to safely convert a RT database with that 
sort of problem from latin1 to utf8. It is probably worth looking 
through their docs (possibly one of the UPGRADING* files?) and the RT 
Wiki for a solution. I expect it could be done with a binary dump of the 
database, altering of any latin1 tables to use utf8, and a re-import of 
the binary dump. I'm not enough of a MySQL expert to detail that process 
(I generally use Postgres where possible.)