[rt-users] utf8 and accents.

Mon Aug 11 09:54:30 EDT 2008

Curtis Bruneau wrote:
> Ruslan Zakirov wrote:
>   
>> On Sat, Aug 9, 2008 at 12:20 AM, Curtis Bruneau <curtisb at vianet.ca> wrote:
>>   
>>     
>>> I need some suggestions, I have come to the conclusion that all utf8
>>> collations don't do french properly, not like latin1 anyway. All accents
>>> are seen as the same, while binary distinct they cannot be unique
>>> indexed and sorting will recognize them as the same as well as queries
>>> using any variant character.
>>>
>>> So I'm in a bit of a bind, if I were to use RT with a case sensitive
>>> collation like utf8_bin would the application behave as expected? I know
>>> search would be much more strict and possibly confusing to the end user.
>>>     
>>>       
>> utf8_bin is good choice. You're free to use binary collation. May be
>> utf8_general_ci collation will be better for you. Any collation is ok
>> as long as you know how to deal with them in mysql.
>>
>>
>>   
>>     
> Ok just wondering, I'll give it a try.. I was more curious if any string 
> type clauses would still work internally since binary collations are 
> everything/case sensitive
> . I'm guessing that's all fine because I think postgres stores it's 
> stuff as binary_cs and relies on the OS do to collations (something like 
> that, other postgres db's around here seem to be case sensitive).
>   
>>> My other option would be to continue to use latin1, is there any way to
>>> accomplish this using the latest code base? It's probably not
>>> configurable and I don't want to have to manage diffs for the possible
>>> changes, unless it is fairly minimal to do..
>>>     
>>>       
>> No, we wouldn't return to that as it's totally wrong and have
>> concequences as it's actually violation of setting purpose. RT was
>> storing UTF8 encoded data in a latin1 column, so collations worked
>> absolutly incorrect for everything even latin1 and were close to
>> binary.
>>
>> At this point I can suggest you move either binary collation or create
>> a new one and send it to mysql team for inclusion.
>>
>>   
>>     
> Understood, I wasn't liking that idea either. Oddly enough 
> latin1_swedish_ci (the latin1 default) isn't suppose to be accent 
> sensitive,  latin1_general_ci is but my old database (mysql 4.1) seems 
> to be indexing it and seeing them seperate. The collation isn't 
> specified so i'm assuming swedish but it's behaving like general, 
> perhaps the old version respected the differences. I'm basically trying 
> to get it the same as before (perhaps if swedish was enforced before I 
> wouldn't be in this position), regardless this isn't really an issue 
> with RT.
>   
>>> The issue in question -> http://bugs.mysql.com/bug.php?id=34130
>>>
>>> They said it's on 'todo', MSSQL handles this with ci_ai, ci_as, cs_ai
>>> and cs_as collations where the accents are either sensitive or not.
>>> Hopefully they do come around to it..
>>>
>>> Character difference for mysql .. http://www.collation-charts.org/mysql60/
>>>
>>>
>>> Curtis
>>>     
>>>       
> Thanks again for your time, i'm really excited to launch 3.8.x, compared 
> to 3.4.x our users are loving it, especially the reporting and all that.
> Curtis.
I have a question that's probably obvious.. If I go ahead with utf8_bin, 
any variation of case on incoming emails will be regarded as distinct 
right? I can see this causing many issues, I may just get rid of my 
accented emails and possibly merge the tickets or just delete the users 
as they aren't valid emails anyway. I don't think I could pad the emails 
enough to get the users to match, looking through my data emails come in 
as all kinds of different cases.

Curtis