[rt-users] RT saves data in quoted-printable, why???
Alex Vandiver
alexmv at bestpractical.com
Thu Mar 5 14:31:56 EST 2015
On Thu, 5 Mar 2015 18:50:57 +0100 Palle Girgensohn
> * Ensure that all data containing non-ASCII is quoted-printable
> encoded for PostgreSQL, instead of merely all data not claiming to be
> "text/plain"
>
> Why is this? It seems it is doing more harm than good -- it makes it
> harder to use indexes and makes it harder to use it from the tables
> directly? PostgreSQL is very capable of storing any kind of character
> set? We use UTF-8 for everything, this fix breaks a lot of things for
> us.
The commit in question is 3a9c38ed, which changes:
} elsif ( !$RT::Handle->BinarySafeBLOBs
&& $MIMEType !~ /text\/plain/gi
&& !Encode::is_utf8( $Body, 1 ) ) {
$ContentEncoding = 'quoted-printable';
}
...to:
} elsif ( !$RT::Handle->BinarySafeBLOBs
&& $Body =~ /\P{ASCII}/
&& !Encode::is_utf8( $Body, 1 ) ) {
$ContentEncoding = 'quoted-printable';
}
That is, any data which claimed to be "text/plain" would blindly be
attempted to be shoved into the database; this includes content which
contained _invalid_ UTF-8 byte sequences. The commit was in RT 4.0; RT
4.2 is much better about transcoding to real UTF-8, or marking the part
as "application/octet-stream" if it cannot be decoded. In that sense,
this logic is now less necessary. However, the commit was absolutely
necessary at the time to not _lose_ data. Erring on the side of data
preservation, at the expense of possibly-unnecessary encoding, seems
like not an unreasonable choice.
That being said, the Encode::is_utf8() check attempts to verify that we
only quoted-printable encode data whose bytes are not currently a
correctly-encoded UTF-8 byte stream. The bug now lies there, as after
the "utf8-reckoning" branch (2620658..af9fe7c), the $Body argument is
guaranteed to contain bytes, not characters. Per the Encode::is_utf8()
documentation:
[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If
CHECK is true, also checks whether STRING contains well-formed UTF-8.
Returns true if successful, false otherwise.
The UTF8 flag is almost certainly off on $Body (it _may_ be on, and
still contain only codepoints < 128, but this is unlikely), so the
well-formed-ness of the string (which is the relevant thing we wish to
confirm) is never checked.
I _believe_ that this code may be replaced by (untested):
} elsif ( !$RT::Handle->BinarySafeBLOBs
&& !eval { Encode::decode( "UTF-8", $Body, Encode::FB_CROAK); 1 } ) {
$ContentEncoding = 'quoted-printable';
}
I may push a branch later that makes this change, and see what breaks.
All of that aside, note that "it makes it harder to use indexes" makes
little sense -- there are no indexes on Content. The full-text index
uses its own tsvector, which it constructs after decoding the Content,
so the transfer encoding of the Content column is irrelevant to that.
- Alex
More information about the rt-users
mailing list