[rt-users] RT saves data in quoted-printable, why???

Thu Mar 5 14:31:56 EST 2015

On Thu, 5 Mar 2015 18:50:57 +0100 Palle Girgensohn
> * Ensure that all data containing non-ASCII is quoted-printable
> encoded for PostgreSQL, instead of merely all data not claiming to be
>    "text/plain"
> 
> Why is this? It seems it is doing more harm than good -- it makes it
> harder to use indexes and makes it harder to use it from the tables
> directly? PostgreSQL is very capable of storing any kind of character
> set? We use UTF-8 for everything, this fix breaks a lot of things for
> us.

The commit in question is 3a9c38ed, which changes:

    } elsif (    !$RT::Handle->BinarySafeBLOBs
              && $MIMEType !~ /text\/plain/gi
              && !Encode::is_utf8( $Body, 1 ) ) {
          $ContentEncoding = 'quoted-printable';
    }

...to:

    } elsif (    !$RT::Handle->BinarySafeBLOBs
              && $Body =~ /\P{ASCII}/
              && !Encode::is_utf8( $Body, 1 ) ) {
          $ContentEncoding = 'quoted-printable';
    }

That is, any data which claimed to be "text/plain" would blindly be
attempted to be shoved into the database; this includes content which
contained _invalid_ UTF-8 byte sequences.  The commit was in RT 4.0; RT
4.2 is much better about transcoding to real UTF-8, or marking the part
as "application/octet-stream" if it cannot be decoded.  In that sense,
this logic is now less necessary.  However, the commit was absolutely
necessary at the time to not _lose_ data.  Erring on the side of data
preservation, at the expense of possibly-unnecessary encoding, seems
like not an unreasonable choice.

That being said, the Encode::is_utf8() check attempts to verify that we
only quoted-printable encode data whose bytes are not currently a
correctly-encoded UTF-8 byte stream.  The bug now lies there, as after
the "utf8-reckoning" branch (2620658..af9fe7c), the $Body argument is
guaranteed to contain bytes, not characters.  Per the Encode::is_utf8()
documentation:

    [INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If
    CHECK is true, also checks whether STRING contains well-formed UTF-8.
    Returns true if successful, false otherwise.

The UTF8 flag is almost certainly off on $Body (it _may_ be on, and
still contain only codepoints < 128, but this is unlikely), so the
well-formed-ness of the string (which is the relevant thing we wish to
confirm) is never checked.

I _believe_ that this code may be replaced by (untested):

    } elsif (    !$RT::Handle->BinarySafeBLOBs
              && !eval { Encode::decode( "UTF-8", $Body, Encode::FB_CROAK); 1 } ) {
          $ContentEncoding = 'quoted-printable';
    }

I may push a branch later that makes this change, and see what breaks.

All of that aside, note that "it makes it harder to use indexes" makes
little sense -- there are no indexes on Content.  The full-text index
uses its own tsvector, which it constructs after decoding the Content,
so the transfer encoding of the Content column is irrelevant to that.

 - Alex