[Rt-commit] rt branch, 4.4/external-encodings, created. rt-4.4.1-124-g1fafe1f

Alex Vandiver alexmv at bestpractical.com
Wed Oct 19 04:16:13 EDT 2016

The branch, 4.4/external-encodings has been created
        at  1fafe1facc72ed16600c3e017954240416e49b53 (commit)

- Log -----------------------------------------------------------------
commit 619afa714c46e2c3ef731c54d7dbff8baade6eff
Author: Alex Vandiver <alex at chmrr.net>
Date:   Wed Oct 19 01:01:47 2016 -0700

    Apply textual decodings to extenally-stored attachments
    The short-circuit `return` statement in the "external" branch of
    _DecodeLOB, introduced in 686daae6, causes externally-stored
    attachments to skip the logic that decodes the bytes to characters.
    This leads to double-encoded textual content.  This was not
    immediately apparent because only textual content longer than the
    `$ExternalStorageCutoffSize` cutoff is stored exterally, masking the
    problem in most cases.
    Remove the premature `return`.  Since the bug was during decoding, not
    storage, there is no corrective action necessary for attachments that
    were already externalized.
    Based on tests and report from Petr <petr at kle.cz>.

diff --git a/lib/RT/Record.pm b/lib/RT/Record.pm
index 6d24385..65cd862 100644
--- a/lib/RT/Record.pm
+++ b/lib/RT/Record.pm
@@ -867,8 +867,6 @@ sub _DecodeLOB {
             RT->Logger->error( "Failed to load $Digest from external storage: $msg" );
             return ("");
-        return ($Content);
     elsif ( $ContentEncoding && $ContentEncoding ne 'none' ) {
         return ( $self->loc( "Unknown ContentEncoding [_1]", $ContentEncoding ) );
diff --git a/t/externalstorage/encoding.t b/t/externalstorage/encoding.t
new file mode 100644
index 0000000..3b1de6e
--- /dev/null
+++ b/t/externalstorage/encoding.t
@@ -0,0 +1,43 @@
+use strict;
+use warnings;
+use RT;
+use RT::Test::ExternalStorage tests => undef;
+RT->Config->Set( ExternalStorageCutoffSize => 1 );
+my $queue = RT::Test->load_or_create_queue(Name => 'General');
+my $non_english_text = Encode::decode("UTF-8",'Příliš žluťoučký kůň pěl ďábelské ódy');
+my $message = MIME::Entity->build(
+    From     => 'root at localhost',
+    Subject  => 'test',
+    Charset  => 'UTF-8',
+    Encoding => 'quoted-printable',
+    Type     => 'text/plain',
+    Data     => Encode::encode('UTF-8', $non_english_text),
+my $ticket = RT::Ticket->new( RT->SystemUser );
+my ($id) = $ticket->Create(
+    Queue => $queue,
+    Subject => 'test',
+    MIMEObj => $message,
+ok $id, 'created a ticket';
+my @attachments = @{ $ticket->Transactions->First->Attachments->ItemsArrayRef };
+is scalar @attachments, 1, "Found one attachment";
+is $attachments[0]->ContentType, "text/plain", "Found the text part";
+is $attachments[0]->Content, $non_english_text, "Can get the text part content";
+ok !system('sbin/rt-externalize-attachments'), "rt-externalize-attachments ran successfully";
+ at attachments = @{ $ticket->Transactions->First->Attachments->ItemsArrayRef };
+is scalar @attachments, 1, "Found one attachment";
+is $attachments[0]->ContentType, "text/plain", "Found the text part";
+is $attachments[0]->Content, $non_english_text, "Can still get the text part content";
+is $attachments[0]->ContentEncoding, "external", "Content is external";

commit 1fafe1facc72ed16600c3e017954240416e49b53
Author: Alex Vandiver <alex at chmrr.net>
Date:   Wed Oct 19 01:01:54 2016 -0700

    Update POD for _DecodeLOB to be clearer about bytes/characters and character sets

diff --git a/lib/RT/Record.pm b/lib/RT/Record.pm
index 65cd862..f91e506 100644
--- a/lib/RT/Record.pm
+++ b/lib/RT/Record.pm
@@ -811,32 +811,31 @@ sub _EncodeLOB {
 =head2 _DecodeLOB C<ContentType>, C<ContentEncoding>, C<Content>
-Unpacks data stored in the database, which may be base64 or QP encoded
-because of our need to store binary and badly encoded data in columns
-marked as UTF-8.  Databases such as PostgreSQL and Oracle care that you
-are feeding them invalid UTF-8 and will refuse the content.  This function
-handles unpacking the encoded data.
+This function reverses the effects of L</_EncodeLOB>, by unpacking the
+data, provided as bytes (not characters!), from the database.  This
+data may also be Base64 or Quoted-Printable encoded, as given by
+C<Content-Encoding>.  This encoding layer exists because the
+underlying database column is "text", which rejects non-UTF-8 byte
 Alternatively, if the data lives in external storage, it will be read
 (or downloaded) and returned.
-C<_DecodeLOB> returns textual data as a UTF-8 string which has been
-processed by L<Encode>'s PERLQQ filter which will replace the invalid bytes
-with C<\x{HH}> so you can see the invalid byte but won't run into problems
-treating the data as UTF-8 later.
-This is similar to how we filter all data coming in via the web UI in
-L<RT::Interface::Web/DecodeARGS>. This filter should only end up being
-applied to old data from less UTF-8-safe versions of RT.
-If the passed C<ContentType> includes a character set, that will be used
-to decode textual data; the default character set is UTF-8.  This is
-necessary because while we attempt to store textual data as UTF-8, the
-definition of "textual" has migrated over time, and thus we may now need
-to attempt to decode data that was previously not transcoded on insertion.
-Important note: This function expects an octet string and returns a
-character string for non-binary data.
+This function differs in one important way from being the inverse of
+L</_EncodeLOB>: for textual data (as judged via
+L<RT::I18N/IsTextualContentType> applied to the given C<ContentType>),
+C<_DecodeLOB> returns character strings, not bytes.  The character set
+used in decoding is taken from the C<ContentType>, or UTF-8 if not
+provided; however, for all textual content inserted by current code,
+the character set used for storage is always UTF-8.
+This decoding step is done using L<Encode>'s PERLQQ filter, which
+replaces invalid byte sequences with C<\x{HH}>.  This mirrors how data
+from query parameters are parsed in L<RT::Interface::Web/DecodeARGS>.
+Since RT is now strict about the bytes it inserts, substitution
+characters should only be needed for data inserted by older versions
+of RT, or for C<ContentType>s which are now believed to be textual,
+but were not considered so on insertion (and thus not transcoded).


More information about the rt-commit mailing list