[Rt-commit] rt branch, 4.2/serializer-bytes, created. rt-4.2.4-15-g0077837

Mon May 19 12:06:25 EDT 2014

The branch, 4.2/serializer-bytes has been created
        at  0077837fcba735a8115d537439b33fe11ef13a65 (commit)

- Log -----------------------------------------------------------------
commit 0077837fcba735a8115d537439b33fe11ef13a65
Author: Alex Vandiver <alexmv at bestpractical.com>
Date:   Fri May 16 22:54:01 2014 -0400

    Serialize bytes, not characters, for Attachments and OCFV LargeContent
    
    Stop encoding all data as utf-8 before inserting -- it is clearly
    incorrect in the case of binary data.  While it is tempting to instead
    only encode as UTF-8 if it is textual data, this too is incorrect.
    
    As 18c810d0 describes, the character set of textual data is not
    guaranteed to be UTF-8; as such, by having stored characters instead of
    bytes in the serialized form, information has been lost.  There are two
    recovery methods, neither terribly appealing: update to store
    charset="utf-8" on all data on insert, or attempt to guess the original
    encoding and re-encode via that.  The former is distasteful because it
    alters the database upon serializing; the latter is fragile becase it is
    not guaranteed to be the same encoding.
    
    Instead, serialize the bytes as they occurred in the database, and
    import them explicitly as bytes.  This does make it possible to insert
    invalid UTF-8 into the database -- but contrary to what 74683a70
    implies, this is not incorrect, as binary data (for instance) is seldom
    UTF-8.  3a9c38ed ensures that anything which is contains high-bit
    characters will be QP-encoded.

diff --git a/lib/RT/Record.pm b/lib/RT/Record.pm
index b000209..ff3236b 100644
--- a/lib/RT/Record.pm
+++ b/lib/RT/Record.pm
@@ -2414,10 +2414,17 @@ sub Serialize {
     $store{$_} = $values{lc $_} for @cols;
     $store{id} = $values{id}; # Explicitly necessary in some cases
 
-    # Un-encode things with a ContentEncoding for transfer
+    # Un-apply the _transfer_ encoding, but don't mess with the octets
+    # themselves.  Calling ->Content directly would, in some cases,
+    # decode from some mostly-unknown character set -- which reversing
+    # on the far end would be complicated.
     if ($ca{ContentEncoding} and $ca{ContentType}) {
         my ($content_col) = grep {exists $ca{$_}} qw/LargeContent Content/;
-        $store{$content_col} = $self->$content_col;
+        $store{$content_col} = $self->_DecodeLOB(
+            "application/octet-stream", # Lie so that we get bytes, not characters
+            $self->ContentEncoding,
+            $self->_Value( $content_col, decode_utf8 => 0 )
+        );
         delete $store{ContentEncoding};
     }
     return %store unless $args{UIDs};
@@ -2456,8 +2463,7 @@ sub PreInflate {
         my ($content_col) = grep {exists $ca{$_}} qw/LargeContent Content/;
         if (defined $data->{$content_col}) {
             my ($ContentEncoding, $Content) = $class->_EncodeLOB(
-                Encode::encode("UTF-8",$data->{$content_col},Encode::FB_CROAK),
-                $data->{ContentType},
+                $data->{$content_col}, $data->{ContentType},
             );
             $data->{ContentEncoding} = $ContentEncoding;
             $data->{$content_col} = $Content;

-----------------------------------------------------------------------