[Rt-commit] rt branch, 4.2/serializer-bytes, created. rt-4.2.4-15-g0077837
Alex Vandiver
alexmv at bestpractical.com
Mon May 19 12:06:25 EDT 2014
The branch, 4.2/serializer-bytes has been created
at 0077837fcba735a8115d537439b33fe11ef13a65 (commit)
- Log -----------------------------------------------------------------
commit 0077837fcba735a8115d537439b33fe11ef13a65
Author: Alex Vandiver <alexmv at bestpractical.com>
Date: Fri May 16 22:54:01 2014 -0400
Serialize bytes, not characters, for Attachments and OCFV LargeContent
Stop encoding all data as utf-8 before inserting -- it is clearly
incorrect in the case of binary data. While it is tempting to instead
only encode as UTF-8 if it is textual data, this too is incorrect.
As 18c810d0 describes, the character set of textual data is not
guaranteed to be UTF-8; as such, by having stored characters instead of
bytes in the serialized form, information has been lost. There are two
recovery methods, neither terribly appealing: update to store
charset="utf-8" on all data on insert, or attempt to guess the original
encoding and re-encode via that. The former is distasteful because it
alters the database upon serializing; the latter is fragile becase it is
not guaranteed to be the same encoding.
Instead, serialize the bytes as they occurred in the database, and
import them explicitly as bytes. This does make it possible to insert
invalid UTF-8 into the database -- but contrary to what 74683a70
implies, this is not incorrect, as binary data (for instance) is seldom
UTF-8. 3a9c38ed ensures that anything which is contains high-bit
characters will be QP-encoded.
diff --git a/lib/RT/Record.pm b/lib/RT/Record.pm
index b000209..ff3236b 100644
--- a/lib/RT/Record.pm
+++ b/lib/RT/Record.pm
@@ -2414,10 +2414,17 @@ sub Serialize {
$store{$_} = $values{lc $_} for @cols;
$store{id} = $values{id}; # Explicitly necessary in some cases
- # Un-encode things with a ContentEncoding for transfer
+ # Un-apply the _transfer_ encoding, but don't mess with the octets
+ # themselves. Calling ->Content directly would, in some cases,
+ # decode from some mostly-unknown character set -- which reversing
+ # on the far end would be complicated.
if ($ca{ContentEncoding} and $ca{ContentType}) {
my ($content_col) = grep {exists $ca{$_}} qw/LargeContent Content/;
- $store{$content_col} = $self->$content_col;
+ $store{$content_col} = $self->_DecodeLOB(
+ "application/octet-stream", # Lie so that we get bytes, not characters
+ $self->ContentEncoding,
+ $self->_Value( $content_col, decode_utf8 => 0 )
+ );
delete $store{ContentEncoding};
}
return %store unless $args{UIDs};
@@ -2456,8 +2463,7 @@ sub PreInflate {
my ($content_col) = grep {exists $ca{$_}} qw/LargeContent Content/;
if (defined $data->{$content_col}) {
my ($ContentEncoding, $Content) = $class->_EncodeLOB(
- Encode::encode("UTF-8",$data->{$content_col},Encode::FB_CROAK),
- $data->{ContentType},
+ $data->{$content_col}, $data->{ContentType},
);
$data->{ContentEncoding} = $ContentEncoding;
$data->{$content_col} = $Content;
-----------------------------------------------------------------------
More information about the rt-commit
mailing list