[Rt-commit] rt branch, 4.0/utf8-reckoning, repushed

Wed Sep 3 13:49:19 EDT 2014

The branch 4.0/utf8-reckoning was deleted and repushed:
       was 9eb1178b7cd31072fbaac0288944e040192c8d69
       now e96002ae4dadf2125e3ead0e1940cb5df9f4c78a

 1:  f2324d1 =  1:  e93c82e Re-indent _EncodeLOB and _DecodeLOB
 2:  1827f6c =  2:  802dc8d Respect the database Content-Type header in decoding textual parts
 3:  8dbaf2c =  3:  968c25c Stop needlessly frobbing utf8 internals
 4:  820c7d9 =  4:  04b5caf Decoding content, and returning characters, is incorrect
 5:  38920a2 !  5:  35eb3bb Stop assuming the data in the database is utf8
    @@ -2,7 +2,7 @@
     
         Stop assuming the data in the database is utf8
         
    -    As noted in 1827f6c, not all content we currently call "textual" was
    +    As noted in 802dc8d, not all content we currently call "textual" was
         always treated as such.  When re-encoding, do not assume that the
         encoding in the database is UTF-8 -- rather, read the Content-Type
         header, and examine the charset stated there.  Convert from that to the
 6:  f2b0db6 =  6:  1945c00 Modernize and condense t/mail/sendmail.t
 7:  1acfacb =  7:  8bc6d50 Always log bytes, not characters
 8:  e9e7e96 =  8:  5e4c0f1 The alluded-to deficiency is not a concern in perl ≥ 5.8.3
 9:  0c028fc !  9:  f497a11 Ensure all MIME::Entity bodies are UTF-8 encoded bytes
    @@ -7,7 +7,7 @@
         and noting their character set.
         
         In the case of Approvals/index.html, there was no need for an explicit
    -    MIME::Entity object; ->Correspond creates on as needed from a "Content"
    +    MIME::Entity object; ->Correspond creates one as needed from a "Content"
         argument.
     
     diff --git a/lib/RT/Action/CreateTickets.pm b/lib/RT/Action/CreateTickets.pm
    @@ -216,7 +216,7 @@
     +        Type    => 'text/plain',
              Charset => 'UTF-8',
     -        Data    => $args{'Content'} || "",
    -+        Data    => Encode::encode( "UTf-8", $args{'Content'} || ""),
    ++        Data    => Encode::encode( "UTF-8", $args{'Content'} || ""),
          );
      
          my ( $Transaction, $Object, $Description ) = $self->Create(
10:  3543a44 ! 10:  3ccd8b0 Ensure all MIME::Entity headers are UTF-8 encoded bytes
    @@ -12,7 +12,7 @@
         
         While the majority of these headers will never have wide characters in
         them, always decoding and encoding ensures the proper disipline to
    -    guarantee that strings with the "UTF-8" flag do not get placed in a
    +    guarantee that strings with the "UTF8" flag do not get placed in a
         header, which can cause double-encoding.
     
     diff --git a/lib/RT/Action/SendEmail.pm b/lib/RT/Action/SendEmail.pm
11:  206e688 = 11:  c0f4e49 Make RT::Action::SendEmail->SetHeader take characters, not bytes
12:  aa3cc45 ! 12:  2206fe5 Add a utility method to check that an input is bytes
    @@ -2,20 +2,20 @@
     
         Add a utility method to check that an input is bytes
         
    -    Note that it is impossible to verify that an input characters; here, we
    -    can only validate if it _could_ be bytes.
    +    Note that it is impossible to verify that an input is characters; here,
    +    we can only validate if it _could_ be bytes.
         
    -    First, any string with the "UTF-8" flag off cannot contain codepoints
    -    above 255, and as such is safe.  Additionally, if the "UTF-8" flag is
    -    on, having no codepoints above 127 means the bytes are unambigious.
    -    Having codepoints above 255 is guaranteedly a sign that the input is not
    -    a byte string.
    +    First, any string with the "UTF8" flag off cannot contain codepoints
    +    above 255, and as such is safe.  Additionally, if the "UTF8" flag is on,
    +    having no codepoints above 127 means the bytes are unambigious.  Having
    +    codepoints above 255 is guaranteedly a sign that the input is not a byte
    +    string.
         
    -    This leaves only the case of a string with the "UTF-8" flag on, and
    -    codepoints above 127 but below 255.  The "UTF-8" flag is a sign that
    -    they were _likely_ touched by character data at some point.  In such
    -    cases we warn, suggesting that the bytes have the UTF-8 flag disabled by
    -    means of utf8::downgrade, if they are indeed bytes.
    +    This leaves only the case of a string with the "UTF8" flag on, and
    +    codepoints above 127 but below 255.  The "UTF8" flag is a sign that they
    +    were _likely_ touched by character data at some point.  In such cases we
    +    warn, suggesting that the bytes have the "UTF8" flag disabled by means
    +    of utf8::downgrade, if they are indeed bytes.
     
     diff --git a/lib/RT/Util.pm b/lib/RT/Util.pm
     --- a/lib/RT/Util.pm
13:  19321eb ! 13:  8e62357 Verify that MIME::Entity bodies are bytes, and remove _utf8_off call
    @@ -6,7 +6,7 @@
         body is indeed bytes, and not characters.
         
         We also remove the _utf8_off call -- because, contrary to what the
    -    comment implies, the presence or absence of the "UTF-8" flag does _not_
    +    comment implies, the presence or absence of the "UTF8" flag does _not_
         determine if a string is "encoded as octets and not as characters"; it
         merely states that the string is capable of holding codepoints > 255.
         If it happens to not contain any, the _utf8_off does nothing.  If it
    @@ -18,7 +18,7 @@
         fixed by a simple _utf8_off, but instead must be fixed by ensuring that
         the body always contains bytes, not wide characters -- as it now does,
         thanks to the prior commits.  The call to RT::Util::assert_bytes serves
    -    as an additional safeguard against backsliding o nthat assumption.
    +    as an additional safeguard against backsliding on that assumption.
     
     diff --git a/lib/RT/I18N.pm b/lib/RT/I18N.pm
     --- a/lib/RT/I18N.pm
14:  5a0cfda = 14:  8140533 Verify that MIME::Entity headers are bytes, and remove _utf8_off call
15:  b865183 ! 15:  2d65e31 Standardize on the stricter Encode::encode("UTF-8", ...) everywhere
    @@ -17,6 +17,19 @@
         dealing with encodings, it should ensure that it does not produce byte
         sequences that are invalid according to official Unicode standards.
     
    +diff --git a/lib/RT/Action/SendEmail.pm b/lib/RT/Action/SendEmail.pm
    +--- a/lib/RT/Action/SendEmail.pm
    ++++ b/lib/RT/Action/SendEmail.pm
    +@@
    +     $self->SetHeader(
    +         Subject =>
    +             RT::Interface::Email::AddSubjectTag(
    +-                Encode::decode_utf8( $head->get('Subject') ),
    ++                Encode::decode( "UTF-8", $head->get('Subject') ),
    +                 $self->TicketObj,
    +             ),
    +     );
    +
     diff --git a/lib/RT/Dashboard/Mailer.pm b/lib/RT/Dashboard/Mailer.pm
     --- a/lib/RT/Dashboard/Mailer.pm
     +++ b/lib/RT/Dashboard/Mailer.pm
    @@ -216,7 +229,7 @@
     +             ( Cc => Encode::encode( "UTF-8", $args{'Cc'} ) ) : ()),
              Type    => 'text/plain',
              Charset => 'UTF-8',
    -         Data    => Encode::encode( "UTf-8", $args{'Content'} || ""),
    +         Data    => Encode::encode( "UTF-8", $args{'Content'} || ""),
     
     diff --git a/lib/RT/Tickets.pm b/lib/RT/Tickets.pm
     --- a/lib/RT/Tickets.pm
16:  df88c57 = 16:  ecf4e7c Remove "use utf8" from RT::I18N::fr, making NBSP explicit
17:  a6e3fb5 = 17:  b3c6ae6 Remove remaining cases of "use utf8"
18:  abe35cd = 18:  9fc8d08 Dashboard: decode bytes in query parameters into characters
19:  774a740 = 19:  53dbebc Tests: WWW::Mechanize correctly returns characters now
20:  69dae45 ! 20:  4522c09 _utf8_on in EncodeToMIME is needless and incorrect; remove it
    @@ -4,12 +4,12 @@
         
         66930fd8 switched from an explicit _utf8_off to an explicit _utf8_on, in
         an attempt to switch from splitting on bytes to splitting on characters.
    -    However, the "UTF-8" flag does not magically determine if a string is
    +    However, the "UTF8" flag does not magically determine if a string is
         bytes or characters.  Instead, only consistency in calling convention
         can do so.  All callsites of RT::Interface::Email::EncodeToMIME and
         RT::Action::SendEmail::MIMEEncodeString now pass character strings; all
         that _utf8_on can do is incorrectly "decode" those strings as UTF-8 if
    -    they happen to not have the "UTF-8" flag set.
    +    they happen to not have the "UTF8" flag set.
     
     diff --git a/lib/RT/Interface/Email.pm b/lib/RT/Interface/Email.pm
     --- a/lib/RT/Interface/Email.pm
21:  df961df = 21:  ed465cf Move comment from PreprocessTimeUpdates to DecodeArgs, where it belongs
22:  ed57bcd = 22:  2eca779 Always decode data in %ARGS as UTF-8 in DecodeArgs
23:  aec38ea = 23:  0afbeca Add RT::Util::assert_bytes checks to _EncodeLOB and _DecodeLOB
24:  44f43cf = 24:  a382b50 Update POD and comments to be clearer about characters vs bytes
25:  a502084 = 25:  dbb9efe Remove an unreachable line
26:  ecb655e = 26:  81132aa TSV need not explicitly encode as UTF-8; all output is UTF-8 encoded
27:  3dbae7a = 27:  44ec571 Move "use Encode" calls to one central location
28:  b26af9b = 28:  779dc60 Consistent character/byte hygene allows RT to run with DBD::Pg 3.3.0
29:  83649c6 ! 29:  c2f0fe0 Note that HTTP output still incorrectly relies on is_utf8
    @@ -2,16 +2,16 @@
     
         Note that HTTP output still incorrectly relies on is_utf8
         
    -    Currently, any string which has the "UTF-8" flag is encoded as UTF-8
    +    Currently, any string which has the "UTF8" flag is encoded as UTF-8
         before being sent to the browser.  This requires that any output which
         is binary, or has already been encoded to bytes, _not_ have the flag
         accidentally set.
         
    -    It also requires that all output character strings have the "UTF-8" flag
    +    It also requires that all output character strings have the "UTF8" flag
         enabled; while necessary for codepoints > 255, it is not strictly
         required for codepoints between 127 and 255.  As RT now consistently
         uses Encode::decode() to produce character strings, which sets the
    -    "UTF-8" flag even for characters in that range, this is likely safe.
    +    "UTF8" flag even for characters in that range, this is likely safe.
         
         The most correct fix would be to explicitly flag output that needs to be
         encoded.  However, doing so in a backwards compatible manner is
30:  db93e66 = 30:  302989a Comment the logic for database decode_utf8/is_utf8 checking
31:  89d45e9 = 31:  a07d811 Encode characters on their way out of tests
32:  9eb1178 = 32:  e96002a Stop hiding "Wide character in..." warnings