[Rt-commit] rt branch, 4.2/utf8-reckoning, repushed

Wed Sep 3 13:49:41 EDT 2014

The branch 4.2/utf8-reckoning was deleted and repushed:
       was 948620a5c3d91444e41ac38361a7b3fa81d5c466
       now af9fe7c431b030f3c78cf0729819cc71df8d61a9

 1:  2715890 =  1:  15dde68 Modernize and condense t/mail/sendmail.t and t/mail/sendmail-plaintext.t
 2:  af6491f =  2:  a275a7f Always log bytes, not characters
 3:  b345603 =  3:  18ef9b2 The alluded-to deficiency is not a concern in perl ≥ 5.8.3
 4:  feb718b !  4:  6d9bd63 Ensure all MIME::Entity bodies are UTF-8 encoded bytes
    @@ -7,7 +7,7 @@
         and noting their character set.
         
         In the case of Approvals/index.html, there was no need for an explicit
    -    MIME::Entity object; ->Correspond creates on as needed from a "Content"
    +    MIME::Entity object; ->Correspond creates one as needed from a "Content"
         argument.
     
     diff --git a/lib/RT/Action/CreateTickets.pm b/lib/RT/Action/CreateTickets.pm
 5:  6dbe1b1 !  5:  41d084f Ensure all MIME::Entity headers are UTF-8 encoded bytes
    @@ -12,7 +12,7 @@
         
         While the majority of these headers will never have wide characters in
         them, always decoding and encoding ensures the proper disipline to
    -    guarantee that strings with the "UTF-8" flag do not get placed in a
    +    guarantee that strings with the "UTF8" flag do not get placed in a
         header, which can cause double-encoding.
     
     diff --git a/lib/RT/Action/SendEmail.pm b/lib/RT/Action/SendEmail.pm
 6:  a122628 =  6:  12c2671 Make RT::Action::SendEmail->SetHeader take characters, not bytes
 7:  2fcc445 !  7:  a21eb81 Add a utility method to check that an input is bytes
    @@ -2,20 +2,20 @@
     
         Add a utility method to check that an input is bytes
         
    -    Note that it is impossible to verify that an input characters; here, we
    -    can only validate if it _could_ be bytes.
    +    Note that it is impossible to verify that an input is characters; here,
    +    we can only validate if it _could_ be bytes.
         
    -    First, any string with the "UTF-8" flag off cannot contain codepoints
    -    above 255, and as such is safe.  Additionally, if the "UTF-8" flag is
    -    on, having no codepoints above 127 means the bytes are unambigious.
    -    Having codepoints above 255 is guaranteedly a sign that the input is not
    -    a byte string.
    +    First, any string with the "UTF8" flag off cannot contain codepoints
    +    above 255, and as such is safe.  Additionally, if the "UTF8" flag is on,
    +    having no codepoints above 127 means the bytes are unambigious.  Having
    +    codepoints above 255 is guaranteedly a sign that the input is not a byte
    +    string.
         
    -    This leaves only the case of a string with the "UTF-8" flag on, and
    -    codepoints above 127 but below 255.  The "UTF-8" flag is a sign that
    -    they were _likely_ touched by character data at some point.  In such
    -    cases we warn, suggesting that the bytes have the UTF-8 flag disabled by
    -    means of utf8::downgrade, if they are indeed bytes.
    +    This leaves only the case of a string with the "UTF8" flag on, and
    +    codepoints above 127 but below 255.  The "UTF8" flag is a sign that they
    +    were _likely_ touched by character data at some point.  In such cases we
    +    warn, suggesting that the bytes have the "UTF8" flag disabled by means
    +    of utf8::downgrade, if they are indeed bytes.
     
     diff --git a/lib/RT/Util.pm b/lib/RT/Util.pm
     --- a/lib/RT/Util.pm
 8:  0aea559 !  8:  17702cd Verify that MIME::Entity bodies are bytes, and remove _utf8_off call
    @@ -6,7 +6,7 @@
         body is indeed bytes, and not characters.
         
         We also remove the _utf8_off call -- because, contrary to what the
    -    comment implies, the presence or absence of the "UTF-8" flag does _not_
    +    comment implies, the presence or absence of the "UTF8" flag does _not_
         determine if a string is "encoded as octets and not as characters"; it
         merely states that the string is capable of holding codepoints > 255.
         If it happens to not contain any, the _utf8_off does nothing.  If it
    @@ -18,7 +18,7 @@
         fixed by a simple _utf8_off, but instead must be fixed by ensuring that
         the body always contains bytes, not wide characters -- as it now does,
         thanks to the prior commits.  The call to RT::Util::assert_bytes serves
    -    as an additional safeguard against backsliding o nthat assumption.
    +    as an additional safeguard against backsliding on that assumption.
     
     diff --git a/lib/RT/I18N.pm b/lib/RT/I18N.pm
     --- a/lib/RT/I18N.pm
 9:  f1660db =  9:  ba11085 Verify that MIME::Entity headers are bytes, and remove _utf8_off call
10:  41f6ff8 = 10:  1d18663 Standardize on the stricter Encode::encode("UTF-8", ...) everywhere
11:  0b4f458 = 11:  ed0458d Remove "use utf8" from RT::I18N::fr, making NBSP explicit
12:  62668f9 = 12:  7548587 Remove remaining cases of "use utf8"
13:  fe89415 = 13:  fb58e26 Dashboard: decode bytes in query parameters into characters
14:  39c008c = 14:  b2db8fc Tests: WWW::Mechanize correctly returns characters now
15:  52e4290 ! 15:  2be0797 _utf8_on in EncodeToMIME is needless and incorrect; remove it
    @@ -4,12 +4,12 @@
         
         66930fd8 switched from an explicit _utf8_off to an explicit _utf8_on, in
         an attempt to switch from splitting on bytes to splitting on characters.
    -    However, the "UTF-8" flag does not magically determine if a string is
    +    However, the "UTF8" flag does not magically determine if a string is
         bytes or characters.  Instead, only consistency in calling convention
         can do so.  All callsites of RT::Interface::Email::EncodeToMIME and
         RT::Action::SendEmail::MIMEEncodeString now pass character strings; all
         that _utf8_on can do is incorrectly "decode" those strings as UTF-8 if
    -    they happen to not have the "UTF-8" flag set.
    +    they happen to not have the "UTF8" flag set.
     
     diff --git a/lib/RT/Interface/Email.pm b/lib/RT/Interface/Email.pm
     --- a/lib/RT/Interface/Email.pm
16:  c73b596 = 16:  f67c72a Move comment from PreprocessTimeUpdates to DecodeArgs, where it belongs
17:  9bba281 = 17:  3ac9388 Always decode data in %ARGS as UTF-8 in DecodeArgs
18:  5c8dcd5 = 18:  9cc181b Add RT::Util::assert_bytes checks to _EncodeLOB and _DecodeLOB
19:  e6c9339 = 19:  b1af637 Update POD and comments to be clearer about characters vs bytes
20:  82fa2b3 = 20:  701c7dd Remove an unreachable line
21:  44cd960 = 21:  4d70cfb TSV need not explicitly encode as UTF-8; all output is UTF-8 encoded
22:  a4c0582 = 22:  b2e341b Move "use Encode" calls to one central location
23:  40b9dc2 = 23:  d91b416 Consistent character/byte hygene allows RT to run with DBD::Pg 3.3.0
24:  ea0eeed = 24:  89a8568 Note that HTTP output still incorrectly relies on is_utf8
25:  c014818 = 25:  bc8e5e9 Comment the logic for database decode_utf8/is_utf8 checking
26:  8eb5159 = 26:  0a5fd0a Encode characters on their way out of tests
27:  948620a = 27:  af9fe7c Stop hiding "Wide character in..." warnings