[Rt-commit] rt branch, 4.0/utf8-reckoning, repushed
Alex Vandiver
alexmv at bestpractical.com
Wed Sep 3 13:49:19 EDT 2014
The branch 4.0/utf8-reckoning was deleted and repushed:
was 9eb1178b7cd31072fbaac0288944e040192c8d69
now e96002ae4dadf2125e3ead0e1940cb5df9f4c78a
1: f2324d1 = 1: e93c82e Re-indent _EncodeLOB and _DecodeLOB
2: 1827f6c = 2: 802dc8d Respect the database Content-Type header in decoding textual parts
3: 8dbaf2c = 3: 968c25c Stop needlessly frobbing utf8 internals
4: 820c7d9 = 4: 04b5caf Decoding content, and returning characters, is incorrect
5: 38920a2 ! 5: 35eb3bb Stop assuming the data in the database is utf8
@@ -2,7 +2,7 @@
Stop assuming the data in the database is utf8
- As noted in 1827f6c, not all content we currently call "textual" was
+ As noted in 802dc8d, not all content we currently call "textual" was
always treated as such. When re-encoding, do not assume that the
encoding in the database is UTF-8 -- rather, read the Content-Type
header, and examine the charset stated there. Convert from that to the
6: f2b0db6 = 6: 1945c00 Modernize and condense t/mail/sendmail.t
7: 1acfacb = 7: 8bc6d50 Always log bytes, not characters
8: e9e7e96 = 8: 5e4c0f1 The alluded-to deficiency is not a concern in perl ≥ 5.8.3
9: 0c028fc ! 9: f497a11 Ensure all MIME::Entity bodies are UTF-8 encoded bytes
@@ -7,7 +7,7 @@
and noting their character set.
In the case of Approvals/index.html, there was no need for an explicit
- MIME::Entity object; ->Correspond creates on as needed from a "Content"
+ MIME::Entity object; ->Correspond creates one as needed from a "Content"
argument.
diff --git a/lib/RT/Action/CreateTickets.pm b/lib/RT/Action/CreateTickets.pm
@@ -216,7 +216,7 @@
+ Type => 'text/plain',
Charset => 'UTF-8',
- Data => $args{'Content'} || "",
-+ Data => Encode::encode( "UTf-8", $args{'Content'} || ""),
++ Data => Encode::encode( "UTF-8", $args{'Content'} || ""),
);
my ( $Transaction, $Object, $Description ) = $self->Create(
10: 3543a44 ! 10: 3ccd8b0 Ensure all MIME::Entity headers are UTF-8 encoded bytes
@@ -12,7 +12,7 @@
While the majority of these headers will never have wide characters in
them, always decoding and encoding ensures the proper disipline to
- guarantee that strings with the "UTF-8" flag do not get placed in a
+ guarantee that strings with the "UTF8" flag do not get placed in a
header, which can cause double-encoding.
diff --git a/lib/RT/Action/SendEmail.pm b/lib/RT/Action/SendEmail.pm
11: 206e688 = 11: c0f4e49 Make RT::Action::SendEmail->SetHeader take characters, not bytes
12: aa3cc45 ! 12: 2206fe5 Add a utility method to check that an input is bytes
@@ -2,20 +2,20 @@
Add a utility method to check that an input is bytes
- Note that it is impossible to verify that an input characters; here, we
- can only validate if it _could_ be bytes.
+ Note that it is impossible to verify that an input is characters; here,
+ we can only validate if it _could_ be bytes.
- First, any string with the "UTF-8" flag off cannot contain codepoints
- above 255, and as such is safe. Additionally, if the "UTF-8" flag is
- on, having no codepoints above 127 means the bytes are unambigious.
- Having codepoints above 255 is guaranteedly a sign that the input is not
- a byte string.
+ First, any string with the "UTF8" flag off cannot contain codepoints
+ above 255, and as such is safe. Additionally, if the "UTF8" flag is on,
+ having no codepoints above 127 means the bytes are unambigious. Having
+ codepoints above 255 is guaranteedly a sign that the input is not a byte
+ string.
- This leaves only the case of a string with the "UTF-8" flag on, and
- codepoints above 127 but below 255. The "UTF-8" flag is a sign that
- they were _likely_ touched by character data at some point. In such
- cases we warn, suggesting that the bytes have the UTF-8 flag disabled by
- means of utf8::downgrade, if they are indeed bytes.
+ This leaves only the case of a string with the "UTF8" flag on, and
+ codepoints above 127 but below 255. The "UTF8" flag is a sign that they
+ were _likely_ touched by character data at some point. In such cases we
+ warn, suggesting that the bytes have the "UTF8" flag disabled by means
+ of utf8::downgrade, if they are indeed bytes.
diff --git a/lib/RT/Util.pm b/lib/RT/Util.pm
--- a/lib/RT/Util.pm
13: 19321eb ! 13: 8e62357 Verify that MIME::Entity bodies are bytes, and remove _utf8_off call
@@ -6,7 +6,7 @@
body is indeed bytes, and not characters.
We also remove the _utf8_off call -- because, contrary to what the
- comment implies, the presence or absence of the "UTF-8" flag does _not_
+ comment implies, the presence or absence of the "UTF8" flag does _not_
determine if a string is "encoded as octets and not as characters"; it
merely states that the string is capable of holding codepoints > 255.
If it happens to not contain any, the _utf8_off does nothing. If it
@@ -18,7 +18,7 @@
fixed by a simple _utf8_off, but instead must be fixed by ensuring that
the body always contains bytes, not wide characters -- as it now does,
thanks to the prior commits. The call to RT::Util::assert_bytes serves
- as an additional safeguard against backsliding o nthat assumption.
+ as an additional safeguard against backsliding on that assumption.
diff --git a/lib/RT/I18N.pm b/lib/RT/I18N.pm
--- a/lib/RT/I18N.pm
14: 5a0cfda = 14: 8140533 Verify that MIME::Entity headers are bytes, and remove _utf8_off call
15: b865183 ! 15: 2d65e31 Standardize on the stricter Encode::encode("UTF-8", ...) everywhere
@@ -17,6 +17,19 @@
dealing with encodings, it should ensure that it does not produce byte
sequences that are invalid according to official Unicode standards.
+diff --git a/lib/RT/Action/SendEmail.pm b/lib/RT/Action/SendEmail.pm
+--- a/lib/RT/Action/SendEmail.pm
++++ b/lib/RT/Action/SendEmail.pm
+@@
+ $self->SetHeader(
+ Subject =>
+ RT::Interface::Email::AddSubjectTag(
+- Encode::decode_utf8( $head->get('Subject') ),
++ Encode::decode( "UTF-8", $head->get('Subject') ),
+ $self->TicketObj,
+ ),
+ );
+
diff --git a/lib/RT/Dashboard/Mailer.pm b/lib/RT/Dashboard/Mailer.pm
--- a/lib/RT/Dashboard/Mailer.pm
+++ b/lib/RT/Dashboard/Mailer.pm
@@ -216,7 +229,7 @@
+ ( Cc => Encode::encode( "UTF-8", $args{'Cc'} ) ) : ()),
Type => 'text/plain',
Charset => 'UTF-8',
- Data => Encode::encode( "UTf-8", $args{'Content'} || ""),
+ Data => Encode::encode( "UTF-8", $args{'Content'} || ""),
diff --git a/lib/RT/Tickets.pm b/lib/RT/Tickets.pm
--- a/lib/RT/Tickets.pm
16: df88c57 = 16: ecf4e7c Remove "use utf8" from RT::I18N::fr, making NBSP explicit
17: a6e3fb5 = 17: b3c6ae6 Remove remaining cases of "use utf8"
18: abe35cd = 18: 9fc8d08 Dashboard: decode bytes in query parameters into characters
19: 774a740 = 19: 53dbebc Tests: WWW::Mechanize correctly returns characters now
20: 69dae45 ! 20: 4522c09 _utf8_on in EncodeToMIME is needless and incorrect; remove it
@@ -4,12 +4,12 @@
66930fd8 switched from an explicit _utf8_off to an explicit _utf8_on, in
an attempt to switch from splitting on bytes to splitting on characters.
- However, the "UTF-8" flag does not magically determine if a string is
+ However, the "UTF8" flag does not magically determine if a string is
bytes or characters. Instead, only consistency in calling convention
can do so. All callsites of RT::Interface::Email::EncodeToMIME and
RT::Action::SendEmail::MIMEEncodeString now pass character strings; all
that _utf8_on can do is incorrectly "decode" those strings as UTF-8 if
- they happen to not have the "UTF-8" flag set.
+ they happen to not have the "UTF8" flag set.
diff --git a/lib/RT/Interface/Email.pm b/lib/RT/Interface/Email.pm
--- a/lib/RT/Interface/Email.pm
21: df961df = 21: ed465cf Move comment from PreprocessTimeUpdates to DecodeArgs, where it belongs
22: ed57bcd = 22: 2eca779 Always decode data in %ARGS as UTF-8 in DecodeArgs
23: aec38ea = 23: 0afbeca Add RT::Util::assert_bytes checks to _EncodeLOB and _DecodeLOB
24: 44f43cf = 24: a382b50 Update POD and comments to be clearer about characters vs bytes
25: a502084 = 25: dbb9efe Remove an unreachable line
26: ecb655e = 26: 81132aa TSV need not explicitly encode as UTF-8; all output is UTF-8 encoded
27: 3dbae7a = 27: 44ec571 Move "use Encode" calls to one central location
28: b26af9b = 28: 779dc60 Consistent character/byte hygene allows RT to run with DBD::Pg 3.3.0
29: 83649c6 ! 29: c2f0fe0 Note that HTTP output still incorrectly relies on is_utf8
@@ -2,16 +2,16 @@
Note that HTTP output still incorrectly relies on is_utf8
- Currently, any string which has the "UTF-8" flag is encoded as UTF-8
+ Currently, any string which has the "UTF8" flag is encoded as UTF-8
before being sent to the browser. This requires that any output which
is binary, or has already been encoded to bytes, _not_ have the flag
accidentally set.
- It also requires that all output character strings have the "UTF-8" flag
+ It also requires that all output character strings have the "UTF8" flag
enabled; while necessary for codepoints > 255, it is not strictly
required for codepoints between 127 and 255. As RT now consistently
uses Encode::decode() to produce character strings, which sets the
- "UTF-8" flag even for characters in that range, this is likely safe.
+ "UTF8" flag even for characters in that range, this is likely safe.
The most correct fix would be to explicitly flag output that needs to be
encoded. However, doing so in a backwards compatible manner is
30: db93e66 = 30: 302989a Comment the logic for database decode_utf8/is_utf8 checking
31: 89d45e9 = 31: a07d811 Encode characters on their way out of tests
32: 9eb1178 = 32: e96002a Stop hiding "Wide character in..." warnings
More information about the rt-commit
mailing list