[Rt-commit] rt branch, 4.2/external-html-encoding, created. rt-4.2.9-136-g5c85b33
Alex Vandiver
alexmv at bestpractical.com
Wed Feb 4 16:01:13 EST 2015
The branch, 4.2/external-html-encoding has been created
at 5c85b33da06f3a9543d113462cb15f4bbb21bf30 (commit)
- Log -----------------------------------------------------------------
commit 5c85b33da06f3a9543d113462cb15f4bbb21bf30
Author: Alex Vandiver <alexmv at bestpractical.com>
Date: Wed Feb 4 15:10:25 2015 -0500
Be explicit about encodings to and from the external formatters
HTML::FormatExternal supports input_charset and output_charset
parameters controlling how it the input is encoded, and the encoding
that the formatter should attempt to output; otherwise, the formatter
attempts to determine based on the <meta> tag. Many of the snippets
passed to ConvertHTMLToText do not have meta tags -- and, more
importantly, in all cases they have already been transcoded to UTF-8,
making the meta tag possibly incorrect. Fortunately, the input_charset
argument is documented to override the charset found in the <meta> tag,
if any.
Not all formatters have the same support, of course. The formatters RT
allows, and their encoding support, are (in preference order):
* w3m supports arbitrary input and output character sets.
* elinks supports arbitrary input and output character sets.
* html2text supports only latin-1 input, and has no control over output
character set.
* links supports arbitrary input and output character sets.
* lynx supports specifying a default input character set, but this does
not override a <meta> tag. It supports arbitrary output character
sets.
Installs with only html2text or lynks are thus likely to suffer from
encoding problems; move them to be lowest on the priority list. They
are still listed because mis-encoded mail is still superior to the
possibly-empty mail that the "core" formatter would send.
Tests need adjustment because EmailOutputEncoding is a suggestion -- if
the content contains characters which the given character set cannot
express, the character set is left as the internal default, UTF-8. In
the case of w3m, an <hr /> tag is rendered as a series of code point
0x2501 ("BOX DRAWINGS HEAVY HORIZONTAL"), which is not in ISO-8859-1; as
such, the text/plain part is sent in UTF-8, not ISO-8859-1.
diff --git a/lib/RT/Interface/Email.pm b/lib/RT/Interface/Email.pm
index 5c7c71e..18fd1d3 100644
--- a/lib/RT/Interface/Email.pm
+++ b/lib/RT/Interface/Email.pm
@@ -1789,9 +1789,9 @@ sub _RecordSendEmailFailure {
=head2 ConvertHTMLToText HTML
-Takes HTML and converts it to plain text. Appropriate for generating a
-plain text part from an HTML part of an email. Returns undef if
-conversion fails.
+Takes HTML characters and converts it to plain text characters.
+Appropriate for generating a plain text part from an HTML part of an
+email. Returns undef if conversion fails.
=cut
@@ -1809,7 +1809,7 @@ sub _HTMLFormatter {
if ($wanted) {
@order = ($wanted, "core");
} else {
- @order = ("w3m", "elinks", "html2text", "links", "lynx", "core");
+ @order = ("w3m", "elinks", "links", "html2text", "lynx", "core");
}
# Always fall back to core, even if it is not listed
for my $prog (@order) {
@@ -1854,15 +1854,19 @@ sub _HTMLFormatter {
RT->Logger->info("Using $prog for HTML -> text conversion");
$formatter = sub {
my $html = shift;
- RT::Util::safe_run_child {
+ my $text = RT::Util::safe_run_child {
local $ENV{PATH} = $path || $ENV{PATH}
|| '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin';
local $ENV{HOME} = $RT::VarPath;
$package->format_string(
- Encode::encode( "UTF-8", $html),
+ Encode::encode( "UTF-8", $html ),
+ input_charset => "UTF-8",
+ output_charset => "UTF-8",
leftmargin => 0, rightmargin => 78
);
};
+ $text = Encode::decode( "UTF-8", $text );
+ return $text;
};
}
RT->Config->Set( HTMLFormatter => $prog );
diff --git a/t/mail/sendmail.t b/t/mail/sendmail.t
index 76975a1..4ef3206 100644
--- a/t/mail/sendmail.t
+++ b/t/mail/sendmail.t
@@ -49,14 +49,28 @@ for my $encoding ('ISO-8859-1', 'UTF-8') {
is(@mail, 1);
like( $mail[0]->head->get('Content-Type'), qr/multipart\/alternative/,
"Its content type is multipart/alternative" );
- like( $mail[0]->parts(0)->head->get('Content-Type'), qr/text\/plain.+?$encoding/,
- "First part's content type is text/plain $encoding" );
+
+ # The text/html part is guaranteed to not have had non-latin-1
+ # characters introduced by the HTML-to-text conversion, so it is
+ # guaranteed to be able to be represented in latin-1
like( $mail[0]->parts(1)->head->get('Content-Type'), qr/text\/html.+?$encoding/,
"Second part's content type is text/html $encoding" );
- my $message_as_string = $mail[0]->parts(0)->bodyhandle->as_string();
+ my $message_as_string = $mail[0]->parts(1)->bodyhandle->as_string();
$message_as_string = Encode::decode($encoding, $message_as_string);
like( $message_as_string , qr/H\x{e5}vard/,
"The message's content contains havard's name in $encoding");
+
+ # The text/plain part may have utf-8 characters in it. Accept either encoding.
+ like( $mail[0]->parts(0)->head->get('Content-Type'), qr/text\/plain.+?(ISO-8859-1|UTF-8)/i,
+ "First part's content type is text/plain (ISO-8859-1 or UTF-8)" );
+
+ # Make sure it checks out in whatever encoding it ended up in
+ $mail[0]->parts(0)->head->get('Content-Type') =~ /text\/plain.+?(ISO-8859-1|UTF-8)/i;
+ my $found = $1 || $encoding;
+ $message_as_string = $mail[0]->parts(0)->bodyhandle->as_string();
+ $message_as_string = Encode::decode($found, $message_as_string);
+ like( $message_as_string , qr/H\x{e5}vard/,
+ "The message's content contains havard's name in $encoding");
}
{
-----------------------------------------------------------------------
More information about the rt-commit
mailing list