[Rt-commit] rt branch, 4.2/external-html-encoding, created. rt-4.2.9-136-g5c85b33

Alex Vandiver alexmv at bestpractical.com
Wed Feb 4 16:01:13 EST 2015

The branch, 4.2/external-html-encoding has been created
        at  5c85b33da06f3a9543d113462cb15f4bbb21bf30 (commit)

- Log -----------------------------------------------------------------
commit 5c85b33da06f3a9543d113462cb15f4bbb21bf30
Author: Alex Vandiver <alexmv at bestpractical.com>
Date:   Wed Feb 4 15:10:25 2015 -0500

    Be explicit about encodings to and from the external formatters
    HTML::FormatExternal supports input_charset and output_charset
    parameters controlling how it the input is encoded, and the encoding
    that the formatter should attempt to output; otherwise, the formatter
    attempts to determine based on the <meta> tag.  Many of the snippets
    passed to ConvertHTMLToText do not have meta tags -- and, more
    importantly, in all cases they have already been transcoded to UTF-8,
    making the meta tag possibly incorrect.  Fortunately, the input_charset
    argument is documented to override the charset found in the <meta> tag,
    if any.
    Not all formatters have the same support, of course.  The formatters RT
    allows, and their encoding support, are (in preference order):
     * w3m supports arbitrary input and output character sets.
     * elinks supports arbitrary input and output character sets.
     * html2text supports only latin-1 input, and has no control over output
       character set.
     * links supports arbitrary input and output character sets.
     * lynx supports specifying a default input character set, but this does
       not override a <meta> tag.  It supports arbitrary output character
    Installs with only html2text or lynks are thus likely to suffer from
    encoding problems; move them to be lowest on the priority list.  They
    are still listed because mis-encoded mail is still superior to the
    possibly-empty mail that the "core" formatter would send.
    Tests need adjustment because EmailOutputEncoding is a suggestion -- if
    the content contains characters which the given character set cannot
    express, the character set is left as the internal default, UTF-8.  In
    the case of w3m, an <hr /> tag is rendered as a series of code point
    0x2501 ("BOX DRAWINGS HEAVY HORIZONTAL"), which is not in ISO-8859-1; as
    such, the text/plain part is sent in UTF-8, not ISO-8859-1.

diff --git a/lib/RT/Interface/Email.pm b/lib/RT/Interface/Email.pm
index 5c7c71e..18fd1d3 100644
--- a/lib/RT/Interface/Email.pm
+++ b/lib/RT/Interface/Email.pm
@@ -1789,9 +1789,9 @@ sub _RecordSendEmailFailure {
 =head2 ConvertHTMLToText HTML
-Takes HTML and converts it to plain text.  Appropriate for generating a
-plain text part from an HTML part of an email.  Returns undef if
-conversion fails.
+Takes HTML characters and converts it to plain text characters.
+Appropriate for generating a plain text part from an HTML part of an
+email.  Returns undef if conversion fails.
@@ -1809,7 +1809,7 @@ sub _HTMLFormatter {
     if ($wanted) {
         @order = ($wanted, "core");
     } else {
-        @order = ("w3m", "elinks", "html2text", "links", "lynx", "core");
+        @order = ("w3m", "elinks", "links", "html2text", "lynx", "core");
     # Always fall back to core, even if it is not listed
     for my $prog (@order) {
@@ -1854,15 +1854,19 @@ sub _HTMLFormatter {
             RT->Logger->info("Using $prog for HTML -> text conversion");
             $formatter = sub {
                 my $html = shift;
-                RT::Util::safe_run_child {
+                my $text = RT::Util::safe_run_child {
                     local $ENV{PATH} = $path || $ENV{PATH}
                         || '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin';
                     local $ENV{HOME} = $RT::VarPath;
-                        Encode::encode( "UTF-8", $html),
+                        Encode::encode( "UTF-8", $html ),
+                        input_charset => "UTF-8",
+                        output_charset => "UTF-8",
                         leftmargin => 0, rightmargin => 78
+                $text = Encode::decode( "UTF-8", $text );
+                return $text;
         RT->Config->Set( HTMLFormatter => $prog );
diff --git a/t/mail/sendmail.t b/t/mail/sendmail.t
index 76975a1..4ef3206 100644
--- a/t/mail/sendmail.t
+++ b/t/mail/sendmail.t
@@ -49,14 +49,28 @@ for my $encoding ('ISO-8859-1', 'UTF-8') {
     is(@mail, 1);
     like( $mail[0]->head->get('Content-Type'), qr/multipart\/alternative/,
           "Its content type is multipart/alternative" );
-    like( $mail[0]->parts(0)->head->get('Content-Type'), qr/text\/plain.+?$encoding/,
-          "First part's content type is text/plain $encoding" );
+    # The text/html part is guaranteed to not have had non-latin-1
+    # characters introduced by the HTML-to-text conversion, so it is
+    # guaranteed to be able to be represented in latin-1
     like( $mail[0]->parts(1)->head->get('Content-Type'), qr/text\/html.+?$encoding/,
           "Second part's content type is text/html $encoding" );
-    my $message_as_string = $mail[0]->parts(0)->bodyhandle->as_string();
+    my $message_as_string = $mail[0]->parts(1)->bodyhandle->as_string();
     $message_as_string = Encode::decode($encoding, $message_as_string);
     like( $message_as_string , qr/H\x{e5}vard/,
           "The message's content contains havard's name in $encoding");
+    # The text/plain part may have utf-8 characters in it.  Accept either encoding.
+    like( $mail[0]->parts(0)->head->get('Content-Type'), qr/text\/plain.+?(ISO-8859-1|UTF-8)/i,
+          "First part's content type is text/plain (ISO-8859-1 or UTF-8)" );
+    # Make sure it checks out in whatever encoding it ended up in
+    $mail[0]->parts(0)->head->get('Content-Type') =~ /text\/plain.+?(ISO-8859-1|UTF-8)/i;
+    my $found = $1 || $encoding;
+    $message_as_string = $mail[0]->parts(0)->bodyhandle->as_string();
+    $message_as_string = Encode::decode($found, $message_as_string);
+    like( $message_as_string , qr/H\x{e5}vard/,
+          "The message's content contains havard's name in $encoding");


More information about the rt-commit mailing list