[rt-users] REG: HTML mails

Eric Goodman ericg at cats.ucsc.edu
Fri May 4 18:35:28 EDT 2001


>Hello,
>
>I've got rt 1.0.7 running, and have also implemented 'stripmime', as a lot
>of mails had HTML content. This works fine if the content is HTML and
>text, but in case a pure HTML mail turns up, the mail has to be read
>separately, and moreover, when we reply to such a mail, the requestor
>doesn't get to see any of his original content.
>
>Is is possible to use a HTML to Text converter, like the one referred
>below:
>
>         http://userpage.fu-berlin.de/~mbayer/tools/html2text.html
>
>Please advice.

Yes, it is possible. I've done this at my site, but my code is still 
so ugly that I didn't want to share it yet.

Each "part" of a MIME message has a name (like "message", "message, 
part 1"), a type ("text", "application"), and a subtype ("text"?, 
"html", etc.)

Stripmime works by using MIME::Parser to break the incoming email 
into its component parts, identifying any parts that aren't plain 
text, and making them links. A plaintext message body (named 
"message") comes with a type/subtype of "text/text" (I think). A 
mixed message comes with two parts to the body, "message, part 1" and 
"message, part 2" of type/subtype "text/text" and "text/html" 
respectively. Stripmime handles both cases well.

The case you describe is HTML only. For this I think you typically 
see a message body with name "message, part 1" (though I would expect 
you might see just "message") and type/subtype "text/html".

All I did was add a check for this third case, and if found run the 
HTML through  HTML::FormatText (a module that can convert html to 
plain text). I made a couple of other modifications to the script 
(that I haven't really reviewed), hence my hesitation to send this to 
the list. I tried to note my mods with "EJG" comments. I expect some 
are missing.

However, in case it is of use, the modified version of the script is 
included below. Note that HTML::FormatText relies on 
HTML::TreeBuilder, and it was a fairly long process to locate and 
install all of the various PERL modules on which those two depend in 
turn.

Hope this helps!

--- Eric


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

#!/usr/bin/perl
use MIME::Parser;
use HTML::FormatText;
use HTML::TreeBuilder;
$now = time();
$basepath = "http://YOUR_SITE/stripmime/$now-$$";
$basefilepath = "/YOUR_HTML_PATH/stripmime/$now-$$";
$outputprog = "/RT_PATH/bin/rt-mailgate @ARGV";

sub dump_entity {
     my ($entity, $checksentry, $name) = @_;
     defined($name) or $name = "message";
     my $IO;

     # EJG: Head appears to be the deliver info.
     # Output the head, if it's the root level head
     # Otherwise, it's just some crappy mime header
     if ($name eq "message") {
        print OUT  $entity->head->original_text."\n";
     }

     # Output the body:
     my @parts = $entity->parts;

     if (@parts) {                     # multipart...
         my $i;
         foreach $i (0 .. $#parts) {       # dump each part...
             dump_entity($parts[$i], 0, ("$name, part ".(1+$i)));
         }
     }
     else {                            # single part...

         # Get MIME type, and display accordingly...
         my ($type, $subtype) = split('/', $entity->head->mime_type);
         my $body = $entity->bodyhandle;

         # If it's text, display it, perhaps
         my $path = $body->path;
         my ($filename) = ($path =~ /\/([^\/]+)$/);

         if ($type =~ /^(text|message)$/ && $subtype ne "html") {
            print OUT "\n>>> Text component $filename:\n" if 
($filename !~ "msgauto");
            if ($IO = $body->open("r")) {
               print OUT $_ while (defined($_ = $IO->getline));
               $IO->close;
               push (@deletetemp, "$basefilepath/$filename");
               $keepgoing = false;
            }
         }
         else {
            # EJG: Added case for Apple headers
            if ( ($type eq "application") && ($subtype eq "applefile") ) {
               print OUT "\n>>> $type/$subtype component, $name:\n";
               print OUT "Not relevant, deleted\n";
               push (@deletetemp, "$basefilepath/$filename");
            }
            else {
               # EJG: Added 3rd condition (to avoid ".html.html" files)
               if ($subtype eq "html" && $filename =~ /msgauto/ && 
$filename !~ /.html$/ ) {
                  $newfilename = "$filename.html";
                  $renametemp{"$basefilepath/$filename"} = 
"$basefilepath/$newfilename";
                  $filename = $newfilename;
               }
               # EJG: If the message or the first part of the message is HTML,
               # EJG:    invoke HTML::FormatText to convert it to text.
               if ($subtype eq "html" && ($name eq "message" || $name 
eq "message, part 1" ) ){
                  my $htmltree = new HTML::TreeBuilder;
                  my $htmlformat = new HTML::FormatText( 
leftmargin=>4, rightmargin=>60 );
                  $htmltree->parse_file( "$basefilepath/$filename" );
                  if ($name eq "message") {
                     print OUT "$sentrystr"."\n";
                  }
                  print OUT "Incoming HTML message detected -- 
converted to text only.\n";
                  print OUT "\n\n==========================================\n";
                  print OUT $htmlformat->format( $htmltree );
                  print OUT "\n\n==========================================\n";
                  print OUT "Original HTML version available at URL below.\n";
               }
               print OUT "\n>>> $type/$subtype component, $name:\n";
               print OUT "<A HREF=\"$basepath/$filename\">\n";
               print OUT "$basepath/$filename\n";
               print OUT "<\/A>\n";
            }
         }
     }
     1;
}

#------------------------------
#
# main
#
sub main {

     # Create a new MIME parser:
     my $parser = new MIME::Parser;

     # Set the output directory:
     (-d "$basefilepath") or mkdir "$basefilepath",0755 or die "mkdir: $!";
     (-w "$basefilepath") or die "can't write to directory";
     $parser->output_dir($basefilepath);
     open (OUT, "|$outputprog");
    
     $parser->output_prefix("msgauto");


     # Read the MIME message:
     $entity = $parser->read(\*STDIN) or die "couldn't parse MIME stream";

     # Dump it out:
     dump_entity($entity, 1);
     close(OUT);


     # Delete unneeded temporary files
     foreach (@deletetemp) {
        unlink ($_);
     }

     # Rename our temporary files that were renamed (html, etc.)
     foreach (keys %renametemp) {
        rename ($_, $renametemp{$_});
     }

     # Delete our directory, or at least try -- won't delete if it's not empty
     rmdir($basefilepath);
}

&main();

exit(0);





More information about the rt-users mailing list