[rt-users] REG: HTML mails
Eric Goodman
ericg at cats.ucsc.edu
Fri May 4 18:35:28 EDT 2001
>Hello,
>
>I've got rt 1.0.7 running, and have also implemented 'stripmime', as a lot
>of mails had HTML content. This works fine if the content is HTML and
>text, but in case a pure HTML mail turns up, the mail has to be read
>separately, and moreover, when we reply to such a mail, the requestor
>doesn't get to see any of his original content.
>
>Is is possible to use a HTML to Text converter, like the one referred
>below:
>
> http://userpage.fu-berlin.de/~mbayer/tools/html2text.html
>
>Please advice.
Yes, it is possible. I've done this at my site, but my code is still
so ugly that I didn't want to share it yet.
Each "part" of a MIME message has a name (like "message", "message,
part 1"), a type ("text", "application"), and a subtype ("text"?,
"html", etc.)
Stripmime works by using MIME::Parser to break the incoming email
into its component parts, identifying any parts that aren't plain
text, and making them links. A plaintext message body (named
"message") comes with a type/subtype of "text/text" (I think). A
mixed message comes with two parts to the body, "message, part 1" and
"message, part 2" of type/subtype "text/text" and "text/html"
respectively. Stripmime handles both cases well.
The case you describe is HTML only. For this I think you typically
see a message body with name "message, part 1" (though I would expect
you might see just "message") and type/subtype "text/html".
All I did was add a check for this third case, and if found run the
HTML through HTML::FormatText (a module that can convert html to
plain text). I made a couple of other modifications to the script
(that I haven't really reviewed), hence my hesitation to send this to
the list. I tried to note my mods with "EJG" comments. I expect some
are missing.
However, in case it is of use, the modified version of the script is
included below. Note that HTML::FormatText relies on
HTML::TreeBuilder, and it was a fairly long process to locate and
install all of the various PERL modules on which those two depend in
turn.
Hope this helps!
--- Eric
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
#!/usr/bin/perl
use MIME::Parser;
use HTML::FormatText;
use HTML::TreeBuilder;
$now = time();
$basepath = "http://YOUR_SITE/stripmime/$now-$$";
$basefilepath = "/YOUR_HTML_PATH/stripmime/$now-$$";
$outputprog = "/RT_PATH/bin/rt-mailgate @ARGV";
sub dump_entity {
my ($entity, $checksentry, $name) = @_;
defined($name) or $name = "message";
my $IO;
# EJG: Head appears to be the deliver info.
# Output the head, if it's the root level head
# Otherwise, it's just some crappy mime header
if ($name eq "message") {
print OUT $entity->head->original_text."\n";
}
# Output the body:
my @parts = $entity->parts;
if (@parts) { # multipart...
my $i;
foreach $i (0 .. $#parts) { # dump each part...
dump_entity($parts[$i], 0, ("$name, part ".(1+$i)));
}
}
else { # single part...
# Get MIME type, and display accordingly...
my ($type, $subtype) = split('/', $entity->head->mime_type);
my $body = $entity->bodyhandle;
# If it's text, display it, perhaps
my $path = $body->path;
my ($filename) = ($path =~ /\/([^\/]+)$/);
if ($type =~ /^(text|message)$/ && $subtype ne "html") {
print OUT "\n>>> Text component $filename:\n" if
($filename !~ "msgauto");
if ($IO = $body->open("r")) {
print OUT $_ while (defined($_ = $IO->getline));
$IO->close;
push (@deletetemp, "$basefilepath/$filename");
$keepgoing = false;
}
}
else {
# EJG: Added case for Apple headers
if ( ($type eq "application") && ($subtype eq "applefile") ) {
print OUT "\n>>> $type/$subtype component, $name:\n";
print OUT "Not relevant, deleted\n";
push (@deletetemp, "$basefilepath/$filename");
}
else {
# EJG: Added 3rd condition (to avoid ".html.html" files)
if ($subtype eq "html" && $filename =~ /msgauto/ &&
$filename !~ /.html$/ ) {
$newfilename = "$filename.html";
$renametemp{"$basefilepath/$filename"} =
"$basefilepath/$newfilename";
$filename = $newfilename;
}
# EJG: If the message or the first part of the message is HTML,
# EJG: invoke HTML::FormatText to convert it to text.
if ($subtype eq "html" && ($name eq "message" || $name
eq "message, part 1" ) ){
my $htmltree = new HTML::TreeBuilder;
my $htmlformat = new HTML::FormatText(
leftmargin=>4, rightmargin=>60 );
$htmltree->parse_file( "$basefilepath/$filename" );
if ($name eq "message") {
print OUT "$sentrystr"."\n";
}
print OUT "Incoming HTML message detected --
converted to text only.\n";
print OUT "\n\n==========================================\n";
print OUT $htmlformat->format( $htmltree );
print OUT "\n\n==========================================\n";
print OUT "Original HTML version available at URL below.\n";
}
print OUT "\n>>> $type/$subtype component, $name:\n";
print OUT "<A HREF=\"$basepath/$filename\">\n";
print OUT "$basepath/$filename\n";
print OUT "<\/A>\n";
}
}
}
1;
}
#------------------------------
#
# main
#
sub main {
# Create a new MIME parser:
my $parser = new MIME::Parser;
# Set the output directory:
(-d "$basefilepath") or mkdir "$basefilepath",0755 or die "mkdir: $!";
(-w "$basefilepath") or die "can't write to directory";
$parser->output_dir($basefilepath);
open (OUT, "|$outputprog");
$parser->output_prefix("msgauto");
# Read the MIME message:
$entity = $parser->read(\*STDIN) or die "couldn't parse MIME stream";
# Dump it out:
dump_entity($entity, 1);
close(OUT);
# Delete unneeded temporary files
foreach (@deletetemp) {
unlink ($_);
}
# Rename our temporary files that were renamed (html, etc.)
foreach (keys %renametemp) {
rename ($_, $renametemp{$_});
}
# Delete our directory, or at least try -- won't delete if it's not empty
rmdir($basefilepath);
}
&main();
exit(0);
More information about the rt-users
mailing list