[Rt-commit] rt branch, 4.0/non-character-scrubber-error, created. rt-4.0.6-254-gb5a67b1
Alex Vandiver
alexmv at bestpractical.com
Thu Oct 25 15:12:12 EDT 2012
The branch, 4.0/non-character-scrubber-error has been created
at b5a67b12cad23f17587ed005057f1e0e27bd3147 (commit)
- Log -----------------------------------------------------------------
commit b5a67b12cad23f17587ed005057f1e0e27bd3147
Author: Alex Vandiver <alexmv at bestpractical.com>
Date: Wed Oct 24 23:24:41 2012 -0400
Remove not-a-character codepoints for safety on perl-5.8.3..perl-5.12.0
Perl between 5.8.3 and 5.12.0 (exclusive) would die with a fatal error
when attempting to match a character class against unicode codepoint
U+FFFF:
$ perl-5.10.1 -wle '$a = "\x{FFFF}"; $a =~ s/[^a]/x/g; print "OK";'
Unicode character 0xffff is illegal at -e line 1.
Malformed UTF-8 character (fatal) at -e line 1.
The first warning is due to embedding the character in the source, and
is not strictly relevant. The second is a fatal error in the regular
expression engine, with a misleading message. The codepoint U+FFFF is
correctly encoded as a UTF-8 character here, but is labelled "not a
character" by the Unicode standard; this means that while software is
free to use these code points for internal use, they should never be
included in text interchange with other programs. It is nonetheless
currently possible to insert into RT's database when provided as its
UTF-8 encoding ("EF BF BF") in email.
The fatal error is particularly destructive in this case because it is
triggered within HTML::Parser, within HTML::Scrubber. This error leaves
HTML::Parser in an inconsistent state, wherein it believes it is still
parsing; this state causes all later calls to that parser to throw the
error "Parse loop not allowed".
And because, since 4024f896, RT stores and reuses the same
HTML::Scrubber object per-process (which caches its own HTML::Parser
object), this means that a U+FFFF codepoint in any content is capable of
causing all future calls to HTML::Scrubber to die, for the remaining
lifetime of the process.
Explicitly strip all 66 non-characters (FFEF, FFFF, 1FFEF, 1FFFF, 2FFEF,
2FFFF, etc, through 10FFFF, as well as FDD0..FDEF) before passing
strings through to HTML::Scrubber. While this is only required to avoid
the error on perl prior to 5.12.0, it also improves correctness on later
perls, which should not be producing the codepoints for text
interchange in the browser.
diff --git a/lib/RT/Interface/Web.pm b/lib/RT/Interface/Web.pm
index 748caa3..5246bf3 100644
--- a/lib/RT/Interface/Web.pm
+++ b/lib/RT/Interface/Web.pm
@@ -3083,6 +3083,16 @@ sub ScrubHTML {
$SCRUBBER = _NewScrubber() unless $SCRUBBER;
$Content = '' if !defined($Content);
+
+ {
+ # Remove invalid Unicode codepoints, which can cause errors in
+ # HTML::Scrubber/HTML::Parser in perl < 5.12 when the regex
+ # engine dies on U+FFFF and other "non-character codepoints."
+ no warnings 'utf8';
+ my @invalid = map { hex($_ . "FFEF"), hex($_ . "FFFF") } 0..10;
+ push @invalid, hex("FDD0") .. hex("FDEF");
+ $Content =~ s/$_//g for map { chr( $_ ) } @invalid;
+ }
return $SCRUBBER->scrub($Content);
}
diff --git a/t/web/scrub.t b/t/web/scrub.t
index 612c6e2..5dab3f9 100644
--- a/t/web/scrub.t
+++ b/t/web/scrub.t
@@ -2,7 +2,7 @@
use strict;
use warnings;
-use RT::Test nodb => 1, tests => 6;
+use RT::Test nodb => 1, tests => 10;
use RT::Interface::Web; # This gets us HTML::Mason::Commands
use Test::LongString;
@@ -40,6 +40,20 @@ use Test::LongString;
is_string(scrub_html($html), $expected, "outlook html");
}
+{
+ no warnings 'utf8';
+ my $html = qq[Some content\N{U+FFFF}here];
+ my $expected = q[Some contenthere];
+ my $got = eval{scrub_html($html)};
+ ok(!$@, "Testing U+FFFF didn't die");
+ is_string($got, $expected, "U+FFFF non-character");
+
+ # This may die _hard_, escaping even the eval.
+ $got = eval{scrub_html("<p>content</p>")};
+ ok(!$@, "Later uninteresting calls also don't die");
+ is_string($got, "<p>content</p>", "Later uninteresting calls return the right value");
+}
+
sub scrub_html {
return HTML::Mason::Commands::ScrubHTML(shift);
}
-----------------------------------------------------------------------
More information about the Rt-commit
mailing list