[Rt-commit] rt branch, 4.0/non-character-scrubber-error, created. rt-4.0.6-254-g977cb83
Alex Vandiver
alexmv at bestpractical.com
Mon Aug 13 22:21:12 EDT 2012
The branch, 4.0/non-character-scrubber-error has been created
at 977cb83c9b7f20d7f3c257b7307c2f59a6c3de77 (commit)
- Log -----------------------------------------------------------------
commit 977cb83c9b7f20d7f3c257b7307c2f59a6c3de77
Author: Alex Vandiver <alexmv at bestpractical.com>
Date: Mon Aug 13 21:59:40 2012 -0400
Remove not-a-character codepoints for safety on perl < 5.12.0
Perl prior to 5.12.0 would die with a fatal error when attempting to
match a character class against unicode codepoint U+FFFF:
$ perl-5.10.1 -wle '$a = "\x{FFFF}"; $a =~ s/[^a]/x/g; print "OK";'
Unicode character 0xffff is illegal at -e line 1.
Malformed UTF-8 character (fatal) at -e line 1.
The first warning is due to embedding the character in the source, and
is not strictly relevant. The second is a fatal error in the regular
expression engine, with a misleading message. The codepoint U+FFFF is
correctly encoded as a UTF-8 character here, but is labelled "not a
character" by the Unicode standard; this means that while software is
free to use these code points for internal use, they should never be
included in text interchange with other programs. It is nonetheless
currently possible to insert into RT's database when provided as its
UTF-8 encoding ("EB EF EF") in email.
The fatal error is particularly destructive in this case because it is
triggered within HTML::Parser, within HTML::Scrubber. This error leaves
HTML::Parser in an inconsistent state, wherein it believes it is still
parsing; this state causes all later calls to that parser to throw the
error "Parse loop not allowed".
And because, since 4024f896, RT stores and reuses the same
HTML::Scrubber object per-process (which caches its own HTML::Parser
object), this means that a U+FFFF codepoint in any content is capable of
causing all future calls to HTML::Scrubber to die, for the remaining
lifetime of the process.
Explicitly strip all 66 non-characters (FFEF, FFFF, 1FFEF, 1FFFF, 2FFEF,
2FFFF, etc, through 10FFFF, as well as FDD0..FDEF) before passing
strings through to HTML::Scrubber. While this is only required to avoid
the error on perl prior to 5.12.0, it also improves correctness on later
perls, which should not be producing the codepoints for text
interchange in the browser.
diff --git a/lib/RT/Interface/Web.pm b/lib/RT/Interface/Web.pm
index 748caa3..5246bf3 100644
--- a/lib/RT/Interface/Web.pm
+++ b/lib/RT/Interface/Web.pm
@@ -3083,6 +3083,16 @@ sub ScrubHTML {
$SCRUBBER = _NewScrubber() unless $SCRUBBER;
$Content = '' if !defined($Content);
+
+ {
+ # Remove invalid Unicode codepoints, which can cause errors in
+ # HTML::Scrubber/HTML::Parser in perl < 5.12 when the regex
+ # engine dies on U+FFFF and other "non-character codepoints."
+ no warnings 'utf8';
+ my @invalid = map { hex($_ . "FFEF"), hex($_ . "FFFF") } 0..10;
+ push @invalid, hex("FDD0") .. hex("FDEF");
+ $Content =~ s/$_//g for map { chr( $_ ) } @invalid;
+ }
return $SCRUBBER->scrub($Content);
}
-----------------------------------------------------------------------
More information about the Rt-commit
mailing list