[Rt-commit] rt branch, 4.0/non-character-scrubber-error, created. rt-4.0.6-254-gb5a67b1

Thu Oct 25 15:12:12 EDT 2012

The branch, 4.0/non-character-scrubber-error has been created
        at  b5a67b12cad23f17587ed005057f1e0e27bd3147 (commit)

- Log -----------------------------------------------------------------
commit b5a67b12cad23f17587ed005057f1e0e27bd3147
Author: Alex Vandiver <alexmv at bestpractical.com>
Date:   Wed Oct 24 23:24:41 2012 -0400

    Remove not-a-character codepoints for safety on perl-5.8.3..perl-5.12.0
    
    Perl between 5.8.3 and 5.12.0 (exclusive) would die with a fatal error
    when attempting to match a character class against unicode codepoint
    U+FFFF:
    
       $ perl-5.10.1 -wle '$a = "\x{FFFF}"; $a =~ s/[^a]/x/g; print "OK";'
       Unicode character 0xffff is illegal at -e line 1.
       Malformed UTF-8 character (fatal) at -e line 1.
    
    The first warning is due to embedding the character in the source, and
    is not strictly relevant.  The second is a fatal error in the regular
    expression engine, with a misleading message.  The codepoint U+FFFF is
    correctly encoded as a UTF-8 character here, but is labelled "not a
    character" by the Unicode standard; this means that while software is
    free to use these code points for internal use, they should never be
    included in text interchange with other programs.  It is nonetheless
    currently possible to insert into RT's database when provided as its
    UTF-8 encoding ("EF BF BF") in email.
    
    The fatal error is particularly destructive in this case because it is
    triggered within HTML::Parser, within HTML::Scrubber.  This error leaves
    HTML::Parser in an inconsistent state, wherein it believes it is still
    parsing; this state causes all later calls to that parser to throw the
    error "Parse loop not allowed".
    
    And because, since 4024f896, RT stores and reuses the same
    HTML::Scrubber object per-process (which caches its own HTML::Parser
    object), this means that a U+FFFF codepoint in any content is capable of
    causing all future calls to HTML::Scrubber to die, for the remaining
    lifetime of the process.
    
    Explicitly strip all 66 non-characters (FFEF, FFFF, 1FFEF, 1FFFF, 2FFEF,
    2FFFF, etc, through 10FFFF, as well as FDD0..FDEF) before passing
    strings through to HTML::Scrubber.  While this is only required to avoid
    the error on perl prior to 5.12.0, it also improves correctness on later
    perls, which should not be producing the codepoints for text
    interchange in the browser.

diff --git a/lib/RT/Interface/Web.pm b/lib/RT/Interface/Web.pm
index 748caa3..5246bf3 100644
--- a/lib/RT/Interface/Web.pm
+++ b/lib/RT/Interface/Web.pm
@@ -3083,6 +3083,16 @@ sub ScrubHTML {
     $SCRUBBER = _NewScrubber() unless $SCRUBBER;
 
     $Content = '' if !defined($Content);
+
+    {
+        # Remove invalid Unicode codepoints, which can cause errors in
+        # HTML::Scrubber/HTML::Parser in perl < 5.12 when the regex
+        # engine dies on U+FFFF and other "non-character codepoints."
+        no warnings 'utf8';
+        my @invalid = map { hex($_ . "FFEF"), hex($_ . "FFFF") } 0..10;
+        push @invalid, hex("FDD0") .. hex("FDEF");
+        $Content =~ s/$_//g for map { chr( $_ ) } @invalid;
+    }
     return $SCRUBBER->scrub($Content);
 }
 
diff --git a/t/web/scrub.t b/t/web/scrub.t
index 612c6e2..5dab3f9 100644
--- a/t/web/scrub.t
+++ b/t/web/scrub.t
@@ -2,7 +2,7 @@
 use strict;
 use warnings;
 
-use RT::Test nodb => 1, tests => 6;
+use RT::Test nodb => 1, tests => 10;
 use RT::Interface::Web; # This gets us HTML::Mason::Commands
 use Test::LongString;
 
@@ -40,6 +40,20 @@ use Test::LongString;
     is_string(scrub_html($html), $expected, "outlook html");
 }
 
+{
+    no warnings 'utf8';
+    my $html = qq[Some content\N{U+FFFF}here];
+    my $expected = q[Some contenthere];
+    my $got = eval{scrub_html($html)};
+    ok(!$@, "Testing U+FFFF didn't die");
+    is_string($got, $expected, "U+FFFF non-character");
+
+    # This may die _hard_, escaping even the eval.
+    $got = eval{scrub_html("<p>content</p>")};
+    ok(!$@, "Later uninteresting calls also don't die");
+    is_string($got, "<p>content</p>", "Later uninteresting calls return the right value");
+}
+
 sub scrub_html {
     return HTML::Mason::Commands::ScrubHTML(shift);
 }

-----------------------------------------------------------------------