[Rt-commit] rt branch, 4.0/non-character-scrubber-error, created. rt-4.0.6-254-g977cb83

Mon Aug 13 22:21:12 EDT 2012

The branch, 4.0/non-character-scrubber-error has been created
        at  977cb83c9b7f20d7f3c257b7307c2f59a6c3de77 (commit)

- Log -----------------------------------------------------------------
commit 977cb83c9b7f20d7f3c257b7307c2f59a6c3de77
Author: Alex Vandiver <alexmv at bestpractical.com>
Date:   Mon Aug 13 21:59:40 2012 -0400

    Remove not-a-character codepoints for safety on perl < 5.12.0
    
    Perl prior to 5.12.0 would die with a fatal error when attempting to
    match a character class against unicode codepoint U+FFFF:
    
       $ perl-5.10.1 -wle '$a = "\x{FFFF}"; $a =~ s/[^a]/x/g; print "OK";'
       Unicode character 0xffff is illegal at -e line 1.
       Malformed UTF-8 character (fatal) at -e line 1.
    
    The first warning is due to embedding the character in the source, and
    is not strictly relevant.  The second is a fatal error in the regular
    expression engine, with a misleading message.  The codepoint U+FFFF is
    correctly encoded as a UTF-8 character here, but is labelled "not a
    character" by the Unicode standard; this means that while software is
    free to use these code points for internal use, they should never be
    included in text interchange with other programs.  It is nonetheless
    currently possible to insert into RT's database when provided as its
    UTF-8 encoding ("EB EF EF") in email.
    
    The fatal error is particularly destructive in this case because it is
    triggered within HTML::Parser, within HTML::Scrubber.  This error leaves
    HTML::Parser in an inconsistent state, wherein it believes it is still
    parsing; this state causes all later calls to that parser to throw the
    error "Parse loop not allowed".
    
    And because, since 4024f896, RT stores and reuses the same
    HTML::Scrubber object per-process (which caches its own HTML::Parser
    object), this means that a U+FFFF codepoint in any content is capable of
    causing all future calls to HTML::Scrubber to die, for the remaining
    lifetime of the process.
    
    Explicitly strip all 66 non-characters (FFEF, FFFF, 1FFEF, 1FFFF, 2FFEF,
    2FFFF, etc, through 10FFFF, as well as FDD0..FDEF) before passing
    strings through to HTML::Scrubber.  While this is only required to avoid
    the error on perl prior to 5.12.0, it also improves correctness on later
    perls, which should not be producing the codepoints for text
    interchange in the browser.

diff --git a/lib/RT/Interface/Web.pm b/lib/RT/Interface/Web.pm
index 748caa3..5246bf3 100644
--- a/lib/RT/Interface/Web.pm
+++ b/lib/RT/Interface/Web.pm
@@ -3083,6 +3083,16 @@ sub ScrubHTML {
     $SCRUBBER = _NewScrubber() unless $SCRUBBER;
 
     $Content = '' if !defined($Content);
+
+    {
+        # Remove invalid Unicode codepoints, which can cause errors in
+        # HTML::Scrubber/HTML::Parser in perl < 5.12 when the regex
+        # engine dies on U+FFFF and other "non-character codepoints."
+        no warnings 'utf8';
+        my @invalid = map { hex($_ . "FFEF"), hex($_ . "FFFF") } 0..10;
+        push @invalid, hex("FDD0") .. hex("FDEF");
+        $Content =~ s/$_//g for map { chr( $_ ) } @invalid;
+    }
     return $SCRUBBER->scrub($Content);
 }
 

-----------------------------------------------------------------------