[Rt-commit] rt branch, 4.0/decode-entities-before-indexing, created. rt-4.0.22-21-g9ced882

Mon Dec 15 14:54:48 EST 2014

The branch, 4.0/decode-entities-before-indexing has been created
        at  9ced88246f8ff368e0877c454fb70b81b654d673 (commit)

- Log -----------------------------------------------------------------
commit 9ced88246f8ff368e0877c454fb70b81b654d673
Author: Kevin Falcone <falcone at bestpractical.com>
Date:   Thu Nov 20 16:58:22 2014 -0500

    Our rich text editor produces entities for indexed characters
    
    Postgres is completely capable of searching for cherché but the rich
    text editor produced cherché and so we handed that to Pg and it
    stored 'cherch' which won't match later.
    
    This causes us to pass the actual word to Pg to be tokenized and stored
    improving the odds that a later search will find it.
    
    Since RT generates text/plain outgoing mail parts and those parts
    generally contained the decoded word, this may have been hiding this
    failure (the search would find the outgoing mail message rather than
    the reply itself, but would still match the ticket).

diff --git a/sbin/rt-fulltext-indexer.in b/sbin/rt-fulltext-indexer.in
index b90d8da..1045a92 100644
--- a/sbin/rt-fulltext-indexer.in
+++ b/sbin/rt-fulltext-indexer.in
@@ -435,7 +435,10 @@ sub extract_html {
     my $attachment = shift;
     my $text = $attachment->Content;
     return undef unless defined $text && length($text);
-# TODO: html -> text
+# the rich text editor generates html entities for characters
+# but Pg doesn't index them, so decode to something it can index.
+    require HTML::Entities;
+    HTML::Entities::decode_entities($text);
     return \$text;
 }
 

-----------------------------------------------------------------------