[Rt-commit] rt branch, 4.0/tsvector-too-long, created. rt-4.0.2-112-g692b5bc

Tue Sep 20 00:41:11 EDT 2011

The branch, 4.0/tsvector-too-long has been created
        at  692b5bcb0d807b6f9c4407dc84108cbc25d1f5cf (commit)

- Log -----------------------------------------------------------------
commit 692b5bcb0d807b6f9c4407dc84108cbc25d1f5cf
Author: Alex Vandiver <alexmv at bestpractical.com>
Date:   Mon Sep 19 23:31:19 2011 -0400

    In Postgres, simply skip attachments whose tsvectors are too large
    
    While PostgreSQL's tsvector format has a number of limitations [1], one
    of them is particularly important:
    
         * The length of a tsvector (lexemes + positions) must be less than
           1 megabyte
    
    This causes an error which causes the statement to fail, and thus
    rt-fulltext-indexer stops there.  Detect that particular failure case,
    and allow indexing to proceed, simply skipping that attachment.
    
    The rest of the limitations, and the reasons they do not need to be
    dealt with similarly, are:
    
         * The length of each lexeme must be less than 2K bytes
    
    This produces a warning from Postgres, but the statement completes.
    
         * The number of lexemes must be less than 2^64
    
    This limit is hit far after the 1 megabyte limitation above, and thus
    essentially can never be triggered.
    
         * Position values in tsvector must be greater than 0 and no more
           than 16,383
    
    If a lexeme is found later than position 16,383, it is simply capped at
    being position 16,383.  This causes information loss, but as RT does not
    present the position of the match in any form, this does not matter.
    
         * No more than 256 positions per lexeme
    
    Only the first 256 locations are stored; later occurrences are ignored.
    Again, as RT does not present the locations of the matches, this
    information loss is irrelevant to the application.
    
    [1] http://www.postgresql.org/docs/9.1/static/textsearch-limitations.html

diff --git a/sbin/rt-fulltext-indexer.in b/sbin/rt-fulltext-indexer.in
index 37fb227..11e2791 100644
--- a/sbin/rt-fulltext-indexer.in
+++ b/sbin/rt-fulltext-indexer.in
@@ -367,9 +367,13 @@ sub process_pg {
         $query = "UPDATE Attachments SET $column = to_tsvector(?) WHERE id = ?";
     }
 
-    my $status = $dbh->do( $query, undef, $$text, $attachment->id );
+    my $status = eval { $dbh->do( $query, undef, $$text, $attachment->id ) };
     unless ( $status ) {
-        die "error: ". $dbh->errstr;
+        if ($dbh->errstr =~ /string is too long for tsvector/) {
+            warn "Attachment @{[$attachment->id]} not indexed, as it contains too many unique words to be indexed";
+        } else {
+            die "error: ". $dbh->errstr;
+        }
     }
 }
 
diff --git a/t/fts/indexed_pg.t b/t/fts/indexed_pg.t
index ea5cad1..c437c1f 100644
--- a/t/fts/indexed_pg.t
+++ b/t/fts/indexed_pg.t
@@ -10,7 +10,7 @@ my ($major, $minor) = $RT::Handle->dbh->get_info(18) =~ /^0*(\d+)\.0*(\d+)/;
 plan skip_all => "Need Pg 8.2 or higher; we have $major.$minor"
     if "$major.$minor" < 8.2;
 
-plan tests => 21;
+plan tests => 36;
 
 RT->Config->Set( FullTextSearch => Enable => 1, Indexed => 1, Column => 'ContentIndex', Table => 'Attachments' );
 
@@ -94,4 +94,26 @@ run_tests(
     "Content LIKE 'pubs'" => { $book->id => 0, $bars->id => 0 },
 );
 
+# Test the "ts_vector too long" skip
+my $content = "";
+$content .= "$_\n" for 1..200_000;
+ at tickets = RT::Test->create_tickets(
+    { Queue => $q->id },
+    { Subject => 'Short content', Content => '50' },
+    { Subject => 'Long content',  Content => $content  },
+    { Subject => 'More short',    Content => '50' },
+);
+
+my ($exit_code, $output) = RT::Test->run_and_capture(
+    command => $RT::SbinPath .'/rt-fulltext-indexer'
+);
+like($output, qr/string is too long for tsvector/, "Got a warning for the ticket");
+ok(!$exit_code, "set up index");
+
+# The long content is skipped entirely
+run_tests(
+    "Content LIKE '1'"  => { $tickets[0]->id => 0, $tickets[1]->id => 0, $tickets[2]->id => 0 },
+    "Content LIKE '50'" => { $tickets[0]->id => 1, $tickets[1]->id => 0, $tickets[2]->id => 1 },
+);
+
 @tickets = ();

-----------------------------------------------------------------------