[Rt-commit] rt branch, 4.0/tsvector-too-long, created. rt-4.0.2-112-g692b5bc
Alex Vandiver
alexmv at bestpractical.com
Tue Sep 20 00:41:11 EDT 2011
The branch, 4.0/tsvector-too-long has been created
at 692b5bcb0d807b6f9c4407dc84108cbc25d1f5cf (commit)
- Log -----------------------------------------------------------------
commit 692b5bcb0d807b6f9c4407dc84108cbc25d1f5cf
Author: Alex Vandiver <alexmv at bestpractical.com>
Date: Mon Sep 19 23:31:19 2011 -0400
In Postgres, simply skip attachments whose tsvectors are too large
While PostgreSQL's tsvector format has a number of limitations [1], one
of them is particularly important:
* The length of a tsvector (lexemes + positions) must be less than
1 megabyte
This causes an error which causes the statement to fail, and thus
rt-fulltext-indexer stops there. Detect that particular failure case,
and allow indexing to proceed, simply skipping that attachment.
The rest of the limitations, and the reasons they do not need to be
dealt with similarly, are:
* The length of each lexeme must be less than 2K bytes
This produces a warning from Postgres, but the statement completes.
* The number of lexemes must be less than 2^64
This limit is hit far after the 1 megabyte limitation above, and thus
essentially can never be triggered.
* Position values in tsvector must be greater than 0 and no more
than 16,383
If a lexeme is found later than position 16,383, it is simply capped at
being position 16,383. This causes information loss, but as RT does not
present the position of the match in any form, this does not matter.
* No more than 256 positions per lexeme
Only the first 256 locations are stored; later occurrences are ignored.
Again, as RT does not present the locations of the matches, this
information loss is irrelevant to the application.
[1] http://www.postgresql.org/docs/9.1/static/textsearch-limitations.html
diff --git a/sbin/rt-fulltext-indexer.in b/sbin/rt-fulltext-indexer.in
index 37fb227..11e2791 100644
--- a/sbin/rt-fulltext-indexer.in
+++ b/sbin/rt-fulltext-indexer.in
@@ -367,9 +367,13 @@ sub process_pg {
$query = "UPDATE Attachments SET $column = to_tsvector(?) WHERE id = ?";
}
- my $status = $dbh->do( $query, undef, $$text, $attachment->id );
+ my $status = eval { $dbh->do( $query, undef, $$text, $attachment->id ) };
unless ( $status ) {
- die "error: ". $dbh->errstr;
+ if ($dbh->errstr =~ /string is too long for tsvector/) {
+ warn "Attachment @{[$attachment->id]} not indexed, as it contains too many unique words to be indexed";
+ } else {
+ die "error: ". $dbh->errstr;
+ }
}
}
diff --git a/t/fts/indexed_pg.t b/t/fts/indexed_pg.t
index ea5cad1..c437c1f 100644
--- a/t/fts/indexed_pg.t
+++ b/t/fts/indexed_pg.t
@@ -10,7 +10,7 @@ my ($major, $minor) = $RT::Handle->dbh->get_info(18) =~ /^0*(\d+)\.0*(\d+)/;
plan skip_all => "Need Pg 8.2 or higher; we have $major.$minor"
if "$major.$minor" < 8.2;
-plan tests => 21;
+plan tests => 36;
RT->Config->Set( FullTextSearch => Enable => 1, Indexed => 1, Column => 'ContentIndex', Table => 'Attachments' );
@@ -94,4 +94,26 @@ run_tests(
"Content LIKE 'pubs'" => { $book->id => 0, $bars->id => 0 },
);
+# Test the "ts_vector too long" skip
+my $content = "";
+$content .= "$_\n" for 1..200_000;
+ at tickets = RT::Test->create_tickets(
+ { Queue => $q->id },
+ { Subject => 'Short content', Content => '50' },
+ { Subject => 'Long content', Content => $content },
+ { Subject => 'More short', Content => '50' },
+);
+
+my ($exit_code, $output) = RT::Test->run_and_capture(
+ command => $RT::SbinPath .'/rt-fulltext-indexer'
+);
+like($output, qr/string is too long for tsvector/, "Got a warning for the ticket");
+ok(!$exit_code, "set up index");
+
+# The long content is skipped entirely
+run_tests(
+ "Content LIKE '1'" => { $tickets[0]->id => 0, $tickets[1]->id => 0, $tickets[2]->id => 0 },
+ "Content LIKE '50'" => { $tickets[0]->id => 1, $tickets[1]->id => 0, $tickets[2]->id => 1 },
+);
+
@tickets = ();
-----------------------------------------------------------------------
More information about the Rt-commit
mailing list