[Rt-commit] rt branch, 4.0/pg-fts-invalid-character, created. rt-4.0.5-62-g81df7e2

Alex Vandiver alexmv at bestpractical.com
Wed Feb 15 15:08:43 EST 2012


The branch, 4.0/pg-fts-invalid-character has been created
        at  81df7e2d07c35834b670e0e41adf677cd15affb5 (commit)

- Log -----------------------------------------------------------------
commit 12b0fded547c53c79db4f5a2e2f049b5f397d387
Author: Alex Vandiver <alexmv at bestpractical.com>
Date:   Wed Feb 15 15:01:05 2012 -0500

    With the Pg FTS, catch and skip attachments which contain invalid UTF8 bytes

diff --git a/sbin/rt-fulltext-indexer.in b/sbin/rt-fulltext-indexer.in
index 7e31cac..652fde0 100644
--- a/sbin/rt-fulltext-indexer.in
+++ b/sbin/rt-fulltext-indexer.in
@@ -371,6 +371,8 @@ sub process_pg {
     unless ( $status ) {
         if ($dbh->errstr =~ /string is too long for tsvector/) {
             warn "Attachment @{[$attachment->id]} not indexed, as it contains too many unique words to be indexed";
+        } elsif ($dbh->errstr =~ /invalid byte sequence/) {
+            warn "Attachment @{[$attachment->id]} cannot be indexed, as it contains invalid UTF8 bytes";
         } else {
             die "error: ". $dbh->errstr;
         }

commit 19721b8012776f5ae523e27f07b6dac06ad1dded
Author: Alex Vandiver <alexmv at bestpractical.com>
Date:   Wed Feb 15 15:03:38 2012 -0500

    Strengthen wording about our ability (or lack thereof) to FTS index on Pg

diff --git a/sbin/rt-fulltext-indexer.in b/sbin/rt-fulltext-indexer.in
index 652fde0..d978586 100644
--- a/sbin/rt-fulltext-indexer.in
+++ b/sbin/rt-fulltext-indexer.in
@@ -370,7 +370,7 @@ sub process_pg {
     my $status = eval { $dbh->do( $query, undef, $$text, $attachment->id ) };
     unless ( $status ) {
         if ($dbh->errstr =~ /string is too long for tsvector/) {
-            warn "Attachment @{[$attachment->id]} not indexed, as it contains too many unique words to be indexed";
+            warn "Attachment @{[$attachment->id]} cannot be indexed, as it contains too many unique words";
         } elsif ($dbh->errstr =~ /invalid byte sequence/) {
             warn "Attachment @{[$attachment->id]} cannot be indexed, as it contains invalid UTF8 bytes";
         } else {

commit 81df7e2d07c35834b670e0e41adf677cd15affb5
Author: Alex Vandiver <alexmv at bestpractical.com>
Date:   Wed Feb 15 15:03:45 2012 -0500

    If we fail to index on Pg, ensure that we continue indexing past that point
    
    Previously, failure to index (because of invalid bytes, or too-long
    content) left the content index NULL.  As our check for where to resume
    indexing is based on rows where the index IS NOT NULL, this could lead
    to a pessimal condition where a large number of failures to index in a
    row would prevent forward progress of the indexer.

diff --git a/sbin/rt-fulltext-indexer.in b/sbin/rt-fulltext-indexer.in
index d978586..407afe0 100644
--- a/sbin/rt-fulltext-indexer.in
+++ b/sbin/rt-fulltext-indexer.in
@@ -376,6 +376,11 @@ sub process_pg {
         } else {
             die "error: ". $dbh->errstr;
         }
+
+        # Insert an empty tsvector, so we count this row as "indexed"
+        # for purposes of knowing where to pick up
+        eval { $dbh->do( $query, undef, "", $attachment->id ) }
+            or die "Failed to insert empty tsvector: " . $dbh->errstr;
     }
 }
 

-----------------------------------------------------------------------


More information about the Rt-commit mailing list