[rt-users] RT 4.0.2 postgresql fulltext - error doing initial indexing

Mon Sep 19 14:37:09 EDT 2011

On Mon, 2011-09-19 at 13:24 +1000, fab junkmail wrote:
> 2011-09-19 02:08:28 UTC ERROR:  string is too long for tsvector
> (3831236 bytes, max 1048575 bytes)
> 2011-09-19 02:08:28 UTC STATEMENT:  UPDATE Attachments SET
> ContentIndex = to_tsvector($1) WHERE id = $2
> 
> 
> I think it is getting to a ticket that has too many unique words so it
> can't index it and it critically fails and stops indexing any further.

You are correct that this is because the content of one of the
attachments contains too many unique words (after removing stopwords and
doing stemming).  This is symptomatic of a pathological case -- for
example, the entirety of "A Tale of Two Cities" (775K) creates a 121K
tsvector and the entire corpus of the King James Bible (4.3M) creates a
160K tsvector.  In contrast, the contents of my /usr/share/dict/words
(916K) produces a 524K tsvector, because there is so little word
repetition.

Knowing what text/plain or text/html corpus you have in your database
which is blowing so significantly past this limit (generating a 3.8M
tsvector is impressive) would be interesting.  I suspect the data in
question is not actually textual data.  If you re-run
rt-fulltext-indexer with --debug, the last attachment number it prints
will tell you which attachment if the problematic one.

> I would appreciate some advice on how I can proceed with getting the
> rest of my data indexed. I think any of the following would be
> suitable but I don't know how to implement them (I am not a coder or a
> dba) and could use some help. Options:
> 
> - modify the rt-fulltext-indexer script to truncate strings that are
> "too long for tsvector". or

As pointed out above, long corpuses can generate perfectly reasonably
sized tsvectors.  Truncating your input strings before indexing will
yield false negatives in perfectly reasonable text; as such, the change
from the wiki will not be taken into core.

> - modify the rt-fulltext-indexer script to skip tickets that have that
> issue and continue indexing other tickets. or

rt-fulltext-indexer currently iterates every attachment content and
updates the tsvector one at a time; as such, modifying it to trap the
update with an eval {} block and continue for particular error cases
should be completely feasible.

> - find out which ticket is causing the problem (hopefully only one)
> and maybe I can delete it before running the rt-fulltext-indexer
> script. or

As I noted above, I suspect the row in question is not actually textual
data, despite being marked text/plain.  As I noted above, running with
--debug may shed some light on the contents which are at issue.
 - Alex