[rt-users] RT 4.0.2 postgresql fulltext - error doing initial indexing

Mon Sep 19 22:05:08 EDT 2011

Hi Alex,

Thank you very much for your responses. Notes inline.

On Tue, Sep 20, 2011 at 4:37 AM, Alex Vandiver <alexmv at bestpractical.com> wrote:
> On Mon, 2011-09-19 at 13:24 +1000, fab junkmail wrote:
>> 2011-09-19 02:08:28 UTC ERROR:  string is too long for tsvector
>> (3831236 bytes, max 1048575 bytes)
>> 2011-09-19 02:08:28 UTC STATEMENT:  UPDATE Attachments SET
>> ContentIndex = to_tsvector($1) WHERE id = $2
>>
>>
>> I think it is getting to a ticket that has too many unique words so it
>> can't index it and it critically fails and stops indexing any further.
>
> You are correct that this is because the content of one of the
> attachments contains too many unique words (after removing stopwords and
> doing stemming).  This is symptomatic of a pathological case -- for
> example, the entirety of "A Tale of Two Cities" (775K) creates a 121K
> tsvector and the entire corpus of the King James Bible (4.3M) creates a
> 160K tsvector.  In contrast, the contents of my /usr/share/dict/words
> (916K) produces a 524K tsvector, because there is so little word
> repetition.
>
> Knowing what text/plain or text/html corpus you have in your database
> which is blowing so significantly past this limit (generating a 3.8M
> tsvector is impressive) would be interesting.  I suspect the data in
> question is not actually textual data.  If you re-run
> rt-fulltext-indexer with --debug, the last attachment number it prints
> will tell you which attachment if the problematic one.

I used the --debug option and was able to find the ticket that caused
the problem. The ticket has a 12MB, 125000 line .txt attachment that
is a file system file listing with full paths. Eg:

...
.//Accounts/Backup/New Folder
.//Accounts/Backup/PRM0124.ZIP
.//Accounts/Backup/PRM0206.ZIP
.//Accounts/Backup/PRM0207.ZIP
.//Accounts/Backup/PRM0214.1.ZIP
...

So in many cases it would have counted the whole line as a unique
word. That explains how the tsvector for this attachment got to over
3MB.

>
>> I would appreciate some advice on how I can proceed with getting the
>> rest of my data indexed. I think any of the following would be
>> suitable but I don't know how to implement them (I am not a coder or a
>> dba) and could use some help. Options:
>>
>> - modify the rt-fulltext-indexer script to truncate strings that are
>> "too long for tsvector". or
>
> As pointed out above, long corpuses can generate perfectly reasonably
> sized tsvectors.  Truncating your input strings before indexing will
> yield false negatives in perfectly reasonable text; as such, the change
> from the wiki will not be taken into core.
>
>> - modify the rt-fulltext-indexer script to skip tickets that have that
>> issue and continue indexing other tickets. or
>
> rt-fulltext-indexer currently iterates every attachment content and
> updates the tsvector one at a time; as such, modifying it to trap the
> update with an eval {} block and continue for particular error cases
> should be completely feasible.
>
>> - find out which ticket is causing the problem (hopefully only one)
>> and maybe I can delete it before running the rt-fulltext-indexer
>> script. or
>
> As I noted above, I suspect the row in question is not actually textual
> data, despite being marked text/plain.  As I noted above, running with
> --debug may shed some light on the contents which are at issue.
>  - Alex
>
>

I have deleted (using shredder) the problem ticket and I am re-running
"/opt/rt4/sbin/rt-fulltext-indexer-mod --all". It has been running for
a few minutes now so looks like it has got past the problem and is
working. If it has another problem I now know how to work around it.

Thanks very much for your help Alex.

Regards,
Anthony