[rt-users] RT 4.0.2 postgresql fulltext - error doing initial indexing

fab junkmail fabjunkmail at gmail.com
Sun Sep 18 23:24:55 EDT 2011


Hi,

I am testing upgrading from RT 3.6.7 on mysql to RT 4.0.2 on
postgresql 8.4.8 (debian) and enabling full text search. The database
has more that 30000 tickets.

I have been able to convert the database to postgresql and update RT
to 4.0.2. I have done the set up for the full text search using
defaults except for selecting GIN index instead of the default.

When running "/opt/rt4/sbin/rt-fulltext-indexer --all" the indexer
runs for hours with the occasional minor warnings as follows:

[Thu Sep 15 01:56:27 2011] [warning]: NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.
(/opt/rt4/sbin/rt-fulltext-indexer:370)

I think that is ok.

However the indexer eventually fails/stops with this error:

[Mon Sep 19 02:08:28 2011] [warning]: DBD::Pg::db do failed: ERROR:
string is too long for tsvector (3831236 bytes, max 1048575 bytes) at
/opt/rt4/sbin/rt-fulltext-indexer line 370.
(/opt/rt4/sbin/rt-fulltext-indexer:370)
[Mon Sep 19 02:08:28 2011] [crit]: DBD::Pg::db do failed: ERROR:
string is too long for tsvector (3831236 bytes, max 1048575 bytes) at
/opt/rt4/sbin/rt-fulltext-indexer line 370.
(/opt/rt4/sbin/../lib/RT.pm:340)
DBD::Pg::db do failed: ERROR:  string is too long for tsvector
(3831236 bytes, max 1048575 bytes) at
/opt/rt4/sbin/rt-fulltext-indexer line 370.

related postgresql log:

2011-09-19 02:08:28 UTC ERROR:  string is too long for tsvector
(3831236 bytes, max 1048575 bytes)
2011-09-19 02:08:28 UTC STATEMENT:  UPDATE Attachments SET
ContentIndex = to_tsvector($1) WHERE id = $2


I think it is getting to a ticket that has too many unique words so it
can't index it and it critically fails and stops indexing any further.

If I run "/opt/rt4/sbin/rt-fulltext-indexer --all" again, it stops
with that error again quite quickly (after about a minute).

I tested doing some content searches for a particular word (using full
text search) and not all tickets that contain that word are found.
With full text turned off (content search still enabled) all tickets
with that word are found.

So I think the indexer has not been able to complete indexing all
tickets as it is getting a problem with one ticket that halts the
indexing.

I notice on an older wiki page for PostgreSQLFullText for before RT
had fulltext indexing built in
http://requesttracker.wikia.com/wiki/PostgreSQLFullText they mention
limiting the size of tsvector that will attempt to index: "So I am
adding the tsvectors only to those entries with a size < 500KB"

I think maybe this has not been implemented for rt 4.0.2?

I would appreciate some advice on how I can proceed with getting the
rest of my data indexed. I think any of the following would be
suitable but I don't know how to implement them (I am not a coder or a
dba) and could use some help. Options:

- modify the rt-fulltext-indexer script to truncate strings that are
"too long for tsvector". or
- modify the rt-fulltext-indexer script to skip tickets that have that
issue and continue indexing other tickets. or
- find out which ticket is causing the problem (hopefully only one)
and maybe I can delete it before running the rt-fulltext-indexer
script. or
- maybe someone can suggest a better alternative

Please help.

Regards,
Anthony



More information about the rt-users mailing list