[rt-users] Large Queue size problems

Mon Apr 19 19:06:50 EDT 2010

Hi all,

Our company currently runs RT for customer support interactions as well 
as a central email abuse reporting system for customer IP blocks.  
Recently we setup a feedback system with a large hosted mail provider 
and we saw the level of incoming abuse/spam reports increase to 10s of 
thousands a day.  I have been trying to identify the source of an issue 
that essentially boils down to this:  When our RT queues are 'large' 
(over 100K tickets) the UI struggles to complete operations or consumes 
all system resources.

To mitigate the issues, we have been using rt-shredder to cull out the 
excess, but I have a backed up DB to test with.  What I have found is 
that on a particular type of search, the returned DB data set is so 
large the apache process handling the request consumes almost all 
available memory on the RT host, leading to swapping and/or a nasty 
segfault.

Our setup involves three hosts:

1 dedicated Gentoo based DB host running MySQL 5.0 with innodb based 
tables.  2G ram and 1 64bit quad core xeon running under VMWare vSphere 4.
2 load balanced Gentoo based apache servers running RT3.8.2 with the 
same proc/cpu specs as the DB host.

The magic search that overloads apache works as follows:

1) Click on our large queue from RT at a glance Quick Search.  The queue 
in question contains 184744 new, 7 open and 7731 stalled tickets in my 
dumped database.
2) Click on any ticket on any of the returned pages.

Apache then consumes so much memory that we have to kill the process at 
best or restart the server at worst.  In a browser, the ticket page 
often fails to load or may be partially completed before the host 
resources are exhausted.

The MySQL query also reveals that the last operation in this state 
returns a large chunk of data, and often pops up in the slow query log 
with an average execution time of 15  seconds.  My first thought was 
that we had an issue with out database.  However several days of testing 
indicated that this problem was directly related to apache/mod_perl 
having to drink from a firehose.

Here's the entry that always logs to the slow query log:

# Time: 100419 16:58:07
# User at Host: rtadmin[rtadmin] @ rt-test[172.20.0.99]
# Query_time: 15  Lock_time: 0  Rows_sent: 184751  Rows_examined: 377948
use rt3;
SELECT main.* FROM Tickets main  WHERE (main.Status != 'deleted') AND 
(main.Queue = '11' AND  ( main.Status = 'new' OR main.Status = 'open' ) 
) AND (main.EffectiveId = main.id) AND (main.Type = 'ticket')  ORDER BY 
main.id ASC;

Interestingly, searching for the specific ticket via the main page 
search box brings up the typed in ticket quickly and without incident.  
Another tidbit, is this appears to involve some level of caching.  If I 
follow the above steps, then kill the process and finally select another 
ticket NOT in the large queue (one off my own top 15 tickets) then the 
same behavior is observed.  Also, I see queries in the MySQL query log 
that include data related to the previous search.  I have performed a 
battery of tests stopping daemons, clearing mason cache, clearing 
browser cache and the like to figure all this out.

The one detail about our setup that I suspect plays a part here is that 
a previous admin wrote a series of email handling scripts that always 
re-writes the sender address before handing the email off to 
rt-mailgate.  We suspected at one point that part of the issue was 
related to the query that looks up other tickets created by the sender.  
An 'explain' in MySQL did show that the volume of data was forcing an on 
disk temp table and filesort, but I haven't directly correlated a slow 
DB operation to the consumption of memory on the apache side.  That 
email handling apparatus is currently being replaced, but does having 
the same 'Creator' on 100K + tickets sound like a really bad thing, or 
is this normal for large shops?

Can I get some feedback on how our system compares to others using RT?  
How many tickets do you collect in a day? What rough system specs are 
you running on?  Is this normal for large volumes of tickets?  Is the 
only answer ever more RAM?

Also, I updated the test rig to 3.8.7 from 3.8.2 today including all DB 
upgrade operations with no change.

Any assistance would be very much appreciated.  Our current game plan is 
to build an archiving system to keep the queue numbers down, but at some 
point large queues may be the norm for us.  Thanks.

-Nick Geron