[rt-users] Option to store attachments on the filesystem

Thu Dec 22 16:16:46 EST 2011

On Thu, Dec 22, 2011 at 08:42:43PM +0000, Geoff Mayes wrote:
> Hi Kevin and Joe,
> 
> Joe -- that's exactly what I did at a previous company where we served 30TB of Bugzilla attachments and it worked very well.  I plan to do that with RT too.
> 
> Kevin -- thanks so much for your detailed response.  Yeah, definitely, I think this is an important conversation to have and I am so happy that others are weighing in.
> 
> > As far as the assertion that "a lot of folks would benefit from this
> > feature", I doubt that would be the case for the vast majority of RT
> > users. Most users can handle "one-stop-shopping" type applications
> > with far fewer problems.
> 
> Yes, you're right.  I was a bit overzealous in claiming that "a lot of folks would benefit from this feature."  By "a lot" I did not mean the majority but instead that a not insignificant number of RT users would be interested in this feature (per the previous list postings I referenced).
> 
> > Once you divorce the metadate repository
> > from the actual ticket data, you add a whole slew of different failure
> > modes that will require much more sophisticated administration processes
> > to prevent, ameliorate, or recover from.
> 
> I don't think this is the case, especially if this feature is implemented well.  RT only needs to deal with filesystem attachments for the "Create Ticket" and "Display Ticket" pages where users can upload or view attachments.  When displaying tickets RT could have a 5 second timeout when trying to access the local filesystem (over AJAX so the page still loads quickly entirely from the database) and if there is a timeout or failure, RT would log the issue (and optionally email) the admin and display a user-friendly message in the same attachments UI area stating that the attachments couldn't be displayed.
> 

I may be mistaken, but I thought that all ticket content is currently stored
as an attachment in the DB and not just those available in the Create Ticket or
Display Ticket screens. Perhaps a size threshold could be used to push larger
attachments (for some value of large) to a filesystem store and store a meta-link
in the attachment table. This would allow for searching and indexing the smaller
updates and would help to keep the DB from being bogged down with the larger items.

> I envision searching of attachments being disabled in RT if the user has chosen to store attachments on the filesystem and this would of course be well-documented for those choosing to install RT with filesystem attachments.  RT could still keep the Attachments table and store metadata about files in that table when a user uploads a file, but this isn't necessary and I'd let Best Practical decide this.  Keeping attachments in the Transactions table isn't necessary, but it fits with the design of RT and would be nice to have.  Again, all of this is internal RT implementation details, and depending on how well it is done, the local administrator of RT could have an incredibly simple and easy experience if they chose to store attachments on the filesystem instead of the database. 
> 
> The local filesystem or fileserver itself is, granted, an additional piece to manage, but it is an incredibly simple piece.  I would much rather manage files on a filesystem than files in a database when I could have a database be 1.6GB instead of 15GB.  And there are so many free (as well as expensive) ways to manage data (e.g. NFS, rsync, RAID, high-end redundant SANs, etc).  Most importantly, the binary data is separated from textual data, and separate backup schedules and schemes are then permitted.  Yay for modular design!  As I said previously, I previously administered 30TB of attachments over NFS for a different tracking system and it worked very well.
> 
> > Your reference to leveraging an existing SAN+SAN management team gives a hint to the increase in
> > both complexity and cost of running an instance.
> 
> This cost is up to the user depending on how secure and robust they would like their data to be.  RT only provides the option for local or database attachment storage.  The user can then decide whether a cheap SATA local disk serves the attachments or a super-expensive (or cheap), locally-mounted fileserver.
> 
> > There are a wide range of RT users from systems that manage a handful
> > of tickets a week all the way to systems handling thousands of tickets
> > or more a week. Those on the small end can/should use whatever DB
> > backend that they are familiar with to simplify administration and
> > the "what did I do?!" errors due to a lack of familiarity.
> 
> Totally agree with this.  An option to store attachments on the filesystem, however, is database-agnostic, so RT admins can select this option with MySQL, Oracle, Postgres, SQLite, etc.
> 
> > As you move towards larger implementations, your DB backend needs to be
> > chosen based on it viability in an enterprise/large-scale environment.
> > I do not know the level of your local MySQL expertise and I am certainly
> > not a MySQL expert, but a 15GB database does not strike me as particularly
> > large, by any metric. Maybe you would benefit by changing your backend DB
> > to something that scales better. I know that other DBs support tablespaces
> > that can allow you to move certain tables to different filesystems/locations
> > to provide for more parallel partitioning across more I/O resources.
> 
> Our desire to store attachments outside of the database, at this point, has little to do with application performance and everything to do with backups, disaster recovery, upgrades, and downtimes.  That being said, I do know that there are big performance gains that come from storing attachments outside of the database.  Check out one of Bugzilla's core developers discussion of this issue and their work-in-progress implementation: https://bugzilla.mozilla.org/show_bug.cgi?id=577532.  So moving attachments out of the database *is* an actual tuning option, just like the other options you mentioned.  Why do something drastic like changing the database backend or performing complicated and expert-level tuning/sharding/partitioning, when I could just add a few config options to RT_SiteConfig.pm and run a script (for a pre-existing instance) that then sets up my instance to serve attachments from a filesystem instead of the db?
> 

I am not against having the option at all. I think we need to have good
documentation on the trade-offs or strengths and weaknesses of each approach.
I do think that more people count on the ticket system as a resource and expect
it to be continuously available, the larger the consequences of adding additional
"moving parts" to the system. Having the ticket metadata available without the
actual data is useless in many environments and currently all ticket content is
stored as an attachment, in general. It is very easy to trivialize the issues
that need to manage a filestore, whether in a SQL DB or filesystem DB(or data
store). When reliability and availability are important, many measures need to
be taken to ensure access to all of the metadata+data and handle business
continuity and disaster recovery.
> Here's one recent example of how our current database size is negatively impacting us:  We upgraded from 3.8.4 to 4.0.4 yesterday and it took almost an hour to dump our database and almost an hour to import the database (we were upgrading MySQL and the OSes as well).  And then we had to import it again because max_packet_size was set too small (which wouldn't have been a problem if attachments were outside the db: anecdotal and not logical argument, but nonetheless a real-world occurrence as errors happen) so add another hour instead of only another 10 minutes.  If attachments were stored outside of the database, we could have reduced just the backup and import phases from 3 hours to 20 minutes.  That is a huge difference, especially when your application is used by thousands of customers waiting to log back in.  The positive ramifications continue: internal development of RT is much faster with a small database because we can copy them around the network faster, perform impo
>  rts in 1/10th the time, and keep our development database up-to-date much easier.
> 
> If someone knew of a simpler way to cut the dump and restore times by 1/10, I would love to hear it and be totally open to a different solution.
> 
As an example, using a different DB product you can replicate the backend to
a new instance and keep it in sync until the upgrade. Then you have effectively
zero time to copy the DB because the work was done outside the critical path
for the upgrade.

> The main point I would like to restate is that larger or quickly-growing instances of RT are very different than smaller or slowly-growing instances.  One pain point of the larger instances is the size of the database and how that affects backups, restores, disaster recoveries, and development.  Having the option to store attachments outside of the database allows the larger RT instances to more easily manage their data for a much longer period of time.  Most importantly for the Best Practical folks, this option increase the appeal of RT to larger organizations instead of the "small- to medium-sized" market as stated at http://requesttracker.wikia.com/wiki/ManualIntroduction.  The addition of this feature along with the recent SphinxSE option truly makes RT more feasible and attractive to larger organizations.

I think that the option to store attachments out-of-band would be useful
in some situations and I would like to get some feedback from BP on that.
As far as the backup situation goes, a simple way to drop your backup time
albeit with an increase in complexity is to backup the attachments table
and other large content tables using an incremental/differential process
and not the "dump-it-all" approach. In other words, dump only the new
items every day. This will dramatically reduce your backup times as well
as simplify keeping test DBs up to date. Just a couple of ideas.

Regards,
Ken