[rt-users] Storing messages containing invalid encodings

Kenneth Marshall ktm at rice.edu
Tue Jun 15 13:41:54 EDT 2010


On Tue, Jun 15, 2010 at 06:24:13PM +0100, Dominic Hargreaves wrote:
> Hello,
> 
> We have found that messages from one particular sender are declared
> as being in a UTF8 encoding, but contain byte sequences which are not
> valid in UTF8; in particular '0xb2', '0xb3', '0xb9' - they appear to
> relate to particularly brain-dead renderings of various quotation 
> marks: <http://www.memoryhole.net/kyle/2007/08/superscriptone.html>
> (although that page doesn't cover the extra breakage of inserting
> those particular bytes into a UTF8 encoded document).
> 
> With PostgreSQL at least, the attachments are stored internally as
> unicode characters, so PostgreSQL not unreasonably refuses to store such
> an attachment. Of course, it's then impossible to create a ticket.
> 
> In an ideal world, the correspondent would receive the error message,
> enquire further, be told why his/her message wasn't usable, and fix
> his/her software.
> 
> In practice, this is unlikely to happen in this particular case and the
> messages are considered of high value to the organisation.
> 
> So, what to do? I've thought of four possibilities:
> 
> One: validate all data received via RT and pass it out to a
> heuristic routine which would substitute all invalid characters by some
> number of U+FFFD characters before storing the message. This might be
> controversial behaviour if the expectation is that RT stores what was
> supplied to it.
> 
> An alternative approach would be to alter the database scheme to allow
> for an attachment with unknown or invalid encoding; the binary data
> would be stored unmodified, and the web interface would offer for
> download the raw data for interpreting at the user's whim.
> 
> A third approach might involve filtering the incoming message outside of
> RT; this might be the most practical way to achieve the behaviour we
> desire, especially since it could be easily contained to individual queues.
> 
> Yet another acceptable workaround might be a much smaller modification
> to notify the queue owners that a message failed to be stored, as well
> as the correspondent.
> 
> Our logs indicate we've had 9 such occurrences (although some may relate
> to a separate UTF8 related bug fixed in 3.8.8 which we've only just
> installed) over 37,000 tickets so it's not a particularly common problem.
> 
> I would be interested to hear of anyone else encountering this issue,
> and any work taken to improve the situation for the unfortunate
> recipient of highly important garbage emails. When it comes down to both
> user expectations, and the oft-quoted principal of being liberal in 
> what one accepts, there is clearly some room for improvement here.
> 
> Cheers,
> Dominic.
> 
> -- 
> Dominic Hargreaves, Systems Development and Support Team
> Computing Services, University of Oxford


Hi Dominic,

I would chose any approach that keeps bad data out of the database,
in this case incorrect UTF-8. Is it possible to reroute bad attachments
to a separate storage for review by responsible parties, ideally before
you it reaches RT, maybe some sort of bad-data quarantine similar to
anti-spam quarantines. Maybe RT could automatically sanitize the data
if needed, using iconv and noting that in the ticket somehow.

Regards,
Ken



More information about the rt-users mailing list