[rt-users] Storing messages containing invalid encodings

Dominic Hargreaves dominic.hargreaves at oucs.ox.ac.uk
Tue Jun 15 13:24:13 EDT 2010


Hello,

We have found that messages from one particular sender are declared
as being in a UTF8 encoding, but contain byte sequences which are not
valid in UTF8; in particular '0xb2', '0xb3', '0xb9' - they appear to
relate to particularly brain-dead renderings of various quotation 
marks: <http://www.memoryhole.net/kyle/2007/08/superscriptone.html>
(although that page doesn't cover the extra breakage of inserting
those particular bytes into a UTF8 encoded document).

With PostgreSQL at least, the attachments are stored internally as
unicode characters, so PostgreSQL not unreasonably refuses to store such
an attachment. Of course, it's then impossible to create a ticket.

In an ideal world, the correspondent would receive the error message,
enquire further, be told why his/her message wasn't usable, and fix
his/her software.

In practice, this is unlikely to happen in this particular case and the
messages are considered of high value to the organisation.

So, what to do? I've thought of four possibilities:

One: validate all data received via RT and pass it out to a
heuristic routine which would substitute all invalid characters by some
number of U+FFFD characters before storing the message. This might be
controversial behaviour if the expectation is that RT stores what was
supplied to it.

An alternative approach would be to alter the database scheme to allow
for an attachment with unknown or invalid encoding; the binary data
would be stored unmodified, and the web interface would offer for
download the raw data for interpreting at the user's whim.

A third approach might involve filtering the incoming message outside of
RT; this might be the most practical way to achieve the behaviour we
desire, especially since it could be easily contained to individual queues.

Yet another acceptable workaround might be a much smaller modification
to notify the queue owners that a message failed to be stored, as well
as the correspondent.

Our logs indicate we've had 9 such occurrences (although some may relate
to a separate UTF8 related bug fixed in 3.8.8 which we've only just
installed) over 37,000 tickets so it's not a particularly common problem.

I would be interested to hear of anyone else encountering this issue,
and any work taken to improve the situation for the unfortunate
recipient of highly important garbage emails. When it comes down to both
user expectations, and the oft-quoted principal of being liberal in 
what one accepts, there is clearly some room for improvement here.

Cheers,
Dominic.

-- 
Dominic Hargreaves, Systems Development and Support Team
Computing Services, University of Oxford
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://lists.bestpractical.com/pipermail/rt-users/attachments/20100615/f4224728/attachment.sig>


More information about the rt-users mailing list