[rt-users] Data corruption with DBD::Pg 3.3.0
Alex Vandiver
alexmv at bestpractical.com
Mon Aug 11 18:15:50 EDT 2014
On 08/11/2014 05:23 PM, Dominic Hargreaves wrote:
> I don't see any sign of a bug against DBD::Pg in the CPAN bugtracker,
> and Debian now has 3.3.0 in unstable and testing.
The bug isn't DBD::Pg's fault -- hence why there's nothing we've
reported -- but rather a case of it becoming _more_ correct, and there
being lurking code in the bowels of DBIx::SearchBuilder that was
incorrect, and now interacts poorly. Specifically:
https://github.com/bestpractical/dbix-searchbuilder/blob/master/lib/DBIx/SearchBuilder/Handle.pm#L577-L579
..which takes characters that we're trying to insert into the database
and encodes them in UTF-8[1] -- which is then _double_ encoded when
DBD::Pg 3.3.0 realizes that the database column is textual. Previous to
3.3.0, it accepted bytes and inserted bytes, which we would later read
out as characters. Now, it accepts bytes and attempts to insert them as
character codepoints, so that the data round-trips and we get the same
character codepoints out. Which is more correct, as 3.2.1 relied on the
"UTF-8" flag to guess if the incoming data was codepoints or bytes,
which was a false presmise.
Those lines are, unfortunately, only part of the problem. Other places
exist in RT which blindly pass bytes (not characters) to textual
columns, which need to be resolved in order for RT to work properly with
DBD::Pg. In other words, the internals of RT are riddled with places
that make the same false assumptions about the "UTF-8" flag as DBD::Pg
3.2.1 did, which mostly canceled each other out.
> Could you say a bit more about the problem and what plans there are
> to fix/workaround it for RT? Forcing a lower version of DBD::Pg isn't
> a practical option in a packaged environment like Debian.
I've pushed https://github.com/bestpractical/rt/tree/4.0/utf8-reckoning
which addresses the deeper issues needed for RT to work. It is
currently in review, and will be merged in as short order as a branch of
that size can be. It passes all tests on both versions of DBD::Pg, but
further testing (carefully, as it might cause data corruption with
non-ASCII characters) would be appreciated.
> This is a pretty serious issue.
Fixing this is indeed high priority for us, as mostly-unrecoverable data
corruption is never a good thing. Once the branch gets merged, I expect
we'll roll release candidates in short order.
- Alex
[1] This is a slight lie, due to perl internals. In some rare cases,
for strings which contain only codepoints which exist in ISO-8859-1, it
instead encodes them in ISO-8859-1 before treating those bytes as
codepoints and double-encoding in UTF-8, for all of your mojibake needs.
Wonderful, no?
More information about the rt-users
mailing list