[rt-users] Re: [rt-devel] Patch for RT 3.0.3 attachment conversion problem (2)

Thu Jun 26 12:27:31 EDT 2003

On Thu, Jun 26, 2003 at 08:09:10PM +0900, Dan Kogai wrote:
> But one thing you should be careful is that a guessed encoding is, 
> after all, just a guess.  You should not rely too much upon it.  If you 
> have alternate way to tell the encoding explicitly, use that instead.

Advice very well taken.  Since this is MIME entities we're talking
about, RT will use all hints possible (content-type.charset, etc)
before falling back to Guess.

> >Cc'ing Kogai-san to try finding a solution.  Kogai-san, can we
> >somehow disable this helpful guessing of "\x00", via a
> >$Encode::Guess::NoUTF32Guessing control variable or something?
> 
> That's possible.  Thought the name should be NoUTF1632 (horrible but 
> more accurate) or something because it guesses not only UTF-32 (which 
> is hardly ever used for the time being) but also UTF-16.

I'll say that $NoUTFAutoGuess is correct, which should eliminate all
unrequested-for guessing of this kind.

Code and POD patch as below, against 1.08. :-)

Thanks,
/Autrijus/

--- Guess.pm.orig	Fri Jun 27 00:17:48 2003
+++ Guess.pm	Fri Jun 27 00:25:33 2003
@@ -18,6 +18,7 @@
 sub perlio_ok { 0 }
 
 our @EXPORT = qw(guess_encoding);
+our $NoUTFAutoGuess = 0;
 
 sub import { # Exporter not used so we do it on our own
     my $callpkg = caller;
@@ -70,22 +71,27 @@
     return unless defined $octet and length $octet;
 
     # cheat 0: utf8 flag;
-    Encode::is_utf8($octet) and return find_encoding('utf8');
+    if ( Encode::is_utf8($octet) ) {
+	return find_encoding('utf8') if !$NoUTFAutoGuess;
+	Encode::_utf8_off($octet);
+    }
     # cheat 1: BOM
     use Encode::Unicode;
-    my $BOM = unpack('n', $octet);
-    return find_encoding('UTF-16') 
-	if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe));
-    $BOM = unpack('N', $octet);
-    return find_encoding('UTF-32') 
-	if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe0000));
+    if (!$NoUTFAutoGuess) {
+	my $BOM = unpack('n', $octet);
+	return find_encoding('UTF-16') 
+	    if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe));
+	$BOM = unpack('N', $octet);
+	return find_encoding('UTF-32') 
+	    if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe0000));
+    }
     my %try =  %{$obj->{Suspects}};
     for my $c (@_){
 	my $e = find_encoding($c) or die "Unknown encoding: $c";
 	$try{$e->name} = $e;
 	$DEBUG and warn "Added: ", $e->name;
     }
-    if ($octet =~ /\x00/o){ # if \x00 found, we assume UTF-(16|32)(BE|LE)
+    if (!$NoUTFAutoGuess and $octet =~ /\x00/o){ # if \x00 found, we assume UTF-(16|32)(BE|LE)
 	my $utf;
 	my ($be, $le) = (0, 0);
 	if ($octet =~ /\x00\x00/o){ # UTF-32(BE|LE) assumed
@@ -188,6 +194,10 @@
 
  # tries all major Japanese Encodings as well
   use Encode::Guess qw/euc-jp shiftjis 7bit-jis/;
+
+If the C<$Encode::Guess::NoUTFAutoGuess> variable is set to a true
+value, no heuristics will be applied to UTF8/16/32, and the result
+will be limited to the suspects and C<ascii>.
 
 =over 4
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://pallas.eruditorum.org/pipermail/rt-devel/attachments/20030627/36c76bf5/attachment.pgp