Moshe Kaminsky wrote:
* Dominic Mitchell <dom@xxxxxxxxxxxxx> [12/09/04 01:53]:
Moshe Kaminsky wrote:
It appears that HTML::Parser modifies some unicode characters while
parsing. The following program gives an example:
#########
#!/usr/bin/perl
use HTML::Parser;
use utf8;
open TEST, '>:utf8', 'word.txt';
my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
$p->parse("zespołów\n");
close TEST;
#########
After running it, 'word.txt' contains "zespołów" with the funny l and
the funny o following it transformed to something else. What am I doing
wrong?
I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.
It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a
nasty tendency to do this. :(
Thankfully the workaround is fairly simple. Add "use Encode" to the top
of the script, and change the callback slightly:
sub { print TEST decode_utf8(shift) }
seems to work ok here.
Thanks! That actually works. However, my real situation is that I'm
using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and
HTML::Parser. So to fix the problem, it appears that the only way is to
modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser
are aware of this problem, and if so, why don't they do this
automatically (or at least add an option to do it automatically) before
giving the text to the handler?
Hmmm, it's a known problem:
http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BUGS
It doesn't look unsolveable, but it's slightly beyond my XS skills. The
key is indicating the character encoding of what you're parsing, but
that's sometimes difficult to determine in advance (think HTML meta tags).
As to how to fix it via HTML::FormatText, I'm not sure. You'd need to
read through the code to find out what it's doing and fix at an
appropriate point. But perhaps there is another way. Instead of
writing out to a file, can you write to an in-memory string? If so,
then that string would be in UTF-8-without-the-UTF-8 flag set. So you
could fix that by doing "decode_utf8()" over that string before writing
it to a file. Or simply write that file out without any encoding which
would do no transformation of the UTF-8 bytes.
-Dom
--
| Semantico: creators of major online resources |
| URL: http://www.semantico.com/ |
| Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232 |
| Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. |
|