|
|
Re: HTML::Parser modifies unicode characters: msg#00024
lang.perl.modules.lwp
|
Subject: |
Re: HTML::Parser modifies unicode characters |
Moshe Kaminsky wrote:
* Dominic Mitchell <dom@xxxxxxxxxxxxx> [13/09/04 12:05]:
Moshe Kaminsky wrote:
* Dominic Mitchell <dom@xxxxxxxxxxxxx> [12/09/04 01:53]:
Moshe Kaminsky wrote:
It appears that HTML::Parser modifies some unicode characters while
parsing. The following program gives an example:
#########
#!/usr/bin/perl
use HTML::Parser;
use utf8;
open TEST, '>:utf8', 'word.txt';
my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
$p->parse("zespołów\n");
close TEST;
#########
After running it, 'word.txt' contains "zespołów" with the funny l and
the funny o following it transformed to something else. What am I doing
wrong?
I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.
It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a
nasty tendency to do this. :(
Thankfully the workaround is fairly simple. Add "use Encode" to the top
of the script, and change the callback slightly:
sub { print TEST decode_utf8(shift) }
seems to work ok here.
Thanks! That actually works. However, my real situation is that I'm
using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and
HTML::Parser. So to fix the problem, it appears that the only way is to
modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser
are aware of this problem, and if so, why don't they do this
automatically (or at least add an option to do it automatically) before
giving the text to the handler?
Hmmm, it's a known problem:
http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BUGS
Thanks. I must say, though, that the explanation there is quite vague. I
don't see myself deducing your solution from this statement.
It's more just guesswork, based on experience with Perl's Unicode. Most
problems come down to something or other losing the UTF-8 flag on a
scalar and are solved with the Encode module. Encode::_is_utf8() is a
handy tool for checking that this is happening.
It doesn't look unsolveable, but it's slightly beyond my XS skills.
The key is indicating the character encoding of what you're parsing,
but that's sometimes difficult to determine in advance (think HTML
meta tags).
I know nothing about XS, unfortunately, but the way I imagine it is that
at some point, HTML::Parser calls the method given by text_h, passing
the text to it. So instead of just passing the text, I suggest that it
should pass decode_utf8 applied to the text. Alternatively, call a fixed
(usual perl) sub 'foo', giving it the value of text_h and the text, and
foo will apply decode_utf8 to the text and than pass the result to
text_h.
The trouble is that there's no guarantee that in the general case, the
input will always be UTF-8. At some point in all this, the input
character encoding needs to be specified. Only from that can the
appropriate action be taken.
As to how to fix it via HTML::FormatText, I'm not sure. You'd need to
read through the code to find out what it's doing and fix at an
appropriate point.
I did it. It is in fact in HTML::TreeBuilder. The thing is that I'm
giving this code to people, so now I need to tell people to do this
change as well (and they might not have the right permission, might not
know perl, may have a different version of HTML::TreeBuilder ...)
But perhaps there is another way. Instead of writing out to a file,
can you write to an in-memory string? If so, then that string would be
in UTF-8-without-the-UTF-8 flag set. So you could fix that by doing
"decode_utf8()" over that string before writing it to a file. Or
simply write that file out without any encoding which would do no
transformation of the UTF-8 bytes.
In the real life example I'm not writing to a file at all, I just did it
in the example to make it easy to verify. But the usage is hidden inside
HTML::FormatText, which gives me a text formatting of the whole html
page. And if I try to use decode_utf8 on this result, I get other
gibberish (presumably because that result already is a perl string).
-Dom
--
| Semantico: creators of major online resources |
| URL: http://www.semantico.com/ |
| Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232 |
| Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. |
|
|