logo       

Re: HTML::Parser modifies unicode characters: msg#00022

lang.perl.modules.lwp

Subject: Re: HTML::Parser modifies unicode characters

* Dominic Mitchell <dom@xxxxxxxxxxxxx> [13/09/04 12:05]:
> Moshe Kaminsky wrote:
>
> >* Dominic Mitchell <dom@xxxxxxxxxxxxx> [12/09/04 01:53]:
> >
> >>Moshe Kaminsky wrote:
> >>
> >>>It appears that HTML::Parser modifies some unicode characters while
> >>>parsing. The following program gives an example:
> >>>
> >>>#########
> >>>
> >>>#!/usr/bin/perl
> >>>use HTML::Parser;
> >>>use utf8;
> >>>open TEST, '>:utf8', 'word.txt';
> >>>my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
> >>>$p->parse("zespołów\n");
> >>>close TEST;
> >>>
> >>>#########
> >>>
> >>>After running it, 'word.txt' contains "zespołów" with the funny l and
> >>>the funny o following it transformed to something else. What am I doing
> >>>wrong?
> >>>I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.
> >>
> >>It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a
> >>nasty tendency to do this. :(
> >>
> >>Thankfully the workaround is fairly simple. Add "use Encode" to the top
> >>of the script, and change the callback slightly:
> >>
> >> sub { print TEST decode_utf8(shift) }
> >>
> >>seems to work ok here.
> >
> >
> >Thanks! That actually works. However, my real situation is that I'm
> >using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and
> >HTML::Parser. So to fix the problem, it appears that the only way is to
> >modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser
> >are aware of this problem, and if so, why don't they do this
> >automatically (or at least add an option to do it automatically) before
> >giving the text to the handler?
>
> Hmmm, it's a known problem:
>
> http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BUGS

Thanks. I must say, though, that the explanation there is quite vague. I
don't see myself deducing your solution from this statement.

>
> It doesn't look unsolveable, but it's slightly beyond my XS skills.
> The key is indicating the character encoding of what you're parsing,
> but that's sometimes difficult to determine in advance (think HTML
> meta tags).

I know nothing about XS, unfortunately, but the way I imagine it is that
at some point, HTML::Parser calls the method given by text_h, passing
the text to it. So instead of just passing the text, I suggest that it
should pass decode_utf8 applied to the text. Alternatively, call a fixed
(usual perl) sub 'foo', giving it the value of text_h and the text, and
foo will apply decode_utf8 to the text and than pass the result to
text_h.
>
> As to how to fix it via HTML::FormatText, I'm not sure. You'd need to
> read through the code to find out what it's doing and fix at an
> appropriate point.

I did it. It is in fact in HTML::TreeBuilder. The thing is that I'm
giving this code to people, so now I need to tell people to do this
change as well (and they might not have the right permission, might not
know perl, may have a different version of HTML::TreeBuilder ...)

> But perhaps there is another way. Instead of writing out to a file,
>can you write to an in-memory string? If so, then that string would be
>in UTF-8-without-the-UTF-8 flag set. So you could fix that by doing
>"decode_utf8()" over that string before writing it to a file. Or
>simply write that file out without any encoding which would do no
>transformation of the UTF-8 bytes.

In the real life example I'm not writing to a file at all, I just did it
in the example to make it easy to verify. But the usage is hidden inside
HTML::FormatText, which gives me a text formatting of the whole html
page. And if I try to use decode_utf8 on this result, I get other
gibberish (presumably because that result already is a perl string).

Thanks for the help.
Moshe

>
> -Dom
>
> --
> | Semantico: creators of major online resources |
> | URL: http://www.semantico.com/ |
> | Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232 |
> | Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. |
>

--
I love deadlines. I like the whooshing sound they make as they fly by.
-- Douglas Adams

Moshe Kaminsky <kaminsky@xxxxxxxxxxxxxxx>

Attachment: pgpcTyjXqi7iM.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise