logo       

Re: HTML::Parser modifies unicode characters: msg#00018

lang.perl.modules.lwp

Subject: Re: HTML::Parser modifies unicode characters

* Dominic Mitchell <dom@xxxxxxxxxxxxx> [12/09/04 01:53]:
> Moshe Kaminsky wrote:
> >It appears that HTML::Parser modifies some unicode characters while
> >parsing. The following program gives an example:
> >
> >#########
> >
> >#!/usr/bin/perl
> >use HTML::Parser;
> >use utf8;
> >open TEST, '>:utf8', 'word.txt';
> >my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
> >$p->parse("zespołów\n");
> >close TEST;
> >
> >#########
> >
> >After running it, 'word.txt' contains "zespołów" with the funny l and
> >the funny o following it transformed to something else. What am I doing
> >wrong?
> >I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.
>
> It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a
> nasty tendency to do this. :(
>
> Thankfully the workaround is fairly simple. Add "use Encode" to the top
> of the script, and change the callback slightly:
>
> sub { print TEST decode_utf8(shift) }
>
> seems to work ok here.

Thanks! That actually works. However, my real situation is that I'm
using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and
HTML::Parser. So to fix the problem, it appears that the only way is to
modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser
are aware of this problem, and if so, why don't they do this
automatically (or at least add an option to do it automatically) before
giving the text to the handler?

Anyway, thanks again.
Moshe

>
> -Dom
>

--
I love deadlines. I like the whooshing sound they make as they fly by.
-- Douglas Adams

Moshe Kaminsky <kaminsky@xxxxxxxxxxxxxxx>
Home: 08-9456841

Attachment: pgp2HprDmjm0h.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise