logo       

Re: HTML::Parser modifies unicode characters: msg#00025

lang.perl.modules.lwp

Subject: Re: HTML::Parser modifies unicode characters

* Dominic Mitchell <dom@xxxxxxxxxxxxx> [13/09/04 14:01]:
> [snip]
> >I know nothing about XS, unfortunately, but the way I imagine it is
> >that at some point, HTML::Parser calls the method given by text_h,
> >passing the text to it. So instead of just passing the text, I
> >suggest that it should pass decode_utf8 applied to the text.
> >Alternatively, call a fixed (usual perl) sub 'foo', giving it the
> >value of text_h and the text, and foo will apply decode_utf8 to the
> >text and than pass the result to text_h.
>
> The trouble is that there's no guarantee that in the general case, the
> input will always be UTF-8. At some point in all this, the input
> character encoding needs to be specified. Only from that can the
> appropriate action be taken.
>

Well, my opinion, at least, is that HTML::Parser should insist on having
(in the terminology of the Enocde docs) a perl string as input, rather
than octets. If your html happens to be a sequence of octets in some
encoding, you can convert it to a perl string using Encode prior to
passing it to HTML::Parser.

Thanks,
Moshe

Attachment: pgpqjlp2kXDl3.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise