|
Re: HTML::Parser modifies unicode characters: msg#00025lang.perl.modules.lwp
* Dominic Mitchell <dom@xxxxxxxxxxxxx> [13/09/04 14:01]: > [snip] > >I know nothing about XS, unfortunately, but the way I imagine it is > >that at some point, HTML::Parser calls the method given by text_h, > >passing the text to it. So instead of just passing the text, I > >suggest that it should pass decode_utf8 applied to the text. > >Alternatively, call a fixed (usual perl) sub 'foo', giving it the > >value of text_h and the text, and foo will apply decode_utf8 to the > >text and than pass the result to text_h. > > The trouble is that there's no guarantee that in the general case, the > input will always be UTF-8. At some point in all this, the input > character encoding needs to be specified. Only from that can the > appropriate action be taken. > Well, my opinion, at least, is that HTML::Parser should insist on having (in the terminology of the Enocde docs) a perl string as input, rather than octets. If your html happens to be a sequence of octets in some encoding, you can convert it to a perl string using Encode prior to passing it to HTML::Parser. Thanks, Moshe
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: HTML::Parser modifies unicode characters: 00025, Dominic Mitchell |
|---|---|
| Next by Date: | Re: Help, Please: Can't Get a Hold of <input type=button ...>: 00025, Daniel E. Doherty |
| Previous by Thread: | Re: HTML::Parser modifies unicode charactersi: 00025, Dominic Mitchell |
| Next by Thread: | How can I become universal utf/unicode: 00025, J and T |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |