|
Re: HTML::Parser modifies unicode characters: msg#00018lang.perl.modules.lwp
* Dominic Mitchell <dom@xxxxxxxxxxxxx> [12/09/04 01:53]: > Moshe Kaminsky wrote: > >It appears that HTML::Parser modifies some unicode characters while > >parsing. The following program gives an example: > > > >######### > > > >#!/usr/bin/perl > >use HTML::Parser; > >use utf8; > >open TEST, '>:utf8', 'word.txt'; > >my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text']; > >$p->parse("zespołów\n"); > >close TEST; > > > >######### > > > >After running it, 'word.txt' contains "zespołów" with the funny l and > >the funny o following it transformed to something else. What am I doing > >wrong? > >I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux. > > It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a > nasty tendency to do this. :( > > Thankfully the workaround is fairly simple. Add "use Encode" to the top > of the script, and change the callback slightly: > > sub { print TEST decode_utf8(shift) } > > seems to work ok here. Thanks! That actually works. However, my real situation is that I'm using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and HTML::Parser. So to fix the problem, it appears that the only way is to modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser are aware of this problem, and if so, why don't they do this automatically (or at least add an option to do it automatically) before giving the text to the handler? Anyway, thanks again. Moshe > > -Dom > -- I love deadlines. I like the whooshing sound they make as they fly by. -- Douglas Adams Moshe Kaminsky <kaminsky@xxxxxxxxxxxxxxx> Home: 08-9456841
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | HTML::Parser modifies unicode characters: 00018, Moshe Kaminsky |
|---|---|
| Next by Date: | Re: HTML::Parser modifies unicode characters: 00018, Dominic Mitchell |
| Previous by Thread: | Re: HTML::Parser modifies unicode charactersi: 00018, Dominic Mitchell |
| Next by Thread: | Re: HTML::Parser modifies unicode characters: 00018, Dominic Mitchell |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |