|
Re: HTML::Parser modifies unicode characters: msg#00023lang.perl.modules.lwp
Moshe Kaminsky wrote: * Dominic Mitchell <dom@xxxxxxxxxxxxx> [12/09/04 01:53]: Hmmm, it's a known problem: http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BUGS It doesn't look unsolveable, but it's slightly beyond my XS skills. The key is indicating the character encoding of what you're parsing, but that's sometimes difficult to determine in advance (think HTML meta tags). As to how to fix it via HTML::FormatText, I'm not sure. You'd need to read through the code to find out what it's doing and fix at an appropriate point. But perhaps there is another way. Instead of writing out to a file, can you write to an in-memory string? If so, then that string would be in UTF-8-without-the-UTF-8 flag set. So you could fix that by doing "decode_utf8()" over that string before writing it to a file. Or simply write that file out without any encoding which would do no transformation of the UTF-8 bytes. -Dom -- | Semantico: creators of major online resources | | URL: http://www.semantico.com/ | | Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232 | | Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. | |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: HTML::Parser modifies unicode characters: 00023, Moshe Kaminsky |
|---|---|
| Next by Date: | Re: HTML::Parser modifies unicode characters: 00023, Dominic Mitchell |
| Previous by Thread: | Re: HTML::Parser modifies unicode charactersi: 00023, Moshe Kaminsky |
| Next by Thread: | Re: HTML::Parser modifies unicode characters: 00023, Moshe Kaminsky |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |