|
Re: How can I become universal utf/unicode: msg#00029lang.perl.modules.lwp
* J and T wrote: >Sometimes when fetching a document you have no idea the encoding and >sometimes you do. What I want to know is how do I convert the incoming Web >page regardless of encoding to UTF-8 as well as encode entities to something >like Aacute (for keyword matching)? You need to determine the character encoding of the document and then transcode the byte stream to from the determined encoding to UTF-8. There are a number of rules how to determine the character encoding of text/html resources, these are unfortunately underspecified and contra- dict each other and, worse, most documents do not have any encoding information which means you would have to "guess" an encoding, or are encoded using a different encoding than what they declare, in these cases you would need to either reject the document or attempt to recover from such problems. There is a HTML::Encoding module on CPAN that can help you to determine the encoding, but there are probably some bugs and the interface will most certainly change once I get around to look at it again (I haven't done so for years). It should however give a good starting point. If that module (or similar code) does not yield in encoding information, there is Encode::Guess which helps a bit to determine the encoding. More sophisticated solutions than Encode::Guess are, AFAICT, not available on CPAN. You could try to interface with or reuse code from some web browsers, MSHTML for example would perform byte pattern analysis to determine an encoding. A simpler approach would be to fallback to e.g. Windows-1252, what you would do depends on how good you would like the results to be. Over at the W3C Markup Validator we currently attempt to use information as HTML::Encoding would report it and if that fails, fall back to UTF-8 and if the document is not decodable as UTF-8, the document is rejected. Which means that lots of documents are rejected. Once the input is UTF-8 encoded, you can use HTML::Parser as usual. I am not sure whether it sets the UTF-8 flag, but either way, it should report the data in the same encoding so you could set the flag later. |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | How can I become universal utf/unicode: 00029, J and T |
|---|---|
| Next by Date: | Patch for WWW::RobotsRules.pm: 00029, Bill Moseley |
| Previous by Thread: | How can I become universal utf/unicodei: 00029, J and T |
| Next by Thread: | Patch for WWW::RobotsRules.pm: 00029, Bill Moseley |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |