|
|
Choosing A Webhost: |
Re: HTML::Entities and WinLatin1 NCRs [PATCH]: msg#00008lang.perl.modules.lwp
Chris Darroch <chrisd@xxxxxxxxxxxxxx> writes: > I use the HTML::Entities module quite a bit and have really > appreciated its support for Unicode characters > 256 with Perl 5.8. > > I do have one particular issue that crops up for me, and I thought > it might affects others as well, so I'm including a crude set of > patches with my "fix". In short, I have to support HTML documents > authored by a wide variety of people, and over time they've > accumulated numeric character references to the troublesome set > of characters between 128 and 159, mostly due to authors working > on Windows platforms. The same documents now may also have > character references to the Unicode code points for those characters. > > Here's a simple example: "two — em — dashes". > > Now, in my particular situation, I sometimes want to decode > these entities to the same code point, so that, for example, I can > match strings against each other. At first I thought I might > get away with this: > > $a = Encode::encode('utf8', $a); # force no utf8 flag > HTML::Entities::decode_entities($a); > $a = Encode::decode('cp1252', $a) unless (Encode::is_utf8($a)); > > But while that will turn "—" into U+2014, it turns > "——" into U+0097 U+2014, which doesn't help. > > So, I whacked into place a decode_entities_cp1252() function > that decodes any numeric characters references in the 128-159 > range (except for a couple of undefined ones) to the UTF-8 > equivalents. I'm positive there are nicer, more elegant, and > probably more flexible ways to do this, but lacking additional > time to experiment, this is where I stopped. To me it feels wrong to add such a kludge to HTML::Entities. It just seems to be the wrong level to do such manipulations. I would suggest that you just post-process the string that decode_entities() returns to fixup the Windows mess using tr///; example: sub cp1252_fixup { # replaces the additional WinLatin-1 chars in the 0x80 - 0x9F range # with the corresponding Unicode character my $str = shift; $str =~ tr/\x80-\x9f/\x{20AC}\x{FFFD}\x{201A}\x{192}\x{201E}\x{2026}\x{2020}\x{2021}\x{2C6}\x{2030}\x{160}\x{2039}\x{152}\x{FFFD}\x{17D}\x{FFFD}\x{FFFD}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{2DC}\x{2122}\x{161}\x{203A}\x{153}\x{FFFD}\x{17E}\x{178}/; $str; } my $str = "Here's a simple example: two — em — dashes"; use HTML::Entities; $str = cp1252_fixup(HTML::Entities::decode($str)); use Data::Dump; print Data::Dump::dump($str), "\n"; Dan: Would it make sense to make Encode provide something like cp1252_fixup or is there already a way to do this with Encode? Regards, Gisle
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | HTML::Entities and WinLatin1 NCRs [PATCH], Chris Darroch |
|---|---|
| Next by Date: | Re: HTML::Entities and WinLatin1 NCRs [PATCH], Gisle Aas |
| Previous by Thread: | HTML::Entities and WinLatin1 NCRs [PATCH], Chris Darroch |
| Next by Thread: | Re: HTML::Entities and WinLatin1 NCRs [PATCH], Gisle Aas |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |