|
windows1252-to-Unicode.sed: msg#00003editors.sed.user
I just wrote this. Thought somebody else might be able to use it. ------------------------ # filename: windows1252-to-Unicode.sed # author: Eric Pement - eric.pement[at]moody.edu # date: 2005-12-06 16:14 # # GNU sed script to convert Windows-1252 characters in the range # of € to Ÿ (0x80 to 0x9F) to Unicode character entities # for HTML and XHTML. Characters in this range are LEGAL for the # windows-1252 character set, but are ILLEGAL for the ISO-8859-1 # character set (the default encoding for web display). # Use this script if you have input text generated in Microsoft # Word/Works, and you find input characters in this range # (usually, the curly quotes, ellipsis, or the en/em dash), and # you want to change the character encoding to ISO-8859-1 (i.e., # Latin-1) with Unicode character entities. # On the 'find' side, I used \xHH instead of \dNN because \xHH # is supported in BOTH ssed default and in ssed -R (Perlmode), # whereas \dNN is not supported if using ssed -R. # On the 'replace' side, HTML character entities can be rendered # with either &#NNNN; (base 10) or &#xHHHH; (base 16). I chose # to use base 16, since that's how they are listed in the charts # at http://unicode.org/charts and http://unicode.org/unibook . s/\x80/\€/g; # 128 - Euro symbol s/\x81/\™/g; # 129 - (nonstandard) TradeMark symbol s/\x82/\‚/g; # 130 - low-9 single quote s/\x83/\ƒ/g; # 131 - 'f' with hook (function symbol) s/\x84/\„/g; # 132 - low-9 double quote s/\x85/\…/g; # 133 - ellipsis s/\x86/\†/g; # 134 - dagger s/\x87/\‡/g; # 135 - double dagger # Next one is susceptible to several substitutions # Note the Unicode variants: # 005E - standard circumflex # 02C4 - modifier letter up arrowhead # 02C6 - modifier letter circumflex accent # 2038 - caret (used at baseline) # 2227 - logical AND # 2303 - UP arrowhead s/\x88/\ˆ/g; # 136 - circumflex accent # s/\x89/\‰/g; # 137 - per thousand [0/00] s/\x8A/\Š/g; # 138 - uppercase 'S' with caron s/\x8B/\‹/g; # 139 - L angle single quote, like < s/\x8C/\Œ/g; # 140 - Uppercase 'OE' ligature # s/\x8D//g; # 141 - (nonstandard) no character? s/\x8E/\Ž/g; # 142 - (nonstd?) uppercase 'Z' with caron # s/\x8F//g; # 143 - (nonstandard) no character? # s/\x90//g; # 144 - (nonstandard) no character? s/\x91/\‘/g; # 145 - L single quote s/\x92/\’/g; # 146 - R single quote (or apostrophe) s/\x93/\“/g; # 147 - L double quote s/\x94/\”/g; # 148 - R double quote s/\x95/\•/g; # 149 - bullet s/\x96/\–/g; # 150 - en dash (between numbers) s/\x97/\—/g; # 151 - em dash (between words) s/\x98/\˜/g; # 152 - small tilde s/\x99/\™/g; # 153 - TradeMark symbol s/\x9A/\š/g; # 154 - small 's' with caron s/\x9B/\›/g; # 155 - R angle single quote, like > s/\x9C/\œ/g; # 156 - small 'oe' ligature # s/\x9D//g; # 157 - (nonstandard) no character? s/\x9E/\ž/g; # 158 - small 'z' with caron s/\x9F/\Ÿ/g; # 159 - uppercase 'Y' with diaeresis #--- [end of sed script] --- ------------------------ -- Eric Pement - eric.pement@xxxxxxxxx Educational Technical Services, MBI ------------------------ Yahoo! Groups Sponsor --------------------~--> Most low income households are not online. Help bridge the digital divide today! http://us.click.yahoo.com/I258zB/QnQLAA/TtwFAA/dkFolB/TM --------------------------------------------------------------------~-> -- Yahoo! Groups Links <*> To visit your group on the web, go to: http://groups.yahoo.com/group/sed-users/ <*> To unsubscribe from this group, send an email to: sed-users-unsubscribe@xxxxxxxxxxxxxxx <*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/ |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Replacing a pattern in one file by reading a string from another file: 00003, Tim Chase |
|---|---|
| Next by Date: | Re: windows1252-to-Unicode.sed: 00003, Angus Leeming |
| Previous by Thread: | Replacing a pattern in one file by reading a string from another filei: 00003, alkkmrz2004 |
| Next by Thread: | Re: windows1252-to-Unicode.sed: 00003, Angus Leeming |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | Mail Home | sitemap | FAQ | advertise |