logo       
Bookmark and Share

windows1252-to-Unicode.sed: msg#00003

editors.sed.user

Subject: windows1252-to-Unicode.sed

I just wrote this. Thought somebody else might be able to use it.

------------------------
# filename: windows1252-to-Unicode.sed
# author: Eric Pement - eric.pement[at]moody.edu
# date: 2005-12-06 16:14
#
# GNU sed script to convert Windows-1252 characters in the range
# of &#128 to Ÿ (0x80 to 0x9F) to Unicode character entities
# for HTML and XHTML. Characters in this range are LEGAL for the
# windows-1252 character set, but are ILLEGAL for the ISO-8859-1
# character set (the default encoding for web display).

# Use this script if you have input text generated in Microsoft
# Word/Works, and you find input characters in this range
# (usually, the curly quotes, ellipsis, or the en/em dash), and
# you want to change the character encoding to ISO-8859-1 (i.e.,
# Latin-1) with Unicode character entities.

# On the 'find' side, I used \xHH instead of \dNN because \xHH
# is supported in BOTH ssed default and in ssed -R (Perlmode),
# whereas \dNN is not supported if using ssed -R.

# On the 'replace' side, HTML character entities can be rendered
# with either &#NNNN; (base 10) or &#xHHHH; (base 16). I chose
# to use base 16, since that's how they are listed in the charts
# at http://unicode.org/charts and http://unicode.org/unibook .

s/\x80/\€/g; # 128 - Euro symbol
s/\x81/\™/g; # 129 - (nonstandard) TradeMark symbol
s/\x82/\‚/g; # 130 - low-9 single quote
s/\x83/\ƒ/g; # 131 - 'f' with hook (function symbol)
s/\x84/\„/g; # 132 - low-9 double quote
s/\x85/\…/g; # 133 - ellipsis
s/\x86/\†/g; # 134 - dagger
s/\x87/\‡/g; # 135 - double dagger

# Next one is susceptible to several substitutions
# Note the Unicode variants:
# 005E - standard circumflex
# 02C4 - modifier letter up arrowhead
# 02C6 - modifier letter circumflex accent
# 2038 - caret (used at baseline)
# 2227 - logical AND
# 2303 - UP arrowhead
s/\x88/\ˆ/g; # 136 - circumflex accent
#

s/\x89/\‰/g; # 137 - per thousand [0/00]
s/\x8A/\Š/g; # 138 - uppercase 'S' with caron
s/\x8B/\&#x2039;/g; # 139 - L angle single quote, like <
s/\x8C/\&#x0152;/g; # 140 - Uppercase 'OE' ligature

# s/\x8D//g; # 141 - (nonstandard) no character?

s/\x8E/\&#x017D;/g; # 142 - (nonstd?) uppercase 'Z' with caron

# s/\x8F//g; # 143 - (nonstandard) no character?
# s/\x90//g; # 144 - (nonstandard) no character?

s/\x91/\&#x2018;/g; # 145 - L single quote
s/\x92/\&#x2019;/g; # 146 - R single quote (or apostrophe)
s/\x93/\&#x201C;/g; # 147 - L double quote
s/\x94/\&#x201D;/g; # 148 - R double quote
s/\x95/\&#x2022;/g; # 149 - bullet
s/\x96/\&#x2013;/g; # 150 - en dash (between numbers)
s/\x97/\&#x2014;/g; # 151 - em dash (between words)
s/\x98/\&#x02DC;/g; # 152 - small tilde
s/\x99/\&#x2122;/g; # 153 - TradeMark symbol
s/\x9A/\&#x0161;/g; # 154 - small 's' with caron
s/\x9B/\&#x203A;/g; # 155 - R angle single quote, like >
s/\x9C/\&#x0153;/g; # 156 - small 'oe' ligature

# s/\x9D//g; # 157 - (nonstandard) no character?

s/\x9E/\&#x017E;/g; # 158 - small 'z' with caron
s/\x9F/\&#x0178;/g; # 159 - uppercase 'Y' with diaeresis
#--- [end of sed script] ---

------------------------

--
Eric Pement - eric.pement@xxxxxxxxx
Educational Technical Services, MBI



------------------------ Yahoo! Groups Sponsor --------------------~-->
Most low income households are not online. Help bridge the digital divide today!
http://us.click.yahoo.com/I258zB/QnQLAA/TtwFAA/dkFolB/TM
--------------------------------------------------------------------~->

--

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/sed-users/

<*> To unsubscribe from this group, send an email to:
sed-users-unsubscribe@xxxxxxxxxxxxxxx

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/







<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | Mail Home | sitemap | FAQ | advertise