Download Firefox: WindowsMac OS X
logo       
Google Custom Search
    AddThis Social Bookmark Button

windows1252-to-Unicode.sed: msg#00003

Subject: windows1252-to-Unicode.sed
I just wrote this.  Thought somebody else might be able to use it.

------------------------
# filename: windows1252-to-Unicode.sed
#   author: Eric Pement - eric.pement[at]moody.edu
#     date: 2005-12-06 16:14
#
# GNU sed script to convert Windows-1252 characters in the range
# of &#128 to Ÿ (0x80 to 0x9F) to Unicode character entities
# for HTML and XHTML. Characters in this range are LEGAL for the
# windows-1252 character set, but are ILLEGAL for the ISO-8859-1
# character set (the default encoding for web display).

# Use this script if you have input text generated in Microsoft
# Word/Works, and you find input characters in this range
# (usually, the curly quotes, ellipsis, or the en/em dash), and
# you want to change the character encoding to ISO-8859-1 (i.e.,
# Latin-1) with Unicode character entities.

# On the 'find' side, I used \xHH instead of \dNN because \xHH
# is supported in BOTH ssed default and in ssed -R (Perlmode),
# whereas \dNN is not supported if using ssed -R.

# On the 'replace' side, HTML character entities can be rendered
# with either &#NNNN; (base 10) or &#xHHHH; (base 16). I chose
# to use base 16, since that's how they are listed in the charts
# at http://unicode.org/charts and http://unicode.org/unibook .

s/\x80/\€/g;   # 128 - Euro symbol
s/\x81/\™/g;   # 129 - (nonstandard) TradeMark symbol
s/\x82/\‚/g;   # 130 - low-9 single quote
s/\x83/\ƒ/g;   # 131 - 'f' with hook (function symbol)
s/\x84/\„/g;   # 132 - low-9 double quote
s/\x85/\…/g;   # 133 - ellipsis
s/\x86/\†/g;   # 134 - dagger
s/\x87/\‡/g;   # 135 - double dagger

# Next one is susceptible to several substitutions
# Note the Unicode variants:
# 005E - standard circumflex
# 02C4 - modifier letter up arrowhead
# 02C6 - modifier letter circumflex accent
# 2038 - caret (used at baseline)
# 2227 - logical AND
# 2303 - UP arrowhead
s/\x88/\ˆ/g;   # 136 - circumflex accent
#

s/\x89/\‰/g;   # 137 - per thousand [0/00]
s/\x8A/\Š/g;   # 138 - uppercase 'S' with caron
s/\x8B/\&#x2039;/g;   # 139 - L angle single quote, like <
s/\x8C/\&#x0152;/g;   # 140 - Uppercase 'OE' ligature

# s/\x8D//g;          # 141 - (nonstandard) no character?

s/\x8E/\&#x017D;/g;   # 142 - (nonstd?) uppercase 'Z' with caron

# s/\x8F//g;          # 143 - (nonstandard) no character?
# s/\x90//g;          # 144 - (nonstandard) no character?

s/\x91/\&#x2018;/g;   # 145 - L single quote
s/\x92/\&#x2019;/g;   # 146 - R single quote (or apostrophe)
s/\x93/\&#x201C;/g;   # 147 - L double quote
s/\x94/\&#x201D;/g;   # 148 - R double quote
s/\x95/\&#x2022;/g;   # 149 - bullet
s/\x96/\&#x2013;/g;   # 150 - en dash (between numbers)
s/\x97/\&#x2014;/g;   # 151 - em dash (between words)
s/\x98/\&#x02DC;/g;   # 152 - small tilde
s/\x99/\&#x2122;/g;   # 153 - TradeMark symbol
s/\x9A/\&#x0161;/g;   # 154 - small 's' with caron
s/\x9B/\&#x203A;/g;   # 155 - R angle single quote, like >
s/\x9C/\&#x0153;/g;   # 156 - small 'oe' ligature

# s/\x9D//g;          # 157 - (nonstandard) no character?

s/\x9E/\&#x017E;/g;   # 158 - small 'z' with caron
s/\x9F/\&#x0178;/g;   # 159 - uppercase 'Y' with diaeresis
#--- [end of sed script] ---

------------------------

-- 
Eric Pement - eric.pement@xxxxxxxxx
Educational Technical Services, MBI



------------------------ Yahoo! Groups Sponsor --------------------~--> 
Most low income households are not online. Help bridge the digital divide today!
http://us.click.yahoo.com/I258zB/QnQLAA/TtwFAA/dkFolB/TM
--------------------------------------------------------------------~-> 

-- 
 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/sed-users/

<*> To unsubscribe from this group, send an email to:
    sed-users-unsubscribe@xxxxxxxxxxxxxxx

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 






<Prev in Thread] Current Thread [Next in Thread>