posted on Thursday, May 05, 2005 8:57 AM
by
joefield
Character Set Shenanigans
For various reasons I've been working a lot at understanding character sets recently. There's a wealth of knowledge out there, not least Joel's excellent introductory article.
I also found Jon's resources very useful. And excellent character set tables here. And Ultra Edit text editor of course - which shows you Hex for Unicode, UTF8, 1252 etc.
Interesting things (for me, living in Western Europe) were:
- Typing Alt + 243 on the numeric keypad will input the character "¾". This is because it selects the code from the IBM 850 code page.
- But... typing Alt + 0243 on the numeric keypad will input the character "ó". This time the code is selected from the Western European 1252 character set.
- The Euro symbol goes walkies a bit - in ISO8859-1 (Latin 1) it wasn't present. It's in ISO8859-15 at 164 decimal. But in 1252 it's at 128 decimal. And its Unicode code point is 0x20AC.
- ISO8859-1 occupies the first 256 code positions (and ASCII the first 128 positions) of the UCS.
- Latin 1 and 1252 are identical from 160 to 255.
I could go on but someone might get sleepy.