Thursday, May 05, 2005 - Posts

Character Set Shenanigans

For various reasons I've been working a lot at understanding character sets recently. There's a wealth of knowledge out there, not least Joel's excellent introductory article.

I also found Jon's resources very useful. And excellent character set tables here. And Ultra Edit text editor of course - which shows you Hex for Unicode, UTF8, 1252 etc.

Interesting things (for me, living in Western Europe) were:

  • Typing Alt + 243 on the numeric keypad will input the character "¾". This is because it selects the code from the IBM 850 code page.
  • But... typing Alt + 0243 on the numeric keypad will input the character "ó". This time the code is selected from the Western European 1252 character set.
  • The Euro symbol goes walkies a bit - in ISO8859-1 (Latin 1) it wasn't present. It's in ISO8859-15 at 164 decimal. But in 1252 it's at 128 decimal. And its Unicode code point is 0x20AC.
  • ISO8859-1 occupies the first 256 code positions (and ASCII the first 128 positions) of the UCS.
  • Latin 1 and 1252 are identical from 160 to 255.

I could go on but someone might get sleepy.