------------------------------------------------------------------------------- Hints and Tips for handling Unicode (UTF-8). Also see "locale.hints" in this same directory ------------------------------------------------------------------------------- VIM and Unicode In VIM, type Ctrl-V u followed by a hexadecimal number. Example: Ctrl-V u 20ac To see the actual charcater encoding in the file use.. ga for the byte character (and what to type to get that character) g8 for the hexadecimal of the UTF-8 encoding (but not the character point) If you are using a UTF-8 font you can force UTF-8 on the file using :set encoding=utf-8 fileencodings= But you will need a terminal that understands UTF-8 (or gvim) You can also set the current locale to do this automatically How else? ------------------------------------------------------------------------------- Unicode Display You can display UTF-8 output (say from the perl example above) using a Gnome Ternimal, if you first set... Terminal -> Set Character Encoding -> Unicode (UTF-8) XTerm should be able to handle it but you need to set a locale during login. ------------------------------------------------------------------------------- Testing Unicode You can test your unicode strings by running it through the "iconv" program. For example... iconv -f UTF-8 -t UTF-16 unicode_file > /dev/null && echo "Valid UTF-8" NOTE: recode replaces iconv recode utf8..utf16 NOTE: normal Ascii files without any low level control codes are valid UTF-8 To print unicode codes for UTF16 use od -t x2 od -t x2 unicode.txt For utf8 use iconv -f utf8 -t utf16 unicode.txt | od -t x2 Conversion in perl... use Encode; $text = 'Текст кириллица'; $text = encode("utf8", decode("cp1251", $text)); print "$text\n"; ------------------------------------------------------------------------------- X window Selections store UTF-8 as "\u" encoded strings xselection PRIMARY \u6d4b\u8bd5\u7528\u7684\u6c49\u5b57 To return it to utf-8 use... env LC_CTYPE=en_AU.utf8 printf `xselection PRIMARY`'\n' or env LC_CTYPE=en_AU.utf8 printf '\u6d4b\u8bd5\u7528\u7684\u6c49\u5b57\n' ------------------------------------------------------------------------------- Perl and Unicode chr() will convert a specific unicode character to UTF-8 However a warning about 'wide' charcaters may also be generated unless prevented by output settings. perl -e 'binmode(STDOUT, ":utf8"); print chr(0x015C)' | od -t x1 0000000 c5 9c 0000002 Or using the -C option to set the input and output string attributes perl -CO -e 'print "\x{6d4b}\x{8bd5}\x{7528}\x{7684}\x{6c49}\x{5b57}\n";' perl -CO -e \ 'print pack("U*", 0x6d4b, 0x8bd5, 0x7528, 0x7684, 0x6c49, 0x5b57), "\n";' Convert UTF-16 to UTF-8 utf-16_source | perl -CO -ne 'print pack("U*",unpack("n*", $_)), "\n"' utf-8_source | iconv -f UTF-8 -t UTF-16 - | od -t x1 NOTE: recode replaces iconv recode utf8..utf16 ------------------------------------------------------------------------------- X windows and Unicode If you ever get a message about Warning: Missing charsets in String to FontSet conversion Warning: Unable to load any usable fontset That application is uses Xaw widgets which that does not handle unicode fonts The sulution is for set the env "LANG=C" before running. Example applications include "xmessage" To Add unicode keys to your Xwindow keyboard Add the following to a xmodmap file like ".Xmodmap" ! Unicode modifications ! The first line sets the key Alt-Right as the 3rd & 4th 'ModeSwitch' control ! See http://www.cl.cam.ac.uk/~mgk25/unicode.html#input ! ! NOTEs: Right-Alt and these keys produce... ! [] typographic single quotes ! {} typographic double quotes ! 23 superscript 2 and 3 ! d degree symbol ! -nm hyphen, n-dash, m-dash ! M micro symbol ! * multiply ! / divide ! $ euro symbol ! space no-break or shifted space ! keycode 113 = Mode_switch Mode_switch keysym d = d NoSymbol degree NoSymbol keysym m = m NoSymbol emdash mu keysym n = n NoSymbol endash NoSymbol keysym 2 = 2 quotedbl twosuperior NoSymbol keysym 3 = 3 numbersign threesuperior NoSymbol keysym 4 = 4 dollar EuroSign NoSymbol keysym space = space NoSymbol nobreakspace NoSymbol keysym minus = minus underscore U2212 NoSymbol keysym slash = slash NoSymbol division NoSymbol keysym asterisk = asterisk NoSymbol multiply NoSymbol keycode 34 = bracketleft braceleft leftsinglequotemark leftdoublequotemark keycode 35 = bracketright braceright rightsinglequotemark rightdoublequotemark ------------------------------------------------------------------------------- UTF-8 Encoding... For a summery of how UTF-8 came to be (which made unicode practical in modern systems), see http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt It was invented by Ken Thompson and Rob Pike, September 1992 and immediatally put to use in Plan-9 and IBM X/Open It also gives the actual original proposal, and the original UTF-8 to UTF-16 translation subroutines, though it is very slightly different, to make determination of where you are in the sequence slightly easier to determine. It also can be extended a little more, though there is not much need for it. See RFC3629 * Character without high bit is the same in ASCII and UTF-8 EG: 5A -> 5A (uppercase Z) * all multi character sequences have high bit set * Start of any multi character sequences have the two highest bits set While all others in multi character sequences has 10 as their high bits * Character codes C0, C1, F5 - FF will never appear in UTF-8 For a more detailed summery of valid character codes see... http://www.phrack.org/phrack/62/p62-0x09_UTF8_Shellcode.txt (This is actually for the use by crackers, but good reading) * Searching in UTF-8 works as normal (unless you want all e's to match) * Number of charcters in a encoding is defined by the number of high bits in the first character (EG C5 9B gives C or 2 high bits thus two bytes for the character As such... Unicode Character valid bits UTF String 00 - 7F 7 0xxxxxxx 0080 - 07FF 11 110xxxxx 10xxxxxx 0800 - FFFF 16 1110xxxx 10xxxxxx 10xxxxxx 010000 - 10FFFF 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Example encoding and decoding Unicode "Latin Small Letter s with Acute" 015B -> 0000 0001 0101 1011 convert to binary -> xxx0 0110 xx01 1011 re-organise bits (shifts) -> 1100 0101 1001 1011 add top level bits -> C59B final UTF-8 string Reverse UTF-8: E2 80 9C -> 1110 0010 1000 0000 1001 1100 -> 0010 0000 0010 1100 -> 202C in Unicode -> "Left Double Quotation Mark" or "Double Turned Comma Quotation Mark" or (in my words) "Opening Double Quote" -------------------------------------------------------------------------------