Internationalisation FAQs
What can I use for a Unicode editor?
Microsoft Word supports Unicode from the '97 version onward. GNU EMACS can be configured to support Unicode as well. And if you are desperate, Notepad will do the job.
The Sun JDK includes an encoding converter called native2ascii. It also has a -reverse switch
What are the "typical" internationalisation issues?
Text isolation
Date, currency, number formatting change in different locales.
Encoding methods (Unicode, DBCS, wide characters, multibyte, ASCII)
Tools and libraries
Are there other cultural and usability issues?
Cultural issues like colour, use and perception of metaphors, teaching methods, eye movement patterns, etc.
GUI icons - An icon that is obviously a mailbox or a phone booth to an American, may look like a porta-pottie or a cow barn to someone from Europe or Asia.
Dialog items could be moved around so that they are more logically located when localising to a right-to-left reading culture.
Are there any Standard Conventions for Internationalisation?
Measurement System
m, kg, s, A, K, mol, cd
ISO 1000, ISO 31
http://physics.nist.gov/cuu/Units/
SI units are now universally used with the exception of the USA. Britain is now predominantly metric with very few exceptions (pints in pubs, miles on street signs, some older people still think of body weight in stones). One of the few areas where the metric system still has to find its way in is typography, where doing everything in millimetres would be a significant simplification but the stubbornness of US software vendors prevents this currently. Even though the SI units are widely known in the US, the associated style conventions for their usage are not (commonly seen wrong notations are eg., km/h -> KPH, kg -> Kgs, s -> secs, kHz -> KHz,etc.).
Paper Format
A4, ISO 216
http://www.cl.cam.ac.uk/~mgk25/iso-paper.html
Dominant office paper format absolutely everywhere on this planet, except for USA and Canada.
Time Format
00:00 - 23:59
ISO 8601
http://www.cl.cam.ac.uk/~mgk25/iso-time.html
Used in most countries, used only partially in English-speaking countries, used relatively little in the US (where a strange am/pm convention dominates, even though the 23:59 notation is widely known and understood).
Date Format
1999-12-31
ISO 8601 http://www.cl.cam.ac.uk/~mgk25/iso-time.html
Used in some countries, far less widely used than 23:59, has recently become the official date notation (or at least an officially recognised alternative notation) in most European countries, widely used in parts of Asia as well.
Telephone numbers
+44 1223 334676
ITU-T Recommendation E.123 Very widely recognised today in international business correspondence, especially since it has become the notation required to appear on fax transmissions.
Character Set
ISO 10646/Unicode
http://www.unicode.org/
Today, the dominating character set in non-Latin information processing (mostly thanks to Microsoft Windows). Not yet universally used thanks to numerous legacy applications, but very likely to replace all other character set standards in the foreseeable future (~10 years). Most Unicode applications will handle only subsets of the standard, probably forever, and the old character set compatibility problem will turn into a Unicode subset incompatibility problem (which should be orders of magnitude less troublesome).
Graphical Icons
ISO 7000, ISO 3864, IEC 417, UN road sign guidelines Some of these are extremely widely known, at least in industrialised countries, (eg., the IEC 417 symbols on tape recorders, copying machines, and telephones, etc., the ISO 3864 warning symbols for ionising radiation, biohazards,flammable materials, poisons, explosives, etc.).
How do I determine byte ordering?
U+FEFF is also known as ZERO WIDTH NO BREAK SPACE when it is in the middle of a data stream. U+FFFE is a non-character. This is an important distinction. If you find xFFFE at the beginning of a data stream, you can assume the system that provides the stream is little endian. It is still U+FEFF, but it is xFFFE.
You can use an UTF-8 representation (sequence of octets) to exchange data. Specifically, the Unicode value U+FEFF. U+FEFF is known as the "byte-order mark". It can be placed in front of an UTF-16 (or deprecated UCS-2) encoded file to give a hint to the byte-order. If you find U+FFFE (reversed byte order mark) you can deduce it is in the opposite byte-order than yours.
The i18n-prog homepage is at:
http://www.acoin.com/i18n/i18n-prog.htm