Sorting
Sorting data is a common task for applications. It is also a highly locale-dependent function. Not only can the sort order change from one locale to the other, even if both use the same script, but scripts may have a very different definition for sorting. In addition, you must also be aware of what you are sorting: the expected order of things may vary from one culture to another.
Rely as much as possible on the system to do the work. The first step toward doing this is to set the locale properly. Use the resources of type 'itl2' on the Macintosh. For comparison: use functions such as
LCMapString
in Windows or
strxfrm
in C that decompose strings in sort-keys.
Here are a few cases of collation issues:
Some single characters may need to be sorted as two characters
For example the 'ä' in German is really the two letters 'ae' together. So the word
Bäcker
should be between
Baden
and
Bahn
. The character 'ß' is actually the two letters 'ss' together, etc.
Some groups of two characters may need to be sorted as one character
In Spanish the pair of letters 'ch' should be treated like a single letter 'c.' This rule was changed in Spain in 1994, where now 'ch' is considered as two distinct letters, going back to the old tradition of before 1803. However, this rule still applies in some other Spanish-speaking countries. The pair 'll' and even 'rr' also have specific behaviors to consider when sorting.
Punctuation within words
Punctuation marks, such as hyphenation, are usually ignored. However, hyphenation should be taken into account because it may make the only difference between two words. For example, the English
black-bird
comes after
blackbird
but before
blackboard
.
Diacritical marks are important
There are many cases in languages where diacritical marks are the only elements used to differenciate between words. For example the French
forêt
(forest) and
foret
(drill), or
cote
(quotation),
côte
(rib) and
côté
(side).
Sorting Ideographs
There are different ways to sort ideographic characters:
By radical, then by the number of strokes.
By the number of strokes, then by radical.
By phonetics (pronunciation).
By code-point values.
Also see the
SORTID reference table
for Windows, and a code sample to
sort items in a list box
.