[Home] [Back] [Up] [<<] [>>]

Sorting

Sorting data is a common task for applications. It is also a highly locale-dependent function. Not only can the sort order change from one locale to the other, even if both use the same script, but scripts may have a very different definition for sorting. In addition, you must also be aware of what you are sorting: the expected order of things may vary from one culture to another.

Rely as much as possible on the system to do the work. The first step toward doing this is to set the locale properly. Use the resources of type 'itl2' on the Macintosh. For comparison: use functions such as LCMapString in Windows or strxfrm in C that decompose strings in sort-keys.

Here are a few cases of collation issues:

Some single characters may need to be sorted as two characters

For example the 'ä' in German is really the two letters 'ae' together. So the word Bäcker should be between Baden and Bahn. The character 'ß' is actually the two letters 'ss' together, etc.

Some groups of two characters may need to be sorted as one character

In Spanish the pair of letters 'ch' should be treated like a single letter 'c.' This rule was changed in Spain in 1994, where now 'ch' is considered as two distinct letters, going back to the old tradition of before 1803. However, this rule still applies in some other Spanish-speaking countries. The pair 'll' and even 'rr' also have specific behaviors to consider when sorting.

Punctuation within words

Punctuation marks, such as hyphenation, are usually ignored. However, hyphenation should be taken into account because it may make the only difference between two words. For example, the English black-bird comes after blackbird but before blackboard.

Diacritical marks are important

There are many cases in languages where diacritical marks are the only elements used to differenciate between words. For example the French forêt (forest) and foret (drill), or cote (quotation), côte (rib) and côté (side).

Sorting Ideographs

There are different ways to sort ideographic characters:

By radical, then by the number of strokes.
By the number of strokes, then by radical.
By phonetics (pronunciation).
By code-point values.

Also see the SORTID reference table for Windows, and a code sample to sort items in a list box.



[Home] [Back] [Up] [<<] [>>]