Sorting strings in Python is something that you can almost take for granted if you do it all the time. Yet, it can become surprisingly tricky considering all the edge cases lurking in the vast Unicode standard. Understanding Unicode isnโt an easy feat, so prepare yourself for a whirlwind tour of surprising edge cases and effective ways of dealing with them.
Note: In this tutorial, youโll often use the term Unicode string to mean any string with non-Latin letters and characters like emoji. However, strings consisting of Basic Latin letters also fall into this category because the underlying ASCII table is a subset of Unicode.
Python pioneered and popularized a robust sorting algorithm called Timsort, which now ships with several major programming languages, including Java, Rust, and Swift. When you call sorted() or list.sort(), Python uses this algorithm under the surface to rearrange elements. As long as your sequence contains comparable elements of compatible types, then you can sort numbers, strings, and other data types in the expected order:
>>> math_constants = [6.28, 2.72, 3.14]
>>> sorted(math_constants)
[2.72, 3.14, 6.28]
>>> fruits = ["orange", "banana", "lemon", "apple"]
>>> sorted(fruits)
['apple', 'banana', 'lemon', 'orange']
>>> people = [("John", "Doe"), ("Anna", "Smith")]
>>> sorted(people)
[('Anna', 'Smith'), ('John', 'Doe')]
Unless you specify otherwise, Python sorts these elements by value in ascending order, comparing them pairwise. Each pair must contain elements that are comparable using either the less than (<) or greater than (>) operator. Sometimes, these comparison operators are undefined for specific data types like complex numbers or between two distinct types. In these cases, a comparison will fail:
>>> 4 + 2j < 2 + 4j
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'complex' and 'complex'
>>> 3.14 < "orange"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'float' and 'str'
In most cases, youโll be working with homogeneous sequences comprising elements of the same type, so youโll rarely run into this problem in the wild. However, things may start to fall apart once you throw strings with non-Latin characters into the mix, such as letters with diacritics or accents:
>>> polish_names = ["Zbigniew", "Ludmiลa", "ลปaneta", "ลukasz"]
>>> sorted(polish_names)
['Ludmiลa', 'Zbigniew', 'ลukasz', 'ลปaneta']
In this example, you run into a common challenge associated with sorting strings. When a string contains characters whose ordinal values extend beyond the usual ASCII range, then you might get unexpected results like what you have here. The name ลukasz ends up after Zbigniew, even though the letter ล (pronounced like a w sound in English) occurs earlier than Z in the Polish alphabet. Whyโs that?
Python sorts strings lexicographically by comparing Unicode code points of the individual characters from left to right. It just so happens that the letter ล has a higher ordinal value in Unicode than the letter Z, making it greater than any of the Latin letters:
>>> ord("ล")
321
>>> ord("Z")
90
>>> any("ล" < letter for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ")
False
Code points are unique yet arbitrary identifiers of Unicode characters, and they donโt inherently agree with the alphabetical order of spoken languages. So, lexicographic sorting wonโt be appropriate for languages other than English.
Different cultures follow different rules for sorting strings, even when they share the same alphabet or parts of it with other cultures. For example, the ch digraph is considered two separate letters (c and h) in Polish, but it becomes a stand-alone letter placed between h and i in the Czech alphabet. This is known as a contraction in Unicode.
Note: Language rules occasionally change. For instance, the Spanish alphabet treated the ch and ll digraphs as single letters until 1994, when the official language regulation institution, RAE, declared them separate.
At some point, the accent ordering in French changed from backward to forward everywhere in the world except for Canada, which still sticks to the old tradition:
| Accent Ordering | |
|---|---|
| France | cote, cotรฉ, cรดte, cรดtรฉ |
| Canada | cote, cรดte, cotรฉ, cรดtรฉ |
In modern French, words are compared from left to right. The earliest letter with an accent, which is รด in the example above, corresponds to a greater sort key, pushing the word to the end of the list. On the other hand, in Canada, words are compared backward from right to left, so the last accentโs position determines the final order.
Depending on their geographic location worldwide, individuals speaking the same language may expect a slightly different sorting order.
Moreover, the sorting order can sometimes differ within the same culture, depending on the context. For example, most German phone books tend to treat letters with an umlaut (รค, รถ, รผ) similarly to the ae, oe, and ue letter combinations. However, other countries overwhelmingly treat these letters the same as their Latin counterparts (a, o, u).
Thereโs no universally correct way to sort Unicode strings. You need to tell Python which rules to apply to get the desired ordering. So, how do you sort Unicode strings alphabetically in Python?
Get Your Code: Click here to download the free sample code that shows you how to sort Unicode strings alphabetically with Python.
How to Sort Strings Using the Unicode Collation Algorithm (UCA)
The problem of sorting Unicode strings isnโt unique to Python. Itโs a common challenge in any programming language or database. To address it, Technical Report #10 in the Unicode Technical Standard (UTS) describes the collation of Unicode strings, which is a consistent way of comparing two strings to establish their sorting order.
The Unicode Collation Algorithm (UCA) assigns a hierarchy of numeric weights to each character, allowing the creation of binary sort keys that account for accents and other special cases. These keys are defined at four levels that determine various features of a character:
- Primary: The base letter
- Secondary: The accents
- Tertiary: The letter case
- Quaternary: Other features
Later in this tutorial, youโll learn how to leverage these weight levels to customize the Unicode collation algorithm by, for example, ignoring the letter case in case-insensitive sorting.
Note: This hierarchical nature of character weights allows you to compare sort keys incrementally to increase performance. It also helps conserve memory. For example, some implementations of the UCA take advantage of the trie data structure to efficiently store and retrieve the weights for a given string.
While the UCA supplies the Default Unicode Collation Element Table (DUCET), you should generally customize this default collation table to the specific needs of a particular language and application. Itโs virtually impossible to ensure the desired sort order for all languages using only one character table. Therefore, software libraries implementing the UCA usually rely on the Common Locale Data Repository (CLDR) to provide such customization.
This repository contains several XML documents with language-specific information. For example, the collation rules for sorting text in Polish explain the relationship between the letters z, ลบ, and ลผ in both uppercase and lowercase:
&Z<ลบ<<<ลน<ลผ<<<ลป
Without going into the technical details of the collation rule syntax, you can observe that z comes before ลบ, which comes before ลผ. The same rule applies regardless of whether the letters are uppercase or lowercase.
To use the Unicode Collation Algorithm in Python, you can install the pyuca library into your virtual environment. While this library is slightly obsolete and only supports Unicode up to version 10, you can download the latest DUCET table from Unicodeโs official website and supply the corresponding filename when creating a collator object:
>>> import pyuca
>>> collator = pyuca.Collator("allkeys-15.0.0.txt") # Unicode version 15.0.0
>>> collator.sort_key("zb")
(9139, 8397, 0, 32, 32, 0, 2, 2, 0)
>>> collator.sort_key("ลผa")
(9139, 8371, 0, 32, 46, 32, 0, 2, 2, 2, 0)
>>> collator.sort_key("zb") < collator.sort_key("ลผa")
False
Alternatively, you can create a collator instance without specifying the file, in which case itโll use an older version that was distributed with the library. Calling the collatorโs .sort_key() method on a string reveals a tuple of weights that you can use in a comparison.
Notice that weights at the first index in both tuples are identical, meaning that the letters z and ลผ are essentially treated as equal, disregarding accents. This is an intended behavior of the default collation table in UCA when you leave it without customization, but as youโll soon learn, this isnโt ideal.
The second weight, which corresponds to the following letter, determines the sorting order of both strings. The remaining weights are irrelevant at this point.
You can sort a sequence of Unicode strings using pyuca by providing the reference to your collatorโs .sort_key() method as the key function:
>>> polish_names = ["Zbigniew", "Ludmiลa", "ลปaneta", "ลukasz"]
>>> sorted(polish_names, key=collator.sort_key)
['Ludmiลa', 'ลukasz', 'ลปaneta', 'Zbigniew']
Remember to omit the parentheses at the end of the method name to avoid calling it! You want to pass a reference to a function or method that Python will later call for each item in the list. If you were to call your method here, then it would execute immediately, returning a single value instead of the callable object that sorted() expects.
As you can see, the default collation table in UCA doesnโt conform to the Polish language rules, as the name ลปaneta should sort after Zbigniew. Unfortunately, pyuca doesnโt let you specify custom rules to alleviate that. While you could manually try editing the downloaded text file before supplying it to the collator, itโs a tedious and error-prone task.
Not only is pyuca limited in this regard, but it also hasnโt been actively maintained for a few years now. As a pure-Python library, it might negatively affect the performance of more demanding applications. Overall, youโre better off finding an alternative tool like PyICU, which is a Python binding to IBMโs open-source International Components for Unicode (ICU) library implemented in C++ and Java.
Note: PyICU is only available as a source distribution, so you need to have a C++ compiler and, on select platforms, a few third-party dependencies to install it. Follow the official instructions in the README file if you face any difficulties.
Once you have PyICU installed, getting started is pretty straightforward and doesnโt look much different than using pyuca. Hereโs an example of how you can create a collator and supply an optional set of rules specific to a given language:
>>> from icu import Collator, Locale
>>> collator = Collator.createInstance(Locale("pl_PL"))
>>> sorted(polish_names, key=collator.getSortKey)
['Ludmiลa', 'ลukasz', 'Zbigniew', 'ลปaneta']
In this case, you use the Polish flavor of the Unicode collation rules, which results in the expected sort order of names. To get a complete list of all available languages, use the following code snippet:
>>> from icu import Locale
>>> Locale.getAvailableLocales().keys()
['af',
'af_NA',
'af_ZA',
'agq',
'agq_CM',
โฎ
'zu',
'zu_ZA']
These languages are built into the library and shipped with it using a binary format, independently of your operating system, which may support a different set of languages. You can use any of these to customize your collator instance.
Additionally, you may build a collator from scratch with entirely custom rules expressed in a format similar to the one youโve seen in the XML file before:
>>> from icu import RuleBasedCollator
>>> collator = RuleBasedCollator(
... """
... &A<ฤ
<<<ฤ
... &C<ฤ<<<ฤ
... &E<ฤ<<<ฤ
... &L<ล<<<ล
... &N<ล<<<ล
... &O<รณ<<<ร
... &S<ล<<<ล
... &Z<ลบ<<<ลน<ลผ<<<ลป
... """
... )
>>> sorted(polish_names, key=collator.getSortKey)
['Ludmiลa', 'ลukasz', 'Zbigniew', 'ลปaneta']
Youโve created an instance of the RuleBasedCollator with the collation rules for sorting text in Polish, which you borrowed from the Common Locale Data Repository (CLDR). As a result, the Polish names are now arranged correctly.
This is barely scratching the surface of whatโs possible with PyICU. The library isnโt just for sorting Unicode strings. It can also help you develop multilingual applications for users worldwide.
The biggest downside of PyICU is its potentially cumbersome installation. Itโs also a relatively large library, both in terms of size on disk and an extensive API. If one of these becomes a deal-breaker for you, or if you canโt use any third-party libraries, then consider alternative options.
Next up, youโll learn about the locale module in Pythonโs standard library.
Leverage Pythonโs locale Module to Sort Unicode Strings
Pythonโs official sorting how-to guide recommends using the standard library module locale to respect cultural and geographical conventions when sorting strings. Itโs a Python interface to C localization functions that expose the regional settings on a POSIX-compatible operating system. Some of them include:
- Address and phone formats
- Currency symbol
- Date and time format
- Decimal symbol (for example, point vs comma)
- Language
- Measurement units
- Paper size (for example, A4 vs Letter)
These parameters are collectively known as locale. You can control them globally through a set of environment variables, like LC_ALL, which will apply to your entire operating system. Alternatively, you can selectively override some of them within your program when it runs.
Note: The concept of locale in computing is closely related to internationalization (i18n) and localization (L10n), which are about designing and adapting software to support multiple languages, respectively.
The locale is about so much more than a language. For example, English has several dialects spoken on many continents. Each dialect may have distinct traits, such as these:
| American | British | |
|---|---|---|
| Spelling | color | colour |
| Vocabulary | truck | lorry |
| Measurement System | imperial | metric |
| Currency | dollars | pounds |
| Pronunciation | uh-loo-muh-nuhm | al-yoo-min-ee-um |
Even regions of the same country sometimes exhibit slight variations. To account for these possible differences, a locale identifier consists of two mandatory and two optional parts:
- Language: A two-letter, lowercase language code (ISO 639)
- Territory: A two-letter, uppercase country code (ISO 3166)
- Charmap: An optional character encoding
- Modifier: An optional modifier for specific locale variations
For example, the identifier ca_ES.UTF-8@valencia represents the Catalan language spoken in Spain using the Valencian dialect with the UTF-8 character encoding. On the other hand, en_US is another valid identifier that corresponds to English spoken in the United States with your operating systemโs default character encoding.
If youโre on macOS or a Linux distribution, then you can check your current locale by typing the locale command at your command prompt:
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
You get information about specific locale categories and their values. The LC_COLLATE category controls how strings are compared, which determines the sorting order of Unicode strings.
To reveal the complete list of available locales installed in your system, append the -a flag to the locale command:
$ locale -a
C
C.UTF-8
en_GB.utf8
en_US.utf8
pl_PL.utf8
POSIX
The C and POSIX locales are defaults defining a minimal set of regional settings that must comply with the C programming language on a POSIX-compliant system.
To install an additional locale, pick one from a list of the supported locales and then generate the corresponding localization files from the chosen template. If youโre on Linux, then you can list the supported locales by revealing the contents of a special file:
$ cat /usr/share/i18n/SUPPORTED | grep pt
pt_BR.UTF-8 UTF-8
pt_BR ISO-8859-1
pt_PT.UTF-8 UTF-8
pt_PT ISO-8859-1
pt_PT@euro ISO-8859-15
$ sudo locale-gen pt_BR.UTF-8
Generating locales (this might take a while)...
pt_BR.UTF-8... done
Generation complete.
These commands enable a new locale for Portuguese as spoken in Brazil. If youโd like your operating system to pick it up, then you must also install the corresponding language pack before changing the current locale:
$ sudo apt install language-pack-pt
$ export LC_ALL="pt_BR.UTF-8"
The package name that you should install may vary. For instance, the one above is suitable for Linux distributions based on Debian. After changing the current locale, most of the standard Unix commands will reflect that in their output:
$ cat no-such-file
cat: no-such-file: Arquivo ou diretรณrio inexistente
When you try to print the contents of a missing file with cat, you get an error message in Portuguese. Note that setting environment variables has only a temporary effect that will last until you close the current terminal session. To apply the new locale permanently, edit your shell configuration file accordingly.
Now, how do you control the locale in Python? Youโve already seen that the default lexicographic sorting in Python may lead to incorrect results. To fix that, you can try using locale.strxfrm(), which transforms a string into a locale-aware counterpart, as the sort key:
>>> import locale
>>> locale.getlocale()
('en_US', 'UTF-8')
>>> polish_names = ["Zbigniew", "Ludmiลa", "ลปaneta", "ลukasz"]
>>> sorted(polish_names) # Lexicographic sorting
['Ludmiลa', 'Zbigniew', 'ลukasz', 'ลปaneta']
>>> sorted(polish_names, key=locale.strxfrm)
['Ludmiลa', 'ลukasz', 'ลปaneta', 'Zbigniew']
Depending on your current locale, the list may end up sorted differently than without the custom key. However, the result can still be incorrect when you donโt specify the right locale. In such a case, Python relies on your operating systemโs default locale, which may be unsuitable for the given language.
Note: Youโll also find another function locale.strcoll(), which compares two strings according to the current locale. It used to be convenient in legacy Python versions, whose sorting functions accepted the cmp parameter, which was dropped in Python 3.
When sorting strings, you always want to override the collation rules to match the language in question:
>>> locale.setlocale(locale.LC_COLLATE, "pl_PL.UTF-8")
'pl_PL.UTF-8'
>>> sorted(polish_names, key=locale.strxfrm)
['Ludmiลa', 'ลukasz', 'Zbigniew', 'ลปaneta']
At last, you get the expected result by setting the collation category to the Polish locale.
Note: You must have the correct locale installed in your operating system before using it in your programs. Otherwise, Python will silently fall back to the operating systemโs default locale without raising an error:
>>> locale.setlocale(locale.LC_COLLATE, "pl_PL.UTF-8")
'pl_PL.UTF-8'
>>> sorted(polish_names, key=locale.strxfrm)
['Ludmiลa', 'Zbigniew', 'ลปaneta', 'ลukasz']
>>> locale.getlocale()
(None, 'UTF-8')
If the default locale is unset or doesnโt match the language that you wish to work with, then youโll still get an incorrect sorting order.
Unfortunately, calling locale.setlocale() is a global setting that wonโt help you with sorting words in more than one language simultaneously. It also isnโt thread-safe, so youโd typically invoke it once after starting your program, as opposed to handling the incoming HTTP requests in a multi-threaded web application, for example.
Apart from being able to handle at most one language at a time, locale will only work on POSIX-compliant systems, making it less portable than other solutions. It requires the correct locale to be installed in your system, and it supports ISO 14651, which is a subset of the complete Unicode Collation Algorithm. That said, it may be enough in more straightforward use cases.
Transliterate Foreign Characters Into Latin Equivalents
You donโt always care about precise collation rules of foreign languages. Instead, you may prefer to make the sort order appear correct to an English speaker by converting each string into a Latin script that best approximates the original. In linguistics, such conversion is known as the romanization of text, and it can take a few forms, including a combination of these:
- Transliteration: Swapping letters from one script with another
- Transcription: Representing a foreign script phonetically
The basic idea behind transliteration is to replace every non-ASCII letter with its closest Latin counterpart before comparing the strings. You can also use this technique to represent Unicode on a legacy system that only supports ASCII characters or to create a clean URL. Note that simply stripping any character beyond ASCII would alter the text, removing too much information.
Here are a few examples of words transliterated from different alphabets:
| Alphabet | Original | Transliteration |
|---|---|---|
| Cyrillic | ะะตััะพะณัะฐะด | Petrograd |
| Greek | ฮฮฝฯฯฯฯ | Knosos |
| Kurdish | Xirabreลkรช | Xirabreske |
| Norwegian | Svolvรฆr | Svolvaer |
| Polish | ลรณdลบ | Lodz |
This method works best with alphabets derived from Latin or Greek, as opposed to other alphabets like Arabic, Chinese, or Hebrew. To handle non-Roman languages, youโll need to devise a way of transcribing sounds associated with the visual symbols.
As an example, take a look at these city names and their transcriptions:
| Alphabet | Original | Transcription |
|---|---|---|
| Arabic | ุงูุฑูุงุถ | Riyadh |
| Chinese | ๅไบฌ | Beijing |
| Hebrew | ืึฐืจืึผืฉึธืืึทืึดื | Jerusalem |
| Japanese | ๆฑไบฌ | Dongjing (Tokyo) |
| Korean | ์์ธ | Seoul |
Transcribing a given text is inherently more complicated than transliterating it, so youโll only consider transliteration from now on.
One way to strip accents or diacritics from Latin letters is by taking advantage of Unicode equivalence, which states that alternative sequences of code points can encode semantically identical characters. Even if two Unicode strings look the same to a human reader, they may have different lengths and contents, affecting the sort order. For example, the letter รฉ has two canonically equivalent Unicode representations:

The representation on the left encodes the last letter in the word cafรฉ using Unicode code point U+00E9. Conversely, the representation on the right encodes the same letter using two code points: U+0065 for the Latin letter e, followed by U+0301, which is the acute accent that combines with the preceding letter. The combining characters like these donโt stand on their own but modify how other characters render.
Note: These alternative Unicode string representations as code points are independent of their encoded byte representations, such as UTF-8 or UTF-16.
The Unicode standard specifies four normal forms and the corresponding algorithms for converting between them through a process called Unicode normalization. In Python, you can use the standard-library module unicodedata to perform such a normalization:
>>> cafe_precomposed = "caf\N{latin small letter e with acute}"
>>> cafe_combining = "cafe\N{combining acute accent}"
>>> cafe_precomposed
'cafรฉ'
>>> cafe_combining
'cafeฬ'
>>> len(cafe_precomposed)
4
>>> len(cafe_combining)
5
>>> cafe_precomposed == cafe_combining
False
>>> import unicodedata
>>> decompose = lambda x: unicodedata.normalize("NFD", x)
>>> compose = lambda x: unicodedata.normalize("NFC", x)
>>> decompose(cafe_precomposed) == cafe_combining
True
>>> cafe_precomposed == compose(cafe_combining)
True
Notice that the two variables render the same glyphs when you evaluate them, but the underlying strings have different lengths and donโt compare as equal. Normalizing either of the strings into the common Unicode representation allows for an apples-to-apples comparison.
Note: Some fonts may miss certain glyphs or render them in a strange way. For example, if your font doesnโt contain either the precomposed or decomposed variant of the acute accent, then you wonโt see the whole picture:

In this case, even though the second string contains two separate code points representing the letter รฉ, it appears as a plain e on the screen. To learn more about fonts, check out Choosing the Best Coding Font for Programming.
Unicodeโs NFD form allows you to decompose a Unicode string into plain Latin letters while discarding any combining characters that may result from this normalization. To make your life easier, define the following helper function, which youโll evolve later:
>>> import unicodedata
>>> def transliterate_v1(text: str) -> str:
... return (
... unicodedata.normalize("NFD", text)
... .encode("ascii", errors="ignore")
... .decode("ascii")
... )
...
After decomposing a string, you .encode() it into a sequence of bytes while ignoring ordinal values that canโt fit in the ASCII range. Then, you .decode() the resulting bytes into a Python string again, which you return to the caller.
To verify if your function works as expected, you can test it against a few pangrams, which are humorous and often nonsensical sentences containing every letter from the given alphabet:
>>> pangrams = {
... "Czech": "Pลรญliลก ลพluลฅouฤkรฝ kลฏล รบpฤl ฤรกbelskรฉ รณdy",
... "Polish": "Pรณjdลบmyลผ haftnฤ
ฤ z wklฤsลoลci guberล",
... "Icelandic": "Kรฆmi nรฝ รถxi hรฉr, ykist รพjรณfum nรบ bรฆรฐi vรญl og รกdrepa",
... }
>>> for label, pangram in pangrams.items():
... print(label, pangram, transliterate_v1(pangram), "", sep="\n")
...
Czech
Pลรญliลก ลพluลฅouฤkรฝ kลฏล รบpฤl ฤรกbelskรฉ รณdy
Prilis zlutoucky kun upel dabelske ody
Polish
Pรณjdลบmyลผ haftnฤ
ฤ z wklฤsลoลci guberล
Pojdzmyz haftnac z wklesosci gubern
Icelandic
Kรฆmi nรฝ รถxi hรฉr, ykist รพjรณfum nรบ bรฆรฐi vรญl og รกdrepa
Kmi ny oxi her, ykist jofum nu bi vil og adrepa
You define and iterate through a Python dictionary of three pangrams before printing their transliterations on the screen. Unfortunately, while your solution works fine with most letters, it sometimes removes them instead of finding a replacement.
Apparently, certain letters with diacritics donโt have a corresponding combining character in Unicode. For example, the Polish letter ล is a standalone character with an inseparable stroke, making it already normalized:
>>> from unicodedata import decomposition, is_normalized, normalize
>>> polish_diacritics = "ฤฤฤลลรลลปลน"
>>> for letter in polish_diacritics:
... print(letter, is_normalized("NFD", letter), decomposition(letter))
...
ฤ False 0041 0328
ฤ False 0043 0301
ฤ False 0045 0328
ล True
ล False 004E 0301
ร False 004F 0301
ล False 0053 0301
ลป False 005A 0307
ลน False 005A 0301
>>> normalize("NFD", "ล")
'ล'
Trying to decompose ล yields the same letter because itโs already decomposed. You can mitigate this unwanted behavior by allowing your transliteration function to accept an optional character mapping to .translate() the string with:
>>> import sys
>>> import unicodedata
>>> def transliterate_v2(text: str, mapping: dict[str, str] = None) -> str:
... combining_characters = "".join(
... character
... for code_point in range(sys.maxunicode)
... if unicodedata.combining(character := chr(code_point))
... )
... if mapping:
... src, dst = ["".join(x) for x in zip(*mapping.items())]
... table = str.maketrans(src, dst, combining_characters)
... else:
... table = str.maketrans(dict.fromkeys(combining_characters))
... return unicodedata.normalize("NFD", text).translate(table)
...
You use a generator expression to prepare a string with all the combining characters in Unicode. If the caller provided a custom mapping, then you separate its keys and values into two strings called src and dst, which you pass to str.maketrans() to build a translation table. Otherwise, you build a translation table that just removes all combining characters. Finally, you normalize the text to the NFD form before applying the translation table.
This new function helps deal with special cases, but it doesnโt scale too well, as you still have to manually build the translation table yourself each time. Moreover, it doesnโt let you map a single code point, such as U+00E6 representing the letter รฆ, into more than one character, which might be necessary:
>>> transliterate_v2("ลukasz ลบle zลoลผyล ลรณลผko", {"ล": "L", "ล": "l"})
'Lukasz zle zlozyl lozko'
>>> transliterate_v2("Kรฆmi", {"รฆ": "ae"})
Traceback (most recent call last):
...
ValueError: the first two maketrans arguments must have equal length
To handle most corner cases of transliteration and even address elements of transcription, you can install Unidecode or another third-party library:
>>> import unidecode
>>> def transliterate_v3(text: str) -> str:
... return unidecode.unidecode(text)
...
>>> text = "Kรฆmi nรฝ รถxi hรฉr, ykist รพjรณfum nรบ bรฆรฐi vรญl og รกdrepa"
>>> print(text, transliterate_v3(text), sep="\n")
Kรฆmi nรฝ รถxi hรฉr, ykist รพjรณfum nรบ bรฆรฐi vรญl og รกdrepa
Kaemi ny oxi her, ykist thjofum nu baedi vil og adrepa
Unlike your earlier attempts at transliteration, calling the unidecode() function can make the resulting string longer than the original, so no information gets lost during the process.
Note: You can also do transliteration with the PyICU library that youโve seen before:
>>> import icu
>>> tr = icu.Transliterator.createInstance("Any-ASCII")
>>> tr.transliterate("Kรฆmi nรฝ รถxi hรฉr, ykist รพjรณfum nรบ bรฆรฐi vรญl og รกdrepa")
'Kaemi ny oxi her, ykist thjofum nu baedi vil og adrepa'
When creating a transliterator instance, you must provide the source and target writing script identifiers.
Now, if you want to sort a sequence of Unicode strings according to their transliterations but without altering them, then you can use your transliteration function as the sort key:
>>> words = ["wklฤsลoลci", "kรฆmi", "ลพluลฅouฤkรฝ", "รพjรณfum"]
>>> sorted(words, key=transliterate_v3)
['kรฆmi', 'รพjรณfum', 'wklฤsลoลci', 'ลพluลฅouฤkรฝ']
This wonโt give you an accurate result, but it should look right to most English-speaking audiences.
Another common challenge associated with sorting Unicode strings, which also requires Unicode normalization, is case-insensitive comparison. Itโs quite widespread in real-life scenarios, so youโll read about it now.
Perform Case-Insensitive Sorting of Unicode Strings
When you sort strings comprising only ASCII characters, then uppercase letters always come first because of their underlying ordinal values:
>>> animals = ["Turtle", "hedgehog", "monkey", "Hystrix", "fish"]
>>> sorted(animals)
['Hystrix', 'Turtle', 'fish', 'hedgehog', 'monkey']
Whether this is desired depends on the language and context. Either way, string sorting in Python is case-sensitive by default. To disregard the letter case when comparing strings, you can convert all letters to lowercase or uppercase and use them as the sort key:
>>> sorted(animals, key=str.lower)
['fish', 'hedgehog', 'Hystrix', 'monkey', 'Turtle']
Now, your strings are sorted alphabetically regardless of their letter case. Unfortunately, while this quick fix works for most English words, it breaks down when encountering accented letters like the French รฉ:
>>> animaux = ["Tortue", "hรฉrissonne", "รlรฉphant", "poisson", "รฉlรฉphant"]
>>> sorted(animaux, key=str.lower)
['hรฉrissonne', 'poisson', 'Tortue', 'รlรฉphant', 'รฉlรฉphant']
Under the surface, you still sort your strings lexicographically using character code points. So, even though you compare a lowercase letter รฉ against lowercase h, the latter comes first because it has a smaller ordinal value.
Itโs worth noting that the concept of uppercase and lowercase letters isnโt universal to all writing systems like it appears in the Latin alphabet. Besides, as you spend more time working with foreign texts, you may run into some really surprising edge cases in Unicode.
For instance, the German letter ร (sharp S) never occurs at the start of a word. Therefore, for a long time, it was only available as a lowercase letter without an uppercase variant in the alphabet:
>>> "ร".islower()
True
>>> "ร".upper()
'SS'
When you try uppercasing it in Python, you get a double-S digraph instead. But, some fonts do provide a capital form of sharp S, and Unicode even assigns a dedicated code point (U+1E9E) to it:
>>> capital_sharp_s = "\N{latin capital letter sharp s}"
>>> capital_sharp_s
'แบ'
>>> capital_sharp_s.lower()
'ร'
>>> capital_sharp_s.lower().upper() == capital_sharp_s
False
Nevertheless, you canโt restore its original capital form using chained calls to .lower() and .upper() in Python. At the same time, various dialects of the German language in countries like Switzerland donโt use sharp S at all.
Another interesting quirk is the Greek letter ฮฃ (sigma), which has two alternative lowercase forms depending on where you put it in a word:
>>> "ฮฃ".lower()
'ฯ'
>>> "ฮฮฃ".lower()
'ฮดฯ'
>>> "ฯ".upper() == "ฯ".upper()
True
At the end of a word, a lowercase sigma takes the final form ฯ, but everywhere else, it appears as the more traditional ฯ.
Uppercase versions of the Latin b, Greek ฮฒ, and Cyrillic ะฒ have an identical visual appearance despite having distinct code points:
>>> for lowercase in ("b", "ฮฒ", "ะฒ"):
... uppercase = lowercase.upper()
... print(uppercase, hex(ord(uppercase)))
...
B 0x42
ฮ 0x392
ะ 0x412
As you can see, there are far too many corner cases to try and cover them all by hand.
Fortunately, Python 3.3 introduced a new string method called str.casefold(), which adds support for Unicode case folding for the purpose of caseless comparison of text. According to the Unicode glossary, folding lets you temporarily ignore certain distinctions between characters. In general, case folding works similarly to lowercasing a string, but there are a few notable exceptions:
>>> "ฤ".lower(), "ฤ".casefold()
('ฤ
', 'ฤ
')
>>> "ร".lower(), "ร".casefold()
('ร', 'ss')
>>> "\u13ad".lower(), "\u13ad".casefold()
('๊ญฝ', 'แญ')
Special cases like the German sharp S are handled differently, letting you compare words like GROSS and groร as equal. Also, starting from Unicode version 8.0, Cherokee letters like U+13AD representing แญ fold to uppercase letters in order to give the stability guarantees required by case folding.
Thatโs great! Can you now use str.casefold() instead of str.lower() or str.upper() as your key function for case-insensitive sorting of Unicode strings in Python? Well, not exactly.
First of all, it wonโt address the problem associated with arbitrary ordinal values of accented letters. Secondly, the strings that you wish to compare must share a common Unicode representation. Because case folding doesnโt always preserve the normal form of the original string, you should generally perform the normalization after folding your strings.
So, to implement a case-insensitive key function for sorting strings, as specified by the Unicode Standard, you must combine str.casefold() with Unicode normalization:
>>> import functools
>>> import unicodedata
>>> def case_insensitive(text: str) -> str:
... nfd = functools.partial(unicodedata.normalize, "NFD")
... return nfd(nfd(text).casefold())
...
>>> case_insensitive("groร") == case_insensitive("GROSS")
True
>>> case_insensitive("groร") == case_insensitive("grรณb")
False
Using the functools module, you define a partial function nfd() that takes a string as an argument and applies the Unicode canonical decomposition to it. You then call this function twice to normalize the string before and after the case folding. Calling it twice acts as an extra safeguard against rare edge cases involving one particular combining character in Greek that requires special treatment to fully comply with the Unicode Standard.
Note: This code is adapted from Pythonโs official Unicode how-to guide, which explains how to correctly compare strings in a case-insensitive way. It follows the formal specification from Section 3.13 Default Case Algorithms in Chapter 3 of the Unicode Standard.
Having this new key function in place, you can now sort your French animals alphabetically while ignoring the letter case:
>>> sorted(animaux, key=case_insensitive)
['รlรฉphant', 'รฉlรฉphant', 'hรฉrissonne', 'poisson', 'Tortue']
Success! While exploring the quirks of Unicode is undoubtedly a great learning experience, you can achieve a similar effect with much less effort using the Unicode Collation Algorithm (UCA).
If you already have the PyICU library installed, then you can ignore the letter case entirely by tweaking your collator object:
>>> from icu import Collator, Locale
>>> collator = Collator.createInstance(Locale("fr_FR"))
>>> collator.setStrength(Collator.SECONDARY)
>>> sorted(animaux, key=collator.getSortKey)
['รlรฉphant', 'รฉlรฉphant', 'hรฉrissonne', 'poisson', 'Tortue']
By decreasing the collatorโs strength to secondary, you make it disregard the subsequent levels of character weights in UCA, including the tertiary level responsible for uppercase and lowercase differences. You can also decide whether uppercase letters should come first or the other way around by configuring the CaseFirst parameter.
Note: In the pure-Python pyuca library, lowercase letters always come before their uppercase counterparts, and you donโt have control over this behavior:
>>> from pyuca import Collator
>>> sorted(animaux, key=Collator().sort_key)
['รฉlรฉphant', 'รlรฉphant', 'hรฉrissonne', 'poisson', 'Tortue']
The weight values embedded in the Default Unicode Collation Element Table (DUCET), which the library uses under the hood, determine such ordering. Notice that this breaks the stability of the sorting algorithm, as the two elephants swapped positions in the list.
Comparing strings in a case-insensitive way is one of the most common requirements when sorting or searching for strings. But, there are many more special cases that you may need to consider, like the handling of punctuation, whitespace characters, or numbers. Libraries like PyICU let you customize the collation rules accordingly.
Next up, youโll learn about the natural sort order, which your users might expect in some scenarios.
Sort Strings With Numbers in Natural Order
Arranging strings in the natural sort order is similar to alphabetical sorting, but it takes digits that may be part of the string into account. A typical example of that would involve listing a directory full of log files whose names contain a version number:
$ ls -v
log.1 log.2 ... log.10 log.11 log.12 ... log.100 log.101 log.102 ...
While such ordering looks natural to most humans, the lexicographic sorting in Python and other programming languages produces a less-than-intuitive result for the same input sequence of strings:
>>> sorted(f"log.{i}" for i in range(1, 111))
['log.1',
'log.10',
'log.100',
'log.101',
โฎ
'log.11',
'log.110',
'log.12',
โฎ
'log.2',
'log.20',
'log.21',
โฎ
'log.99']
Because Python compares your strings by value, all digits end up treated as ordinary character code points instead of numbers. So, a string that contains "10" comes before "2", disrupting the natural ordering of numbers. To fix this, you can try using regular expressions by isolating the individual alphanumeric chunks of the string into separate elements to compare:
>>> import re
>>> def natural_order(text: str) -> tuple[str | int, ...]:
... return tuple(
... int(chunk) if chunk.isdigit() else chunk
... for chunk in re.split(r"(\d+)", text)
... )
...
>>> natural_order("log.42")
('log.', 42, '')
>>> sorted([f"log.{i}" for i in range(1, 111)], key=natural_order)
['log.1',
'log.2',
'log.3',
'log.4',
โฎ
'log.10',
'log.11',
'log.12',
โฎ
'log.100',
'log.101',
'log.102',
โฎ
'log.110']
Your sort key function, natural_order(), returns a tuple whose elements are either strings or integers. This compound key allows Python to compare the corresponding tuple elements one by one from left to right using the appropriate comparison for the data type.
Note that this is a fairly straightforward implementation that canโt handle signed numbers, floating-point numbers, or any of the edge cases that youโve learned so far. To deal with special cases like locale-aware and case-insensitive sorting, you should reach for a hassle-free library like natsort, which plays nicely with PyICU:
>>> filenames = [
... "raport_maj-2023.xlsx",
... "bลฤdy.log.3",
... "bลฤdy.log.1",
... "Raport_kwiecieล_2023.xlsx",
... "Bลฤdy.log.101",
... ]
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "pl_PL.UTF-8")
'pl_PL.UTF-8'
>>> from natsort import natsorted, ns
>>> natsorted(filenames, alg=ns.LOCALE | ns.IGNORECASE)
['bลฤdy.log.1',
'bลฤdy.log.3',
'Bลฤdy.log.101',
'Raport_kwiecieล_2023.xlsx',
'raport_maj-2023.xlsx']
Use the locale module to set the desired language and country code. When natsort detects PyICU in your virtual environment, itโll use it and only fall back to locale if it canโt find this third-party library. Just make sure to set the locale before importing natsort. Otherwise, itโll fall back to your operating systemโs default locale.
So far, youโve been sorting plain strings in Python, which is rarely the case in real-world programming projects. Youโre more likely to sort complex data structures or objects by their specific attributes. In the next section, youโll learn how to do just that.
Sort Complex Objects by More Than One Key
Ordering people by name is a classic example of sorting complex objects, which often comes up in practice. Suppose that you need to compile a list of employees in a company according to their first and last names for an annual report. You can leverage Pythonโs named tuple to represent each person:
>>> from typing import NamedTuple
>>> class Person(NamedTuple):
... first_name: str
... last_name: str
...
... def __repr__(self):
... return f"{self.first_name} {self.last_name}"
...
>>> people = [
... Person("Zbigniew", "Nowak"),
... Person("Anna", "Wรณjcik"),
... Person("ลukasz", "Kowalski"),
... Person("ลปaneta", "Jabลoลska"),
... Person("Anna", "Nowak"),
... Person("Ludmiลa", "Wiลniewska"),
... ]
>>> sorted(people)
[Anna Nowak,
Anna Wรณjcik,
Ludmiลa Wiลniewska,
Zbigniew Nowak,
ลukasz Kowalski,
ลปaneta Jabลoลska]
Because each person is a tuple consisting of the first and last names, Python knows how to compare these objects without requiring you to implement any special methods in your class. When you pass a list of Person instances to the sorted() function, each object becomes a compound key. Unfortunately, sorting still relies on the lexicographic order of Unicode strings, so you donโt get the expected result.
You can fix the situation by defining your own compound key using the pyuca collator:
>>> import pyuca
>>> collator = pyuca.Collator()
>>> def compound_key(person: Person) -> tuple:
... return (
... collator.sort_key(person.first_name),
... collator.sort_key(person.last_name),
... )
...
>>> sorted(people, key=compound_key)
[Anna Nowak,
Anna Wรณjcik,
Ludmiลa Wiลniewska,
ลukasz Kowalski,
ลปaneta Jabลoลska,
Zbigniew Nowak]
This time, the result comes out a bit better but still not quite correct, as Zbigniew is placed after ลปaneta. Do you remember why?
Itโs the same problem that you ran into earlier while using pyuca, which relies on the default collation table without letting you customize it for the specific locale. Fortunately, you can define a compound key using PyICU equally well:
>>> from icu import Collator, Locale
>>> collator = Collator.createInstance(Locale("pl_PL"))
>>> def compound_key(person: Person) -> tuple:
... return (
... collator.getSortKey(person.first_name),
... collator.getSortKey(person.last_name),
... )
...
>>> sorted(people, key=compound_key)
[Anna Nowak,
Anna Wรณjcik,
Ludmiลa Wiลniewska,
ลukasz Kowalski,
Zbigniew Nowak,
ลปaneta Jabลoลska]
Overall, this code looks very much the same but uses a slightly different API that exposes the locale customization. When you configure suitable language rules, you get the desired result.
If youโre curious, then you can experiment by swapping the attributes in your compound key to observe how that affects the sorting order:
>>> def by_last_name_first(person: Person) -> tuple:
... return (
... collator.getSortKey(person.last_name),
... collator.getSortKey(person.first_name),
... )
...
>>> sorted(people, key=by_last_name_first)
[ลปaneta Jabลoลska,
ลukasz Kowalski,
Anna Nowak,
Zbigniew Nowak,
Ludmiลa Wiลniewska,
Anna Wรณjcik]
In this case, youโve sorted your employees primarily by last name. You then sort individuals who share the same last name by first name.
Now you know how to sort complex objects using a custom compound key, which takes locale-aware Unicode collation rules into account. With that skill under your belt, you conclude this comprehensive tutorial.
Conclusion
Congratulations on successfully navigating the intricate task of sorting Unicode strings in Python! Youโre now aware of the potential hurdles that you might face in real-world projects, as well as strategies to overcome them. Youโve gained insight into the Unicode Collation Algorithm (UCA) and several Python libraries that expose it to you.
You know the difference between alphabetical and lexicographic sorting, understand what collation rules are, and can explain how the canonical decomposition in Unicode works. Youโre able to customize your sorting order to make it locale-aware, reflecting specific cultural conventions. Furthermore, you can romanize non-Latin texts through transliteration and transcription for compatibility with systems that only support ASCII.
Your newfound expertise lets you safely perform case-insensitive sorting and natural sorting within the Unicode domain. Finally, you can construct compound sort keys for ordering complex objects based on a subset of attributes. Armed with this thorough knowledge, youโre well prepared to develop reliable multilingual applications catered to a global user base.
Get Your Code: Click here to download the free sample code that shows you how to sort Unicode strings alphabetically with Python.



