This post is a shorter version of my previous post Unicode for Curious Developers Loving Code ๐. My motivation is to persuade you that you must learn more about Unicode and that you should read my other post ๐.
Modern programs must handle UnicodeโPython has excellent support for Unicode, and will keep getting better.
โ โGuido van Rossum, Inventor of Python
Unicode is omnipresent. Your operating system, your programming language, and your everyday applications all support Unicode so that you can use your own language in a multilingual environment like the internet. All this is possible because developers before you learned about Unicode. What about you? Letโs test your comprehension using a few puzzlers.
Puzzler 1: Normalization
What is the output of this program?
Answer
The answer is False
. Thatโs right. You read correctly. Try to copy/paste this snippet in your terminal using the Python interpreter. You can even convert this line to your favorite (modern) programming language and get the same result. But why?
Explanation
Unicode defines more than 100,000 characters (and has room for more than one million more!). Unicode characters are code points, not glyphs. This means that Unicode declares that U+0041 is A, U+03B1 is ฮฑ, etc. Unicode doesnโt care how characters look. Your operating system is often responsible for displaying these characters on screen using glyphs defined in font files.
So, when you are reading the previous example, you are viewing the rendered representation of this Python program (also made of Unicode characters). You see glyphs, not code points. Right?
The program can be made more explicit like this:
Both lines return False
again, but the result now seems a lot more obvious. Unicode defines several characters (or sequences of characters) resulting in the same glyph. For example, both Unicode text U+00E0 and U+0061 U+0300 represent the latin letter ร
. These representations are canonically equivalent in Unicode, which means they must be considered identical. Why not just avoid duplicates instead?
Unicode encourages the use of combining characters to define accentuated letters. For example, the combining character U+0300 COMBINING GRAVE ACCENT
modifies the base character U+0061 LATIN SMALL LETTER A
. But Unicode was introduced after hundreds of incomplete character sets like ASCII. Therefore, texts created before Unicode would have to be converted, and Unicode was designed to make this conversion as simple as possible. If a character (e.g., ร
) was defined by an existing character set (e.g., Windows-1252), Unicode included this character to support a one-to-one mapping with this character set. But why Python considers the two strings as different?
Python, like almost all programming languages, considers two Unicode strings to be equals if they are composed of the same sequence of code points. To solve this problem, Unicode defines several text normalization algorithms that replace equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points. Unicode normalization is often implemented by programming languages in special modules:
Normalization is not applied systematically. But you should probably use it when sorting or indexing texts. (Imagine if Google didnโt normalize texts in its index, you would miss many results when searching for accentuated texts.)
Puzzler 2: Encoding
Unicode is also famous for its emojis. Consider the following program in Go. What is the output?
Answer
The answer is 3/4
(not 1/1
). Thatโs right. You read correctly. You can try this example in different languages and get different results (Python 3 outputs 1/1
, Java outputs 2/2
). Why?
Explanation
One million of code points is a lot! Imagine that you are creating a new programming language and must define a string data type to store a sequence of Unicode code points. Computers only work with 0s and 1s, and this sequence of Unicode code points must be converted to bytes. Each code point must logically be stored using 32 bits (e.g., 1,000,00010 = 111101000010010000002), which means a naive implementation would use a variable of type []int32
to store a Unicode text. Right?
The fact is almost all characters in everyday use are stored in the first 65,536 code points (requiring only 16 bits). In addition, the vast majority of characters in Latin languages are still the famous ASCII characters which are the first 128 Unicode characters (required only 8 bits).
For example, saving the following Python program in a file encoded in UTF-32 (Note: UTF-32 is one of the encodings defined by Unicode and outputs each code point using 32 bits):
Produces the following bytes:
Thatโs a lot of 00
. This is not optimal. Imagine if Go was using this representation internally and your application works mainly with strings. We can do better. How?
Programming languages adopt different solutions to represent strings, which explains why we have different results. For example, Go stores string literals as a []byte
containing the UTF-8 encoding (UTF-8 requires 1 byte for ASCII characters, 3 bytes for most non-ASCII characters, and 4 bytes for most emojis and rare characters). The function len
in Go simply returns the number of bytes in the UTF-8 representation of a string. For example, the U+270B RAISED HAND
(โ) requires 3 bytes in UTF-8 and U+1F91A RAISED BACK OF HAND EMOJI
(๐ค) requires 4 bytes in UTF-8 since characters are not stored in the same block inside the vast Unicode table. So, yes, rotating your hand has its importance when working with Unicode ๐.
To illustrate the gain of using UTF-8, here is the same file stored using this encoding:
You now understand why we commonly save our files in UTF-8, and why this encoding is the default on most systems.
Other languages like Java use UTF-16 encoding for their internal string data type representation (UTF-16 uses 2 bytes for the first 65,536 characters and 4 bytes for the remaining ones). Python uses a similar approach, but the implementation does not expose these details to the developer. In short, you must understand how your programming language works.
Puzzler 3: Emojis
Here is another program using flags. What is the output?
Answer
The program outputs True
and True
. Thatโs weird. Why would different flags be considered equal? It just doesnโt make sense. Or maybe it is.
Explanation
We have already discussed how accentuated letters can be formed using combining characters like U+0300 COMBINING GRAVE ACCENT
. To understand this puzzler, you need to know that some Emojis are also defined by combining characters. For example, Unicode defines a series of Regional Indicator Symbol
for every letter A-Z. Emoji country flags combine two regional symbols corresponding to the two-letter country code defined by the ISO 3166-1 standard. For example,
- ๐ซ๐ท (France,
FR
) is defined by the sequence U+1F1EBRegional Indicator Symbol Letter F
+ U+1F1F7Regional Indicator Symbol Letter R
. - ๐ซ๐ฎ (Finland,
FI
) is defined by the sequence U+1F1EBRegional Indicator Symbol Letter F
+ U+1F1EERegional Indicator Symbol Letter I
. - ๐ง๐ท(Brazil,
BR
) is defined by the sequence U+1F1E7Regional Indicator Symbol Letter B
+ U+1F1F7Regional Indicator Symbol Letter R
.
In Python, string indexing returns the i-nth code point in the Unicode sequence. So "๐ซ๐ท"[0]
returns the Regional Indicator Symbol Letter F
and "๐ซ๐ท"[1]
returns the Regional Indicator Symbol Letter R
. This explains the output (FR=FI and FR=BR).
Combining characters are also used by skin tones. Unicode defines a code point for every color defined by the Fitzpatrick scale: U+1F3FF Dark skin tone
, U+1F3FE Medium Dark skin tone
, U+1F3FD Medium skin tone
, U+1F3FC Medium Light skin tone
, and U+1F3FB Light skin tone
. For example:
Comparing Unicode texts containing the same emoji using different skin tones is tricky:
Unicode normalization that we covered in Puzzler 1 doesnโt help:
As there is no current support in standard libraries, the most obvious solution is to ignore skin tones completely and compare only the base Unicode character:
Puzzler 4: Regex
What is the output of the following program?
Answer
The answer is 1
. The regex only found one match (10 mAh
). Why?
Explanation
The metacharacter \w
matches a single word character defined by the expression [a-zA-Z_0-9]
. It works great with ASCII characters like m
but not with Unicode letters like ยต
.
In addition to assigning a unique code point to every single character in use by any language, Unicode also provides a database defining a list of properties for every character. One of these properties is General_Category
(Lu
for uppercase letter, Nd
for decimal number, etc.). Programming languages import this database in their code to implement common functions like isUpper()
, toLowerCase()
, isLetter()
, and also to extend the behavior of their regular-expression engine.
Java supports other classes like \p{Lu}
to match an uppercase letter or just \p{L}
to match any Unicode letter:
- 1
- We replaced the ASCII-only class
\w
with the Unicode-compatible class\p{L}
.
This article introduced some pitfalls when working with Unicode. There is so much more to cover.
If I succeeded in arousing your curiosity, I recommend you to read Unicode for Curious Developers Loving Code ๐. It will take you less than one hour (thatโs a lot for a blog post, I know) but compared to the time spent debugging an issue, one hour is a small price to pay to understand what you are doing. Learning always pays off.
Puzzler 5: Bonus
What is the output of this program?
Answer
The anwser is True
. Thatโs right. You read correctly, again. Based on what we covered in this article, you may be able to found the explanation.
Explanation
Python accepts non-ASCII characters for identifiers like variable names but normalizes them using the NFKC algorithm (one of the four normalization algorithms defined by the Unicode Standard). For example:
Both Unicode characters โ
and H
normalize to the same character. It means both identifiers represent the same variable, which means that when we are updating one of them, we are updating the same, unique variable. But why normalize identifiers?
It may seem wrong, but normalizing identifiers is a great idea. Unicode contains a lot of characters, and some different characters are represented visually using very similar glyphs. Compare with this more subtle example (this example can be more or less relevant depending on the fonts available on your system):
Note that Python does not accept any Unicode character in identifiers:
Only characters belonging to specific categories such as Lu
(uppercase letters) or Ll
(lowercase letters) are accepted. Emojis could therefore not be used in Python variable names, but some languages like Haskell arenโt that restrictive.