Text and Byte Sequences

These notes are based on Fluent Python. Most content is summarized from the book, with some original understanding. Converted from Jupyter to Markdown; original Jupyter notebook in this repo.

Character Issues

Unicode standard clearly distinguishes between character identity (code points) and byte representation:

Code points: Numbers from 0~1114111 (decimal), shown as U+ followed by 4-6 hex digits. Examples: A is U+0041, Euro is U+20AC, treble clef is U+1D11E.
Encoding: Algorithm converting between code points and byte sequences. A (U+0041) is A in UTF-8, A� in UTF-16LE. Euro € (U+20AC) is â¬ (UTF-8), ¬ (UTF-16LE).

python

s = 'café'  # 4 Unicode characters
len(s)      # 4

python

b = s.encode('utf_8')  # UTF-8 encode str → bytes
b                      # b'cafÃ©'

python

len(b)                 # 5 bytes (é uses 2 bytes in UTF-8)

python

b.decode('utf_8')      # UTF-8 decode bytes → str
# 'café'

Bytes Overview

Python 3 has two binary sequence types:

bytes (immutable)
bytearray (mutable)

Elements are integers 0~255, not characters. Slicing always returns same type, even length-1 slices.

python

cafe = bytes('café', encoding='utf_8')
cafe                  # b'cafÃ©'

cafe[0]               # 99 (integer)
cafe[:1]              # b'c' (bytes)

cafe_arr = bytearray(cafe)
cafe_arr[-1:]         # bytearray(b'©')

Basic Codecs

Python has many codecs (e.g., 'utf_8', 'latin_1'). Used by open(), str.encode(), bytes.decode().

python

for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Niño'.encode(codec), sep='	')

Unicode Normalization

Combining characters mean the same text can have different code point sequences.

python

s1 = 'café'
s2 = 'café'

Use unicodedata.normalize('NFC', s) (or NFD, NFKC, NFKD) before comparison.

Dual-Mode APIs

Some stdlib APIs accept both str and bytes (e.g. re, os), changing behavior based on type.

Text and Byte Sequences ​

Character Issues ​

Bytes Overview ​

Basic Codecs ​

Unicode Normalization ​

Dual-Mode APIs ​

Text and Byte Sequences

Character Issues

Bytes Overview

Basic Codecs

Unicode Normalization

Dual-Mode APIs