Text and Byte Sequences
These notes are based on Fluent Python. Most content is summarized from the book, with some original understanding. Converted from Jupyter to Markdown; original Jupyter notebook in this repo.
Character Issues
Unicode standard clearly distinguishes between character identity (code points) and byte representation:
- Code points: Numbers from
0~1114111(decimal), shown asU+followed by 4-6 hex digits. Examples:AisU+0041, Euro isU+20AC, treble clef isU+1D11E. - Encoding: Algorithm converting between code points and byte sequences.
A (U+0041)isAin UTF-8,A�in UTF-16LE. Euro€ (U+20AC)isâ¬(UTF-8),¬(UTF-16LE).
s = 'café' # 4 Unicode characters
len(s) # 4b = s.encode('utf_8') # UTF-8 encode str → bytes
b # b'café'len(b) # 5 bytes (é uses 2 bytes in UTF-8)b.decode('utf_8') # UTF-8 decode bytes → str
# 'café'Bytes Overview
Python 3 has two binary sequence types:
- bytes (immutable)
- bytearray (mutable)
Elements are integers 0~255, not characters. Slicing always returns same type, even length-1 slices.
cafe = bytes('café', encoding='utf_8')
cafe # b'café'
cafe[0] # 99 (integer)
cafe[:1] # b'c' (bytes)
cafe_arr = bytearray(cafe)
cafe_arr[-1:] # bytearray(b'©')Basic Codecs
Python has many codecs (e.g., 'utf_8', 'latin_1'). Used by open(), str.encode(), bytes.decode().
for codec in ['latin_1', 'utf_8', 'utf_16']:
print(codec, 'El Niño'.encode(codec), sep=' ')Unicode Normalization
Combining characters mean the same text can have different code point sequences.
s1 = 'café'
s2 = 'café'Use unicodedata.normalize('NFC', s) (or NFD, NFKC, NFKD) before comparison.
Dual-Mode APIs
Some stdlib APIs accept both str and bytes (e.g. re, os), changing behavior based on type.