Dictionaries and Sets

These notes are based on Fluent Python. Most of the content is summarized from the book, with a small portion being the author's own understanding. The notes were converted from Jupyter to Markdown; the original Jupyter notebook is in this repo.

General mapping types

The collections.abc module provides two abstract base classes, Mapping and MutableMapping. They define the formal interface for dict and other mapping-like types.

All mapping types in the standard library are implemented on top of dict. Because of this, they share a common limitation: only hashable types can be used as keys in these mappings (this requirement applies only to keys; values do not need to be hashable).

If an object is hashable, its hash value must remain constant during its lifetime, and it must implement the __hash__() method. A hashable object must also implement __eq__() so that it can be compared with other keys. If two hashable objects are equal, then their hash values must be the same.

Atomic immutable types (str, bytes, and numeric types) are all hashable. frozenset is also hashable, because by definition it can only contain hashable elements.
A tuple is hashable only if all of its elements are hashable.

In general, instances of user-defined classes are hashable by default, and their hash values are derived from the result of the id() function. This means such objects are all unequal to each other by default. If a class implements __eq__ and that method depends on the internal state of the object, then the object is hashable only if all of that internal state is immutable.

python

tt = (1, 2, (30, 40))
hash(tt)

python

tl = (1, 2, [30, 40])
hash(tl)

Raises:

text

TypeError: unhashable type: 'list'

python

tf = (1, 2, frozenset([30, 40]))
hash(tf)

Dictionaries provide multiple construction methods

python

a = dict(one=1, two=2, three=3)
b = {'one': 1, 'two': 2, 'three': 3}
c = dict(zip(['one', 'two', 'three'], [1, 2, 3]))
d = dict({'three': 3, 'one': 1, 'two': 2})
e = dict([('two', 2), ('one', 1), ('three', 3)])
a == b == c == d == e

python

# You can also build new dicts with dict comprehensions
DIAL_CODES = [
    (86, 'China'),
    (91, 'India'),
    (1, 'United States'),
    (62, 'Indonesia'),
    (55, 'Brazil'),
    (92, 'Pakistan'),
    (880, 'Bangladesh'),
    (234, 'Nigeria'),
    (7, 'Russia'),
    (81, 'Japan'),
]


country_code = {country: code for code, country in DIAL_CODES}
country_code

python

{code: country.upper() for country, code in country_code.items() if code < 66 }

Common dictionary methods:

Method	Description
`dict.clear()`	Remove all items from the dictionary
`dict.copy()`	Return a shallow copy of the dictionary
`dict.fromkeys(seq[, val])`	Return a new dict with keys from `seq` and value `val` for all keys
`dict.get(key[, default])`	Return the value for `key` if present, otherwise return `default`
`dict.items()`	Return a new view of the dictionary's key-value pairs
`dict.keys()`	Return a new view of the dictionary's keys
`dict.pop(key[, default])`	Remove and return the value for `key` if present; otherwise return `default`
`dict.popitem()`	Remove and return a key-value pair from the dictionary (for compatibility, returns the last inserted item)
`dict.setdefault(key[, default])`	If `key` is in the dict, return its value; if not, insert `key` with `default` and return `default`
`dict.update([other])`	Update the dictionary from another mapping or iterable of key-value pairs
`dict.values()`	Return a new view of the dictionary's values

You can use setdefault to handle missing keys.

Sometimes, even when a key is not present in a mapping, it is convenient to get a default value when reading it. There are two ways to achieve this:

Use collections.defaultdict instead of a plain dict.
Define a dict subclass and implement the __missing__ method.

python

import collections


index = collections.defaultdict(list)
index['unknown word']

If a class inherits from dict and provides a __missing__ method, then when __getitem__ is called with a missing key, Python automatically calls __missing__ instead of raising KeyError.

python

class StrKeyDict0(dict):
    def __missing__(self, key):
        if isinstance(key, str):
            raise KeyError(key)
        return self[str(key)]

    def get(self, key, default=None):
        try:
            return self[key]
        except KeyError:
            return default

    def __contains__(self, key):
        return key in self.keys([]) or str(key) in self.keys([])

Variants of dict

collections.OrderedDict: preserves insertion order, so key iteration order is always predictable.
collections.ChainMap: groups multiple mapping objects so that key lookups search them one by one as if they were a single mapping.
collections.Counter: a mapping that stores integer counts for keys. Each update increments the counter. This is useful for counting hashable objects and is also known as a bag or multiset in other languages.
collections.UserDict: a pure-Python implementation that wraps a real dict. It inherits from collections.MutableMapping, and is designed as a convenient base class for writing custom mapping types.

For creating custom mapping types, it is usually more convenient to subclass UserDict than dict. A key reason is that the built-in dict sometimes takes internal shortcuts in its method implementations, which can force you to override more methods than expected in subclasses. UserDict avoids these problems.

python

# This mapping converts all keys to strings for add/update/lookup operations
import collections


class StrKeyDict(collections.UserDict):
    def __missing__(self, key):
        if isinstance(key, str):
            raise KeyError(key)
        return self[str(key)]

    def __contains__(self, key):
        return str(key) in self.data

    def __setitem__(self, key, item):
        self.data[str(key)] = item

Because UserDict inherits from MutableMapping, all remaining mapping methods of StrKeyDict are inherited from UserDict, MutableMapping, and Mapping. The ultimate base class Mapping is an abstract base class (ABC), but it still provides several useful concrete methods.

Two such methods are particularly worth noting:

MutableMapping.update: This method can be used directly. It is also used inside __init__ so that the constructor can accept various forms of input (other mappings, iterables of (key, value) pairs, and keyword arguments). Internally it uses self[key] = value to add new items, which in turn invokes the custom __setitem__ method.
This is one reason to subclass UserDict instead of dict: dict might bypass overridden methods in some internal operations.
Mapping.get: Called inside __getitem__ to provide a default value for missing keys. Internally it is implemented roughly as self[key] if key in self else default, which again relies on the custom __getitem__ implementation.

Immutable mapping types

Starting from Python 3.3, the types module provides MappingProxyType, which creates a read-only view of a mapping. If you pass a mapping to MappingProxyType, it returns a dynamic, read-only view of that mapping. "Dynamic" means that changes to the original mapping are visible through the view, but the view itself cannot be used to modify the underlying data.

python

from types import MappingProxyType


d = {1: 'A'}
d_proxy = MappingProxyType(d)
d_proxy

python

d_proxy[1]

python

d_proxy[2] = 'X'  # raises TypeError

python

# But you can still modify the original mapping
d[2] = 'B'
d_proxy

Set theory

Conceptually, a set is a collection of unique objects. Because of this, sets are often used for deduplication.

Set elements must be hashable. The set type itself is not hashable, but frozenset is. Therefore, you can create a set containing multiple distinct frozenset objects.

python

l = ['spam', 'spam', 'eggs', 'spam']
set(l)

python

list(set(l))

Besides uniqueness, sets also implement many basic infix operators. Given two sets a and b:

a | b returns their union,
a & b returns their intersection,
a - b returns the difference.

Except for the empty set, set literals such as {1}, {1, 2}, etc. look exactly like their mathematical form. The empty set must be written as set().

Literals such as {1, 2, 3} are faster and more readable than set([1, 2, 3]). The latter is slower because Python must first look up the set constructor by name, then build a list, and finally pass that list to the constructor.

python

# Use dis.dis (the disassembler) to compare bytecode of the two approaches
from dis import dis


dis('{1}')

python

dis('set([1])')

Set comprehensions:

python

from unicodedata import name
{chr(i) for i in range(32, 256) if 'SIGN' in name(chr(i), '')}

Common set operations:

Operation	Description
`set1.union(set2)`	Return a new set with elements from both `set1` and `set2`
`set1.intersection(set2)`	Return a new set with elements common to `set1` and `set2`
`set1.difference(set2)`	Return a new set with elements in `set1` but not in `set2`
`set1.symmetric_difference(set2)`	Return a new set with elements in either set but not both
`set1.issubset(set2)`	Return `True` if `set1` is a subset of `set2`, otherwise `False`
`set1.issuperset(set2)`	Return `True` if `set1` is a superset of `set2`, otherwise `False`
`set1.isdisjoint(set2)`	Return `True` if `set1` and `set2` have no elements in common, otherwise `False`
`set1.add(elem)`	Add element `elem` to `set1`
`set1.remove(elem)`	Remove `elem` from `set1`; raise an error if `elem` is not present
`set1.discard(elem)`	Remove `elem` from `set1` if present; do nothing if not present
`set1.pop()`	Remove and return an arbitrary element from `set1`; raise an error if the set is empty
`set1.clear()`	Remove all elements from `set1`

Under the hood of dict and set

A benchmark of using the in operator to search for 1000 elements in dictionaries of five different sizes:

On that particular machine, searching for 1000 floats among 1000 dictionary keys took about 0.000202 seconds. Running the same search on a dictionary with 10,000,000 elements took only about 0.000337 seconds.

Adding lists into the benchmark for comparison:

Internal structure of a hash table: A hash table is essentially a sparse array (an array with empty slots). In standard data structure textbooks, each slot in a hash table is usually called a bucket. In Python's dict hash table, each key-value pair occupies one bucket. Each bucket holds a reference to the key and a reference to the value. Because all buckets have the same size, they can be accessed by an offset.

Python tries to keep roughly one-third of the buckets empty. When the table becomes too full, it is resized and all entries are copied into a larger table.

To put an object into a hash table, Python first computes the hash value of the key. In Python, this is done with the hash() function.

Hash values and equality

The built-in hash() function works for all built-in types. For instances of user-defined classes, hash() calls their __hash__ method. If two objects compare equal, their hash values must also be equal; otherwise hash tables will not function correctly. For example, since 1 == 1.0 is True, it must also be the case that hash(1) == hash(1.0), even though the internal representations of an integer and a float are completely different.

To serve as hash table indexes, hash values should be well-distributed over the index space. Ideally, the more similar two unequal objects are, the more different their hash values should be.

Starting from Python 3.3, the hash values of str, bytes, and datetime objects are salted with a random value. The salt is a process-wide constant, but a new random salt is generated every time the Python interpreter starts. This is a security measure to mitigate certain DoS attacks.

Hash table algorithm

To retrieve my_dict[search_key], Python first computes hash(search_key), then uses the lower bits of the hash value as an offset into the hash table (the number of bits used depends on the table size).
If the bucket at that offset is empty, Python raises KeyError.
If the bucket is not empty, it contains a (found_key, found_value) pair. Python then checks whether search_key == found_key. If so, it returns found_value.
If search_key and found_key are not equal, this is a hash collision. Python then probes other locations in the table according to its collision resolution strategy until it finds either the key or an empty bucket.
When inserting a new item, Python may decide to resize the hash table if it becomes too full. Resizing increases the table size, and therefore the number of bits used from the hash value, which helps reduce the likelihood of collisions.

Although this algorithm may look expensive, in practice-even with millions of items-most lookups experience no collisions, and on average there may be just one or two collisions per lookup. Even in unlucky cases the number of collisions is usually very small.

Strengths and limitations of dict/set

1. Keys must be hashable

A hashable object must satisfy all of the following:

Support the hash() function and provide an unchanging hash value over its lifetime.
Support equality comparison via __eq__().
If a == b is True, then hash(a) == hash(b) must also be True.

All user-defined objects are hashable by default because their hash values are derived from id(), and by default they are all unequal.

2. Dictionaries have high memory overhead

Because dictionaries use hash tables, and hash tables must be sparse, dictionaries are not memory-efficient.

If you need to store a very large number of records, it is often better to use a list of tuples or namedtuple instances, rather than a list of dictionaries in a JSON-like style. Tuples save space for two reasons: you avoid the hash table overhead, and you do not need to store field names with every record.

3. Key lookup is fast

dict is a classic space-for-time trade-off: dictionaries use more memory, but they provide very fast access times that are almost independent of the number of stored items-as long as the dictionary fits in memory.

4. Key order depends on insertion order

When inserting new keys, if a hash collision occurs, Python might store the new key in a different location than where it would naturally go. As a result, two dictionaries created from lists of key-value pairs with different insertion orders, such as dict([(key1, value1), (key2, value2)]) and dict([(key2, value2), (key1, value1)]), will compare equal, but the internal order of their keys may differ if collisions occurred during insertion.

In contrast, consider PHP. The PHP manual describes arrays as follows: "An array in PHP is actually an ordered map. A map is a type that associates values to keys. This type is optimized for several different uses; it can be treated as an array, list (vector), hash table (map implementation), dictionary, collection, stack, queue, and more."

It is hard to imagine the complexity and overhead of implementing something that mixes lists and ordered dictionaries in this way.

PHP and Ruby borrowed their hash literal syntax from Perl, using => between keys and values. JavaScript borrowed its syntax from Python, using :. JSON, derived from JavaScript, is essentially a subset of Python's literal syntax. Other than the spelling differences in the literals true, false, and null, JSON is fully compatible with Python's syntax.

Dictionaries and Sets ​

General mapping types ​

Variants of dict ​

Immutable mapping types ​

Set theory ​

Under the hood of dict and set ​

Hash values and equality ​

Hash table algorithm ​

Strengths and limitations of dict/set ​

Dictionaries and Sets

General mapping types

Variants of dict

Immutable mapping types

Set theory

Under the hood of dict and set

Hash values and equality

Hash table algorithm

Strengths and limitations of dict/set