Python 内置数据结构深度解析

list、dict、set、tuple 不只是数据容器。搞懂它们的底层实现和时间复杂度，才能写出高性能 Python。

list：动态数组

底层实现

Python 的 list 是一个动态数组（dynamic array），底层是一块连续的内存，存储的是对象引用（指针），而非对象本身。

内存预分配策略：为了避免每次 append 都重新分配内存，list 会预留额外空间：

import sys

lst = []
sizes = []
for i in range(20):
    sizes.append(sys.getsizeof(lst))
    lst.append(i)

print(sizes)
# 增长模式约为：0, 4, 8, 16, 25, 35, 46, ...

时间复杂度

操作	复杂度	说明
`append(x)`	O(1) 均摊	偶尔触发扩容 O(n)
`pop()`	O(1)	尾部删除
`insert(0, x)`	O(n)	要移动所有元素
`pop(0)`	O(n)	要移动所有元素
`list[i]`	O(1)	随机访问
`x in list`	O(n)	线性扫描
`sort()`	O(n log n)	Timsort

# 错误示范：用 list 模拟队列（O(n) 出队）
queue = []
queue.append(1)
queue.append(2)
queue.pop(0)    # O(n)！

# 正确做法：用 deque
from collections import deque
q = deque()
q.append(1)
q.append(2)
q.popleft()     # O(1)

bisect：有序插入

import bisect

sorted_list = [1, 3, 5, 7, 9]
bisect.insort(sorted_list, 4)    # [1, 3, 4, 5, 7, 9]，O(log n) 查找位置

bisect.bisect_left(sorted_list, 5)   # 2（插入 5 的左边位置）
bisect.bisect_right(sorted_list, 5)  # 4（插入 5 的右边位置）

dict：哈希表

底层实现

CPython 3.6+ 的 dict 使用了紧凑哈希表，3.7+ 开始正式保证插入顺序。

实现原理：

维护一个索引数组（稀疏）和一个按插入顺序排列的条目数组（紧凑）
通过 hash(key) 计算哈希值，映射到索引数组的槽位
哈希冲突用开放地址法（linear probing + perturbation）解决

# hash 冲突示例
print(hash("a"))       # 某个整数
print(hash(1))         # 1
print(hash(1.0))       # 1（1 == 1.0，所以 hash 相等）

时间复杂度

操作	平均	最坏（哈希冲突严重）
`d[k]`	O(1)	O(n)
`d[k] = v`	O(1)	O(n)
`del d[k]`	O(1)	O(n)
`k in d`	O(1)	O(n)
`len(d)`	O(1)	-

dict 进阶用法

from collections import defaultdict, Counter, OrderedDict, ChainMap

# defaultdict：访问不存在的 key 时自动初始化
word_count = defaultdict(int)
for word in "hello world hello".split():
    word_count[word] += 1
# {'hello': 2, 'world': 1}

# 构建邻接表
graph = defaultdict(list)
for u, v in edges:
    graph[u].append(v)

# Counter：计数器，dict 的子类
c = Counter("abracadabra")
# Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
c.most_common(3)         # [('a', 5), ('b', 2), ('r', 2)]
c1 + c2                  # 合并计数
c1 - c2                  # 差集计数

# ChainMap：多个 dict 的逻辑合并（不拷贝）
defaults = {"color": "red", "size": 10}
overrides = {"color": "blue"}
config = ChainMap(overrides, defaults)
config["color"]    # "blue"（先查 overrides）
config["size"]     # 10（overrides 没有，查 defaults）

set / frozenset：哈希集合

底层实现

set 本质是一个只有 key 没有 value 的哈希表，内存消耗比 dict 少。

s1 = {1, 2, 3, 4}
s2 = {3, 4, 5, 6}

s1 & s2   # 交集：{3, 4}
s1 | s2   # 并集：{1,2,3,4,5,6}
s1 - s2   # 差集：{1, 2}
s1 ^ s2   # 对称差：{1,2,5,6}

# 成员检测对比
import time

big_list = list(range(1000000))
big_set = set(big_list)

# list: O(n)
start = time.time()
999999 in big_list
print(f"list: {time.time()-start:.6f}s")   # ~0.01s

# set: O(1)
start = time.time()
999999 in big_set
print(f"set: {time.time()-start:.6f}s")    # ~0.000001s

frozenset

# 不可变，可作 dict key 或 set 的元素
fs = frozenset([1, 2, 3])

# 应用：统计不重复的边（无向图）
edges = frozenset({frozenset({1,2}), frozenset({2,1})})
len(edges)   # 1，因为 {1,2} == {2,1}

tuple：不可变序列

# tuple 比 list 内存更小
import sys
lst = [1, 2, 3, 4, 5]
tpl = (1, 2, 3, 4, 5)
print(sys.getsizeof(lst))  # 104 bytes
print(sys.getsizeof(tpl))  # 80 bytes

# tuple 创建更快
import timeit
timeit.timeit("[1, 2, 3, 4, 5]", number=1000000)  # ~0.08s
timeit.timeit("(1, 2, 3, 4, 5)", number=1000000)  # ~0.02s

# tuple 可作 dict key
locations = {
    (0, 0): "origin",
    (1, 0): "right",
    (0, 1): "up",
}

# 解包
a, b, c = (1, 2, 3)
first, *rest = (1, 2, 3, 4, 5)

collections 模块深度使用

deque：双端队列

from collections import deque

dq = deque([1, 2, 3], maxlen=5)  # maxlen 限制大小，满了自动丢弃最旧的

dq.append(4)        # 右端添加 O(1)
dq.appendleft(0)    # 左端添加 O(1)
dq.pop()            # 右端删除 O(1)
dq.popleft()        # 左端删除 O(1)
dq.rotate(1)        # 循环右移
dq.rotate(-1)       # 循环左移

# 滑动窗口最大值
def sliding_window_max(nums, k):
    dq = deque()    # 存下标，单调递减
    result = []
    for i, x in enumerate(nums):
        while dq and nums[dq[-1]] < x:
            dq.pop()
        dq.append(i)
        if dq[0] <= i - k:
            dq.popleft()
        if i >= k - 1:
            result.append(nums[dq[0]])
    return result

heapq：堆（优先队列）

import heapq

heap = [3, 1, 4, 1, 5, 9, 2, 6]
heapq.heapify(heap)           # O(n) 原地建堆

heapq.heappush(heap, 0)       # O(log n) 插入
smallest = heapq.heappop(heap)  # O(log n) 取最小值

# Top-K 问题
nums = [3, 1, 4, 1, 5, 9, 2, 6]
heapq.nlargest(3, nums)    # [9, 6, 5]
heapq.nsmallest(3, nums)   # [1, 1, 2]

# 最大堆：取反
heap = [-x for x in nums]
heapq.heapify(heap)
max_val = -heapq.heappop(heap)

时间复杂度汇总对比

操作	list	dict	set	tuple
访问元素 `[i]`	O(1)	O(1)	不支持	O(1)
成员检测 `in`	O(n)	O(1)	O(1)	O(n)
插入（尾部/任意）	O(1)/O(n)	O(1)	O(1)	不可变
删除（尾部/任意）	O(1)/O(n)	O(1)	O(1)	不可变
排序	O(n log n)	不支持	不支持	不支持
长度 `len`	O(1)	O(1)	O(1)	O(1)

内存占用对比

import sys
from collections import namedtuple

# list of tuples
data_tuples = [(i, i*2, f"item_{i}") for i in range(1000)]

# list of dicts
data_dicts = [{"x": i, "y": i*2, "name": f"item_{i}"} for i in range(1000)]

# list of namedtuples
Item = namedtuple("Item", ["x", "y", "name"])
data_named = [Item(i, i*2, f"item_{i}") for i in range(1000)]

print(f"tuples: {sys.getsizeof(data_tuples[0])} bytes each")  # ~72
print(f"dicts:  {sys.getsizeof(data_dicts[0])} bytes each")   # ~232
print(f"named:  {sys.getsizeof(data_named[0])} bytes each")   # ~72

# 结论：存大量结构化数据时，tuple/namedtuple 比 dict 节省约 3x 内存

实际场景选型建议

场景	推荐结构	原因
有序序列，频繁尾部增删	`list`	均摊 O(1)
有序序列，频繁两端增删	`deque`	两端 O(1)
有序序列，需要二分查找	排序 `list` + `bisect`	O(log n) 查找
键值映射	`dict`	O(1) 查找
计数	`Counter`	语义清晰
快速成员检测	`set`	O(1) in
不可变数据/dict key	`tuple/frozenset`	可哈希
优先队列/最值	`heapq`	O(log n)
结构化数据（只读）	`namedtuple/dataclass`	语义+内存