硬控 Python 代码，加速 500%

Python 经常被批评为比 C、C++ 或 Rust 等其他语言慢，但通过使用 python 庞大的内置库提供的正确技巧，你可以显著提高 Python 代码的性能。

1. Slots

Python 的灵活性常常会导致性能问题，尤其是内存使用。默认情况下，Python 使用字典来存储实例属性，这可能是低效的。

使用 __slots__ 可以优化内存使用并提高性能。

下面是一个基本类，用一个字典来存储各种属性：

from pympler import asizeof

class person:

    def __init__(self, name, age):
        self.name = name
        self.age = age

unoptimized_instance = person("Harry", 20)
print(f"UnOptimized memory instance: {asizeof.asizeof(unoptimized_instance)} bytes")

在这个示例中，我们创建了一个未优化的实例，我们可以看到它占用了 520 字节的内存，与其他语言相比，这对于一个对象来说太多了。

现在，我们使用 slots 类变量来优化这个类：

from pympler import asizeof

class person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

unoptimized_instance = person("Harry", 20)
print(f"UnOptimized memory instance: {asizeof.asizeof(unoptimized_instance)} bytes")

class Slotted_person:
    __slots__ = ['name', 'age']
    def __init__(self, name, age):
        self.name = name
        self.age = age

optimized_instance = Slotted_person("Harry", 20)
print(f"Optimized memory instance: {asizeof.asizeof(optimized_instance)} bytes")

使用 slots 使内存的效率提高了 75%，这将减少程序的占用，从而提高速度。

下面是一个比较：

import time
import gc  # Garbage collection
from pympler import asizeof

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

class SlottedPerson:
    __slots__ = ['name', 'age']
    def __init__(self, name, age):
        self.name = name
        self.age = age

# 测量记忆和时间的功能
def measure_time_and_memory(cls, name, age, iterations=1000):
    gc.collect()  # Force garbage collection
    start_time = time.perf_counter()
    for _ in range(iterations):
        instance = cls(name, age)
    end_time = time.perf_counter()
    memory_usage = asizeof.asizeof(instance)
    avg_time = (end_time - start_time) / iterations
    return memory_usage, avg_time * 1000  # Convert to milliseconds

# 未优化类的测量
unoptimized_memory, unoptimized_time = measure_time_and_memory(Person, "Harry", 20)
print(f"Unoptimized memory instance: {unoptimized_memory} bytes")
print(f"Time taken to create unoptimized instance: {unoptimized_time:.6f} milliseconds")

# 优化后类的测量
optimized_memory, optimized_time = measure_time_and_memory(SlottedPerson, "Harry", 20)
print(f"Optimized memory instance: {optimized_memory} bytes")
print(f"Time taken to create optimized instance: {optimized_time:.6f} milliseconds")

# C计算加速结果
speedup = unoptimized_time / optimized_time
print(f"{speedup:.2f} times faster")

由于垃圾收集或后台运行的其他进程的开销，不得不添加 垃圾收集。这些微小的变化偶尔会导致一些看似违反直觉的结果，例如，尽管优化后的实例更节省内存，但创建时间却要稍长一些。看起来很麻烦，但却能获得更好的内存优化。

2. 列表推导式

在 Python 中对数据进行迭代时，在 for 循环和列表推导式之间做出选择会极大地影响性能。列表推导式不仅是一种更 Pythonic 的循环编写方式，而且在大多数情况下都更快。

我们看一个例子，在这个例子中，我们创建了一个从 1 到 1000 万数字的正方形列表：*

import time

# 实用循环
start = time.perf_counter()
squares_loop = []

for i in range(1, 10_000_001):
    squares_loop.append(i ** 2)
end = time.perf_counter()

print(f"For loop: {end - start:.6f} seconds")

# 实用生成器
start = time.perf_counter()
squares_comprehension = [i ** 2 for i in range(1, 10_000_001)]
end = time.perf_counter()

print(f"List comprehension: {end - start:.6f} seconds")

2.1 什么是列表推导式

列表推导式在引擎盖下是以一个优化的 C 循环来实现的。相比之下，标准的 for 循环需要多条 Python 字节码指令，包括函数调用，这会增加开销。

你通常会发现，列表推导式比 for 循环快 30-50%。这是对 for 循环的重大改进，使得列表理解比典型的 for 循环更干净、更快。

2.2 何时使用列表推导式

转换和过滤，需要从现有的可迭代表中得到一个新的列表，此时推荐使用。
可以避免需要多个嵌套循环的复杂操作，或可读性较差的操作。

在 Python 代码中采用列表解析，可以编写更简洁、更快速、更高效的脚本。

3. @lru_cache 装饰器

如果你的 Python 函数重复执行同样昂贵的计算，那么来自 functools 模块的 lru_cache 装饰器可以通过缓存之前函数调用的结果来大幅提高性能。这对于递归函数或涉及重复计算的任务尤其有用。

3.1 什么是 `lru_cache`？

lru_cache 代表最近最少使用缓存。只要输入参数相同，它就会缓存函数调用的结果，并从内存中检索这些结果，而不是重新计算。默认情况下，它最多缓存 128 次调用，但你也可以配置这一限制，甚至使其不受限制。

一个经典的用例是计算斐波那契数字，递归会导致冗余计算。

不使用 lru_cache:

import time

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

start = time.perf_counter()

print(f"Result: {fibonacci(35)}")
print(f"Time taken without cache: {time.perf_counter() - start:.6f} seconds")

使用 lru_cache:

from functools import lru_cache
import time

@lru_cache(maxsize=128)  # Cache the most recent 128 results

def fibonacci_cached(n):
    if n <= 1:
        return n
    return fibonacci_cached(n - 1) + fibonacci_cached(n - 2)

start = time.perf_counter()

print(f"Result: {fibonacci_cached(35)}")
print(f"Time taken with cache: {time.perf_counter() - start:.6f} seconds")

3.2 性能比较

在没有缓存的情况下，由于重复调用，计算斐波那契数字的时间大大延长。使用 lru_cache 后，先前计算的结果将被重复使用，从而大幅提升性能：

Without cache: 3.456789 seconds
With cache: 0.000234 seconds

Speedup factor = Without cache time / With cache time
Speedup factor = 3.456789 seconds / 0.000234 seconds
Speedup factor ≈ 14769.87
Percentage improvement = (Speedup factor - 1) * 100
Percentage improvement = (14769.87 - 1) * 100
Percentage improvement ≈ 1476887%

3.3 配置缓存

maxsize：限制缓存结果的数量（默认为 128）。设置 maxsize=None 可实现无限制缓存。
lru_cache(None)：为长期运行的程序提供无限缓存。

3.4 何时使用 `lru_cache`?

具有相同输入的重复计算，如递归函数或 API 调用。
重新计算比缓存更昂贵的函数。

通过使用 lru_cache 装饰器，你可以优化你的 Python 程序以节省时间和计算资源，使它成为任何开发者性能工具包中的必备工具。

4. 生成器

生成器是 Python 中的一种可迭代类型，但与列表不同的是，生成器不会在内存中存储所有值。相反，它们在运行中生成值，一次只产生一个结果。这使它们成为处理大数据或流式数据处理任务的绝佳选择。

4.1 使用列表与生成器模拟大数据

我们使用列表和生成器来模拟处理一个包含 1000 万条记录的数据集。

使用列表

import sys

# Simulate big data as a list
big_data_list = [i for i in range(10_000_000)]

# Check memory usage
print(f"Memory usage for list: {sys.getsizeof(big_data_list)} bytes")

# Process the data
result = sum(big_data_list)
print(f"Sum of list: {result}")
Memory usage for list: 89095160 bytes
Sum of list: 49999995000000

使用生成器

# Simulate big data as a generator
big_data_generator = (i for i in range(10_000_000)

# Check memory usage
print(f"Memory usage for generator: {sys.getsizeof(big_data_generator)} bytes")

# Process the data
result = sum(big_data_generator)
print(f"Sum of generator: {result}")
Memory saved = 89095160 bytes - 192 bytes
Memory saved = 89094968 bytes
Percentage saved = (Memory saved / List memory usage) * 100
Percentage saved = (89094968 bytes / 89095160 bytes) * 100
Percentage saved ≈ 99.9998%

4.2 真实案例：处理日志文件

假设你正在分析一个庞大的服务器日志文件，并想计算错误信息的数量：

使用生成器处理日志

def log_file_reader(file_path):
    with open(file_path, 'r') as file:
        for line in file:
            yield line

# 统计错误信息的数量
error_count = sum(1 for line in log_file_reader("large_log_file.txt") if "ERROR" in line)

print(f"Total errors: {error_count}")

在这里，生成器一次读取一行文件，避免将整个文件加载到内存中。

对于大型数据集，生成器是编写内存效率高、可扩展的 Python 程序的强大工具。它们尤其适用于顺序数据处理任务，如分析日志、流数据或处理海量 CSV 文件。如果内存有限，用生成器代替列表可以使代码更快、更精简。

5. 避免使用全局变量

由于 Python 解析变量名的方式，访问局部变量比全局变量更快。通过一个例子来清楚地说明访问全局变量和局部变量在时间上的差异。

5.1 为什么局部变量更快

在 Python 中，当变量被引用时：

局部变量是从函数的作用域直接访问的。
全局变量需要 Python 首先检查局部作用域，然后检查全局作用域，这就增加了一个额外的查找步骤。

访问局部变量和全局变量的时间比较：

import time

# 全局变量
global_var = 10

# 访问全局变量的函数
def access_global():
    global global_var
    return global_var

# 访问局部变量的函数
def access_local():
    local_var = 10
    return local_var

# 测量全局变量访问时间
start_time = time.time()
for _ in range(1_000_000):
    access_global()  # Access global variable
end_time = time.time()
global_access_time = end_time - start_time

# 测量局部变量访问时间
start_time = time.time()
for _ in range(1_000_000):
    access_local()  # Access local variable
end_time = time.time()
local_access_time = end_time - start_time

# 输出时差
print(f"Time taken to access global variable: {global_access_time:.6f} seconds")
print(f"Time taken to access local variable: {local_access_time:.6f} seconds")
Time taken to access global variable: 0.265412 seconds
Time taken to access local variable: 0.138774 seconds

Speedup factor = Time taken to access global variable / Time taken to access local variable
Speedup factor = 0.265412 seconds / 0.138774 seconds
Speedup factor ≈ 1.91
Percentage improvement = (Speedup factor - 1) * 100
Percentage improvement = (1.91 - 1) * 100
Percentage improvement ≈ 91.25%

写在最后

优化 Python 代码并不一定是一项艰巨的任务。通过采用诸如使用 slots 来提高内存效率、利用 functools.lru_cache 来进行缓存、用 list comprehensions 代替循环以及避免使用全局变量等技术，可以大大提高代码的性能。此外，使用大数据生成器可确保您的应用程序保持高效和可扩展。

请记住，优化就是平衡--专注于影响最大的领域，而不要让代码过于复杂。Python 的简洁性和可读性是其最大的优势，这些技术可以帮助您保持这些特质，同时释放更高的性能。