【Python】并发与并行

文章目录

并发与并行（Python实现）
- 1. 并发与并行的原理
- - 1.1 并发的基本概念
  - 1.2 并行的基本概念
  - 1.3 并发与并行的区别
  - 1.4 什么是GIL
  - - 1.4.1 GIL 的原理
    - 1.4.2 GIL对Python编程的影响
    - 1.4.3 一些常见的python解释器
    - - 1.4.3.1 CPython
      - 1.4.3.2 PyPy
      - 1.4.3.3 Jython
      - 1.4.3.4 IronPython
    - 1.4.4 GIL 的影响总结
- 2. Python 多线程
- - 2.1 _thread
  - 2.2 threading
  - 2.3 守护线程
  - 2.4 线程合并（Join）
  - 2.5 锁机制
  - - 2.5.1 多线程没加锁
    - 2.5.2 多线程增加锁
- 3. Python 多进程
- - 3.1 multiprocessing详解
  - 3.2 进程池（Pool）
  - - 示例：使用 Pool.map 执行多个任务
  - 3.3 子进程创建
  - 3.4 进程间通信
  - - 3.4.1 Queue
    - 3.4.2 Pipe
  - 3.5 Manager 共享数据
  - 3.6 一些常见的共享数据方案
  - - 3.6.1 使用数据库
    - 3.6.2 使用文件系统
    - 3.6.3 使用消息队列服务
    - 3.6.4 使用共享内存（Shared Memory）
    - 3.6.5 Manager (服务进程)

并发与并行（Python实现）

1. 并发与并行的原理

1.1 并发的基本概念

在Python中，可以使用threading模块实现并发执行。即使Python的全局解释器锁（GIL）限制了真正的多线程并行执行，threading模块仍然适合I/O密集型任务的并发处理（例如网络请求）。

并发代码示例（使用threading模块）：

import threading
import timedef task1():print("Task 1 is running")time.sleep(1)  # 模拟I/O等待def task2():print("Task 2 is running")time.sleep(1)  # 模拟I/O等待if __name__ == "__main__":thread1 = threading.Thread(target=task1)thread2 = threading.Thread(target=task2)# 启动线程，但任务会交替执行，而非同时执行thread1.start()thread2.start()# 等待任务完成thread1.join()thread2.join()

在此例中，task1 和 task2 是并发执行的，线程轮换交替进行，而不是同时在不同核心上运行。尽管如此，它们在I/O等待时段上“同时”执行，看似是并行的。

1.2 并行的基本概念

在Python中，multiprocessing模块可以实现真正的并行执行，适合CPU密集型任务。multiprocessing可以创建多个进程，每个进程都有独立的Python解释器实例，不受GIL限制，因此可以在多核CPU上实现真正的并行。

并行代码示例（使用multiprocessing模块）：

from multiprocessing import Pool
import osdef process_task(item):print(f"{item} is being processed by process {os.getpid()}")if __name__ == "__main__":items = ["Item1", "Item2", "Item3", "Item4"]# 使用进程池并行处理任务with Pool(processes=4) as pool:pool.map(process_task, items)

在此例中，Pool创建了一个进程池，将每个item分配给不同的进程，并行处理。每个任务由独立的进程在不同的核心上执行，实现真正的并行。

1.3 并发与并行的区别

特性	并发	并行
定义	在同一时间段内处理多个任务	在真正意义上同时处理多个任务
执行环境	单核或多核均可	通常需要多核
实现方式	任务交替进行，利用时间片轮转	每个任务由独立核心处理
典型应用	I/O密集型任务，如网络请求	CPU密集型任务，如科学计算
示例	网络爬虫、文件读取等	数值计算、数据处理等

并发和并行代码对比（使用concurrent.futures模块）：

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import time
import osdef task(name):print(f"Task {name} is being executed by {os.getpid()}")time.sleep(1)  # 模拟任务执行时间if __name__ == "__main__":# 并发执行：使用线程池with ThreadPoolExecutor(max_workers=2) as executor:executor.submit(task, "1")executor.submit(task, "2")# 并行执行：使用进程池with ProcessPoolExecutor(max_workers=2) as executor:executor.submit(task, "1")executor.submit(task, "2")

在此示例中，ThreadPoolExecutor使用线程池执行并发任务，而ProcessPoolExecutor则在不同进程上并行执行任务。

1.4 什么是GIL

GIL（Global Interpreter Lock，全局解释器锁）是Python解释器为了简化内存管理和保证线程安全而引入的一种机制。GIL限制了在同一时间只能有一个线程执行Python字节码，从而防止多线程并发访问共享数据导致的不一致问题。

1.4.1 GIL 的原理

在Python的标准解释器CPython中，GIL是一个互斥锁，保证了每次只有一个线程持有它，从而可以访问Python解释器的内部数据结构。
当一个线程运行一段时间后（执行一定的字节码数量，或遇到I/O操作），会自动释放GIL，让其他线程有机会运行。
虽然多个线程可以创建，但由于GIL的存在，它们无法真正实现并行执行CPU密集型任务。

1.4.2 GIL对Python编程的影响

GIL对Python的多线程编程，尤其是CPU密集型任务有以下几个影响：

限制CPU密集型任务的并行：
- 在Python中，CPU密集型任务（如数值计算、图像处理）无法通过多线程来实现真正的并行。即便系统有多个CPU核，也只有一个线程在任意时刻执行，导致多线程在CPU密集型任务中并不会加速运行。
```
import threading
import timedef cpu_task():# 模拟计算密集型任务count = 0for _ in range(100000000):count += 1# 创建多个线程
threads = [threading.Thread(target=cpu_task) for _ in range(4)]start_time = time.time()
for thread in threads:thread.start()
for thread in threads:thread.join()
print("Total time:", time.time() - start_time)
```
即使使用4个线程，执行时间不会明显减少，因为GIL限制了CPU的并行。
I/O密集型任务表现良好：
- 对于I/O密集型任务（如网络请求、文件操作），多线程仍然可以提高程序性能。因为I/O等待期间GIL会释放，使其他线程可以利用等待时间执行任务。
```
import threading
import timedef io_task():time.sleep(1)  # 模拟I/O操作# 创建多个线程
threads = [threading.Thread(target=io_task) for _ in range(4)]start_time = time.time()
for thread in threads:thread.start()
for thread in threads:thread.join()
print("Total time:", time.time() - start_time)
```
在此例中，即便有GIL，多线程可以在I/O等待时切换，因此总耗时接近1秒，而不是4秒。
多进程替代方案：
- 为了规避GIL限制，对于CPU密集型任务可以使用multiprocessing模块创建多个进程。每个进程都有独立的Python解释器实例和GIL，可以在多个核心上并行执行任务。
```
from multiprocessing import Pool
import timedef cpu_task(_):count = 0for _ in range(100000000):count += 1if __name__ == "__main__":start_time = time.time()with Pool(4) as pool:pool.map(cpu_task, range(4))print("Total time:", time.time() - start_time)
```
使用进程池可以充分利用多核CPU的能力，执行时间会显著减少。

GIL在Python中虽然保障了线程安全，但对多线程的并行性能有所限制。对I/O密集型任务，GIL影响较小；但对于CPU密集型任务，使用多进程（而非多线程）是更佳选择，以充分利用多核CPU。

1.4.3 一些常见的python解释器

在Python中，解释器的多样化给不同应用场景带来了不同的性能表现和特性，特别是在多线程、多进程的执行上。以下是几种主要的Python解释器及其在GIL方面的特点和注意事项：

1.4.3.1 CPython

CPython 是 Python 官方的标准实现，使用 C 语言编写。它是最广泛使用的Python解释器，也是 Python 默认的解释器。
GIL（Global Interpreter Lock，全球解释器锁） 是CPython中为实现线程安全的一种锁机制。因为Python对象内部管理内存的复杂性，GIL限制了一个进程在同一时间只能有一个线程执行字节码。
在多线程 CPU 密集型任务中，GIL显著限制了性能，因为即使在多核 CPU 上，只有一个线程可以在任一时刻获得 GIL 并执行。
影响：多线程任务在 CPython 中的性能不如多进程或单线程，因为每次线程切换都涉及到获取和释放 GIL，而这会带来额外的开销。在 I/O 密集型任务（例如文件读取或网络请求）中，GIL的影响较小，因为线程在等待I/O时会释放GIL，允许其他线程执行。

1.4.3.2 PyPy

PyPy 是 Python 的另一种实现，使用 RPython（Restricted Python）编写，具有 JIT（Just-in-Time）编译功能，这种即时编译大大加快了Python代码的执行速度。
GIL：PyPy 也包含 GIL，尽管其性能在某些情况下比 CPython 要高，但在多线程的CPU密集型任务中，仍然会受到 GIL 的限制。
优势：在很多标准任务（尤其是循环和内存密集型任务）中，PyPy通常比CPython快许多倍。它适合于性能要求高、需要动态编译的 Python 程序。

1.4.3.3 Jython

Jython 是基于 Java 虚拟机（JVM）运行的 Python 实现。它可以将 Python 代码编译成 Java 字节码，允许 Python 程序调用 Java 库。
GIL：Jython 没有 GIL，因为 JVM 本身有自己的线程管理机制，可以在多核 CPU 上实现真正的并行。因此，在 Jython 中可以更好地利用多核 CPU。
适用场景：适合需要与 Java 深度集成的项目，或需要在 JVM 环境中运行 Python 代码的应用。不过，Jython 并不支持所有 CPython 模块，尤其是那些与底层 C 代码绑定的模块（如 NumPy）。

1.4.3.4 IronPython

IronPython 是专门为 .NET 框架设计的 Python 实现。它能够与 .NET 框架的库互操作，允许 Python 代码与 C#、VB.NET 等语言共存。
GIL：IronPython 没有 GIL，因为 .NET 环境有自己的多线程管理方式，支持真正的并行执行。IronPython可以充分利用多核 CPU，因此在多线程任务上更具优势。
适用场景：非常适合需要与 .NET 平台整合的项目，尤其在企业环境中广泛应用。不过，和 Jython 类似，IronPython 也不完全兼容 CPython 的标准库。

1.4.4 GIL 的影响总结

性能开销：GIL 的存在使得在 CPython 中执行 CPU 密集型任务的多线程性能不高，因为每次线程切换都需要竞争 GIL，带来资源开销。因此，对于 CPU 密集型任务，推荐使用多进程（例如 multiprocessing 模块）以充分利用多核 CPU。
I/O 密集型任务的多线程：尽管 GIL 限制了多线程的并行执行，但对于 I/O 密集型任务（如网络请求、文件操作等），多线程在 CPython 中仍然可以提高性能。因为在 I/O 等待时，线程会释放 GIL，其他线程可以利用这段时间执行任务。
选择不同解释器：如果需要更高的多线程性能（特别是 CPU 密集型任务），可以选择没有 GIL 的解释器，如 Jython 或 IronPython，或选择性能更佳的 PyPy。

2. Python 多线程

Python 标准库提供了两个用于多线程的模块：_thread 和 threading。其中 _thread 是一个低级模块，threading 对其进行了封装，提供了更高层的 API，使得多线程编程更加直观和易用。通常，我们只需要使用 threading 模块来进行多线程操作。

2.1 _thread

_thread 是 Python 最基础的多线程模块，但由于其接口较为原始，一般不直接使用。它允许创建和控制线程，但不提供线程同步机制。

示例代码：

import _thread
import timedef print_time(thread_name, delay):for i in range(3):time.sleep(delay)print(f"{thread_name}: {time.ctime(time.time())}")try:_thread.start_new_thread(print_time, ("Thread-1", 1))_thread.start_new_thread(print_time, ("Thread-2", 2))
except Exception as e:print("Error: unable to start thread", e)time.sleep(5)  # 确保主线程存活以观察输出

输出示例：

	Thread-1: Wed Nov  6 16:44:12 2024Thread-2: Wed Nov  6 16:44:13 2024Thread-1: Wed Nov  6 16:44:13 2024Thread-1: Wed Nov  6 16:44:14 2024Thread-2: Wed Nov  6 16:44:15 2024

说明： _thread 启动了两个线程，Thread-1 每秒打印一次时间，Thread-2 每两秒打印一次。主线程 time.sleep(5) 使主线程暂时等待，确保能看到子线程的输出。

2.2 threading

threading 模块是 Python 中使用多线程的高级接口。它支持线程的创建、管理、同步，提供了更友好的接口。

示例代码：

import threading
import timeclass MyThread(threading.Thread):def __init__(self, name, delay):super().__init__()self.name = nameself.delay = delaydef run(self):for i in range(3):time.sleep(self.delay)print(f"{self.name}: {time.ctime(time.time())}")thread1 = MyThread("Thread-1", 1)
thread2 = MyThread("Thread-2", 2)thread1.start()
thread2.start()thread1.join()
thread2.join()
print("Exiting Main Thread")

输出示例：

	Thread-1: Wed Nov  6 16:46:41 2024Thread-2: Wed Nov  6 16:46:42 2024Thread-1: Wed Nov  6 16:46:42 2024Thread-1: Wed Nov  6 16:46:43 2024Thread-2: Wed Nov  6 16:46:44 2024Thread-2: Wed Nov  6 16:46:46 2024Exiting Main Thread

说明： 两个 MyThread 实例以不同的延迟执行，join 保证了主线程等待子线程完成后再退出，确保所有输出完成。

2.3 守护线程

守护线程是为了在主线程结束后自动终止的线程。通常用于执行后台任务（如日志记录等）。在 Python 中可以通过 thread.daemon = True 或 setDaemon(True) 来设置守护线程。

示例代码：

import threading
import timedef daemon_task():while True:time.sleep(1)print("Daemon thread running...")daemon_thread = threading.Thread(target=daemon_task)
daemon_thread.daemon = True
daemon_thread.start()time.sleep(5)  # 主线程运行5秒后结束
print("Main thread ending...")

输出示例：

	Daemon thread running...Daemon thread running...Daemon thread running...Daemon thread running...Main thread ending...

说明： 两个 MyThread 实例以不同的延迟执行，join 保证了主线程等待子线程完成后再退出，确保所有输出完成。

2.4 线程合并（Join）

join 方法用于等待线程完成。通常在主线程需要等待子线程执行完成后再继续执行时使用。

示例代码：

import threading
import timedef task():time.sleep(2)print("Task completed")thread = threading.Thread(target=task)
thread.start()print("Waiting for the thread to complete...")
thread.join()
print("Thread has completed")

输出示例：

	Waiting for the thread to complete...Task completedThread has completed

说明： 主线程在 join 后等待子线程 task 完成并打印信息。

2.5 锁机制

在多线程环境下，如果多个线程同时访问共享资源，会产生数据竞争。

2.5.1 多线程没加锁

以下是一个没有使用锁的示例，用来演示如果多个线程同时访问共享变量而不进行同步，可能会导致数据不一致的问题。

import threadingcounter = 0  # 共享资源，不加锁def increment():global counterfor _ in range(100000):counter += 1# 创建并启动多个线程
threads = [threading.Thread(target=increment) for _ in range(5)]
for thread in threads:thread.start()for thread in threads:thread.join()print("Final counter value:", counter)

预期输出：

Final counter value: 500000

实际输出示例（每次运行的结果可能不同）：

Final counter value: 467852

说明：
在这个例子中，counter 变量是多个线程共享的资源，但没有加锁，因此多个线程会同时对其进行读写操作。由于线程间的竞争，counter 的最终值可能会小于预期的 500000。这是因为在没有锁的情况下，多个线程会出现数据竞争或竞态条件，导致一些增量操作丢失，从而使计数结果错误。

原因：
每次执行 counter += 1 时，Python 实际上执行了三个步骤：

读取 counter 的值
将值加 1
将结果写回 counter

由于多个线程同时执行这些步骤，可能会导致读取和写入操作互相干扰。例如，当线程 A 读取 counter 的值后，还没来得及写入，线程 B 也读取了相同的值并完成了增量和写入操作。这样，线程 A 的增量操作就被线程 B 的覆盖，从而导致结果不准确。

这个例子展示了没有使用锁的情况下，多线程对共享变量的访问可能会导致数据不一致的问题。这种情况在实际应用中尤其需要注意，锁机制可以用来确保多个线程不会同时修改共享资源，从而避免竞态条件。

2.5.2 多线程增加锁

threading 模块提供了 Lock 对象来实现线程同步。

示例代码：

import threadingcounter = 0
lock = threading.Lock()def increment():global counterfor _ in range(100000):with lock:  # 加锁counter += 1threads = [threading.Thread(target=increment) for _ in range(5)]
for thread in threads:thread.start()for thread in threads:thread.join()print("Final counter value:", counter)

输出示例：

	Final counter value: 500000

说明： 锁确保 counter 的更新是线程安全的，因此最终计数结果准确。如果不加锁，可能会得到小于 500000 的结果。

3. Python 多进程

在 Python 中，由于全局解释器锁（GIL）的限制，即使在多核 CPU 上运行，多线程程序也无法在 CPU 密集型任务中真正并行执行。GIL 使得同一时间只能有一个线程执行 Python 字节码，这在 I/O 密集型任务中影响不大，但在 CPU 密集型任务中会严重限制性能。为了解决这个问题，multiprocessing 模块提供了多进程支持，使得我们可以创建多个进程来充分利用多核 CPU，以并行的方式处理任务。下面我们详细讲解 Python 中多进程的常用方法。

3.1 multiprocessing详解

multiprocessing 是 Python 中的多进程模块，允许在多个进程中执行任务，适合 CPU 密集型任务。其 API 类似于 threading，支持创建进程、进程同步、进程间通信等功能。

在 multiprocessing 模块中，我们可以通过 Process 类来创建和启动一个进程。每个进程有自己独立的内存空间，不会与其他进程共享变量，这样可以避免多线程中常见的数据竞争问题。这里我们以一个简单的示例来介绍如何创建并启动一个进程。

示例代码：

from multiprocessing import Process
import timedef task(name):print(f"Task {name} is running")time.sleep(2)print(f"Task {name} is completed")process1 = Process(target=task, args=("Process-1",))
process2 = Process(target=task, args=("Process-2",))process1.start()
process2.start()process1.join()
process2.join()
print("All processes are complete")

输出示例：

	Task Process-1 is runningTask Process-2 is runningTask Process-1 is completedTask Process-2 is completedAll processes are complete

说明： 在这个示例中，我们创建了两个独立的进程，Process-1 和 Process-2，它们并行地运行任务函数 task。使用 join 方法可以确保主进程等待子进程完成任务后再继续执行，从而保证所有进程的输出完成后，主程序才结束。

3.2 进程池（Pool）

在需要同时处理大量任务时，可以使用 multiprocessing.Pool 创建一个进程池（Pool），从而更方便地控制多个进程的创建和管理。Pool 提供了 map、apply、apply_async、starmap 等方法来管理任务。

示例：使用 Pool.map 执行多个任务

from multiprocessing import Pooldef square(n):return n * nif __name__ == "__main__":with Pool(4) as pool:  # 创建4个进程的进程池results = pool.map(square, [1, 2, 3, 4, 5])print("Results:", results)

输出示例：

Results: [1, 4, 9, 16, 25]

说明：在这个示例中，map 方法会将列表中的每个数字传入 square 函数并返回平方结果。进程池允许同时运行 4 个进程，因此在这里充分利用了多核 CPU 的并行计算能力。

3.3 子进程创建

使用 Process 类创建子进程可以让我们直接控制每个进程的启动和终止。下面是一个简单的例子：

from multiprocessing import Processdef child_task():print("Child process is running")if __name__ == "__main__":process = Process(target=child_task)process.start()process.join()print("Child process has finished")

输出示例：

Child process is running
Child process has finished

说明：Process 类直接创建和启动一个子进程。在多进程编程中，我们需要使用 if __name__ == "__main__" 来保护主程序代码，避免重复创建子进程。

3.4 进程间通信

由于每个进程有自己独立的内存空间，进程间的数据是相互隔离的。如果需要在进程之间传递数据，可以使用 multiprocessing 提供的 Queue 或 Pipe 来实现进程间通信。

3.4.1 Queue

Queue 类允许在进程之间共享数据。它使用先进先出（FIFO）模型，使一个进程可以将数据放入队列，另一个进程从队列中读取。

from multiprocessing import Process, Queuedef producer(queue):queue.put("Data from producer")def consumer(queue):data = queue.get()print("Received:", data)if __name__ == "__main__":queue = Queue()p1 = Process(target=producer, args=(queue,))p2 = Process(target=consumer, args=(queue,))p1.start()p2.start()p1.join()p2.join()

输出示例：

Received: Data from producer

说明：Queue 是进程安全的结构，允许一个进程将数据写入队列，另一个进程从队列读取数据。这个例子中，producer 把数据放入队列，consumer 从队列中读取数据并输出。

3.4.2 Pipe

Pipe 是另一种进程间通信机制。它通过创建一对连接对象，允许数据在两个进程之间传递。

from multiprocessing import Process, Pipedef send_data(conn):conn.send("Message from sender")conn.close()def receive_data(conn):print("Received:", conn.recv())conn.close()if __name__ == "__main__":parent_conn, child_conn = Pipe()p1 = Process(target=send_data, args=(child_conn,))p2 = Process(target=receive_data, args=(parent_conn,))p1.start()p2.start()p1.join()p2.join()

输出示例：

Received: Message from sender

说明：Pipe 可以创建一对连接对象，send_data 函数发送数据，receive_data 函数接收数据。Pipe 适用于两个进程之间的双向通信。

3.5 Manager 共享数据

在多进程环境中，有时需要在多个进程间共享复杂的数据结构，例如 list 或 dict。multiprocessing.Manager 可以实现这些数据结构的共享。

from multiprocessing import Process, Managerdef worker(shared_dict, key, value):shared_dict[key] = valueif __name__ == "__main__":with Manager() as manager:shared_dict = manager.dict()processes = [Process(target=worker, args=(shared_dict, i, i * 2)) for i in range(5)]for p in processes:p.start()for p in processes:p.join()print("Shared dictionary:", shared_dict)

输出示例：

Shared dictionary: {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}

说明：在这个示例中，我们使用 Manager 的 dict 方法创建了一个可以在多个进程间共享的字典。每个进程更新字典的不同键值对，最终所有进程完成后打印共享字典的内容。Manager 提供了线程安全的共享数据结构。

3.6 一些常见的共享数据方案

3.6.1 使用数据库

数据库是一个通用的、持久化的数据存储解决方案。各个进程可以通过数据库来共享和交换数据，不受同一机器内存限制。以下是几种常见的数据库：

关系型数据库（如 MySQL、PostgreSQL）：数据可以被不同进程以 SQL 查询的方式访问和修改，适合需要关系模型和事务支持的场景。
NoSQL 数据库（如 Redis、MongoDB）：NoSQL 数据库如 Redis 在内存中存储数据，速度快且支持简单的数据结构，适合实时数据交换的场景。

示例：使用 Redis 实现数据共享

import redis
import multiprocessingdef write_to_redis():r = redis.Redis(host='localhost', port=6379, db=0)r.set('shared_key', 'Hello from process 1')def read_from_redis():r = redis.Redis(host='localhost', port=6379, db=0)value = r.get('shared_key')print("Read from Redis:", value.decode())# 创建并运行进程
p1 = multiprocessing.Process(target=write_to_redis)
p2 = multiprocessing.Process(target=read_from_redis)p1.start()
p2.start()
p1.join()
p2.join()

3.6.2 使用文件系统

文件系统可以用于在进程之间共享数据，例如 JSON、CSV 或文本文件。进程可以通过读写文件来交换信息。这种方式适合不需要实时更新的数据，且适合小规模数据共享的需求。

示例：使用文件实现数据共享

import json
import multiprocessingdef write_to_file():data = {"shared_key": "Hello from process 1"}with open("shared_data.json", "w") as f:json.dump(data, f)def read_from_file():with open("shared_data.json", "r") as f:data = json.load(f)print("Read from file:", data)p1 = multiprocessing.Process(target=write_to_file)
p2 = multiprocessing.Process(target=read_from_file)p1.start()
p1.join()
p2.start()
p2.join()