【学习笔记】科学计算

[pytorch 加速] CPU传输 & GPU计算的并行（pin_memory，non_blocking）

https://www.bilibili.com/video/BV15Xxve1EtZ

from IPython.display import Image
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

references
- https://tigress-web.princeton.edu/~jdh4/PyTorchPerformanceTuningGuide_GTC2021.pdf
  - https://pytorch.org/blog/optimizing-cuda-rnn-with-torchscript/
- https://towardsdatascience.com/optimize-pytorch-performance-for-speed-and-memory-efficiency-2022-84f453916ea6
- https://blog.dailydoseofds.com/p/memory-pinning-to-accelerate-model

import torch
import torch.jit
import timeit

更快的数据传输

for epoch in range(epochs): for batch_idx, (data, target) in enumerate(train_loader):data, target = data.to(device), target.to(device)optimizer.zero_grad()output = model(data)loss = criterion(output, target)loss.backward()        optimizer.step()

standard PyTorch model training loop

data, target = data.to(device), target.to(device) transfers the data to the GPU from the CPU.
Everything executes on the GPU after the data transfer.

在这里插入图片描述

When the model is being trained on the 1st mini-batch, the CPU can transfer the 2nd mini-batch to the GPU.
This ensures that the GPU does not have to wait for the next mini-batch of data as soon as it completes processing an existing mini-batch.

在这里插入图片描述

While the CPU may remain idle, this process ensures that the GPU (which is the actual accelerator for our model training) always has data to work with.
Formally, this process is known as memory pinning, and it is used to speed up the data transfer from the CPU to the GPU by making the training workflow asynchronous.
enable pin_memory and set num_workers (muti-core processors) for faster transfers

train_loader = DataLoader(train_dataset,batch_size=64, shuffle=True,pin_memory=True, num_workers=8)

during the data transfer step in the training step, specify non_blocking=True, as depicted below:

for epoch in range(epochs):for batch_idx, (data, target) in enumerate(train_loader):data = data.to(device, non_blocking=True)target = target.to(device, non_blocking=True)optimizer.zero_grad()        output = model(data)loss = criterion(output, target)loss.backward()optimizer.step()

减少分页内存和pin memory的swap

以下以MNIST图像分类任务为例：

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import time
from tqdm.notebook import tqdm# 定义数据转换
transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5,), (0.5,))
])# 加载 MNIST 数据集
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)class SimpleNet(nn.Module):def __init__(self):super(SimpleNet, self).__init__()self.fc = nn.Sequential(nn.Flatten(),nn.Linear(28*28, 512),nn.ReLU(),nn.Linear(512, 10))def forward(self, x):return self.fc(x)device = torch.device("cuda" if torch.cuda.is_available() else "cpu")def train(loader, model, optimizer, device, non_blocking=False, epochs=5):model.train()total_loss = 0start_time = time.time()for epoch in tqdm(range(epochs), desc="Epochs"):batch_bar = tqdm(enumerate(loader), total=len(loader), desc=f"Epoch {epoch+1}/{epochs}", leave=False)for batch_idx, (data, target) in batch_bar:# 将数据移到设备上data = data.to(device, non_blocking=non_blocking)target = target.to(device, non_blocking=non_blocking)optimizer.zero_grad()output = model(data)loss = nn.functional.cross_entropy(output, target)loss.backward()optimizer.step()total_loss += loss.item()end_time = time.time()avg_loss = total_loss / (len(loader)*epochs)elapsed_time = end_time - start_timereturn avg_loss, elapsed_timefrom multiprocessing import cpu_count
loader1 = DataLoader(train_dataset, batch_size=64, shuffle=True)
loader2 = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=cpu_count()//2, pin_memory=True)model1 = SimpleNet().to(device)
optimizer1 = optim.SGD(model1.parameters(), lr=0.01)
avg_loss1, time1 = train(loader1, model1, optimizer1, device, non_blocking=False, epochs=5)
print(f"设置 1 - 平均损失: {avg_loss1:.4f}, 训练时间: {time1:.2f} 秒")

这样的运行情况是：

Epochs:   0%|          | 0/5 [00:00<?, ?it/s]
Epoch 1/5:   0%|          | 0/938 [00:00<?, ?it/s]
Epoch 2/5:   0%|          | 0/938 [00:00<?, ?it/s]
Epoch 3/5:   0%|          | 0/938 [00:00<?, ?it/s]
Epoch 4/5:   0%|          | 0/938 [00:00<?, ?it/s]
Epoch 5/5:   0%|          | 0/938 [00:00<?, ?it/s]
设置 1 - 平均损失: 0.3854, 训练时间: 42.71 秒

然后我们换一种方式，使用num_workers为CPU核的一半，并使用pin_memory

model2 = SimpleNet().to(device)
optimizer2 = optim.SGD(model2.parameters(), lr=0.01)
avg_loss2, time2 = train(loader2, model2, optimizer2, device, non_blocking=True, epochs=5)
print(f"设置 2 - 平均损失: {avg_loss2:.4f}, 训练时间: {time2:.2f} 秒")

运行结果为：

Epochs:   0%|          | 0/5 [00:00<?, ?it/s]
Epoch 1/5:   0%|          | 0/938 [00:00<?, ?it/s]
Epoch 2/5:   0%|          | 0/938 [00:00<?, ?it/s]
Epoch 3/5:   0%|          | 0/938 [00:00<?, ?it/s]
Epoch 4/5:   0%|          | 0/938 [00:00<?, ?it/s]
Epoch 5/5:   0%|          | 0/938 [00:00<?, ?it/s]
设置 2 - 平均损失: 0.3847, 训练时间: 19.92 秒

大约提升一倍的时间

另一个是JIT（Just-In-Time compilation) ）

JIT 通过将模型编译成中间表示（Intermediate Representation, IR），然后进一步将其转换为机器代码
Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT
- JIT fuse the pointwise operations

# 创建一个大型的随机张量作为输入数据
x = torch.randn(15000, 15000)# 使用 JIT 编译的函数
@torch.jit.script
def fused_gelu(x):return x * 0.5 * (1.0 + torch.erf(x / 1.41421))# 未使用 JIT 编译的相同函数
def gelu(x):return x * 0.5 * (1.0 + torch.erf(x / 1.41421))# 使用 timeit 测量 JIT 编译函数的执行时间
jit_time = timeit.timeit('fused_gelu(x)', globals=globals(), number=100)
nonjit_time = timeit.timeit('gelu(x)', globals=globals(), number=100)print(jit_time, nonjit_time) # 20.05574530499871 31.39065190600013