PyTorch 源码学习：GPU 内存管理之它山之石—

TensorFlow 和 PyTorch 都是常用的深度学习框架，各自有一套独特但又相似的 GPU 内存管理机制（BFC 算法）。它山之石可以攻玉。了解 TensorFlow 的 BFC 算法有助于学习 PyTorch 管理 GPU 内存的精妙之处。本文重点关注 TensorFlow BFC 算法的核心思想，研究的 TensorFlow 版本为1f1379c5f721579c58b4ee560daac7df90f8519a。更多内容请参考：

Ubuntu 22.04 LTS 源码编译安装 PyTorch
【翻译】pytorch/CONTRIBUTING.md
【翻译】Pytorch机制，源代码分析与内存管理调研
深度学习框架与静态/动态计算图【笔记】
PyTorch 源码学习：阅读经验 & 代码结构
PyTorch 源码学习：从 Tensor 到 Storage
PyTorch 源码学习：Dispatch & Autograd & Operators
PyTorch 源码学习：GPU 内存管理之深入分析 CUDACachingAllocator

文章目录

1 关于 TensorFlow BFC 算法
2 关于 XLA
- 2.1 初见端倪
- 2.2 初识 XLA
3 主要数据结构
- 3.1 Chunk
- 3.2 Bin
- 3.3 辅助类
4 BFC 算法核心
- 4.1 分配内存
- - 4.1.1 调整大小
  - 4.1.2 查找Bin
  - 4.1.3 查找Chunk
  - 4.1.4 扩展内存
- 4.2 回收内存
- - 4.2.1 获取ChunkHandle
  - 4.2.2 更新状态
  - 4.2.3 合并Chunk
5 总结
参考资料
- 关于 XLA
- 关于 TensorFlow BFC

1 关于 TensorFlow BFC 算法

以下内容来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

背景：使用 GPU 训练时，一次训练任务无论是模型参数还是中间结果都需要占用大量显存。为了避免每次训练重新开辟显存带来计算之外的开销，一般框架的做法是在真正的训练任务开始前，将每个节点的输入和输出，以及模型参数的 shape 计算出来并全局开辟一次，例如 Caffe 就是这种做法。随着深度学习模型的发展和迭代，不仅模型训练的数据 shape 可能发生变化，就连模型本身在训练过程中也可能发生变化，那么按照固定 shape 一次开辟显存的做法就不能满足需求了。为此，TensorFlow 重新设计了较为灵活的显存管理机制，它使用了名为 BFC 的分配算法，并通过 BFC Allocator 为每个 Tensor 分配满足需求的显存。

问题：显存分配与回收的性能需求。Tensor 在每次创建时会得到存储区域，而每一轮训练都要重新创建新的 Tensor，那么这里面临的一个问题：如此频繁的分配和回收存储区，如何才能做的高效？试想对于 GPU 来说，如果Allocate函数直接封装 CUDA 中昂贵的cudaMalloc函数，当 Tensor 被释放时直接调用cudaFree函数，那么训练速度将会因为这些 overhead 大打折扣。

思路：存储池。如果你对操作系统这门课比较熟悉，那么应该很容易想到解决办法：将显存按照不同的大小一次性开辟出来，并组成存储池，每次调用Allocate函数时从存储池中获取，Tensor 回收时将显存重新挂到存储池中。这样做确实可以满足性能需求，但是需要为此设计一个相对复杂的存储管理器。BFC Allocator 就是 TensorFlow 中管理 GPU 显存的存储管理器。

2 关于 XLA

2.1 初见端倪

为什么要先说 XLA 呢？

笔者在 TensorFlow 仓库的 main 分支寻找与 BFC (Best-Fit with Coalescing) 算法有关的源码时，首先是定位到了 tensorflow/core/common_runtime/gpu，该目录下有 gpu_bfc_allocator.h 和 gpu_bfc_allocator.cc 两个文件，但并没有找到具体的与 BFC 算法相关的代码实现。

gpu_bfc_allocator.h 的核心代码：

#include <memory>
#include <optional>
#include <string>#include "xla/tsl/framework/allocator.h"
#include "xla/tsl/framework/bfc_allocator.h"
#include "xla/tsl/platform/macros.h"namespace tensorflow {// A GPU memory allocator that implements a 'best-fit with coalescing'
// algorithm.
class GPUBFCAllocator : public tsl::BFCAllocator {public:// See BFCAllocator::Options.struct Options {// Overridden by TF_FORCE_GPU_ALLOW_GROWTH if that envvar is set.bool allow_growth = false;// If nullopt, defaults to TF_ENABLE_GPU_GARBAGE_COLLECTION, or true if that// envvar is not present.//// Note:////  - BFCAllocator defaults garbage_collection to false, not true.//  - this is not the same override behavior as TF_FORCE_GPU_ALLOW_GROWTH.std::optional<bool> garbage_collection;double fragmentation_fraction = 0;bool allow_retry_on_failure = true;};GPUBFCAllocator(std::unique_ptr<tsl::SubAllocator> sub_allocator,size_t total_memory, const std::string& name,const Options& opts);~GPUBFCAllocator() override {}GPUBFCAllocator(const GPUBFCAllocator&) = delete;void operator=(const GPUBFCAllocator&) = delete;
};}  // namespace tensorflow

但从 gpu_bfc_allocator.h 的代码中，笔者发现GPUBFCAllocator类继承自tsl::BFCAllocator，而tsl命名空间就来自第三方库 third_party/xla。

而 BFC 算法的具体实现就在 xla/tsl/framework 目录下的 bfc_allocator.h 和 bfc_allocator.cc 两个文件中。

2.2 初识 XLA

TensorFlow 通过 XLA 编译器优化 GPU 代码执行。

下图来自：openxla/xla: A machine learning compiler for GPUs, CPUs, and ML accelerators

下图来自：技术栈架构 - AI技术栈解析及应用- 作者：张真瑜 | 山东大学智能创新研究院

在这里插入图片描述

OpenXLA 作为一个灵活的深度学习编译器框架，与 PyTorch 和 TensorFlow 深度集成，通过自定义算子、JIT 编译和 GPU 内核融合等技术，大幅提升了这些深度学习框架在 GPU 上的执行效率。同时，OpenXLA 还利用 CUDA Runtime API 和 CUDA Driver API，实现了对 GPU 硬件的精细控制和优化，包括内存管理、设备操作和内核启动等。

3 主要数据结构

3.1 Chunk

以下内容部分来自：TensorFlow 源码分析之内存管理BFC算法——DeepReve

整个内存空间由一个按基址升序排列的Chunk双向链表来表示，它们的直接前趋和后继必须在地址连续的内存空间。Chunk结构体里含有实际大小、请求大小、是否被占用、基址、直接前趋、直接后继、Bin索引等信息。

以下内容部分来自：Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理

Chunk 是 TensorFlow 的最小内存单位，由数倍 256 bytes (kMinAllocationSize) 的连续内存块组成。TensorFlow 的内存管理是基于 Chunk 的管理。

以下内容部分来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

Chunk 是 BFC 最核心的数据结构之一，在 TensorFlow 源码中是以struct来描述的。具体来说，一个 Chunk 代表一段连续的存储空间，BFC 要求各个 Chunk 要按照基地址升序排列并组织成双向链表，下图展示了 Chunk 的结构以及 Chunk 之间的连接关系。初始时，每个 Chunk 都有自己的size，并且这些size都是以 256 字节为模。应当注意，每个 Chunk 或者完全被标记为使用，或者完全标记为空闲，不存在该 Chunk 内只有部分空间被使用的情况。

Chunk的代码实现：

  // A Chunk points to a piece of memory that's either entirely free or entirely// in use by one user memory allocation.//// An AllocationRegion's memory is split up into one or more disjoint Chunks,// which together cover the whole region without gaps.  Chunks participate in// a doubly-linked list, and the prev/next pointers point to the physically// adjacent chunks.//// Since a chunk cannot be partially in use, we may need to split a free chunk// in order to service a user allocation.  We always merge adjacent free// chunks.//// Chunks contain information about whether they are in use or whether they// are free, and contain a pointer to the bin they are in.struct Chunk {size_t size = 0;  // Full size of buffer.// We sometimes give chunks that are larger than needed to reduce// fragmentation.  requested_size keeps track of what the client// actually wanted so we can understand whether our splitting// strategy is efficient.size_t requested_size = 0;// allocation_id is set to -1 when the chunk is not in use. It is assigned a// value greater than zero before the chunk is returned from// AllocateRaw, and this value is unique among values assigned by// the parent allocator.int64_t allocation_id = -1;void* ptr = nullptr;  // pointer to granted subbuffer.// If not kInvalidChunkHandle, the memory referred to by 'prev' is directly// preceding the memory used by this chunk.  E.g., It should start// at 'ptr - prev->size'ChunkHandle prev = kInvalidChunkHandle;// If not kInvalidChunkHandle, the memory referred to by 'next' is directly// following the memory used by this chunk.  E.g., It should be at// 'ptr + size'ChunkHandle next = kInvalidChunkHandle;// What bin are we in?BinNum bin_num = kInvalidBinNum;// Optional count when this chunk was most recently made free.uint64 freed_at_count = 0;bool in_use() const { return allocation_id != -1; }#ifdef TENSORFLOW_MEM_DEBUG// optional debugging infoconst char* op_name = nullptr;uint64 step_id = 0;int64 action_count = 0;
#endifstring DebugString(BFCAllocator* a,bool recurse) ABSL_NO_THREAD_SAFETY_ANALYSIS {string dbg;absl::StrAppend(&dbg, "  Size: ", strings::HumanReadableNumBytes(size)," | Requested Size: ", strings::HumanReadableNumBytes(requested_size)," | in_use: ", in_use(), " | bin_num: ", bin_num);if (recurse && prev != BFCAllocator::kInvalidChunkHandle) {Chunk* p = a->ChunkFromHandle(prev);absl::StrAppend(&dbg, ", prev: ", p->DebugString(a, false));}if (recurse && next != BFCAllocator::kInvalidChunkHandle) {Chunk* n = a->ChunkFromHandle(next);absl::StrAppend(&dbg, ", next: ", n->DebugString(a, false));}
#ifdef TENSORFLOW_MEM_DEBUGabsl::StrAppend(&dbg, ", for: ", op_name ? op_name : "UNKNOWN",", stepid: ", step_id, ", last_action: ", action_count);
#endifreturn dbg;}};

3.2 Bin

以下内容部分来自：TensorFlow内存管理bfc算法

BFC 算法采取的是被动分块的策略。最开始整个内存是一个 Chunk，随着用户申请空间的次数增加，最开始的大 Chunk 会被不断的 Split 开来，从而产生越来越多的小 Chunk。当 Chunk 数量很大时，为了寻找一个合适的内存块而遍历双链表无疑是一笔巨大的开销。

为了实现对空闲块的高效管理，BFC 算法设计了 Bin 这个抽象数据结构。

每个 Bin 都有一个size属性，一个 Bin 是一个拥有 Chunk size >= Bin size的空闲 Chunk 的集合。集合中的 Chunk 按照 Chunk size 的升序组织成单链表。
BFC 算法维护了一个 Bin 的集合：Bins。它由多个 Bin 以及从属于每个 Bin 的 Chunk 组成。内存中所有的空闲 Chunk 都由 Bins 管理。

图中每一列表示一个 Bin，列首方格中的数字表示 Bin 的size。Bin size 的大小都是 256 的 2^n 的倍。每个 Bin 下面挂载了一系列的空闲 Chunk，每个 Chunk 的 Chunk size 都大于等于所属的 Bin 的 Bin size，按照 Chunk size 的升序挂载成单链表。

以下内容部分来自：Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理

当申请新的内存的时候，如何更快高效的查找匹配的空闲 Chunk，这是非常重要的。TensorFlow 基于 Chunk 构建了一个全局的 Bin，每个 Bin 里管理的是内存大小在一定范围的 Chunk（内存大小范围 (2^bin_num)*256 到 (2^(bin_num+1))*256-1的，bin_num 代表的是 Bin 的序列号）。每个 Bin 里会保存着一个空闲的 Chunk 的 Set。

以下内容部分来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

如果我们想查询某一块符合条件的空闲 Chunk 并取出，那么只能对双向链表做遍历，显然这个效率不是很高。为了加速查询某块 Chunk 的速度，可以在创建 Chunk 链表时按一定顺序排列，并将整个有序链表在逻辑上切分成多个段，为每个段记录所包含的 Chunk 的范围，这种结构就是 Bin，它相当于一种索引。因此，Bin 结构是为了方便 Chunk 的查询而出现的。

在 BFC Allocator 中，每个段中 Chunk 的顺序是按照size和基地址升序排序的，每个 Bin 都设有自己的bin_size，该bin_size表示该段包含的最小 Chunk 的size。这样一来，用户端就可以根据所需要申请的 Memory 大小直接找到对应的 Bin，然后在该 Bin 中遍历寻找适合的 Chunk。为了能够根据bin_size直接定位到 Bin，规定bin_size与bin_num的大小关系为：bin_size=256 * 2^bin_num。用户在申请 Memory 时，会将实际大小映射到最适合的bin_size上，然后再根据bin_size与bin_num的关系找到对应的 Bin，进而在该段中遍历搜索。

Bin 中 Chunk 的是通过 Set 组织的，为了能在 Set 中体现双向链表的逻辑，只需要让 Chunk 在 Set 中按照规则升序排列，并修正前驱后继指针即可。

以下内容部分来自：极简入门TensorFlow 内存管理

BFCAllocator 下的两个比较重要的数据结构，Chunk 和 Bin，两者之间的关系如下图，看起来像一个个糖葫芦，第一个 bin size 为 256<<1，第二个为 256<<2, 一次类推，TF 内有 21 个 bin，最后 bin size 为 256 << 21 为 512MB，每一个 bin 下面会接下若干个大于 bin size 的 Chunk，整个内存空间由以下的结构来组织，当分配内存大小指定时，系统会遍历 bin，找到能够第一次满足 Chunk > bin_size，每一个 bin 下的 Chunk 是有序的（Bin 下的 ChunkComparator）

Bin的代码实现：

  // A Bin is a collection of similar-sized free chunks.// Allocated chunks are never in a Bin.struct Bin {// All chunks in this bin have >= bin_size memory.size_t bin_size = 0;class ChunkComparator {public:explicit ChunkComparator(BFCAllocator* allocator): allocator_(allocator) {}// Sort first by size and then use pointer address as a tie breaker.bool operator()(const ChunkHandle ha, const ChunkHandle hb) constABSL_NO_THREAD_SAFETY_ANALYSIS {const Chunk* a = allocator_->ChunkFromHandle(ha);const Chunk* b = allocator_->ChunkFromHandle(hb);if (a->size != b->size) {return a->size < b->size;}return a->ptr < b->ptr;}private:BFCAllocator* allocator_;  // The parent allocator};typedef std::set<ChunkHandle, ChunkComparator> FreeChunkSet;// List of free chunks within the bin, sorted by chunk size.// Chunk * not owned.FreeChunkSet free_chunks;Bin(BFCAllocator* allocator, size_t bs): bin_size(bs), free_chunks(ChunkComparator(allocator)) {}};

3.3 辅助类

以下内容部分来自：Nvidia GPU显存池-BFC

AllocationRegion 与 RegionManager 两个辅助类的主要功能是记录每次分配给用户的显存指针和对应 Chunk 的映射关系，方便后续显存回收。

在本文，我们把重点放在 TensorFlow BFC 算法的核心思想，不展开讨论辅助类 AllocationRegion 和 RegionManager，感兴趣可以学习：

AllocationRegion 的代码实现
RegionManager 的代码实现

4 BFC 算法核心

4.1 分配内存

以下内容部分来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

这是 BFCAllocator 的为用户分配 Chunk 的总体流程，该过程涉及到了几个比较重要的子过程。它们分别是

遍历搜索寻找最佳 Chunk 指针的FindChunkPtr过程，
当 Chunk 链表中不存在合适的 Chunk 以至于不得不向物理设备申请新存储空间的Extend过程，
以及分配 Chunk 时为缓解碎片问题而出现的SplitChunk过程。

总流程AllocateRawInternal的代码实现：

void* BFCAllocator::AllocateRawInternal(size_t unused_alignment,size_t num_bytes,bool dump_log_on_failure,uint64 freed_before) {if (num_bytes == 0) {VLOG(2) << "tried to allocate 0 bytes";return nullptr;}// 1) 调整大小// First, always allocate memory of at least kMinAllocationSize// bytes, and always allocate multiples of kMinAllocationSize bytes// so all memory addresses are nicely byte aligned.size_t rounded_bytes = RoundedBytes(num_bytes);// 2) 查找 Bin// The BFC allocator tries to find the best fit first.BinNum bin_num = BinNumForSize(rounded_bytes);absl::MutexLock l(&mutex_);if (!timestamped_chunks_.empty()) {// Merge timestamped chunks whose counts have become safe for general use.MergeTimestampedChunks(0);}// 3) 查找 Chunkvoid* ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);if (ptr != nullptr) {AddTraceMe("MemoryAllocation", ptr);return ptr;}// 4) 如果查找失败，那么扩展内存，然后再查找合适的空闲Chunk// Try to extendif (Extend(unused_alignment, rounded_bytes)) {ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);  // 4.8 再次查找 Chunkif (ptr != nullptr) {AddTraceMe("MemoryAllocation", ptr);return ptr;}}// 5) 第一次尝试合并，并再次查找 Chunkif ((freed_before == 0) && (!timestamped_chunks_.empty())) {// We're unable to satisfy an allocation request without a specific// timestamp requirement.  Rather than fail, try merging any held-out// timestamped chunks more aggressively until a free chunk of the necessary// size is formed.if (MergeTimestampedChunks(rounded_bytes)) {ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);if (ptr != nullptr) {AddTraceMe("MemoryAllocation", ptr);return ptr;}}}// 6) 第二次尝试合并，并再次查找 Chunk// Reaching this point means that no chunks can satisfy the request. Also,// the unallocated bytes cannot satisfy the request. Before giving up, let's// try deallocating free regions so that suballocator can combine them with// the unallocated bytes and form a larger region.if (DeallocateFreeRegions(rounded_bytes) &&Extend(unused_alignment, rounded_bytes)) {ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);if (ptr != nullptr) {AddTraceMe("MemoryAllocation", ptr);return ptr;}}// 7) 可用内存不足导致分配失败，报告 OOM// We searched all bins for an existing free chunk to use and// couldn't find one.  This means we must have run out of memory,// Dump the memory log for analysis.MaybeWriteMemoryMap();if (dump_log_on_failure) {LOG(WARNING)<< "Allocator (" << Name() << ") ran out of memory trying "<< "to allocate " << strings::HumanReadableNumBytes(num_bytes)<< " (rounded to " << rounded_bytes << ")" << "requested by op "<< tsl::profiler::ScopedMemoryDebugAnnotation::CurrentAnnotation().pending_op_name<< "\nIf the cause is memory fragmentation maybe the environment "<< "variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will "<< "improve the situation. \nCurrent allocation summary follows."<< "\nCurrent allocation summary follows.";DumpMemoryLog(rounded_bytes);LOG(WARNING) << RenderOccupancy();}return nullptr;
}

4.1.1 调整大小

RoundedBytes的代码实现：

size_t BFCAllocator::RoundedBytes(size_t bytes) {size_t rounded_bytes =(kMinAllocationSize *((bytes + kMinAllocationSize - 1) / kMinAllocationSize));DCHECK_EQ(size_t{0}, rounded_bytes % kMinAllocationSize);return rounded_bytes;
}

将每次分配的内存大小调整为kMinAllocationSize的N倍，这样所有内存地址都是很好地按字节对齐了。

4.1.2 查找Bin

BinNumForSize的代码实现：

  BinNum BinNumForSize(size_t bytes) {uint64 v = std::max<size_t>(bytes, 256) >> kMinAllocationBits;int b = std::min(kNumBins - 1, tsl::Log2Floor64(v));return b;}

Bin size 是 256 的 2^N 倍。std::max<size_t>(bytes, 256) >> kMinAllocationBits先将分配内存的字节数右移 8 位，然后把结果用在std::min(kNumBins - 1, tsl::Log2Floor64(v))，计算出的二进制有效位数即为 Bin 在 Bins 中的索引。

4.1.3 查找Chunk

FindChunkPtr的代码实现：

void* BFCAllocator::FindChunkPtr(BinNum bin_num, size_t rounded_bytes,size_t num_bytes, uint64 freed_before) {// First identify the first bin that could satisfy rounded_bytes.for (; bin_num < kNumBins; bin_num++) {// Start searching from the first bin for the smallest chunk that fits// rounded_bytes.Bin* b = BinFromIndex(bin_num);for (auto citer = b->free_chunks.begin(); citer != b->free_chunks.end();++citer) {// 1) 从之前得到的 Bin 索引开始，查找合适的空闲 Chunkconst BFCAllocator::ChunkHandle h = (*citer);BFCAllocator::Chunk* chunk = ChunkFromHandle(h);DCHECK(!chunk->in_use());if (freed_before > 0 && freed_before < chunk->freed_at_count) {continue;}if (chunk->size >= rounded_bytes) {// 2) 将找到的 Chunk 从 Bin 中移除// We found an existing chunk that fits us that wasn't in use, so remove// it from the free bin structure prior to using.RemoveFreeChunkIterFromBin(&b->free_chunks, citer);// 3) 拆分 Chunk：当以下两个条件之一满足时，SplitChunk过程将被触发。// 		1. 当 Chunk 的 size 是用户请求的 round size 两倍及以上时（用户请求的 size 会根据最小分配单元做 round 近似）// 		2. 当 Chunk 的 size 减去用户请求的 round size 后依然大于等于最大碎片限定时（128MB）// If we can break the size of the chunk into two reasonably large// pieces, do don't waste more than max_internal_fragmentation_bytes on// padding. If this threshold is not set by the user, then use 128MB as// the default.const int64_t max_internal_fragmentation_bytes =(opts_.fragmentation_fraction > 0.0)? opts_.fragmentation_fraction * memory_limit_: 128 << 20;if (chunk->size >= rounded_bytes * 2 ||static_cast<int64_t>(chunk->size) - rounded_bytes >=max_internal_fragmentation_bytes) {SplitChunk(h, rounded_bytes);chunk = ChunkFromHandle(h);  // Update chunk pointer in case it moved}// 4) 修改 Chunk 的请求大小、分配 ID（标记 Chunk 被占用）// The requested size of the returned chunk is what the user// has allocated.chunk->requested_size = num_bytes;// Assign a unique id and increment the id counter, marking the// chunk as being in use.chunk->allocation_id = next_allocation_id_++;// 5) 更新统计// Update stats.++stats_.num_allocs;stats_.bytes_in_use += chunk->size;if (stats_.bytes_in_use > stats_.peak_bytes_in_use) {VLOG(2) << "New Peak memory usage of " << stats_.bytes_in_use<< " bytes for " << Name();}stats_.peak_bytes_in_use =std::max(stats_.peak_bytes_in_use, stats_.bytes_in_use);stats_.largest_alloc_size =std::max<std::size_t>(stats_.largest_alloc_size, chunk->size);#ifdef TENSORFLOW_MEM_DEBUGif (ShouldRecordOpName()) {const auto& annotation =profiler::ScopedMemoryDebugAnnotation::CurrentAnnotation();if (annotation.pending_op_name != nullptr) {chunk->op_name = annotation.pending_op_name;} else {LOG(INFO) << "missing pending_op_name for " << Name()<< " reading addr "<< static_cast<const void*>(&annotation.pending_op_name)<< "\n"<< CurrentStackTrace();chunk->op_name = nullptr;}chunk->action_count = ++action_counter_;chunk->step_id = annotation.pending_step_id;int slot = chunk->action_count % MEM_DEBUG_SIZE_HISTORY_SIZE;size_history_[slot] = stats_.bytes_in_use;}
#endifVLOG(4) << "Returning: " << chunk->ptr;if (VLOG_IS_ON(4)) {LOG(INFO) << "A: " << RenderOccupancy();}// 6) 成功时返回找到的 Chunk 指针return chunk->ptr;}}}// 7) 失败时返回空指针return nullptr;
}

SplitChunk的代码实现：

void BFCAllocator::SplitChunk(BFCAllocator::ChunkHandle h, size_t num_bytes) {// Allocate the new chunk before we do any ChunkFromHandleChunkHandle h_new_chunk = AllocateChunk();Chunk* c = ChunkFromHandle(h);CHECK(!c->in_use() && (c->bin_num == kInvalidBinNum));// Create a new chunk starting num_bytes after cBFCAllocator::Chunk* new_chunk = ChunkFromHandle(h_new_chunk);new_chunk->ptr = static_cast<void*>(static_cast<char*>(c->ptr) + num_bytes);region_manager_.set_handle(new_chunk->ptr, h_new_chunk);// Set the new sizes of the chunks.new_chunk->size = c->size - num_bytes;c->size = num_bytes;// The new chunk is not in use.new_chunk->allocation_id = -1;// It inherits the freed time.new_chunk->freed_at_count = c->freed_at_count;// 1) 调整 Chunk 的前驱后继指针// Maintain the pointers.// c <-> c_neighbor becomes// c <-> new_chunk <-> c_neighborBFCAllocator::ChunkHandle h_neighbor = c->next;new_chunk->prev = h;new_chunk->next = h_neighbor;c->next = h_new_chunk;if (h_neighbor != kInvalidChunkHandle) {Chunk* c_neighbor = ChunkFromHandle(h_neighbor);c_neighbor->prev = h_new_chunk;}// 2) 根据 Free Chunk 的大小，将它插入到对应的 Bin 中// Add the newly free chunk to the free bin.InsertFreeChunkIntoBin(h_new_chunk);
}

4.1.4 扩展内存

Extend的代码实现：

bool BFCAllocator::Extend(size_t alignment, size_t rounded_bytes) {size_t available_bytes = memory_limit_ - *stats_.pool_bytes;// Rounds available_bytes down to the nearest multiple of kMinAllocationSize.available_bytes = (available_bytes / kMinAllocationSize) * kMinAllocationSize;// 1) 已占用的加上申请的内存大小，超过最大内存限制时，返回失败。// Do we have enough space to handle the client's request?// If not, fail immediately.if (rounded_bytes > available_bytes) {return false;}// 2) 循环将当前区域可分配的内存扩充 1 倍，直到大于等于申请的内存大小。// If curr_region_allocation_bytes_ is not enough to satisfy the// allocation, keep multiplying by a power of two until that is// sufficient.bool increased_allocation = false;while (rounded_bytes > curr_region_allocation_bytes_) {curr_region_allocation_bytes_ *= 2;increased_allocation = true;}// 3) 从内存池里分配内存// Try allocating.size_t bytes = std::min(curr_region_allocation_bytes_, available_bytes);size_t bytes_received;void* mem_addr = sub_allocator_->Alloc(alignment, bytes, &bytes_received);// 4) 如果失败，尝试分配申请内存大小的 90%。一直重复，直到分配成功或可用内存不足。if (mem_addr == nullptr) {static constexpr float kBackpedalFactor = 0.9;// Try allocating less memory.while (mem_addr == nullptr) {bytes = RoundedBytes(bytes * kBackpedalFactor);if (bytes < rounded_bytes) {return false;}mem_addr = sub_allocator_->Alloc(alignment, bytes, &bytes_received);}}if (!increased_allocation) {// Increase the region size of the next required allocation.curr_region_allocation_bytes_ *= 2;}VLOG(1) << "Extending allocation by "<< strings::HumanReadableNumBytes(bytes_received) << " bytes for "<< Name() << ".";*stats_.pool_bytes += bytes_received;*stats_.peak_pool_bytes =std::max(*stats_.pool_bytes, *stats_.peak_pool_bytes);VLOG(1) << "Total allocated bytes: "<< strings::HumanReadableNumBytes(*stats_.pool_bytes);VLOG(1) << "Allocated memory at " << mem_addr << " to "<< static_cast<void*>(static_cast<char*>(mem_addr) + bytes_received);// 5) 给分配的内存添加 AllocationRegionAllocationRegion* maybe_extended_region = nullptr;if (coalesce_regions_) {maybe_extended_region =region_manager_.AddOrExtendAllocationRegion(mem_addr, bytes_received);} else {region_manager_.AddAllocationRegion(mem_addr, bytes_received);}// 6) 创建一个空闲 Chunk// Create one large chunk for the whole memory space that will// be chunked later.ChunkHandle h = AllocateChunk();BFCAllocator::Chunk* c = ChunkFromHandle(h);c->ptr = mem_addr;c->size = bytes_received;c->allocation_id = -1;c->prev = kInvalidChunkHandle;c->next = kInvalidChunkHandle;c->freed_at_count = 0;region_manager_.set_handle(c->ptr, h);// If the region was extended, then there exists a previous chunk that should// be linked to the new chunk.if (maybe_extended_region != nullptr) {ChunkHandle prev =maybe_extended_region->get_handle(maybe_extended_region->ptr());BFCAllocator::Chunk* prev_chunk = ChunkFromHandle(prev);// Find the last recorded chunk in the extended region.while (prev_chunk->next != kInvalidChunkHandle) {prev = prev_chunk->next;prev_chunk = ChunkFromHandle(prev);}c->prev = prev;prev_chunk->next = h;}// 7) 将空闲 Chunk 插入 Bin 中// Maybe merge adjacent chunks and insert the chunk into the right bin.InsertFreeChunkIntoBin(TryToCoalesce(h, /*ignore_freed_at=*/false));return true;
}

4.2 回收内存

以下内容部分来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

因为在回收时只知道存储空间首地址指针，并不知道其对应的 Chunk，所以需要先借助region_manager等辅助工具获取其所对应的 Chunk 指针，然后考虑其前驱后继节点是否可以合并。下面展示了整体流程。

总流程DeallocateRawInternal的代码实现：

void BFCAllocator::DeallocateRawInternal(void* ptr) {if (ptr == nullptr) {VLOG(2) << "tried to deallocate nullptr";return;}absl::MutexLock l(&mutex_);// 1) 从 RegionManager 的指针到 ChunkHandle 的映射关系中得到 ChunkHandle// Find the chunk from the ptr.BFCAllocator::ChunkHandle h = region_manager_.get_handle(ptr);CHECK(h != kInvalidChunkHandle);// Record chunk information before it's freed.Chunk* chunk = ChunkFromHandle(h);void* chunk_ptr = chunk->ptr;int64_t req_bytes = chunk->requested_size;int64_t alloc_bytes = chunk->size;// 2) 将 Chunk 标记为空闲，然后把总占用的内存量减去 Chunk 的大小MarkFree(h);// Consider coalescing it.if (timing_counter_) {InsertFreeChunkIntoBin(h);timestamped_chunks_.push_back(h);} else {// 3) 合并 Chunk// 4) 将合并后的空闲 Chunk 插入 BinInsertFreeChunkIntoBin(TryToCoalesce(h, false));}// TraceMe needs to be added after MarkFree and InsertFreeChunkIntoBin for// correct aggregation stats (bytes_in_use, fragmentation).AddTraceMe("MemoryDeallocation", chunk_ptr, req_bytes, alloc_bytes);if (VLOG_IS_ON(4)) {LOG(INFO) << "F: " << RenderOccupancy();}
}

4.2.1 获取ChunkHandle

4.2.2 更新状态

MarkFree的代码实现：

void BFCAllocator::MarkFree(BFCAllocator::ChunkHandle h) {Chunk* c = ChunkFromHandle(h);CHECK(c->in_use() && (c->bin_num == kInvalidBinNum));// 1) 将 Chunk 标记为空闲// Mark the chunk as no longer in use.c->allocation_id = -1;// Optionally record the free time.if (timing_counter_) {c->freed_at_count = timing_counter_->next();}// 2) 更新状态，把总占用的内存量减去 Chunk 的大小// Updates the stats.stats_.bytes_in_use -= c->size;#ifdef TENSORFLOW_MEM_DEBUGif (ShouldRecordOpName()) {c->action_count = ++action_counter_;int slot = c->action_count % MEM_DEBUG_SIZE_HISTORY_SIZE;size_history_[slot] = stats_.bytes_in_use;}
#endif
}

4.2.3 合并Chunk

TryToCoalesce的代码实现：

BFCAllocator::ChunkHandle BFCAllocator::TryToCoalesce(ChunkHandle h,bool ignore_freed_at) {Chunk* c = ChunkFromHandle(h);if ((!ignore_freed_at) && c->freed_at_count > 0) return h;ChunkHandle coalesced_chunk = h;// 1) 合并直接后继// If the next chunk is free, merge it into c and delete it.if (c->next != kInvalidChunkHandle && !ChunkFromHandle(c->next)->in_use()) {Chunk* n = ChunkFromHandle(c->next);if ((n->freed_at_count == 0) || ignore_freed_at) {VLOG(4) << "Merging c->next " << n->ptr << " with c " << c->ptr;// 1.1) 将直接后继从 Bin 中移除RemoveFreeChunkFromBin(c->next);// 1.2) 合并Merge(h, c->next);}}// 2) 合并直接前趋// If the previous chunk is free, merge c into it and delete c.if (c->prev != kInvalidChunkHandle && !ChunkFromHandle(c->prev)->in_use()) {Chunk* n = ChunkFromHandle(c->prev);if ((n->freed_at_count == 0) || ignore_freed_at) {VLOG(4) << "Merging c " << c->ptr << " into c->prev " << n->ptr;coalesced_chunk = c->prev;// 2.1) 将直接前趋从 Bin 中移除RemoveFreeChunkFromBin(c->prev);// 2.2) 合并Merge(c->prev, h);}}return coalesced_chunk;
}

Merge的代码实现：

// Merges h1 and h2 when Chunk(h1)->next is h2 and Chunk(h2)->prev is c1.
// We merge Chunk(h2) into Chunk(h1).
void BFCAllocator::Merge(BFCAllocator::ChunkHandle h1,BFCAllocator::ChunkHandle h2) {Chunk* c1 = ChunkFromHandle(h1);Chunk* c2 = ChunkFromHandle(h2);// We can only merge chunks that are not in use.CHECK(!c1->in_use() && !c2->in_use());// c1's prev doesn't change, still points to the same ptr, and is// still not in use.// 1) 修改 Chunk 的指向关系// Fix up neighbor pointers//// c1 <-> c2 <-> c3 should become// c1 <-> c3BFCAllocator::ChunkHandle h3 = c2->next;c1->next = h3;CHECK(c2->prev == h1);if (h3 != kInvalidChunkHandle) {BFCAllocator::Chunk* c3 = ChunkFromHandle(h3);c3->prev = h1;}// 2) 更新 Chunk 大小// Set the new sizec1->size += c2->size;// Pick latest free time.c1->freed_at_count = std::max(c1->freed_at_count, c2->freed_at_count);// 3) 删除被合并的 ChunkDeleteChunk(h2);
}

5 总结

以下内容来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

本文总结了 TensorFlow 中存储管理器——BFC Allocator。它的设计思路来自于经典来的 dlmalloc 分配算法，是 Best fit coalecing 的简单实现版本。BFC Allocator 是为了应对 TensorFlow 中频繁分配释放存储空间需求的场景而出现的解决方案，通过事先将存储空间从物理设备上开辟好，并将这些空闲存储空间封装成 Chunk，组织成有序双向链表，然后利用 Bin 这一种索引结构为 Chunk 的查询做加速，最终完成了高效的分配算法。在实际分配时，可能会遇到 Chunk 链表中不存在符合要求的空闲 Chunk 情况，这时候就可能需要向物理设备中再次开辟新的存储空间，这个过程被视为对 Chunk 链表的扩展，对应的过程是Extend。因为是按 Chunk 进行分配，势必可能造成存储碎片，为了解决碎片问题，BFC Allocator 设计了SplitChunk和Merge函数。

参考资料

关于 XLA

openxla/xla: A machine learning compiler for GPUs, CPUs, and ML accelerators
OpenXLA (NVIDIA) - AI技术栈解析及应用- 作者：张真瑜 | 山东大学智能创新研究院
XLA：优化机器学习编译器 | TensorFlow
XLA优化原理简介-云社区-华为云
如何看待OpenXLA这个开源项目？ - TanyoKwok 郭天佑的回答 - 知乎

关于 TensorFlow BFC

TensorFlow 源码分析之内存管理BFC算法——DeepReve
TensorFlow内存管理bfc算法
Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理（文中还对 Region 进行了介绍）
TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack（文中有个分析出的结论：存储区分配的时机是在一个 Tensor 对象被创建时立即发生的；文中还对辅助类 AllocationRegion 与 RegionManager 进行了介绍）
极简入门TensorFlow 内存管理
Tensorflow源码剖析：Allocator（基础篇）（文中介绍了 Allocator 类）
- TensorFlow源码剖析：Allocator（提高篇）（文中介绍了 AllocatorStats 类，其余大部分内容和TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack相似）
- TensorFlow源码剖析：Allocator（进阶篇）（文中介绍了 AllocatorRetry 类）
- TensorFlow源码剖析：Allocator（应用篇）（文中介绍了 TensorFlow 下 GPU 相关配置项）
Nvidia GPU显存池-BFC（文中介绍了 Linux 内核内存池、tcmalloc 和 dlmalloc 等内存分配器、由allow_growth参数决定的两种分配模式）