内存虚拟化介绍
宿主机上的程序地址转换时为 HVA(宿主机虚拟地址)--MMU-->HPA(宿主机物理地址)
而宿主机上的虚拟机面临两层转化需求:
GVP(虚拟机虚拟地址)--MMU-->GPA(虚拟机物理地址)
GPA(虚拟机物理地址)--VMM-->HPA(宿主机物理地址)
虚拟机内存转化,以往依赖影子页面技术,现在主要依赖EPT技术。
-
虚拟机中GVP(虚拟机虚拟地址)--MMU-->GPA(虚拟机物理地址), 因虚拟机系统无法感知自己被虚拟因此按照MMU默认处理地址转换即可。
-
CPU可感知自己在虚拟机中运行,此时将自动额外查询EPT页面完成GPA(虚拟机物理地址)到HPA(宿主机物理地址)的转化。
-
EPT页表由VMM实现维护,主要是构建GPA到HPA的映射并注册为EPT页表, 是在进行EPT表查询失败产生EPT异常退出时在KVM中注册的。
查询当前系统是否支持EPT可通过命令
cat /proc/cpuinfo |grep -E 'ept|pdpe1gb'
一般来讲,EPT使用的是IA-32e的分页模式,即使48位物理地址,总共分为四级页表,每级页表使用9位物理地址定位,最后12位表示在一个页(4KB)内的偏移
启动qemu并在4444 端口开启 hmp
/home/xiyanxiyan10/project/qemu/build/qemu-system-x86_64 -m 10240 -enable-kvm -cpu host -s -kernel /home/xiyanxiyan10/project/linux-source-6.2.0/arch/x86/boot/bzImage -hda ./ rootfs.img -nographic -append "root=/dev/sda rw console=ttyS0 nokaslr" -qmp tcp:127.0.0.1:4444,server,nowait
使用qemu提供的脚本进入交互式hmp命令行界面, 查询虚拟机内存信息
xiyanxiyan10@xiyanxiyan10:~/project/qemu$ scripts/qmp/qmp-shell -H localhost:4444
Welcome to the HMP shell!
Connected to QEMU 8.2.50(QEMU) info mtree
address-space: VGA0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container...
qemu内存数据组织
MemoryRegion
顾名思义,这是用来表达一段内存区域的。其中重要的两个成员就是:addr和size,表示了这段内存区域 对应的起始地址和大小。
从内存模型的角度出发,MemoryRegion重要的特征是形成了一棵内存区域树。
来一个简单的图示意一下:
struct MemoryRegion+------------------------+ |name | | (const char *) | +------------------------+ |addr | | (hwaddr) | |size | | (Int128) | +------------------------+ |subregions | | QTAILQ_HEAD() | +------------------------+ ||----+-------------------+---------------------+----| || || |struct MemoryRegion struct MemoryRegion+------------------------+ +------------------------+|name | |name || (const char *) | | (const char *) |+------------------------+ +------------------------+|addr | |addr || (hwaddr) | | (hwaddr) ||size | |size || (Int128) | | (Int128) |+------------------------+ +------------------------+|subregions | |subregions || QTAILQ_HEAD() | | QTAILQ_HEAD() |+------------------------+ +------------------------+
那我们来看看一个MemoryRegion的树形结构会是什么样子的。
每个address-space 指向一个MemoryRegion 根, MemoryRegion 根下是一个MemoryRegion节点构成的树。
(QEMU) info mtree
address-space: VGA0000000000000000-ffffffffffffffff (prio 0, i/o): bus master containeraddress-space: piix3-ide0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container0000000000000000-ffffffffffffffff (prio 0, i/o): alias bus master @system 0000000000000000-ffffffffffffffffaddress-space: e10000000000000000000-ffffffffffffffff (prio 0, i/o): bus master containeraddress-space: cpu-memory-0
address-space: memory0000000000000000-ffffffffffffffff (prio 0, i/o): system0000000000000000-00000000bfffffff (prio 0, ram): alias ram-below-4g @pc.ram 0000000000000000-00000000bfffffff0000000000000000-ffffffffffffffff (prio -1, i/o): pci00000000000a0000-00000000000bffff (prio 1, i/o): vga-lowmem00000000000c0000-00000000000dffff (prio 1, rom): pc.rom00000000000e0000-00000000000fffff (prio 1, rom): alias isa-bios @pc.bios 0000000000020000-000000000003ffff00000000fd000000-00000000fdffffff (prio 1, ram): vga.vram00000000febc0000-00000000febdffff (prio 1, i/o): e1000-mmio00000000febf0000-00000000febf0fff (prio 1, i/o): vga.mmio00000000febf0000-00000000febf017f (prio 0, i/o): edid00000000febf0400-00000000febf041f (prio 0, i/o): vga ioports remapped00000000febf0500-00000000febf0515 (prio 0, i/o): bochs dispi interface00000000febf0600-00000000febf0607 (prio 0, i/o): qemu extended regs00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios00000000000a0000-00000000000bffff (prio 1, i/o): alias smram-region @pci 00000000000a0000-00000000000bffff00000000000c0000-00000000000c3fff (prio 1, ram): alias pam-rom @pc.ram 00000000000c0000-00000000000c3fff00000000000c4000-00000000000c7fff (prio 1, ram): alias pam-rom @pc.ram 00000000000c4000-00000000000c7fff00000000000c8000-00000000000cbfff (prio 1, ram): alias pam-rom @pc.ram 00000000000c8000-00000000000cbfff00000000000cb000-00000000000cdfff (prio 1000, ram): alias kvmvapic-rom @pc.ram 00000000000cb000-00000000000cdfff00000000000cc000-00000000000cffff (prio 1, ram): alias pam-rom @pc.ram 00000000000cc000-00000000000cffff00000000000d0000-00000000000d3fff (prio 1, ram): alias pam-rom @pc.ram 00000000000d0000-00000000000d3fff00000000000d4000-00000000000d7fff (prio 1, ram): alias pam-rom @pc.ram 00000000000d4000-00000000000d7fff00000000000d8000-00000000000dbfff (prio 1, ram): alias pam-rom @pc.ram 00000000000d8000-00000000000dbfff00000000000dc000-00000000000dffff (prio 1, ram): alias pam-rom @pc.ram 00000000000dc000-00000000000dffff00000000000e0000-00000000000e3fff (prio 1, ram): alias pam-rom @pc.ram 00000000000e0000-00000000000e3fff00000000000e4000-00000000000e7fff (prio 1, ram): alias pam-ram @pc.ram 00000000000e4000-00000000000e7fff00000000000e8000-00000000000ebfff (prio 1, ram): alias pam-ram @pc.ram 00000000000e8000-00000000000ebfff00000000000ec000-00000000000effff (prio 1, ram): alias pam-ram @pc.ram 00000000000ec000-00000000000effff00000000000f0000-00000000000fffff (prio 1, ram): alias pam-rom @pc.ram 00000000000f0000-00000000000fffff00000000fec00000-00000000fec00fff (prio 0, i/o): kvm-ioapic00000000fed00000-00000000fed003ff (prio 0, i/o): hpet00000000fee00000-00000000feefffff (prio 4096, i/o): kvm-apic-msi0000000100000000-00000002bfffffff (prio 0, ram): alias ram-above-4g @pc.ram 00000000c0000000-000000027fffffff
FlatView/FlatRange
FlatView就是平面视图。那是啥的平面视图呢?我就知道你聪明,不用猜就知道。 是MemoryRegion的平面视图。刚才咱不是看了么,MemoryRegion形成了一棵高大雄伟的树,但是 要用的时候还是得铺平了看起来舒服。
和刚才一样,我们也来瞅一眼这个数据结构的样子。
FlatView (An array of FlatRange)
+----------------------+
|nr |
|nr_allocated |
| (unsigned) | FlatRange FlatRange
+----------------------+
|ranges | ------> +---------------------+---------------------+
| (FlatRange *) | |offset_in_region |offset_in_region |
+----------------------+ | | |+---------------------+---------------------+|addr(AddrRange) |addr(AddrRange) || +----------------| +----------------+| |start (Int128) | |start (Int128) || |size (Int128) | |size (Int128) |+----+----------------+----+----------------+|mr |mr || (MemoryRegion *) | (MemoryRegion *) |+---------------------+---------------------+
Address-space
接下来我们来看看这几者之间的关联。
AddressSpace
+-------------------------+
|name |
| (char *) | FlatView (An array of FlatRange)
+-------------------------+ +----------------------+
|current_map | -------->|nr |
| (FlatView *) | |nr_allocated |
+-------------------------+ | (unsigned) | FlatRange FlatRange
| | +----------------------+
| | |ranges | ------> +---------------------+---------------------+
| | | (FlatRange *) | |offset_in_region |offset_in_region |
| | +----------------------+ | | |
| | +---------------------+---------------------+
| | |addr(AddrRange) |addr(AddrRange) |
| | | +----------------| +----------------+
| | | |start (Int128) | |start (Int128) |
| | | |size (Int128) | |size (Int128) |
| | +----+----------------+----+----------------+
| | |mr |mr |
| | | (MemoryRegion *) | (MemoryRegion *) |
| | +---------------------+---------------------+
| |
| |
| |
| | MemoryRegion(system_memory/system_io)
+-------------------------+ +----------------------+
|root | | | root of a MemoryRegion
| (MemoryRegion *) | -------->| | tree
+-------------------------+ +----------------------+
RamBlock
RAMBlock数据结构就是描述虚拟机在主机上对应的内存空间的, qemu使用链表将他们串联组织。
ram_list (RAMList)+------------------------------+|dirty_memory[] || (unsigned long *) |+------------------------------+|blocks || QLIST_HEAD |+------------------------------+|| RAMBlock RAMBlock| +---------------------------+ +---------------------------++---> |next | -------------> |next || QLIST_ENTRY(RAMBlock) | | QLIST_ENTRY(RAMBlock) |+---------------------------+ +---------------------------+|offset | |offset ||used_length | |used_length ||max_length | |max_length || (ram_addr_t) | | (ram_addr_t) |+---------------------------+ +---------------------------+
GPA -> HVA 的映射由MemoryRegion->addr到RAMBlock->host完成。
因此 MemoryRegion 与 RAMBlock 的关联建立了 虚拟机内存与物理机上虚拟地址间的映射。
RAMBlock RAMBlock+---------------------------+ +---------------------------+|next | -----------------------------> |next || QLIST_ENTRY(RAMBlock) | | QLIST_ENTRY(RAMBlock) |+---------------------------+ +---------------------------+|offset | |offset ||used_length | |used_length ||max_length | |max_length || (ram_addr_t) | | (ram_addr_t) |+---------------------------+ +---------------------------+|host | virtual address of a ram |host | | (uint8_t *) | in host (mmap) | (uint8_t *) |+---------------------------+ +---------------------------+|mr | |mr || (struct MemoryRegion *)| | (struct MemoryRegion *)|+---------------------------+ +---------------------------+| || || || struct MemoryRegion | struct MemoryRegion+-->+------------------------+ +-->+------------------------+|name | |name || (const char *) | | (const char *) |+------------------------+ +------------------------+|addr | physical address in guest |addr || (hwaddr) | (offset in RAMBlock) | (hwaddr) ||size | |size || (Int128) | | (Int128) |+------------------------+ +------------------------+
MemoryListener
为了让EPT正常工作,还需要将虚拟机的内存布局通知到KVM,并且每次变化都需要通知KVM进行修改。这个过程是通过MemoryListener来实现的。MemoryListener定义如下。
MemoryListerner+---------------------------+|begin ||commit |+---------------------------+|region_add ||region_del |+---------------------------+|eventfd_add ||eventfd_del |+---------------------------+|log_start ||log_stop |+---------------------------+
一个AddressSpace 下挂有一组关注该AddressSpace 变化的MemoryListerner,当AddressSpace 更新时将调用挂在该AddressSpace 下的所有 Listener, 其中包含将虚拟机的内存布局通知到KVM的Listener(kvm_region_add, kvm_region_del), 当然也可以继续扩展其他关注内存变化的Listener。
数据结构
/*** struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects*/
struct AddressSpace {/* private: */struct rcu_head rcu;char *name;// 指向 MemoryRegion 树的根部MemoryRegion *root;// 将 MemoryRegion 树, 展开后的平坦的视图/* Accessed via RCU. */struct FlatView *current_map;int ioeventfd_nb;int ioeventfd_notifiers;struct MemoryRegionIoeventfd *ioeventfds;// 内存发生事件变化的回调处理链表QTAILQ_HEAD(, MemoryListener) listeners;// AddressSpace是使用链表串联组织的QTAILQ_ENTRY(AddressSpace) address_spaces_link;
};
根据填写属性的不同常见的MemoryRegion有如下几类。
-
RAM:host上一段实际分配给虚拟机作为物理内存的虚拟内存。
-
MMIO:guest的一段内存,但是在宿主机上没有对应的虚拟内存,而是截获对这个区域的访问,调用对应读写函数用在设备模拟中。
-
ROM:与RAM类似,只是该类型内存只有只读属性,无法写入。
-
ROM device:其在读方面类似RAM,能够直接读取,而在写方面类似MMIO,写入会调用对应的写回调函数
-
container:包含若干个MemoryRegion,每一个Region在这个container的偏移都不一样。container主要用来将多个MemoryRegion合并成一个,如PCI的MemoryRegion就会包括RAM和MMIO。一般来说,container中的region不会重合,但是有的时候也有例外。
-
alias:region的另一个部分,可以使一个region被分成几个不连续的部分
/** MemoryRegion:** A struct representing a memory region.*/
struct MemoryRegion {Object parent_obj;/* private: *//* The following fields should fit in a cache line */bool romd_mode;bool ram;bool subpage;bool readonly; /* For RAM regions */bool nonvolatile;bool rom_device;bool flush_coalesced_mmio;bool unmergeable;uint8_t dirty_log_mask;bool is_iommu;// ram_block表示实际分配的物理内存, 即宿主机的虚拟内存// 使用 RAMBlock 存储RAMBlock *ram_block;Object *owner;/* owner as TYPE_DEVICE. Used for re-entrancy checks in MR access hotpath */DeviceState *dev;// ops里面是一组回调函数,在对MemoryRegion进行操作时会被调用,如MMIO的读写请求const MemoryRegionOps *ops;void *opaque;MemoryRegion *container;int mapped_via_alias; /* Mapped via an alias, container might be NULL */Int128 size;// addr表示该MemoryRegion所在的虚拟机的物理地址hwaddr addr;void (*destructor)(MemoryRegion *mr);uint64_t align;bool terminates;bool ram_device;bool enabled;bool warning_printed; /* For reservations */uint8_t vga_logging_count;MemoryRegion *alias;hwaddr alias_offset;// priority用来指示MemoryRegion的优先级int32_t priority;//subregions将该MemoryRegion所属的子MemoryRegion连接起来QTAILQ_HEAD(, MemoryRegion) subregions;// subregions_link则用来连接同一个父MemoryRegion下的相同兄弟QTAILQ_ENTRY(MemoryRegion) subregions_link;QTAILQ_HEAD(, CoalescedMemoryRange) coalesced;const char *name;unsigned ioeventfd_nb;MemoryRegionIoeventfd *ioeventfds;RamDiscardManager *rdm; /* Only for RAM *//* For devices designed to perform re-entrant IO into their own IO MRs */bool disable_reentrancy_guard;
};/** Memory region callbacks*/
struct MemoryRegionOps {/* Read from the memory region. @addr is relative to @mr; @size is* in bytes. */uint64_t (*read)(void *opaque,hwaddr addr,unsigned size);/* Write to the memory region. @addr is relative to @mr; @size is* in bytes. */void (*write)(void *opaque,hwaddr addr,uint64_t data,unsigned size);MemTxResult (*read_with_attrs)(void *opaque,hwaddr addr,uint64_t *data,unsigned size,MemTxAttrs attrs);MemTxResult (*write_with_attrs)(void *opaque,hwaddr addr,uint64_t data,unsigned size,MemTxAttrs attrs);//...
};
qemu分配虚拟机内存
虚拟机使用的物理内存是映射宿主机虚拟内存。
因此 qemu为虚拟机分配内存的过程即是宿主机申请虚拟内存的过程
即RamBlock 结构初始化时申请内存空间。
pc_memory_init()memory_region_allocate_system_memory()allocate_system_memory_nonnuma()memory_region_init_ram_nomigrate()memory_region_init_ram_shared_nomigrate(){mr->ram = true;mr->destructor = memory_region_destructor_ram;// 申请 RAMBlockmr->ram_block = qemu_ram_alloc(size, share, mr, errp);}
RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,MemoryRegion *mr, Error **errp)
{assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
}static
RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,void (*resized)(const char*,uint64_t length,void *host),void *host, uint32_t ram_flags,MemoryRegion *mr, Error **errp)
{// ...size = HOST_PAGE_ALIGN(size);max_size = HOST_PAGE_ALIGN(max_size);new_block = g_malloc0(sizeof(*new_block));new_block->mr = mr;new_block->resized = resized;//...ram_block_add(new_block, &local_err);//...return new_block;
}static void ram_block_add(RAMBlock *new_block, Error **errp)
{//...if (!new_block->host) {if (xen_enabled()) {//...} else {new_block->host = qemu_anon_ram_alloc(new_block->max_length,&new_block->mr->align,shared, noreserve);//...}}//...
}
// posix 下分配空间的方法
/* alloc shared memory pages */
void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared,bool noreserve)
{const uint32_t qemu_map_flags = (shared ? QEMU_MAP_SHARED : 0) |(noreserve ? QEMU_MAP_NORESERVE : 0);size_t align = QEMU_VMALLOC_ALIGN;void *ptr = qemu_ram_mmap(-1, size, align, qemu_map_flags, 0);//...return ptr;
}void *qemu_ram_mmap(int fd,size_t size,size_t align,uint32_t qemu_map_flags,off_t map_offset)
{//...ptr = mmap_activate(guardptr + offset, size, fd, qemu_map_flags,map_offset);//...return ptr;
}static void *mmap_activate(void *ptr, size_t size, int fd,uint32_t qemu_map_flags, off_t map_offset)
{// ...// 可见最终调用 mmap 申请内存页activated_ptr = mmap(ptr, size, prot, flags | map_sync_flags, fd,map_offset);// ...return activated_ptr;
}
内存分配表构建
QEMU内存的分派指的是,当给定一个AddressSpace和一个地址时,要能够快速地找出其所在的MemoryRegionSection,从而找到对应的MemoryRegion。与内存分派相关的数据结构是AddressSpaceDispatch,AddressSpace结构体中的dispatch成员为AddressSpaceDispatch,记录了该AddressSpace中的分派信息。
简单来说
-
phys_map 像是CR3
-
nodes 是一个用链表存储了的页表
-
sections 是nodes的叶子ptr指向的内容,其中包含了MemoryRegion
当查询MemoryRegion 时通过 nodes 与 phys_map 配合快速在 sections 中找到对应的section, 从而查询到关联的 MemoryRegion
AddressSpaceDispatch+-------------------------+|as || (AddressSpace *) |+-------------------------+|mru_section || (MemoryRegionSection*)|| || || || || |+-------------------------+|map(PhysPageMap) | MemoryRegionSection[]| +---------------------+ +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+| |sections |-------->|mr = io_mem_unassigned |mr = io_mem_notdirty |mr = io_mem_rom |mr = io_mem_watch |mr = one mr in tree |mr = subpage_t->iomem || | MemoryRegionSection*| | (MemoryRegion *) | (MemoryRegion *) | (MemoryRegion *) | (MemoryRegion *) | (MemoryRegion *) | (MemoryRegion *) || | | | | | | | | || +---------------------+ +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+| |sections_nb | |fv |fv |fv |fv |fv |fv || |sections_nb_alloc | | (FlatView *) | (FlatView *) | (FlatView *) | (FlatView *) | (FlatView *) | (FlatView *) || | (unsigned) | +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+| +---------------------+ |size (Int128) |size (Int128) |size (Int128) |size (Int128) |size (Int128) |size (Int128) || | | +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+| | | |offset_within_region |offset_within_region |offset_within_region |offset_within_region |offset_within_region |offset_within_region || | | | (hwaddr) | (hwaddr) | (hwaddr) | (hwaddr) | (hwaddr) | (hwaddr) || | | |offset_within_address_space|offset_within_address_space|offset_within_address_space|offset_within_address_space|offset_within_address_space|offset_within_address_space|| | | | (hwaddr) GPA | (hwaddr) GPA | (hwaddr) GPA | (hwaddr) GPA | (hwaddr) | (hwaddr) || | | +---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+| | | ^| | | nodes[1] || | | +---->+------------------+ || | | | |u32 skip:6 | = 0 || | | | |u32 ptr:26 | = 4 -----------------------------------------+| | | P_L2_LEVELS = 6 | +------------------+| | | nodes[0] = PhysPageEntry[P_L2_SIZE = 2^9] | | || +---------------------+ +------------------+ | | ... || |nodes | ------->|u32 skip:6 | = 1 | | || | (Node *) | |u32 ptr:26 | = 1 -------------+ +------------------+| +---------------------+ +------------------+ |u32 skip:6 | = 0| |nodes_nb | | | |u32 ptr:26 | = PHYS_SECTION_UNASSIGNED| |nodes_nb_alloc | | ... | +------------------+| | (unsigned) | | || +---------------------+ +------------------+| | | |u32 skip:6 | = 1| | | |u32 ptr:26 | = PHYS_MAP_NODE_NIL nodes[2]| | | +------------------+ +---->+------------------+| | | |u32 skip:6 | = 1 | |u32 skip:6 || | | |u32 ptr:26 | = 2 -------------+ |u32 ptr:26 || | | +------------------+ +------------------+| | | ^ | || | | | | ... || | | | | |+---+---------------------+ | +------------------+|phys_map(PhysPageEntry) | | |u32 skip:6 || +---------------------+ | |u32 ptr:26 || |u32 skip:6 | = 1 | +------------------+| |u32 ptr:26 | = 0 --------++---+---------------------+
对应数据结构如下
typedef PhysPageEntry Node[P_L2_SIZE];typedef struct PhysPageMap {struct rcu_head rcu;unsigned sections_nb;unsigned sections_nb_alloc;unsigned nodes_nb;unsigned nodes_nb_alloc;Node *nodes;MemoryRegionSection *sections;
} PhysPageMap;struct AddressSpaceDispatch {MemoryRegionSection *mru_section;/* This is a multi-level map on the physical address space.* The bottom level has pointers to MemoryRegionSections.*/PhysPageEntry phys_map;PhysPageMap map;
};struct MemoryRegionSection {Int128 size;MemoryRegion *mr;FlatView *fv;hwaddr offset_within_region;hwaddr offset_within_address_space;bool readonly;bool nonvolatile;bool unmergeable;
};
提交内存到KVM
在 QEMU 中,memory_region_transaction_commit
是一个关键函数,用于提交内存事务,确保虚拟机的内存映射在 MemoryRegion 层级的更改被正确同步到底层的内存子系统或硬件加速(如 KVM)。它的调用时机通常与 内存映射的变更 紧密相关,具体如下:
初始化内存布局时
-
当虚拟机启动时,QEMU 会初始化客户机的内存布局,包括:
-
设置 RAM 区域。
-
注册设备内存(如 MMIO 区域)。
-
-
在内存区域的变更完成后,调用
memory_region_transaction_commit
提交更改。
内存区域添加或删除时
-
动态添加或移除内存区域(如热插拔内存或设备)时,QEMU 会先对内存映射进行更改,并在完成后调用
memory_region_transaction_commit
提交更新。
客户机地址空间调整时
-
当客户机地址空间(如 PCI 地址空间)发生更改时,需要更新内存区域的映射。
-
这些调整通常由设备模型(Device Model)触发,在完成设备地址空间调整后调用
memory_region_transaction_commit
。
快照恢复或迁移时
-
在快照恢复或虚拟机迁移时,内存映射需要重新构建。
-
QEMU 会调用
memory_region_transaction_commit
确保新的内存布局生效。
访问权限或属性变更时
-
如果需要调整内存区域的访问权限(如读写权限)或属性(如缓存策略),这些更改会通过事务机制提交。
源码中调用memory_region_transaction_commit
的函数列表可参考对照函数名
static void memory_region_finalize(Object *obj)
void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,hwaddr size)
void memory_region_set_readonly(MemoryRegion *mr, bool readonly)
void memory_region_del_eventfd(MemoryRegion *mr,hwaddr addr,unsigned size,bool match_data,uint64_t data,EventNotifier *e)
...
memory_region_transaction_commit
的函数主要流程如下
memory_region_transaction_commit(), update topology or ioeventfdsflatviews_reset()flatviews_init()flat_views = g_hash_table_new_full()empty_view = generate_memory_topology(NULL);generate_memory_topology()MEMORY_LISTENER_CALL_GLOBAL(begin, Forward)address_space_set_flatview()address_space_update_topology_pass(false)address_space_update_topology_pass(true)address_space_update_ioeventfds()address_space_add_del_ioeventfds()MEMORY_LISTENER_CALL_GLOBAL(commit, Forward)
-
flatviews_reset: 重构所有AddressSpace的flatview
-
MEMORY_LISTENER_CALL_GLOBAL(begin, Forward)
-
address_space_set_flatview: 根据变化添加删除region
-
address_space_update_ioeventfds: 根据变化添加删除eventfd
-
MEMORY_LISTENER_CALL_GLOBAL(commit, Forward)
前三部主要是根据事件更新qemu的内存管理映射结构, 最后一部即是将当前内存映射更新通过listener 回调同步到kvm中。
对于kvm 模式启动的qemu, 其listener提交回调为
kml->listener.region_add = kvm_region_add;kml->listener.region_del = kvm_region_del;
以region_add为例, 调用链为以region_add为例, 调用链为
kvm_region_addkvm_set_phys_memkvm_set_user_memory_regionkvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);kvm_region_addkvm_set_phys_memkvm_set_user_memory_regionkvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);即是将qemu构建的内存空间映射提交给kvm。
关注函数kvm_set_user_memory_region 函数,其将KVMSlot 放入kvm_userspace_memory_region 进行对kvm的内存信息提交。
而参数中KVMSlot *slot,则是上层调用函数将MemoryRegionSection 信息转化为 KVMSlot 的, 以对应KVM的交互信息协议。
static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
{KVMState *s = kvm_state;struct kvm_userspace_memory_region mem;int ret;mem.slot = slot->slot | (kml->as_id << 16);mem.guest_phys_addr = slot->start_addr;mem.userspace_addr = (unsigned long)slot->ram;mem.flags = slot->flags;//...mem.memory_size = slot->memory_size;ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);slot->old_flags = mem.flags;//...return ret;
}
KVM构建EPT
VCPU创建好之后,在初始化的时候会调用kvm_mmu_setup进行MMU的初始化,相关函数调用为init_kvm_mmu。
static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,union kvm_cpu_role cpu_role)
{// ...context->root_role.word = root_role.word;// tdp_page_fault,用来处理EPT的页访问错误,之后根据VCPU所处的模式设置相应的值context->page_fault = kvm_tdp_page_fault;//...
}
int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{// ...return direct_page_fault(vcpu, fault);
}
继续将调用函数 direct_page_fault, 主要任务是
-
定位缺页的客户机地址范围(GPA)。
-
使用
kvm_memory_slot
找到对应的主机物理内存(HPA)。 -
更新二级页表(EPT/NPT)以建立映射。
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);unsigned long mmu_seq;int r;// 虚拟机地址 gfnfault->gfn = fault->addr >> PAGE_SHIFT;// 使用 gfn查询到的 宿主机对应的 slot, 确定虚拟机地址与需要关联的宿主机地址信息// slot 即qemu上报给kvm 的地址信息fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);// ...if (is_tdp_mmu_fault) {r = kvm_tdp_mmu_map(vcpu, fault);} else {r = make_mmu_pages_available(vcpu);if (r)goto out_unlock;// 构建 ept表 或 npt 映射,与cpu构架相关r = __direct_map(vcpu, fault);}//...return r;
}
MMIO处理
调试
调试qemu 触发内存信息提交至KVM。
参见协议栈,在虚拟机启动时触发了kvm_region_commit, 并最终将组织的内存信息提交给KVM
(gdb) bt
#0 kvm_set_user_memory_region (slot=0x7ffff4c79010, new=new@entry=true, kml=<optimized out>) at ../accel/kvm/kvm-all.c:282
#1 0x0000555555cba6f2 in kvm_set_phys_mem (kml=kml@entry=0x555556dd0040, section=section@entry=0x555557406ec0, add=<optimized out>, add@entry=true) at ../accel/kvm/kvm-all.c:1365
#2 0x0000555555cbab4c in kvm_region_commit (listener=0x555556dd0040) at ../accel/kvm/kvm-all.c:1574
#3 0x0000555555c56bae in memory_region_transaction_commit () at ../system/memory.c:1137
#4 memory_region_transaction_commit () at ../system/memory.c:1117
#5 0x0000555555b7a525 in pc_memory_init(pcms=pcms@entry=0x55555700fc80, system_memory=system_memory@entry=0x555557019600, rom_memory=rom_memory@entry=0x555556dc8dc0, pci_hole64_size=pci_hole64_size@entry=2147483648)at ../hw/i386/pc.c:961
#6 0x0000555555b604d3 in pc_init1 (machine=0x55555700fc80, pci_type=0x555555ef9ac7 "i440FX", host_type=0x555555ef9ae6 "i440FX-pcihost") at ../hw/i386/pc_piix.c:243
#7 0x0000555555908201 in machine_run_board_init (machine=0x55555700fc80, mem_path=<optimized out>, errp=<optimized out>, errp@entry=0x555556d3f378 <error_fatal>)at ../hw/core/machine.c:1541
#8 0x0000555555abe1c6 in qemu_init_board () at ../system/vl.c:2614
调试KVM页表构建,可见当虚拟机缺页时触发了AMD CPU的页构建函数。
kvm_tdp_mmu_map
-
主动操作:
-
通常在配置或更新客户机内存时调用,例如初始化或更改内存映射时。
-
开发者可以直接调用,用于明确地为某一范围内的客户机地址建立映射。
-
-
触发条件:
-
不依赖客户机行为,是由主机(KVM)主动调用。
-
-
典型场景:
-
当创建或调整
kvm_memory_slot
时。 -
当需要在运行前预先建立映射以优化性能时。
-
-
输入和职责:
-
输入包括 GPA 范围、HPA 起始地址、映射权限等。
-
直接更新 TDP 页表,可能会涉及多页的批量映射。
-
direct_page_fault
-
被动操作:
-
处理客户机运行时触发的 TDP 缺页异常。
-
在客户机尝试访问未被映射的地址或权限不足的地址时,由硬件中断触发。
-
-
触发条件:
-
依赖客户机行为。
-
硬件检测到 TDP 缺页或权限问题,导致 VM 退出到主机,触发异常处理。
-
-
典型场景:
-
客户机访问尚未映射的 GPA。
-
客户机试图执行权限不足的操作,例如写入只读页。
-
-
输入和职责:
-
输入包括缺页的 GPA 和触发的异常信息(如读/写/执行权限)。
-
根据 GPA 定位对应的
kvm_memory_slot
,计算 HPA,更新 TDP 页表并恢复客户机执行。
-
Breakpoint 3, kvm_tdp_mmu_map (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/tdp_mmu.c:1159
1159 {
(gdb) c
Continuing.Breakpoint 2, direct_page_fault (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/mmu.c:4267
4267 {
(gdb) bt
#0 direct_page_fault (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/mmu.c:4267
#1 0xffffffffa027b3ad in kvm_tdp_page_fault (vcpu=vcpu@entry=0xffff88810a168000, fault=fault@entry=0xffffc90000b83bd0) at arch/x86/kvm/mmu/mmu.c:4393
#2 0xffffffffa027b6bd in kvm_mmu_do_page_fault (prefetch=false, err=4, cr2_or_gpa=1008168, vcpu=0xffff88810a168000) at arch/x86/kvm/mmu/mmu_internal.h:291
#3 kvm_mmu_page_fault (vcpu=0xffff88810a168000, cr2_or_gpa=1008168, error_code=4294967300, insn=0x0 <fixed_percpu_data>, insn_len=0) at arch/x86/kvm/mmu/mmu.c:5592
参考文档
-
https://ctf-wiki.org/pwn/virtualization/basic-knowledge/mem-virtualization/
-
https://richardweiyang-2.gitbook.io/kernel-exploring/00-kvm/01-memory_virtualization/01_1-qemu_memory_model
-
https://flowlet.net/2023/03/19/intel-vt/
-
https://www.cnblogs.com/binlovetech/p/17571929.html
-
https://richardweiyang-2.gitbook.io/understanding_qemu/00-as/05-ramblock
-
https://www.cnblogs.com/LoyenWang/p/13943005.html
-
https://oenhan.com/kvm-src-4-mem
-
https://www.anquanke.com/post/id/86412
-
https://github.com/0voice/kernel_awsome_feature/blob/main/KVM%E4%B9%8B%E5%86%85%E5%AD%98%E8%99%9A%E6%8B%9F%E5%8C%96.md