为了实现用户对网卡硬件的配置,查询,或者执行比如create_cq等命令,mellanox网卡提供了command queue + mailbox的机制,本节将以create_cq为例看下这个过程。
command queue(后续简称cmdq)是一个4K对齐的长度为4K的连续物理内存,如图1所示,cmdq中有多个entry,entry的数量和entry的size都是从initialization segment中读到的。
entry格式如图2,软件的每一个cmd,会对应cmdq中的一个entry,cmd的input会存到entry中,如果input小于16B,那么就直接存储到command_input_inline_data,如果大于16B,那么剩余的会存到图一中的mailbox里,mailbox是一个链表结构,每个mailbox可以容纳512B的数据,第一个mailbox的指针存储在entry的input mailbox pointer。
将输入存到entry和对应的mailbox之后,软件会写cmdq的doorbell,doorbell位于initialization segment,可以理解为一个vector,软件会将这次entry对应的index写入到这个vector对应的bit通知硬件执行。硬件执行cmdq中entry的顺序是不确定的,初始化entry时,entry的ownership会被设置为HW,当硬件执行完成后,ownership会被重新设置为SW,因此软件可以轮询ownership确定硬件是否完成。
然后看下驱动中是怎么做的。
初始化
mlx5_cmd_init主要工作就是创建command queue的buf,然后将这块内存的总线地址告诉网卡硬件。
通过dma_pool_create创建dma内存池,每个内存块大小为sizeof(mlx5_cmd_prot_block),一个内存块表示了一个mailbox,mlx5_cmd_prot_block类型用于表示mailbox。
int mlx5_cmd_init(struct mlx5_core_dev *dev)
{int size = sizeof(struct mlx5_cmd_prot_block);int align = roundup_pow_of_two(size);struct mlx5_cmd *cmd = &dev->cmd;...cmd->pool = dma_pool_create("mlx5_cmd", mlx5_core_dma_dev(dev), size, align, 0);err = alloc_cmd_page(dev, cmd);...
}
然后通过dma_zalloc_coherent分配大小为一页的一致性dma映射,这块内存就是command queue的buf,虚拟地址保存在cmd_alloc_buf,dma地址保存到dma,同时还需要保证这个地址是页对齐的。如果不对齐,还需要重新分配一次大小为两页的dma映射,从而保证页对齐。
static int alloc_cmd_page(struct mlx5_core_dev *dev, struct mlx5_cmd *cmd)
{cmd->cmd_alloc_buf = dma_zalloc_coherent(mlx5_core_dma_dev(dev), MLX5_ADAPTER_PAGE_SIZE, &cmd->alloc_dma, GFP_KERNEL);.../* make sure it is aligned to 4K */if (!((uintptr_t)cmd->cmd_alloc_buf & (MLX5_ADAPTER_PAGE_SIZE - 1))) {cmd->cmd_buf = cmd->cmd_alloc_buf;cmd->dma = cmd->alloc_dma;cmd->alloc_size = MLX5_ADAPTER_PAGE_SIZE;return 0;} dma_free_coherent(mlx5_core_dma_dev(dev), MLX5_ADAPTER_PAGE_SIZE, cmd->cmd_alloc_buf,cmd->alloc_dma);cmd->cmd_alloc_buf = dma_zalloc_coherent(mlx5_core_dma_dev(dev),2 * MLX5_ADAPTER_PAGE_SIZE - 1, &cmd->alloc_dma, GFP_KERNEL);...cmd->cmd_buf = PTR_ALIGN(cmd->cmd_alloc_buf, MLX5_ADAPTER_PAGE_SIZE);cmd->dma = ALIGN(cmd->alloc_dma, MLX5_ADAPTER_PAGE_SIZE);cmd->alloc_size = 2 * MLX5_ADAPTER_PAGE_SIZE - 1;return 0;
}
然后从Initialization Segment(后续简称iseg)读出cmdq_addr_l_sz,然后解析出低位的log_sz和log_stride,log_sz以log形式表示cmdq一共有多少个entry,log_stride以log形式表示cmdq的一个entry大小。然后将cmdq的dma地址写到iseg,这样硬件就知道cmdq的地址了。最后设置mode为CMD_MODE_POLLING,创建一个单线程的workqueue。到这里初始化就完成了。
int mlx5_cmd_init(struct mlx5_core_dev *dev)
{...cmd_l = ioread32be(&dev->iseg->cmdq_addr_l_sz) & 0xff;cmd->log_sz = cmd_l >> 4 & 0xf;cmd->log_stride = cmd_l & 0xf;cmd->max_reg_cmds = (1 << cmd->log_sz) - 1;cmd->bitmask = (1UL << cmd->max_reg_cmds) - 1;cmd_h = (u32)((u64)(cmd->dma) >> 32);cmd_l = (u32)(cmd->dma);iowrite32be(cmd_h, &dev->iseg->cmdq_addr_h);iowrite32be(cmd_l, &dev->iseg->cmdq_addr_l_sz);/* Make sure firmware sees the complete address before we proceed */wmb();...cmd->mode = CMD_MODE_POLLING;cmd->allowed_opcode = CMD_ALLOWED_OPCODE_ALL;cmd->wq = create_singlethread_workqueue(cmd->wq_name)...
}
cmd的下发与执行
输入输出
接下来以create_cq为例看下如何下发一个cmd,create_cq的输入由结构体mlx5_ifc_create_cq_in_bits解释,这个结构体不表示实际内存,比如opcode[0x10],通过u8类型表示仅仅是为了编程的方便,实际只占了16bit。输出由结构体mlx5_ifc_create_cq_out_bits表示。
struct mlx5_ifc_create_cq_in_bits {u8 opcode[0x10];u8 uid[0x10];u8 reserved_at_20[0x10];u8 op_mod[0x10];u8 reserved_at_40[0x40];struct mlx5_ifc_cqc_bits cq_context;u8 reserved_at_280[0x60];u8 cq_umem_valid[0x1];u8 reserved_at_2e1[0x59f];u8 pas[][0x40];
};
首先看下如何存储输入输出,mlx5_cmd_msg 表示一个msg,用于管理cmdq的entry和对应的mailbox,其中first用于存储开始的16B,如果输入大于16B,将会存储到next对应的mailbox链表里。mlx5_cmd_mailbox表示一个mailbox,其中buf为mailbox对应的虚拟地址,dma为这块buf的dma地址。next为软件侧的链,是一个虚拟地址,软件通过next遍历mailbox链表,硬件遍历mailbox的链表是通过buf里的next,是一个dma地址。
struct mlx5_cmd_msg {struct list_head list;struct cmd_msg_cache *parent;u32 len;struct mlx5_cmd_first first;struct mlx5_cmd_mailbox *next;
};struct mlx5_cmd_first {__be32 data[4];
};struct mlx5_cmd_mailbox {void *buf;dma_addr_t dma;struct mlx5_cmd_mailbox *next;
};
然后开始看执行cmd的过程,入口函数为mlx5_cmd_do,实际直接执行cmd_exec,其中callback和context为NULL,force_poling为false。
static int cmd_exec(struct mlx5_core_dev *dev, void *in, int in_size, void *out,int out_size, mlx5_cmd_cbk_t callback, void *context,bool force_polling)
{struct mlx5_cmd_msg *inb;struct mlx5_cmd_msg *outb;opcode = MLX5_GET(mbox_in, in, opcode);pages_queue = is_manage_pages(in);gfp = callback ? GFP_ATOMIC : GFP_KERNEL;inb = alloc_msg(dev, in_size, gfp);...
}
分配msg
cmd_exec首先通过alloc_msg分配input msg,初始化的时候已经分配了若干个mlx5_cmd_msg作为cache。如果in的长度小于16,就是可以放到一个cmdq entry,那么直接分配,不复用cache,否则会先看cache中有没有满足需要大小的msg,如果有则使用cache中的msg,假设没有命中cache,会通过mlx5_alloc_cmd_msg进行分配。
static struct mlx5_cmd_msg *alloc_msg(struct mlx5_core_dev *dev, int in_size,gfp_t gfp)
{struct mlx5_cmd_msg *msg = ERR_PTR(-ENOMEM);struct mlx5_cmd *cmd = &dev->cmd;if (in_size <= 16)goto cache_miss;...
cache_miss:msg = mlx5_alloc_cmd_msg(dev, gfp, in_size, 0);return msg;
}
先分配msg,然后通过mlx5_calc_cmd_blocks计算需要几个mailbox entry,计算方法为cmdq entry可以存16B,然后剩下的看需要几个mailbox存储,一个mailbox可以存储512B。然后通过alloc_cmd_box分配mailbox,这里就是通过dma_pool_zalloc从cmd->pool中获取一个dma内存块,对于每一个新分配的mailbox,通过mailbox的next字段链接到msg的next链表,软件通过这里的next就可以遍历链表。然后开始初始化mailbox的buf,设置block_num,设置token为0,然后设置buf的next为下一个mailbox的dma地址,硬件通过buf里的next就可以遍历链表了。
static struct mlx5_cmd_msg *mlx5_alloc_cmd_msg(struct mlx5_core_dev *dev,gfp_t flags, int size,u8 token)
{struct mlx5_cmd_mailbox *tmp, *head = NULL;struct mlx5_cmd_prot_block *block;struct mlx5_cmd_msg *msg;int err; int n;int i;msg = kzalloc(sizeof(*msg), flags);if (!msg)return ERR_PTR(-ENOMEM);msg->len = size;n = mlx5_calc_cmd_blocks(msg);for (i = 0; i < n; i++) {tmp = alloc_cmd_box(dev, flags);if (IS_ERR(tmp)) {mlx5_core_warn(dev, "failed allocating block\n");err = PTR_ERR(tmp);goto err_alloc;} block = tmp->buf;tmp->next = head;block->next = cpu_to_be64(tmp->next ? tmp->next->dma : 0);block->block_num = cpu_to_be32(n - i - 1);block->token = token;head = tmp; } msg->next = head;return msg; err_alloc:while (head) {tmp = head->next;free_cmd_box(dev, head);head = tmp; } kfree(msg);return ERR_PTR(err);
}
static int cmd_exec(struct mlx5_core_dev *dev, void *in, int in_size, void *out,int out_size, mlx5_cmd_cbk_t callback, void *context,bool force_polling)
{...token = alloc_token(&dev->cmd);err = mlx5_copy_to_msg(inb, in, in_size, token);outb = mlx5_alloc_cmd_msg(dev, gfp, out_size, token);
}
然后开始分配token,token就是一个自增的uint8,标识一次cmd,一次cmd的cmdq entry和对应的所有mailbox的token都需要是一致的,然后通过mlx5_copy_to_msg将数据从in中拷贝到msg inb中,就是将连续的输入分散拷贝到cmd entry和mailbox中。然后分配output msg,和分配input msg一样。
执行
mlx5_cmd_invoke通过cmd_alloc_ent分配一个mlx5_cmd_work_ent ent,用于记录上下文信息,比如输入输出等。初始化ent中的work_struct work,对应的执行函数为cmd_work_handler,然后通过queue_work将ent->work提交到cmd->wq中执行,这里的wq就是初始化中创建的。wait_func通过wait_for_completion等待cmd->wq完成ent->work的执行。
static int mlx5_cmd_invoke(struct mlx5_core_dev *dev, struct mlx5_cmd_msg *in,struct mlx5_cmd_msg *out, void *uout, int uout_size,mlx5_cmd_cbk_t callback,void *context, int page_queue, u8 *status,u8 token, bool force_polling)
{struct mlx5_cmd *cmd = &dev->cmd;struct mlx5_cmd_work_ent *ent;ent = cmd_alloc_ent(cmd, in, out, uout, uout_size,callback, context, page_queue);ent->token = token;ent->polling = force_polling;init_completion(&ent->handling);if (!callback)init_completion(&ent->done);INIT_DELAYED_WORK(&ent->cb_timeout_work, cb_timeout_handler);INIT_WORK(&ent->work, cmd_work_handler);if (page_queue) {cmd_work_handler(&ent->work);} else if (!queue_work(cmd->wq, &ent->work)) {mlx5_core_warn(dev, "failed to queue work\n");err = -ENOMEM;goto out_free;}if (callback)goto out; /* mlx5_cmd_comp_handler() will put(ent) */err = wait_func(dev, ent);...
}
然后看下work_queue是如何执行的,即cmd_work_handler。
首先执行,cmd->bitmask表示了当前cdmq的buff有哪些entry可用,cmd_alloc_index就是通过find_first_bit找到bitmask中第一个置为1的位ret,如果找到了那么将这位置清0,表示占用了,将ret记录到ent->idx。
static int cmd_alloc_index(struct mlx5_cmd *cmd)
{ unsigned long flags;int ret; spin_lock_irqsave(&cmd->alloc_lock, flags);ret = find_first_bit(&cmd->bitmask, cmd->max_reg_cmds);if (ret < cmd->max_reg_cmds)clear_bit(ret, &cmd->bitmask);spin_unlock_irqrestore(&cmd->alloc_lock, flags);return ret < cmd->max_reg_cmds ? ret : -ENOMEM;
}
static void cmd_work_handler(struct work_struct *work)
{struct mlx5_cmd_work_ent *ent = container_of(work, struct mlx5_cmd_work_ent, work);struct mlx5_cmd *cmd = ent->cmd;bool poll_cmd = ent->polling;struct mlx5_cmd_layout *lay;int alloc_ret;int cmd_mode;dev = container_of(cmd, struct mlx5_core_dev, cmd);cb_timeout = msecs_to_jiffies(mlx5_tout_ms(dev, CMD));complete(&ent->handling);sem = ent->page_queue ? &cmd->pages_sem : &cmd->sem;down(sem);if (!ent->page_queue) {alloc_ret = cmd_alloc_index(cmd);if (alloc_ret < 0) { ...up(sem);return;} ent->idx = alloc_ret;} else {...} cmd->ent_arr[ent->idx] = ent; lay = get_inst(cmd, ent->idx);...
}
然后执行get_inst获取cmdq的第idx个entry,即lay,前边有说log_stride是以log表示的cmdq entry大小,因此就是(idx << cmd->log_stride)。
static struct mlx5_cmd_layout *get_inst(struct mlx5_cmd *cmd, int idx)
{ return cmd->cmd_buf + (idx << cmd->log_stride);
}
然后开始设置lay,首先拷贝input msg的前16B到lay->in,如果input msg有next,表示输入长度超过了16B,因此将第一个mailbox的dma地址设置到lay->in_ptr,接着设置inlen,同理设置out和outlen,然后设置token,signature,设置完成后将ownership bit修改为HW。
static void cmd_work_handler(struct work_struct *work) {...ent->lay = lay;memset(lay, 0, sizeof(*lay));memcpy(lay->in, ent->in->first.data, sizeof(lay->in));ent->op = be32_to_cpu(lay->in[0]) >> 16;if (ent->in->next)lay->in_ptr = cpu_to_be64(ent->in->next->dma);lay->inlen = cpu_to_be32(ent->in->len);if (ent->out->next)lay->out_ptr = cpu_to_be64(ent->out->next->dma);lay->outlen = cpu_to_be32(ent->out->len);lay->type = MLX5_PCI_CMD_XPORT;lay->token = ent->token;lay->status_own = CMD_OWNER_HW;set_signature(ent, !cmd->checksum_disabled);...
}
对lay的设置完成之后,然后开始写cmd queue的doorbell,doorbell是一个vector,软件通过写对应的bit表示新的cmd是cmd_buf的哪个位置。
static void cmd_work_handler(struct work_struct *work)
{...wmb();iowrite32be(1 << ent->idx, &dev->iseg->cmd_dbell);/* if not in polling don't use ent after this point */if (cmd_mode == CMD_MODE_POLLING || poll_cmd) {poll_timeout(ent);/* make sure we read the descriptor after ownership is SW */rmb();mlx5_cmd_comp_handler(dev, 1ULL << ent->idx, ent->ret == -ETIMEDOUT ?MLX5_CMD_COMP_TYPE_FORCED : MLX5_CMD_COMP_TYPE_POLLING);}
}
然后通过poll_timeout等待网卡执行,实现上就是通过轮询cmq entry的ownership字段,当ownership变成SW之后表示硬件已经完成执行。
static void poll_timeout(struct mlx5_cmd_work_ent *ent)
{struct mlx5_core_dev *dev = container_of(ent->cmd, struct mlx5_core_dev, cmd);u64 cmd_to_ms = mlx5_tout_ms(dev, CMD);unsigned long poll_end;u8 own;poll_end = jiffies + msecs_to_jiffies(cmd_to_ms + 1000);do {own = READ_ONCE(ent->lay->status_own);if (!(own & CMD_OWNER_HW)) {ent->ret = 0;return;}cond_resched();} while (time_before(jiffies, poll_end));ent->ret = -ETIMEDOUT;
}
然后执行mlx5_cmd_comp_handler将硬件输出的前16B拷贝到output msg,后边的输出可以通过output找到output mailbox获取,通过complete通知执行的完成。
static void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, u64 vec, enum mlx5_comp_t comp_type)
{struct mlx5_cmd *cmd = &dev->cmd;struct mlx5_cmd_work_ent *ent;...memcpy(ent->out->first.data, ent->lay->out, sizeof(ent->lay->out));complete(&ent->done);...
}
最后回到cmd_exec会将output msg拷贝回out,对于create_cq来说就是mlx5_ifc_create_cq_in_bits。