self-attention 的 CUDA 实现及优化 (上)

self-attention 的 CUDA 实现及优化 (上)

导 读

self-attention 是 Transformer 中最关键、最复杂的部分,也是 Transformer 优化的核心环节。理解 self-attention ,对于深入理解 Transformer 具有关键作用,本篇主要就围绕 self-attention 展开,由于该部分比较复杂,故分为上下两篇,本篇为上篇。

0****1

self-attention的CUDA简单实现

self-attention 的原理非常常见,在之前的文章中也分析很多,因此不在此介绍介绍其原理,仅解读代码。

1、CPU版本

以下是基础的 CPU 版本的实现,下面对其稍作分析:

• 输入inp 为 x 与 QKV_weight 相乘后得到的 QKV 值,对于b(batch size), t(sequence len), h(head) 的 q(query_t) 值的索引为 inp[b,t,h*hs:(h+1)hs] , k(key_t2) 值在此基础上偏移 C 维即可,即inp[b,t,h*hs+C:(h+1)hs+C]

•  得到 q,k 之后,便通过点乘计算 attention 值,算完一个 attn 值之后进行 scale 操作(同时记录最大值以便进行softmax),计算完一行后进行 mask 操作

•  进行 softmax 操作,得到 attn 值

•  索引 v(value_t2) 并与 attn 值进行矩阵乘法运算

// CPU code referencevoid attention_forward_cpu(float* out, float* preatt, float* att,const float* inp,int B, int T, int C, int NH) {// input is (B, T, 3C) Q,K,V// preatt, att are (B, NH, T, T)// output is (B, T, C)int C3 = C*3;int hs = C / NH; // head sizefloat scale = 1.0 / sqrtf(hs);for (int b = 0; b < B; b++) {for (int t = 0; t < T; t++) {for (int h = 0; h < NH; h++) {const float* query_t = inp + b * T * C3 + t * C3 + h * hs;float* preatt_bth = preatt + b*NH*T*T + h*T*T + t*T;float* att_bth = att + b*NH*T*T + h*T*T + t*T;// pass 1: calculate query dot key and maxvalfloat maxval = -10000.0f; // TODO something betterfor (int t2 = 0; t2 <= t; t2++) {const float* key_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C; // +C because it's key// (query_t) dot (key_t2)float val = 0.0f;for (int i = 0; i < hs; i++) {val += query_t[i] * key_t2[i];}val *= scale;if (val > maxval) {maxval = val;}preatt_bth[t2] = val;}// pad with -INFINITY outside of autoregressive region for debugging comparisonsfor (int t2 = t+1; t2 < T; t2++) {preatt_bth[t2] = -INFINITY;}// pass 2: calculate the exp and keep track of sumfloat expsum = 0.0f;for (int t2 = 0; t2 <= t; t2++) {float expv = expf(preatt_bth[t2] - maxval);expsum += expv;att_bth[t2] = expv;}float expsum_inv = expsum == 0.0f ? 0.0f : 1.0f / expsum;// pass 3: normalize to get the softmaxfor (int t2 = 0; t2 < T; t2++) {if (t2 <= t) {att_bth[t2] *= expsum_inv;} else {// causal attention mask. not strictly necessary to set to zero here// only doing this explicitly for debugging and checking to PyTorchatt_bth[t2] = 0.0f;}}// pass 4: accumulate weighted values into the output of attentionfloat* out_bth = out + b * T * C + t * C + h * hs;for (int i = 0; i < hs; i++) { out_bth[i] = 0.0f; }for (int t2 = 0; t2 <= t; t2++) {const float* value_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C*2; // +C*2 because it's valuefloat att_btht2 = att_bth[t2];for (int i = 0; i < hs; i++) {out_bth[i] += att_btht2 * value_t2[i];}}}}}
}

2、CUDA初步实现(V1)

仍然延续 CPU 版本的基本思路,只是计算的不同,拆分为 3 个 kernel 进行计算:

•  第一步:计算 attention 值,总共使用B*NH*T*T 个线程,即每个线程计算一个值

 // attention calculationint total_threads = B * NH * T * T;int num_blocks = ceil_div(total_threads, block_size);attention_query_key_kernel1<<<num_blocks, block_size>>>(preatt, inp, B, T, C, NH);

kernel 函数的实现如下:

__global__ void attention_query_key_kernel1(float* preatt, const float* inp,int B, int T, int C, int NH) {int idx = blockIdx.x * blockDim.x + threadIdx.x;int total_threads = B * NH * T * T;if (idx < total_threads) {int t2 = idx % T;int t = (idx / T) % T;if (t2 > t) {// autoregressive maskpreatt[idx] = -INFINITY;return;}int h = (idx / (T * T)) % NH;int b = idx / (NH * T * T);int C3 = C*3;int hs = C / NH; // head sizeconst float* query_t = inp + b * T * C3 + t * C3 + h * hs;const float* key_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C; // +C because it's key// (query_t) dot (key_t2)float val = 0.0f;for (int i = 0; i < hs; i++) {val += query_t[i] * key_t2[i];}val *= 1.0 / sqrtf(hs);preatt[idx] = val;}
}

•  第二步:softmax 操作,该操作在之前的 op 优化中已经详细讨论,不予赘述

_global__ void attention_softmax_kernel1(float* att, const float* preatt,int B, int T, int NH) {int idx = blockIdx.x * blockDim.x + threadIdx.x;int total_threads = B * T * NH;if (idx < total_threads) {int h = idx % NH;int t = (idx / NH) % T;int b = idx / (NH * T);const float* preatt_bth = preatt + b*NH*T*T + h*T*T + t*T;float* att_bth = att + b*NH*T*T + h*T*T + t*T;// find maxvalfloat maxval = -10000.0f; // TODO something betterfor (int t2 = 0; t2 <= t; t2++) {if (preatt_bth[t2] > maxval) {maxval = preatt_bth[t2];}}// calculate the exp and keep track of sumfloat expsum = 0.0f;for (int t2 = 0; t2 <= t; t2++) {float expv = expf(preatt_bth[t2] - maxval);expsum += expv;att_bth[t2] = expv;}float expsum_inv = expsum == 0.0f ? 0.0f : 1.0f / expsum;// normalize to get the softmaxfor (int t2 = 0; t2 < T; t2++) {if (t2 <= t) {att_bth[t2] *= expsum_inv;} else {// causal attention mask. not strictly necessary to set to zero here// only doing this explicitly for debugging and checking to PyTorchatt_bth[t2] = 0.0f;}}}
}

•  第三步:attention 值与 v 进行矩阵乘法运算

__global__ void attention_value_kernel1(float* out, const float* att, const float* inp,int B, int T, int C, int NH) {int idx = blockIdx.x * blockDim.x + threadIdx.x;int total_threads = B * T * NH;if (idx < total_threads) {int h = idx % NH;int t = (idx / NH) % T;int b = idx / (NH * T);int C3 = C*3;int hs = C / NH; // head sizefloat* out_bth = out + b * T * C + t * C + h * hs;const float* att_bth = att + b*NH*T*T + h*T*T + t*T;for (int i = 0; i < hs; i++) { out_bth[i] = 0.0f; }for (int t2 = 0; t2 <= t; t2++) {const  float* value_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C*2; // +C*2 because it's valuefloat att_btht2 = att_bth[t2];for (int i = 0; i < hs; i++) {out_bth[i] += att_btht2 * value_t2[i];}}}
}

由此完成最基本的 self-attention 的实现,性能数据如下:

block_size   32 | time 238.912872 ms
block_size   64 | time 252.689301 ms
block_size  128 | time 246.945175 ms
block_size  256 | time 261.469421 ms
block_size  512 | time 241.190613 ms

3、flash attention的简单实现(V2)

flash attention 是根据 GPU 的内存体系对 self-attention 做的一个极其重要的优化。

•  首先对于关键参数进行初始化

// these are hardcoded to 32 for nowconst int Bc = 32;const int Br = 32;// renaming these to be consistent with the kernel// const int B = B;const int nh = NH;const int N = T;const int d = C / NH;// moreconst int Tc = ceil((float) N / Bc);const int Tr = ceil((float) N / Br);const float softmax_scale = 1.0 / sqrt(d);

•  然后计算每个 block 所需要的 SRAM,以确保不会溢出

// calculate SRAM size needed per block, ensure we have enough shared memoryint col_tile_size = Bc * d;  // size of Kj, Vjint row_tile_size = Br * d;  // size of Qiconst int sram_size =(2 * col_tile_size * sizeof(float))  // SRAM size for Kj, Vj+ (row_tile_size * sizeof(float))  // SRAM size for Qi+ (Bc * Br * sizeof(float));  // SRAM size for Sint max_sram_size;cudaDeviceGetAttribute(&max_sram_size, cudaDevAttrMaxSharedMemoryPerBlock, 0);if (sram_size > max_sram_size) {printf("Max shared memory: %d, requested shared memory: %d \n", max_sram_size, sram_size);printf("SRAM size exceeds maximum shared memory per block\n");printf("Try decreasing col_tile_size or row_tile_size further\n");exit(1);}

•  为了避免在 flash attention 中进行复杂的索引、reshape 及 permute 操作,首先使用一个kernel 完成这些操作

__global__ void permute_kernel(float* q, float* k, float* v,const float* inp,int B, int N, int NH, int d) {// okay so now, this kernel wants Q,K,V to all be of shape (B, NH, N, d)// but instead, we have a single tensor QKV (inp) of shape (B, N, 3, NH, d)int idx = blockIdx.x * blockDim.x + threadIdx.x;// Q[b][nh_][n][d_] = inp[b][n][0][nh_][d_]if (idx < B * NH * N * d) {int b = idx / (NH * N * d);int rest = idx % (NH * N * d);int nh_ = rest / (N * d);rest = rest % (N * d);int n = rest / d;int d_ = rest % d;int inp_idx = \(b * N * 3 * NH * d)+   (n * 3 * NH * d)+       (0 * NH * d)+          (nh_ * d)+                d_;q[idx] = inp[inp_idx];k[idx] = inp[inp_idx + NH * d];v[idx] = inp[inp_idx + 2 * (NH * d)];}
}

•  之后就是核心环节,flash attention 的实现了,其过程可以参照以下图示:

__global__ void attention_forward_kernel2(const float* Q,const float* K,const float* V,const int N,const int d,const int Tc,const int Tr,const int Bc,const int Br,const float softmax_scale,float* l,float* m,float* O
) {int tx = threadIdx.x;int bx = blockIdx.x; int by = blockIdx.y;  // batch and head index// Offset into Q,K,V,O,l,m - different for each batch and headint qkv_offset = (bx * gridDim.y * N * d) + (by * N * d);  // gridDim.y = nhint lm_offset = (bx * gridDim.y * N) + (by * N);  // offset for l and m// Define SRAM for Q,K,V,Sextern __shared__ float sram[];int tile_size = Bc * d;  // size of Qi, Kj, Vjfloat* Qi = sram;float* Kj = &sram[tile_size];float* Vj = &sram[tile_size * 2];float* S = &sram[tile_size * 3];for (int j = 0; j < Tc; j++) {// Load Kj, Vj to SRAMfor (int x = 0; x < d; x++) {Kj[(tx * d) + x] = K[qkv_offset + (tile_size * j) + (tx * d) + x];Vj[(tx * d) + x] = V[qkv_offset + (tile_size * j) + (tx * d) + x];}__syncthreads();  // such that the inner loop can use the correct Kj, Vjfor (int i = 0; i < Tr; i++)  {// if past the end of the sequence, breakif (i * Br + tx >= N) {break;}// Load Qi to SRAM, l and m to registersfor (int x = 0; x < d; x++) {Qi[(tx * d) + x] = Q[qkv_offset + (tile_size * i) + (tx * d) + x];}float row_m_prev = m[lm_offset + (Br * i) + tx];float row_l_prev = l[lm_offset + (Br * i) + tx];// S = QK^T, row_m = rowmax(S)// S[tx][y] = Sum_{x = 0}^{d-1} {Qi[tx][x] * Kj[y][x]}// row_m = Max_{y = 0}^{Bc-1} S[tx][y]// with causal maskingfloat row_m = -INFINITY;for (int y = 0; y < Bc; y++) {if (j * Bc + y >= N) {break;}float sum = 0;for (int x = 0; x < d; x++) {sum += Qi[(tx * d) + x] * Kj[(y * d) + x];}sum *= softmax_scale;if (i * Br + tx < j * Bc + y)sum = -INFINITY;S[(Bc * tx) + y] = sum;if (sum > row_m)row_m = sum;}// implement softmax with causal masking// P = exp(S - row_m), row_l = rowsum(P)// P[tx][y] = exp(S[tx][y] - row_m)float row_l = 0;for (int y = 0; y < Bc; y++) {if (j * Bc + y >= N) {break;}if (i * Br + tx < j * Bc + y)S[(Bc * tx) + y] = 0;elseS[(Bc * tx) + y] = __expf(S[(Bc * tx) + y] - row_m);row_l += S[(Bc * tx) + y];}// Compute new m and lfloat row_m_new = max(row_m_prev, row_m);float row_l_new = (__expf(row_m_prev - row_m_new) * row_l_prev) + (__expf(row_m - row_m_new) * row_l);// Write O, l, m to HBMfor (int x = 0; x < d; x++) {float pv = 0;  // Pij * Vjfor (int y = 0; y < Bc; y++) {if (j * Bc + y >= N) {break;}pv += S[(Bc * tx) + y] * Vj[(y * d) + x];}O[qkv_offset + (tile_size * i) + (tx * d) + x] = (1 / row_l_new) \* ((row_l_prev * __expf(row_m_prev - row_m_new) * O[qkv_offset + (tile_size * i) + (tx * d) + x]) \+ (__expf(row_m - row_m_new) * pv));}m[lm_offset + (Br * i) + tx] = row_m_new;l[lm_offset + (Br * i) + tx] = row_l_new;}__syncthreads();  // otherwise, thread can use the wrong Kj, Vj in inner loop}
}

•  以上计算完成后,还需要进行 unpermute 操作,具体如下:

__global__ void unpermute_kernel(const float* inp, float *out, int B, int N, int NH, int d) {// out has shape (B, nh, N, d) but we need to unpermute it to (B, N, nh, d)int idx = blockIdx.x * blockDim.x + threadIdx.x;// out[b][n][nh_][d_] <- inp[b][nh_][n][d_]if (idx < B * NH * N * d) {int b = idx / (NH * N * d);int rest = idx % (NH * N * d);int nh_ = rest / (N * d);rest = rest % (N * d);int n = rest / d;int d_ = rest % d;int other_idx = (b * NH * N * d) + (n * NH * d) + (nh_ * d) + d_;out[other_idx] = inp[idx];}
}

这样就完成了简单的 flash attention 1 的前向过程,性能相较于V1反而有所下降,主要是数据量较小所致,数据如下:

block_size   32 | time 536.709961 ms
block_size   64 | time 526.100098 ms
block_size  128 | time 583.016235 ms
block_size  256 | time 573.955994 ms
block_size  512 | time 534.477051 ms

0****2

self-attention的高效实现

1、 使用 cuBLAS 库函数(V3)

在之前的实现中,所有的操作都是手动实现的,尽管从结果上看完全正确,但是性能上和官方版本仍有较大差距。因此本节将 self-attention 中的矩阵乘法操作使用官方 cuBLAS 库来实现。

在此仅展示两个矩阵乘法的实现过程,首先是q@k.T 如下:

// batched matrix multiply with cuBLASconst float alpha = 1.0f;const float beta = 0.0f;cublasCheck(cublasSgemmStridedBatched(cublas_handle,CUBLAS_OP_T, CUBLAS_OP_N,T, T, HS,&alpha,k, HS, T * HS,q, HS, T * HS,&beta,preatt, T, T * T,B * NH));

然后是att@v ,如下:

 // new approach: first cuBLAS another batched matmul// y = att @ v # (B, nh, T, T) @ (B, nh, T, hs) -> (B, nh, T, hs)cublasCheck(cublasSgemmStridedBatched(cublas_handle,CUBLAS_OP_N, CUBLAS_OP_N,HS, T, T,&alpha,v, HS, T * HS,att, T, T * T,&beta,vaccum, HS, T * HS,B * NH));

性能相较于 V1 版本,提升约百倍以上,数据如下:

block_size   32 | time 4.318913 ms
block_size   64 | time 2.606850 ms
block_size  128 | time 2.034935 ms
block_size  256 | time 2.031407 ms
block_size  512 | time 2.064406 ms

2 、算子融合与 online softmax(V4)

在 V3 基础上,使用 online softmax 并且将 scale 操作融合,具体如下:

__global__ void softmax_forward_kernel5(float* out, float inv_temperature, const float* inp, int N, int T) {// inp, out shape: (N, T, T), where N = B * NH// fuses the multiplication by scale inside attention// directly autoregressive, so we only compute the lower triangular part// uses the online softmax algorithmassert(T % 4  == 0);namespace cg = cooperative_groups;cg::thread_block block = cg::this_thread_block();cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);int idx = blockIdx.x * warp.meta_group_size() + warp.meta_group_rank();if(idx >= N * T) {return;}int own_pos = idx % T;int pos_by_4 = own_pos / 4;// one row of inp, i.e. inp[idx, :] of shape (T,)const float* x = inp + idx * T;// not INF, so we don't get NaNs accidentally when subtracting two values.float maxval = -FLT_MAX;float sumval = 0.0f;const float4* x_vec = reinterpret_cast<const float4*>(x);for (int i = warp.thread_rank(); i < pos_by_4; i += warp.size()) {float4 v = x_vec[i];float old_maxval = maxval;for(int k = 0; k < 4; ++k) {maxval = fmaxf(maxval, vec_at(v, k));}sumval *= expf(inv_temperature * (old_maxval - maxval));for(int k = 0; k < 4; ++k) {sumval += expf(inv_temperature * (vec_at(v, k) - maxval));}}if(4*pos_by_4 + warp.thread_rank() <= own_pos) {float old_maxval = maxval;maxval = fmaxf(maxval, x[4*pos_by_4 + warp.thread_rank()]);sumval *= expf(inv_temperature * (old_maxval - maxval));sumval += expf(inv_temperature * (x[4*pos_by_4 + warp.thread_rank()] - maxval));}float global_maxval = cg::reduce(warp, maxval, cg::greater<float>{});sumval *= expf(inv_temperature * (maxval - global_maxval));float sum = cg::reduce(warp, sumval, cg::plus<float>{});float norm = 1.f / sum;// divide the whole row by the sumfor (int i = warp.thread_rank(); i <= own_pos; i += warp.size()) {// recalculation is faster than doing the round-trip through memory.float ev = expf(inv_temperature * (__ldcs(x + i) - global_maxval));__stcs(out + idx * T + i, ev * norm);}
}

其余操作不变,性能略有提升,数据如下:

block_size   32 | time 1.198167 ms
block_size   64 | time 1.073088 ms
block_size  128 | time 1.042434 ms
block_size  256 | time 1.041798 ms
block_size  512 | time 1.044009 ms

3 、使用 FP16 进行矩阵运算(V5)

在 permute/unpermute 阶段进行 FP32<->FP16 类型转换,如下:

if (!skip_permute || first_run_validation) {permute_kernel_lowp<<<num_blocks, block_size>>>(q, k, v, inp, B, T, NH, HS);}
...if(!skip_permute || first_run_validation) {unpermute_kernel_lowp<<<num_blocks, block_size>>>(vaccum, out, B, T, NH, HS);}

性能数据如下:

block_size   32 | time 0.866851 ms
block_size   64 | time 0.743674 ms
block_size  128 | time 0.703196 ms
block_size  256 | time 0.713902 ms
block_size  512 | time 0.712848 ms

以上几种方法的对比如下,注意坐标轴为指数,计算设备的 A100-80G

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/324000.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

机器学习每周挑战——二手车车辆信息交易售价数据

这是数据集的截图 目录 背景描述 数据说明 车型对照&#xff1a; 燃料类型对照&#xff1a; 老规矩&#xff0c;第一步先导入用到的库 第二步&#xff0c;读入数据&#xff1a; 第三步&#xff0c;数据预处理 第四步&#xff1a;对数据的分析 第五步&#xff1a;模型建…

unreal engine4 创建动画蒙太奇

UE4系列文章目录 文章目录 UE4系列文章目录前言一、创建动画蒙太奇 前言 动画蒙太奇的官方解释&#xff1a;Animation Montages are animation assets that enable you to combine animations in a single asset and control playback using Blueprints.You can use Animation…

postman常用功能超全使用教程

Postman 使用 一、Postman 简介 Postman是一个接口测试工具,在做接口测试的时候,Postman相当于一个客户端,它可以模拟用户发起的各类HTTP请求(如:get/post/delete/put…等等),将请求数据发送至服务端,获取对应的响应结果。 二、Postman 功能简介 三、Postman 下载安装 Post…

shpfile转GeoJSON;控制shp转GeoJSON的精度;如何获取GeoJSON;GeoJSON是什么有什么用;GeoJSON结构详解(带数据示例)

目录 一、GeoJSON是什么 二、GeoJSON的结构组成 2.1、点&#xff08;Point&#xff09;数据示例 2.2、线&#xff08;LineString&#xff09;数据示例 2.3、面&#xff08;Polygon&#xff09;数据示例 2.4、特征&#xff08;Feature&#xff09;数据示例 2.5、特征集合&…

element ui的table多选

使用el-table的selection-change事件来获取选中的值&#xff1b; 例&#xff1a; html代码&#xff1a; <el-button type"primary" click"openTableSet">列表设置</el-button><!-- 列表设置弹框 --> <el-dialog :close-on-click-mo…

在uniapp里面使用 mp-html 并且开启 latex 功能

在uniapp里面使用 mp-html 并且开启 latex 功能 默认情况下 mp-html 是不会开启 latex 功能的, 如果需要开启 latex 功能是需要到代码操作拉取代码自行打包的。 这里说一下 mp-html 里面的 latex 功能是由 https://github.com/rojer95/katex-mini 提供的技术实现&#xff0c;…

五一反向旅游,景区“AI+视频监控”将持续助力旅游业发展

一、建设背景 每年五一劳动节出去旅游都是人挤人状态&#xff0c;这导致景区的体验感极差。今年“五一反向旅游”的话题冲上了热搜&#xff0c;好多人选择了五一之后再出去旅游&#xff0c;避开拥挤的人群&#xff0c;这个时候景区的监管力度和感知能力就更要跟上去&#xff0…

Cargo - 构建 rust项目、管理依赖包

文章目录 关于 Cargo构建项目创建工程编译运行buildclean 管理依赖添加依赖updatecheck计时 manual rust 安装可参考&#xff1a;https://blog.csdn.net/lovechris00/article/details/124808034 关于 Cargo Cargo 官方文档 &#xff1a; https://doc.rust-lang.org/cargo/crat…

Web前端开发 小实训(三) 商品秒杀小练习

学生能够在本次实训中完成商品秒杀页面的基本逻辑 任务要求 能够实现某一个商品的秒杀&#xff0c;在倒计时结束后不再进行秒杀。 操作步骤 1、打开预设好的页面 <html><head><meta charset"utf-8"><title>秒杀</title><link …

2024.5.10

TCP服务器端 #include "widget.h" #include "ui_widget.h"Widget::Widget(QWidget *parent): QWidget(parent), ui(new Ui::Widget) {ui->setupUi(this);//设置窗口大小和窗口大小固定this->resize(727,879);this->setFixedSize(727,879);//创建…

MySQL——变量的浮点数问题处理

新建链接&#xff0c;自带world数据库&#xff0c;里面自带city表格。 DQL #MySQL变量的浮点数问题处理 set dx3.14,dy3.25; select dxdy;#计算显示异常&#xff0c;会有很多00000的提示set resultdxdy; select result; 查询结果

Kafka分级存储概念(一)

Kafka分级存储及实现原理 概述 Kafka社区在3.6版本引入了一个十分重要的特性: 分级存储,本系列文章主要旨在介绍Kafka分级存储的设计理念、设计细节以及具体的代码实现 背景:为什么要有分级存储? 场景 作为一款具有高吞吐及高性能的消息中间件,Kafka被广泛应用在大数据、…

如何使用jmeter进行接口测试?jmeter接口测试流程是怎样的

前言 我们学习自动化测试都会用到不同的工具&#xff0c;那么今天笔者呢&#xff0c;想给大家聊聊Jmeter接口测试流程详解&#xff0c;废话不多说直接进入正题。 一、jmeter简介 Jmeter是由Apache公司开发的java开源项目&#xff0c;所以想要使用它必须基于java环境才可以&am…

让GPT们成为我们的小助手:使用ChatGPT来生成测试用数据

让GPT们成为我们的小助手 任务&#xff1a;帮忙生成测试数据 今天本来想做一个测试&#xff0c;所以需要一些测试数据。为了让测试显得更真实&#xff0c;所以希望测试数据看上去就是一份真实的数据&#xff0c;所以我就希望ChatGPT&#xff08;这里是代指&#xff0c;我有使…

太阳能无人机的多元化应用

随着新能源技术的不断发展和成熟&#xff0c;太阳能在无人机的应用技术已经成熟。太阳能无人机得到了量产和广泛的应用。传统无人机相比&#xff0c;太阳能无人机无需燃油&#xff0c;运行费用低廉&#xff0c;搭载多种高科技设备&#xff0c;能够高效、多元化地采集和分析各类…

UG NX二次开发(C#)-获取Part中对象创建时的序号(*)

文章目录 1、前言2、UG NX的对象序号讲解3、采用UG NX二次开发或者建模序号4、注意事项1、前言 在UG NX中,我们创建任意一个对象,都会在模型历史中添加一个创建对象的编号,即是对象序号,这个是递增的,当删除中间产生的对象时,其序号会重新按照建模顺序重新排布。今天一个…

宝塔面板如何删除一个站点

我们一般的网站都是PHPMySQL开发的&#xff0c;所以删除站点&#xff0c;就要先删数据库&#xff0c;再删网站目录 注意&#xff1a;一点要确保无用的再删 删除站点目录

布局全球内容生态,酷开科技Coolita AIOS以硬核品质亮相

当前&#xff0c;全球产业链供应链格局持续重构&#xff0c;成为影响中国对外经济发展的重要因素。2024年4月15至5月5日&#xff0c;历史久、规模大、层次高&#xff0c;作为中国外贸风向标的第135届中国进出口商品交易会&#xff08;即广交会&#xff09;在美丽的广州隆重举行…

CTFHUB-技能树-Web题-RCE(远程代码执行)-文件包含

CTFHUB-技能树-Web题-RCE&#xff08;远程代码执行&#xff09; 文件包含 文章目录 CTFHUB-技能树-Web题-RCE&#xff08;远程代码执行&#xff09;文件包含解题方法1:![在这里插入图片描述](https://img-blog.csdnimg.cn/direct/71f7355b3c124dfe8cdf1c95e6991553.png#pic_ce…

Linux 麒麟系统安装

国产麒麟系统官网地址&#xff1a; https://www.openkylin.top/downloads/ 下载该镜像后&#xff0c;使用VMware部署一个虚拟机&#xff1a; 完成虚拟机创建。点击&#xff1a;“开启此虚拟机” 选择“试用试用开放麒麟而不安装&#xff08;T&#xff09;”&#xff0c;进入op…