【项目记录】大模型基于llama.cpp在Qemu-riscv64向量扩展指令下的部署

概述

本文在qemu-riscv64平台上,利用向量扩展指令加速运行基于llama.cpp构建的大模型。
参考博客链接:
Accelerating llama.cpp with RISC-V Vector Extension
基于RVV的llama.cpp在Banana Pi F3 RISCV开发板上的演示

llama.cpp工程

Llama.cpp是一个基于C++编写的高性能大模型推理框架,旨在提供快速、稳定且易于使用的计算工具,Llama.cpp支持多种计算模式,包括向量计算、矩阵运算、图算法等,可广泛应用于机器学习、图像处理、数据分析等领域。

目录结构

src是构建模型架构的基础库文件夹
examples是部分案例模型的源文件
ggml是计算操作的库文件
在这里插入图片描述

源码分析

可如下参考链接(注意:函数名有所变动):
CodeLeaner@微信公众号:llama.cpp源码解析

llama模型
// llama  @examples/main/main.cpp
main()  gpt_params_parse(argc, argv, params, LLAMA_EXAMPLE_MAIN, print_usage)	// 解析传递进来的模型参数llama_init_from_gpt_params()llama_load_model_from_file(params.model.c_str(), mparams);		// 加载model参数llama_new_context_with_model(model, cparams);			// ggml_backend_cpu_init();*cpu_backend				// 定义指针指向 cpu_backend_illama_tokenize(ctx, prompt, true, true)				// 将prompt tokenizewhile   // 循环产生tokenllama_decode(ctx, llama_batch_get_one(&embd[i], n_eval, n_past, 0))	// 生成token函数llama_token_to_piece(ctx, id, params.special)gpt_perf_print(ctx, smpl);						// 打印性能结果
llama.cpp模型库文件

// decode 函数体 @src/llama.cpp
int32_t llama_decode(        struct llama_context * ctx,          struct llama_batch   batch)llama_decode_internal(*ctx, batch)while (lctx.sbatch.n_tokens > 0) llama_build_graph(lctx, ubatch, false);llama_graph_compute(lctx, gf, n_threads, threadpool);ggml_backend_cpu_set_n_threads(lctx.backend_cpu, n_threads);ggml_backend_cpu_set_threadpool(lctx.backend_cpu, threadpool);ggml_backend_cpu_set_abort_callback(lctx.backend_cpu, lctx.abort_callback, lctx.abort_callback_data);	ggml_backend_sched_graph_compute_async(lctx.sched, gf);
ggml计算库函数
// backend计算图执行函数   ggml/src/ggml-backend.c
enum ggml_status ggml_backend_sched_graph_compute_async(ggml_backend_sched_t sched, struct ggml_cgraph * graph)ggml_backend_sched_alloc_graph(sched, graph)ggml_backend_sched_split_graph(sched, graph);			// 该函数划分graph,并调用cpu_backend进行计算,继而调用ggml_graph_compute计算// ggml计算图中各类计算选通的主体函数  ggml/src/ggml.c
enum ggml_status ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan)ggml_graph_compute_thread(&threadpool->workers[0])ggml_compute_forward(&params, node);ggml_compute_forward_dup(params, tensor);ggml_compute_forward_add1(params, tensor);ggml_compute_forward_repeat(params, tensor);ggml_compute_forward_mul_mat(params, tensor);ggml_compute_forward_soft_max(params, tensor);ggml_compute_forward_rms_norm(params, tensor);

结构体

// cpu执行数据流结构体,将函数作为结构体成员。
cpu_backend_i	// @ggml/src/ggml-backend.c   ggml_backend_cpu_graph_plan_create()ggml_graph_plan()ggml_backend_cpu_graph_plan_compute()ggml_graph_compute()ggml_backend_cpu_graph_compute()ggml_graph_plan()ggml_graph_compute()

llama.cpp中量化方式(Qn_0、Qn_1等)含义

sgsprog@hackmd.io: Linux 核心專題: llama.cpp 效能分析

rvv移植代码分析

Github相关链接:
Llama.cpp中利用GGML中对RVV的支持1
Llama.cpp中利用GGML中对RVV的支持2
Tameem-10xE@llama.cpp Github:Added RISC-V Vector Intrinsics Support

起初移植代码在ggml.c中后续迁移至ggml-quants.c文件中。

修改函数包括12个:
quantize_row_q8_0
quantize_row_q8_1
ggml_vec_dot_q4_0_q8_0
ggml_vec_dot_q4_1_q8_1
ggml_vec_dot_q5_0_q8_0
ggml_vec_dot_q5_1_q8_1
ggml_vec_dot_q8_0_q8_0
ggml_vec_dot_q2_K_q8_K
ggml_vec_dot_q3_K_q8_K
ggml_vec_dot_q4_K_q8_K
ggml_vec_dot_q5_K_q8_K
ggml_vec_dot_q6_K_q8_K

这些函数作为不同量化模式的成员函数:

static const ggml_type_traits_t type_traits[GGML_TYPE_COUNT] = {[GGML_TYPE_Q8_0] = {.type_name                = "q8_0",.blck_size                = QK8_0,.type_size                = sizeof(block_q8_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q8_0,.from_float               = quantize_row_q8_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q8_0_ref,.from_float_to_mat        = quantize_mat_q8_0,.vec_dot                  = ggml_vec_dot_q8_0_q8_0,.vec_dot_type             = GGML_TYPE_Q8_0,......
}

在ggml不同算子计算函数中被调用:

static void ggml_compute_forward_mul_mat_one_chunk(...)ggml_vec_dot_t const vec_dot      = type_traits[type].vec_dot;for (int64_t iir1 = ir1_start; iir1 < ir1_end; iir1 += blck_1) {for (int64_t iir0 = ir0_start; iir0 < ir0_end; iir0 += blck_0) {for (int64_t ir1 = iir1; ir1 < iir1 + blck_1 && ir1 < ir1_end; ir1 += num_rows_per_vec_dot) {...for (int64_t ir0 = iir0; ir0 < iir0 + blck_0 && ir0 < ir0_end; ir0 += num_rows_per_vec_dot) {vec_dot(ne00, &tmp[ir0 - iir0], (num_rows_per_vec_dot > 1 ? 16 : 0), src0_row + ir0 * nb01, (num_rows_per_vec_dot > 1 ? nb01 : 0), src1_col, (num_rows_per_vec_dot > 1 ? src1_col_stride : 0), num_rows_per_vec_dot);}...}}}

2024/10/02: 工具准备OK,但qemu运行时被killed

工具版本

Qemu:
在这里插入图片描述
Gcc版本:
Github Release

llama.cpp:
llama.cpp Github 10月2号pull

llama-7b模型版本:
Huggingface gguf文件

编译

llama.cpp编译

cd llama.cpp
make   RISCV_CROSS_COMPILE=1 

运行命令

qemu-riscv64 -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-server -m /home/kevin/data/projects/kg_proj/rvv_transformer/codellama-7b.Q4_K_M.gguf -p “Anything” -n 9

问题

命令运行现象

在这里插入图片描述在这里插入图片描述
在这里插入图片描述

可能原因

运行内存可能太小

2024/10/03: 使用10xE团队的最新版,解决tokenizer的问题,但还是被killed

最新版Github链接

Tameem-10xE/llama.cpp Github

问题:运行7B模型被killed

运行现象

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

可能原因

有可能跟qemu运行的swapfile有关
可以团队成员提的一个issue:
Github Issue: qemu-riscv64 unexpectedly reached EOF error

解决办法

先尝试换一个更小的模型试试,不行就解决swapfile的问题

运行3B规模的model的现象:failed to allocate buffer of size

kevin@BRICKHOUSE01:~/data/projects/kg_proj/rvv_transformer/llama.cpp$ qemu-riscv64  -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot  -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-cli -m /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf -p "Anything" -n 9
Log start
main: build = 3733 (e5701063)
main: built with riscv64-unknown-linux-gnu-gcc () 13.2.0 for riscv64-unknown-linux-gnu
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 27
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:   59 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  137 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_S mix - 3.66 bpw
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.48 GiB (3.96 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors:        CPU buffer size =  1518.09 MiB
.....................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 15032385568
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model '/home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf'
main: error: unable to load model

可能原因

llama.cpp Github issue: Bug: ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 137438953504

解决办法

尝试调整模型的参数:
llama.cpp Github参数说明

调整ctx参数后成功运行

kevin@BRICKHOUSE01:~/data/projects/kg_proj/rvv_transformer/llama.cpp$ qemu-riscv64  -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot  -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-cli -m /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf -p "Anything" -n 9 -c 50
Log start
main: build = 3733 (e5701063)
main: built with riscv64-unknown-linux-gnu-gcc () 13.2.0 for riscv64-unknown-linux-gnu
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 27
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:   59 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  137 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_S mix - 3.66 bpw
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.48 GiB (3.96 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors:        CPU buffer size =  1518.09 MiB
.....................................................................
llama_new_context_with_model: n_ctx      = 64
llama_new_context_with_model: n_batch    = 64
llama_new_context_with_model: n_ubatch   = 64
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     7.00 MiB
llama_new_context_with_model: KV self size  =    7.00 MiB, K (f16):    3.50 MiB, V (f16):    3.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =    32.06 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 1system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling seed: 2622092847
sampling params:repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler constr:logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 64, n_batch = 2048, n_predict = 9, n_keep = 1Anything else?
Yes. You can also consider the
llama_perf_print:    sampling time =      10.96 ms /    11 runs   (    1.00 ms per token,  1003.37 tokens per second)
llama_perf_print:        load time =   73770.30 ms
llama_perf_print: prompt eval time =   15435.63 ms /     2 tokens ( 7717.81 ms per token,     0.13 tokens per second)
llama_perf_print:        eval time =   74267.24 ms /     8 runs   ( 9283.41 ms per token,     0.11 tokens per second)
llama_perf_print:       total time =   89764.85 ms /    10 tokens
Log end

2024/10/05:生成非向量支持的riscv版本llama.cpp,进行对比实验

llama.cpp编译

make CC="riscv64-unknown-linux-gnu-gcc -march=rv64gc -mabi=lp64d" CXX="riscv64-unknown-linux-gnu-g++ -march=rv64gc -mabi=lp64d"
make CC="riscv64-unknown-linux-gnu-gcc -march=rv64gc -mabi=lp64d" CXX="riscv64-unknown-linux-gnu-g++ -march=rv64gc -mabi=lp64d"

报错:riscv64-unknown-linux-gnu-g++: error: ‘-march=native’: ISA string must begin with rv32 or rv64

解决方案:
GCC GitHub:riscv64-unknown-linux-gnu-g++: error: ‘-march=native’: ISA string must begin with rv32 or rv64

即将llama.cpp工程中的Makefile中523和524行修改为rv64gc编译,该部分原先是交叉编译时使能RVV编译,如图所示:

在这里插入图片描述

修正后编译命令

make RISCV_CROSS_COMPILE=1

运行结果

kevin@BRICKHOUSE01:~/data/projects/kg_proj/rvv_transformer/llama.cpp.rv64gc$ qemu-riscv64  -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot  -cpu rv64 ./llama-cli -m /home/kevin/data/projects/kg_proj/rvv_transformer/codellama-7b.Q4_K_M.gguf -p "Anything" -n 100 -c 1024
Log start
main: build = 3733 (e5701063)
main: built with riscv64-unknown-linux-gnu-gcc () 13.2.0 for riscv64-unknown-linux-gnu
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/kevin/data/projects/kg_proj/rvv_transformer/codellama-7b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-7b-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1686 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name     = codellama_codellama-7b-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: PRE token        = 32007 '▁<PRE>'
llm_load_print_meta: SUF token        = 32008 '▁<SUF>'
llm_load_print_meta: MID token        = 32009 '▁<MID>'
llm_load_print_meta: EOT token        = 32010 '▁<EOT>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3891.33 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    98.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling seed: 3630221793
sampling params:repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler constr:logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 1024, n_batch = 2048, n_predict = 100, n_keep = 1Anything is possible if you believe it. The only thing you are not capable of doing is believing in anything.
- John L. Sullivan- To be honest, I've never had a problem with the fact that I'm a bad person and I'm a horrible human. I've always been okay with that.
- I think it's a mistake to think of yourself as being in the middle of the world, instead of at the
llama_perf_print:    sampling time =      94.53 ms /   104 runs   (    0.91 ms per token,  1100.12 tokens per second)
llama_perf_print:        load time =   44296.57 ms
llama_perf_print: prompt eval time =   79786.77 ms /     4 tokens (19946.69 ms per token,     0.05 tokens per second)
llama_perf_print:        eval time = 2360379.06 ms /    99 runs   (23842.21 ms per token,     0.04 tokens per second)
llama_perf_print:       total time = 2440911.31 ms /   103 tokens
Log end

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/443469.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

AI教父荣获2024诺贝尔物理学奖:杰弗里·辛顿和他的深度学习之路!

大家好&#xff0c;我是木易&#xff0c;一个持续关注AI领域的互联网技术产品经理&#xff0c;国内Top2本科&#xff0c;美国Top10 CS研究生&#xff0c;MBA。我坚信AI是普通人变强的“外挂”&#xff0c;专注于分享AI全维度知识&#xff0c;包括但不限于AI科普&#xff0c;AI工…

Chrome浏览器调用ActiveX控件--allWebOffice控件功能介绍

allWebOffice控件概述 allWebOffice控件能够实现在浏览器窗口中在线操作微软Office及WPS办公文档的应用&#xff08;阅读、编辑、保存等&#xff09;&#xff0c;支持编辑文档时保留修改痕迹&#xff0c;支持书签位置内容动态填充&#xff0c;支持公文套红&#xff0c;支持文档…

springMVC添加webapp

项目结构-->模块-->找到想添加的模块下的web 点击号 添加路径 会在.../src/main/目录下自动生成目录

Golang | Leetcode Golang题解之第467题环绕字符串中唯一的子字符串

题目&#xff1a; 题解&#xff1a; func findSubstringInWraproundString(p string) (ans int) {dp : [26]int{}k : 0for i, ch : range p {if i > 0 && (byte(ch)-p[i-1]26)%26 1 { // 字符之差为 1 或 -25k} else {k 1}dp[ch-a] max(dp[ch-a], k)}for _, v :…

【xilinx-versal】【Petalinux】I2C驱动开发问题记录

问题 调试中发现系统起来后无I2C设备。 仔细查找后发现没有配置versal的I2C控制器。 解决方法 打开versal的I2C控制器的配置 起来后I2C设备注册成功

使用idea和vecode创建vue项目并启动(超详细)

一、idea创建vue项目 创建项目之前先下载好插件 新建项目找到vue生成器 写好名称&#xff0c;找到自己需要存放的地址&#xff0c;node解释器安装方式可以看我上一个博客&#xff0c;vueCLI是选择vue的版本&#xff0c;我们可以使用idea自带的vue版本默认是vue3&#xff0c;创…

标准正态分布的数据 tensorflow 实现正态分布图,python 编程,数据分析和人工智能

import tensorflow as tf import matplotlib.pyplot as plt # 设置随机种子以获得可重复的结果 tf.random.set_seed(42) # 生成正态分布的数据 # mean0 和 stddev1 表示生成标准正态分布的数据 # shape(1000,) 表示生成1000个数据点 data tf.random.normal(mean0, stddev1, …

postman变量,断言,参数化

环境变量 1.创建环境变量 正式环境是错误的&#xff0c;方便验证环境变化 2.在请求中添加变量 3.运行前选择环境变量 全局变量 能够在任何接口访问的变量 console中打印日志 console.log(responseBody);//将数据解析为json格式 var data JSON.parse(responseBody); conso…

k8s中pod的管理

资源管理介绍 在kubernetes中&#xff0c;所有的内容都抽象为资源&#xff0c;用户需要通过操作资源来管理kubernetes。 kubernetes的本质上就是一个集群系统&#xff0c;用户可以在集群中部署各种服务 所谓的部署服务&#xff0c;其实就是在kubernetes集群中运行一个个的容器&…

NLP: SBERT介绍及sentence-transformers库的使用

1. Sentence-BERT Sentence-BERT(简写SBERT)模型是BERT模型最有趣的变体之一&#xff0c;通过扩展预训练的BERT模型来获得固定长度的句子特征&#xff0c;主要用于句子对分类、计算两个句子之间的相似度任务。 1.1 计算句子特征 SBERT模型同样是将句子标记送入预训练的BERT模型…

Web3 游戏周报(9.22 - 9.28)

回顾上周的区块链游戏概况&#xff0c;查看 Footprint Analytics 与 ABGA 最新发布的数据报告。 【9.22-9.28】Web3 游戏行业动态&#xff1a; Axie Infinity 将 Fortune Slips 的冷却时间缩短至 24 小时&#xff0c;从而提高玩家的收入。 Web3 游戏开发商 Darkbright Studios…

【源码+文档+调试讲解】二手物品调剂系统NODEJS

摘 要 二手物品调剂系统是一种在线平台&#xff0c;旨在促进用户之间的二手物品交易。该系统提供了一个方便的界面&#xff0c;让用户能够发布、浏览和搜索二手物品信息。用户可以根据自己的需求和兴趣&#xff0c;筛选出合适的物品&#xff0c;并通过系统与卖家进行联系。系统…

手撕Python之生成器、装饰器、异常

1.生成器 生成器的定义方式&#xff1a;在函数中使用yield yield值&#xff1a;将值返回到调用处 我们需要使用next()进行获取yield的返回值 yield的使用以及生成器函数的返回的接收next() def test():yield 1,2,3ttest() print(t) #<generator object test at 0x01B77…

气象大模型预测天气预报的原理

随着气象科学的发展&#xff0c;气象预报已经从早期的经验判断发展到基于数值模拟的高精度预测。气象大模型&#xff0c;作为一种强大的计算工具&#xff0c;利用大规模数据和复杂的物理模型&#xff0c;提供了精准的天气预报服务。本文将介绍气象大模型的原理&#xff0c;以及…

嵌入式面试——FreeRTOS篇(六) 任务通知

本篇为&#xff1a;FreeRTOS 任务通知篇 任务通知简介 1、任务通知介绍 答&#xff1a; 任务通知&#xff1a;用来通知任务的&#xff0c;任务控制块中的结构体成员变量ulNotifiedValue就是这个通知值。 使用队列、信号量、事件标志组时都需要另外创建一个结构体&#xff0c…

新个性化时尚解决方案!Prompt2Fashion:自动生成多风格、类型时尚图像数据集。

今天给大家介绍一种自动化生成时尚图像数据的方法Prompt2Fashion。 首先创建了一组描述&#xff0c;比如“适合婚礼的休闲风格服装”&#xff0c;然后用这些描述来指导计算机生成图像。具体来说&#xff0c;他们使用了大型语言模型来写出这些服装的描述&#xff0c;接着将这些描…

SpringBoot统一日志框架

在项目开发中&#xff0c;日志十分的重要&#xff0c;不管是记录运行情况还是定位线上问题&#xff0c;都离不开对日志的分析。 1.日志框架的选择 市面上常见的日志框架有很多&#xff0c;它们可以被分为两类&#xff1a;日志门面&#xff08;日志抽象层&#xff09;和日志实…

【万字长文】Word2Vec计算详解(三)分层Softmax与负采样

【万字长文】Word2Vec计算详解&#xff08;三&#xff09;分层Softmax与负采样 写在前面 第三部分介绍Word2Vec模型的两种优化方案。 【万字长文】Word2Vec计算详解&#xff08;一&#xff09;CBOW模型 markdown行 9000 【万字长文】Word2Vec计算详解&#xff08;二&#xff0…

网站集群批量管理-Ansible(playbook)

1.剧本概述 1. playbook 文件,用于长久保存并且实现批量管理,维护,部署的文件. 类似于脚本存放命令和变量 2. 剧本yaml格式,yaml格式的文件:空格,冒号 2. 区别 ans-playbookans ad-hoc共同点批量管理,使用模块批量管理,使用模块区别重复调用不是很方便,不容易重复场景部署服务…

LeetCode讲解篇之377. 组合总和 Ⅳ

文章目录 题目描述题解思路题解代码题目链接 题目描述 题解思路 总和为target的元素组合个数 可以由 总和为target - nums[j]的元素组合个数 转换而来&#xff0c;其中j为nums所有元素的下标 而总和target - nums[j]的元素组合个数 可以由 总和为target - nums[j] - nums[k]的…