基于llama.cpp实现Llama3模型的guff格式转换、4bit量化以及GPU推理加速(海光DCU)

重要说明:本文从网上资料整理而来,仅记录博主学习相关知识点的过程,侵删。

序言

本文使用llama.cpp框架,对 Llama3-8B-Instruct 模型进行gguf格式转换,8bit量化,并在CPU和GPU上对8bit模型进行推理。

测试平台曙光超算互联网平台SCNet

GPU/DCU异构加速卡AI 显存64GB PCIE(基于ROCm平台的GPU)

测试服务器的详细配置,请参考:曙光超算互联网平台SCNet之国产异构加速卡DCU

一、参考资料

llama.cpp 代码仓库:https://github.com/ggerganov/llama.cpp

Tutorial: How to convert HuggingFace model to GGUF format #2948

llamacpp_zh

使用llama.cpp实现LLM大模型的格式转换、量化、推理、部署

【学习笔记】:Ubuntu 22 使用模型量化工具llama.cpp部署大模型 CPU+GPU

llama3 微调教程之 llama factory 的 安装部署与模型微调过程,模型量化和gguf转换。

二、llama.cpp相关介绍

1. llama.cpp简介

llama.cpp 是一个C++库,用于在本地或云端高效地执行大型语言模型(LLM)的推理任务。该库是一个纯C/C++实现,不依赖任何外部库,并且针对x86架构提供了AVX、AVX2和AVX512加速支持。此外,它还提供了2、3、4、5、6以及8位量化功能,以加快推理速度并减少内存占用。对于大于总VRAM容量的大规模模型,该库还支持CPU+GPU混合推理模式进行部分加速。

在这里插入图片描述

与传统的基于 Python 的实现相比,llama.cpp 通过直接在 C/C++ 环境中运行,减少了对解释器的依赖,从而可能提高性能并降低资源消耗。此外,llama.cpp 支持跨平台,可以在多种操作系统上编译和运行,包括但不限于 macOS、Linux、Windows,以及通过 Docker 容器化部署。

在这里插入图片描述

2. llama.cpp 优势

选择llama.cpp作为LLM推理的平台,有几个显著优势:

  • 无依赖实现llama.cpp不依赖Python、PyTorch或TensorFlow等框架,可以直接在C/C++环境中运行,减少了复杂性和潜在的性能瓶颈。
  • 跨平台支持:从支持苹果硅片到各种GPU和CPU,llama.cpp优化了多种硬件的性能,确保在不同系统上都能获得最佳性能。
  • 灵活的性能配置:用户可以通过设置不同的位深(1.5位至8位)来量化模型,这有助于在保持推理速度的同时减少内存使用。

3. llama.cpp目标

llama.cpp 出现之后,在 GitHub 上狂砍 63.2k star(截止到2024年8月8日),比 stable diffusion 还要夸张,堪称 “star rocket”。这背后是 llama.cpp 切中了 “AI at the edge” 这一方向。“AI at the edge“ 中的 edge 可以理解为与 cloud 相对的概念。不管是个人的 laptop,gaming PC,手机,甚至树莓派,都可以称为 edge。

4. GGUT格式

探索GGUF:利用llama.cpp高效运行大型语言模型

4.1 引言

在人工智能领域,大型语言模型的发展日新月异,它们在自然语言处理、机器翻译、智能助手等多个领域展现出了前所未有的能力。然而,随着模型规模的不断扩大,这些庞大的神经网络模型在存储、传输和加载上面临着一系列挑战。传统的文件格式在处理这些庞大的数据集时显得力不从心,不仅效率低下,而且兼容性和扩展性也难以满足日益增长的需求。

在这样的背景下,开发者Georgi Gerganov提出GGUF格式,该模型格式可以对模型进行高效的压缩,减少模型的大小与内存占用,从而提升模型的推理速度和效率。

4.2 GGUT简介

GGUF(Georgi Gerganov’s Universal Format),即 Georgi Gerganov 通用格式,是 llama.cpp 项目中提出的一种创新模型文件格式。GGUF格式是专为大型语言模型设计的二进制文件格式,旨在解决当前大模型在实际应用中遇到的存储效率、加载速度、兼容性和扩展性等问题。GGUF通过优化数据结构和编码方式,显著提升了模型文件的存储效率,同时保证了快速的加载性能。此外,它的设计考虑了跨平台和跨框架的兼容性,使得模型能够无缝地在不同的硬件和软件环境中运行,极大地促进了大型模型的广泛应用和进一步发展。当前,GGUF格式广泛应用于各类大模型的部署和分享,特别是在Hugging Face等开源社区中广受欢迎。

关于 GGUF 的更多信息可以参考:2398#issuecomment-1682837610。

4.3 GGUT优势

GGUF格式模型在实际使用中体现出的主要特点和优势包括:

  • 高效存储:GGUF格式优化了数据的存储方式,减少了存储空间的占用,这对于大型模型尤为重要。
  • 快速加载:GGUF格式支持快速加载模型数据,这对于需要即时响应的应用场景非常有用,比如在线聊天机器人或实时翻译系统。
  • 高效推理:GGUF 格式对模型数据进行了优化,以实现更快的加载时间和推理速度,这对于需要快速响应的应用场景至关重要。
  • 内存优化:通过精心设计的数据结构和存储方案,GGUF 减少了模型在运行时的内存占用,使得在资源受限的设备上部署大型语言模型成为可能。
  • 复杂令牌化支持:GGUF 支持复杂的令牌化过程,包括对特殊令牌的识别和处理,这使得模型能够更准确地理解和生成语言文本。
  • 跨平台兼容性:作为一种统一的格式,GGUF 格式的模型文件可以在多种硬件和操作系统上使用,确保了模型的广泛适用性。
  • 灵活性和扩展性:GGUF 格式设计考虑了未来的扩展,可以适应不同语言模型的需求,包括自定义词汇和特殊操作。
  • 量化支持:GGUF 支持多种量化技术,允许模型在不同精度级别上运行,从而在性能和模型大小之间取得平衡。

通过这些创新,GGUF 格式成为了 llama.cpp 高效运行大型语言模型的关键因素,为开发者提供了一个强大的工具,以在各种环境中部署和使用先进的自然语言处理能力。

5. GGML

ggml.ai 官网:http://ggml.ai/

ggml 代码仓库:https://github.com/ggerganov/ggml

llama.cpp 代码仓库:https://github.com/ggerganov/llama.cpp

whisper.cpp 代码仓库:https://github.com/ggerganov/whisper.cpp

解开封印!加倍 LLM 推理吞吐: ggml.ai 与 llama.cpp

5.1 ggml简介

在这里插入图片描述

5.2 ggml目标

在这里插入图片描述

6. llama-cpp-python

llama-cpp-python 代码仓库:https://github.com/abetlen/llama-cpp-python

llama-cpp-python 文档:https://llama-cpp-python.readthedocs.io/en/latest/

Installing llama-cpp-python with GPU Support

llama-cpp-python 是 llama-cpp的python高级API。

三、快速体验llama.cpp

经过测试,tag=b3045 亲测有效

llama.cpp/tree/b3045 代码仓库:https://github.com/ggerganov/llama.cpp/tree/b3045

1. 准备环境

测试环境,仅供参考。

requirements.txt

accelerate==0.32.1
addict==2.4.0
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.4.0
apex @ https://cancon.hpccube.com:65024/directlink/4/apex/DAS1.0/apex-1.1.0+das1.0+0dd7f68.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=fdeb7c8a0b354a6a2faa61ae2055b2c2e7deb07bfa4aa7811068c5e02455ee1e
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.15.0
beautifulsoup4==4.12.3
bitsandbytes @ https://cancon.hpccube.com:65024/directlink/4/bitsandbytes/DAS1.0/bitsandbytes-0.37.0+das1.0+gitd3d888f.abi0.dtk2404.torch2.1-py3-none-any.whl#sha256=c46eb3f1555f2153424c3c0297e6645c0881cb76965cf5f3d11f77b52d80c19c
bleach==6.1.0
boltons @ file:///croot/boltons_1677628692245/work
brotlipy==0.7.0
certifi @ file:///croot/certifi_1707229174982/work/certifi
cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click==8.1.7
coloredlogs==15.0.1
comm==0.2.2
conda-content-trust @ file:///tmp/abs_5952f1c8-355c-4855-ad2e-538535021ba5h26t22e5/croots/recipe/conda-content-trust_1658126371814/work
conda-package-handling @ file:///croot/conda-package-handling_1666940373510/work
contourpy==1.2.1
cryptography @ file:///croot/cryptography_1665612644927/work
cycler==0.12.1
datasets==2.19.2
debugpy==1.8.1
decorator==5.1.1
deepspeed @ https://cancon.hpccube.com:65024/directlink/4/deepspeed/DAS1.0/deepspeed-0.12.3+das1.0+gita724046.abi0.dtk2404.torch2.1.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=726d64f73ab2ed7bcd716dcb2af53bb3c790ab4a24180b1b9319e7a7ab2cc569
defusedxml==0.7.1
diffusers==0.29.2
dill==0.3.8
dnspython==2.6.1
einops==0.8.0
email_validator==2.1.1
exceptiongroup==1.2.1
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.19.1
filelock==3.14.0
fire==0.6.0
flash-attn @ https://cancon.hpccube.com:65024/directlink/4/flash_attn/DAS1.0/flash_attn-2.0.4+das1.0+82379d7.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=2facc1831d95b55bf1bca88c7f23163751f4c749e4f7fc9256d8311ddbb5d399
flatbuffers==24.3.25
fonttools==4.52.4
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.3.1
h11==0.14.0
hf_transfer==0.1.8
hjson==3.1.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface==0.0.1
huggingface-hub==0.24.5
humanfriendly==10.0
hypothesis==5.35.1
idna @ file:///croot/idna_1666125576474/work
importlib_metadata==7.1.0
invisible-watermark==0.2.0
ipykernel==6.29.4
ipython==8.24.0
ipywidgets==8.1.3
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.4
json5==0.9.25
jsonpatch @ file:///croot/jsonpatch_1714483231291/work
jsonpointer==2.1
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyter_ext_dataset==0.1.0
jupyter_ext_logo==0.1.0
jupyter_server==2.14.0
jupyter_server_terminals==0.5.3
jupyterlab==4.2.1
jupyterlab-language-pack-zh-CN==4.0.post6
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.2
jupyterlab_widgets==3.0.11
kiwisolver==1.4.5
lightop @ https://cancon.hpccube.com:65024/directlink/4/lightop/DAS1.0/lightop-0.3+das1.0+837dbb7.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=7f4eb1190a570c05a63a4aade326c87367c4e5ccf6ff82ad5e92220790817e5c
lmdeploy @ https://cancon.hpccube.com:65024/directlink/4/lmdeploy/DAS1.0/lmdeploy-0.1.0_das1.0+git782048c.abi0.dtk2404.torch2.1.-cp310-cp310-manylinux2014_x86_64.whl#sha256=499940e022de16b3f1211a52c2daa3a603b109a015487499c9e11a53c6d5ad2c
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.0.2
mmcv @ https://cancon.hpccube.com:65024/directlink/4/mmcv/DAS1.0/mmcv-2.0.1_das1.0+gitc0ccf15.abi0.dtk2404.torch2.1.-cp310-cp310-manylinux2014_x86_64.whl#sha256=4fc5ff39d232e5ca1efebf7cfdfcf9bc0675308cf40e5f17237c4f2eec66f210
mmengine==0.10.4
mmengine-lite==0.10.4
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
ninja==1.11.1.1
notebook_shim==0.2.4
numpy==1.24.3
onnxruntime @ https://cancon.hpccube.com:65024/directlink/4/onnxruntime/DAS1.0/onnxruntime-1.15.0+das1.0+gita9ca438.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=509446b41adb89e7507700482cb99e2c399ab3164bc9ea6d9a50e11f84a2406e
opencv-python==4.9.0.80
orjson==3.10.3
overrides==7.7.0
packaging @ file:///croot/packaging_1710807400464/work
pandas==2.2.2
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pillow==10.3.0
platformdirs==4.2.2
pluggy @ file:///tmp/build/80754af9/pluggy_1648024709248/work
prometheus_client==0.20.0
prompt_toolkit==3.0.45
protobuf==5.27.0
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==16.1.0
pyarrow-hotfix==0.6
pycosat @ file:///croot/pycosat_1666805502580/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==2.7.2
pydantic_core==2.18.3
Pygments==2.18.0
pynvml==11.5.0
pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work
pyparsing==3.1.2
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.9
pytz==2024.1
PyWavelets==1.6.0
PyYAML==6.0.1
pyzmq==26.0.3
ray==2.9.3
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rpds-py==0.18.1
ruamel.yaml @ file:///croot/ruamel.yaml_1666304550667/work
ruamel.yaml.clib @ file:///croot/ruamel.yaml.clib_1666302247304/work
safetensors==0.4.3
Send2Trash==1.8.3
sentencepiece==0.2.0
shellingham==1.5.4
six @ file:///tmp/build/80754af9/six_1644875935023/work
sniffio==1.3.1
sortedcontainers==2.4.0
soupsieve==2.5
stack-data==0.6.3
starlette==0.37.2
sympy==1.12.1
termcolor==2.4.0
terminado==0.18.1
tiktoken==0.7.0
tinycss2==1.3.0
tokenizers==0.15.0
tomli==2.0.1
toolz @ file:///croot/toolz_1667464077321/work
torch @ https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.0/torch-2.1.0+das1.0+git00661e0.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=0b5f4be74ffdd6fe7540a844bf4f02e432b7d267b5e9fdd7f9448192d93bf3b6
torchaudio @ https://cancon.hpccube.com:65024/directlink/4/torchaudio/DAS1.0/torchaudio-2.1.2+das1.0+253903e.abi0.dtk2404.torch2.1.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=2a7b3bbe8b558f48784f302900fd1dff3ff9d10a3c139e00f2b136a76d6d7f1c
torchvision @ https://cancon.hpccube.com:65024/directlink/4/vision/DAS1.0/torchvision-0.16.0+das1.0+gitc9e7141.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=4d5e5071e89892cccb24c3ee0216cd79b3c22bc5cf1eb0eb49c2792d9f49fb62
tornado==6.4
tqdm @ file:///opt/conda/conda-bld/tqdm_1664392687731/work
traitlets==5.14.3
transformers==4.38.0
triton @ https://cancon.hpccube.com:65024/directlink/4/triton/DAS1.0/triton-2.1.0+das1.0+git3841f975.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=0dda810eb171af0b3f5cf90a1a4b2f41c9ef0ef08453762a798c86dd01fe976f
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing_extensions==4.12.0
tzdata==2024.1
ujson==5.10.0
uri-template==1.3.0
urllib3 @ file:///croot/urllib3_1670526988650/work
uvicorn==0.30.0
uvloop==0.19.0
vllm @ https://cancon.hpccube.com:65024/directlink/4/vllm/DAS1.0/vllm-0.3.3+das1.0+git3380931.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=23bcdb8a6eb0382770dc7460ea3f7c85cd0c885913b28759eb8a9894731cdb87
watchfiles==0.22.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
widgetsnbextension==4.0.11
xformers @ https://cancon.hpccube.com:65024/directlink/4/xformers/DAS1.0/xformers-0.0.25+das1.0+gitd11e899.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=b086d1bd50bd19c82ca44c424fe193dfcdd48bdd6695d3e6a58f53764c64f428
xxhash==3.4.1
yapf==0.40.2
yarl==1.9.4
zipp==3.19.0

envs.yaml

name: llama.cpp
channels:- https://repo.anaconda.com/pkgs/main- https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main- defaults
dependencies:- _libgcc_mutex=0.1=main- _openmp_mutex=5.1=1_gnu- boltons=23.0.0=py310h06a4308_0- brotlipy=0.7.0=py310h7f8727e_1002- bzip2=1.0.8=h7b6447c_0- ca-certificates=2024.3.11=h06a4308_0- certifi=2024.2.2=py310h06a4308_0- cffi=1.15.1=py310h74dc2b5_0- charset-normalizer=2.0.4=pyhd3eb1b0_0- conda-content-trust=0.1.3=py310h06a4308_0- conda-package-handling=1.9.0=py310h5eee18b_1- cryptography=38.0.1=py310h9ce1e76_0- idna=3.4=py310h06a4308_0- jsonpatch=1.33=py310h06a4308_1- ld_impl_linux-64=2.38=h1181459_1- libffi=3.3=he6710b0_2- libgcc-ng=11.2.0=h1234567_1- libgomp=11.2.0=h1234567_1- libstdcxx-ng=11.2.0=h1234567_1- libuuid=1.41.5=h5eee18b_0- ncurses=6.3=h5eee18b_3- openssl=1.1.1w=h7f8727e_0- pluggy=1.0.0=py310h06a4308_1- pycosat=0.6.4=py310h5eee18b_0- pycparser=2.21=pyhd3eb1b0_0- pyopenssl=22.0.0=pyhd3eb1b0_0- pysocks=1.7.1=py310h06a4308_0- python=3.10.8=haa1d7c7_0- readline=8.2=h5eee18b_0- ruamel.yaml=0.17.21=py310h5eee18b_0- ruamel.yaml.clib=0.2.6=py310h5eee18b_1- setuptools=65.5.0=py310h06a4308_0- six=1.16.0=pyhd3eb1b0_1- sqlite=3.40.0=h5082296_0- tk=8.6.12=h1ccaba5_0- toolz=0.12.0=py310h06a4308_0- tqdm=4.64.1=py310h06a4308_0- urllib3=1.26.13=py310h06a4308_0- wheel=0.37.1=pyhd3eb1b0_0- xz=5.2.8=h5eee18b_0- zlib=1.2.13=h5eee18b_0- pip:- accelerate==0.32.1- addict==2.4.0- aiohttp==3.9.5- aiosignal==1.3.1- annotated-types==0.7.0- anyio==4.4.0- apex==1.1.0+0dd7f68.abi0.dtk2404.torch2.1- argon2-cffi==23.1.0- argon2-cffi-bindings==21.2.0- arrow==1.3.0- asttokens==2.4.1- async-lru==2.0.4- async-timeout==4.0.3- attrs==23.2.0- babel==2.15.0- beautifulsoup4==4.12.3- bitsandbytes==0.37.0+gitd3d888f.abi0.dtk2404.torch2.1- bleach==6.1.0- click==8.1.7- coloredlogs==15.0.1- comm==0.2.2- contourpy==1.2.1- cycler==0.12.1- datasets==2.19.2- debugpy==1.8.1- decorator==5.1.1- deepspeed==0.12.3+gita724046.abi0.dtk2404.torch2.1.0- defusedxml==0.7.1- diffusers==0.29.2- dill==0.3.8- dnspython==2.6.1- einops==0.8.0- email-validator==2.1.1- exceptiongroup==1.2.1- executing==2.0.1- fastapi==0.111.0- fastapi-cli==0.0.4- fastjsonschema==2.19.1- filelock==3.14.0- fire==0.6.0- flash-attn==2.0.4+82379d7.abi0.dtk2404.torch2.1- flatbuffers==24.3.25- fonttools==4.52.4- fqdn==1.5.1- frozenlist==1.4.1- fsspec==2024.3.1- h11==0.14.0- hf-transfer==0.1.8- hjson==3.1.0- httpcore==1.0.5- httptools==0.6.1- httpx==0.27.0- huggingface==0.0.1- huggingface-hub==0.24.5- humanfriendly==10.0- hypothesis==5.35.1- importlib-metadata==7.1.0- invisible-watermark==0.2.0- ipykernel==6.29.4- ipython==8.24.0- ipywidgets==8.1.3- isoduration==20.11.0- jedi==0.19.1- jinja2==3.1.4- json5==0.9.25- jsonpointer==2.4- jsonschema==4.22.0- jsonschema-specifications==2023.12.1- jupyter-client==8.6.2- jupyter-core==5.7.2- jupyter-events==0.10.0- jupyter-ext-dataset==0.1.0- jupyter-ext-logo==0.1.0- jupyter-lsp==2.2.5- jupyter-server==2.14.0- jupyter-server-terminals==0.5.3- jupyterlab==4.2.1- jupyterlab-language-pack-zh-cn==4.0.post6- jupyterlab-pygments==0.3.0- jupyterlab-server==2.27.2- jupyterlab-widgets==3.0.11- kiwisolver==1.4.5- lightop==0.3+837dbb7.abi0.dtk2404.torch2.1- lmdeploy==0.1.0-git782048c.abi0.dtk2404.torch2.1.- markdown-it-py==3.0.0- markupsafe==2.1.5- matplotlib==3.9.0- matplotlib-inline==0.1.7- mdurl==0.1.2- mistune==3.0.2- mmcv==2.0.1-gitc0ccf15.abi0.dtk2404.torch2.1.- mmengine==0.10.4- mmengine-lite==0.10.4- mpmath==1.3.0- msgpack==1.0.8- multidict==6.0.5- multiprocess==0.70.16- nbclient==0.10.0- nbconvert==7.16.4- nbformat==5.10.4- nest-asyncio==1.6.0- networkx==3.3- ninja==1.11.1.1- notebook-shim==0.2.4- numpy==1.24.3- onnxruntime==1.15.0+gita9ca438.abi0.dtk2404- opencv-python==4.9.0.80- orjson==3.10.3- overrides==7.7.0- packaging==24.0- pandas==2.2.2- pandocfilters==1.5.1- parso==0.8.4- pexpect==4.9.0- pillow==10.3.0- pip==24.0- platformdirs==4.2.2- prometheus-client==0.20.0- prompt-toolkit==3.0.45- protobuf==5.27.0- psutil==5.9.8- ptyprocess==0.7.0- pure-eval==0.2.2- py-cpuinfo==9.0.0- pyarrow==16.1.0- pyarrow-hotfix==0.6- pydantic==2.7.2- pydantic-core==2.18.3- pygments==2.18.0- pynvml==11.5.0- pyparsing==3.1.2- python-dateutil==2.9.0.post0- python-dotenv==1.0.1- python-json-logger==2.0.7- python-multipart==0.0.9- pytz==2024.1- pywavelets==1.6.0- pyyaml==6.0.1- pyzmq==26.0.3- ray==2.9.3- referencing==0.35.1- regex==2024.5.15- requests==2.32.3- rfc3339-validator==0.1.4- rfc3986-validator==0.1.1- rich==13.7.1- rpds-py==0.18.1- safetensors==0.4.3- send2trash==1.8.3- sentencepiece==0.2.0- shellingham==1.5.4- sniffio==1.3.1- sortedcontainers==2.4.0- soupsieve==2.5- stack-data==0.6.3- starlette==0.37.2- sympy==1.12.1- termcolor==2.4.0- terminado==0.18.1- tiktoken==0.7.0- tinycss2==1.3.0- tokenizers==0.15.0- tomli==2.0.1- torch==2.1.0+git00661e0.abi0.dtk2404- torchaudio==2.1.2+253903e.abi0.dtk2404.torch2.1.0- torchvision==0.16.0+gitc9e7141.abi0.dtk2404.torch2.1- tornado==6.4- traitlets==5.14.3- transformers==4.38.0- triton==2.1.0+git3841f975.abi0.dtk2404- typer==0.12.3- types-python-dateutil==2.9.0.20240316- typing-extensions==4.12.0- tzdata==2024.1- ujson==5.10.0- uri-template==1.3.0- uvicorn==0.30.0- uvloop==0.19.0- vllm==0.3.3+git3380931.abi0.dtk2404.torch2.1- watchfiles==0.22.0- wcwidth==0.2.13- webcolors==1.13- webencodings==0.5.1- websocket-client==1.8.0- websockets==12.0- widgetsnbextension==4.0.11- xformers==0.0.25+gitd11e899.abi0.dtk2404.torch2.1- xxhash==3.4.1- yapf==0.40.2- yarl==1.9.4- zipp==3.19.0
prefix: /opt/conda/envs/llama.cpp

2. 下载llama.cpp

# 下载llama.cpp
# 如果下载失败,可以手动下载,再上传到服务器
git clone https://github.com/ggerganov/llama.cpp.git # 检出b3045标签,并创建b3045分支
git checkout -b b3045 b3045cd llama.cpp

编译前的文件目录:

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# tree -L 1
.
|-- AUTHORS
|-- CMakeLists.txt
|-- CMakePresets.json
|-- LICENSE
|-- Makefile
|-- Package.swift
|-- README-sycl.md
|-- README.md
|-- SECURITY.md
|-- ci
|-- cmake
|-- codecov.yml
|-- common
|-- convert-hf-to-gguf-update.py
|-- convert-hf-to-gguf.py
|-- convert-llama-ggml-to-gguf.py
|-- convert.py
|-- docs
|-- examples
|-- flake.lock
|-- flake.nix
|-- ggml-alloc.c
|-- ggml-alloc.h
|-- ggml-backend-impl.h
|-- ggml-backend.c
|-- ggml-backend.h
|-- ggml-common.h
|-- ggml-cuda
|-- ggml-cuda.cu
|-- ggml-cuda.h
|-- ggml-impl.h
|-- ggml-kompute.cpp
|-- ggml-kompute.h
|-- ggml-metal.h
|-- ggml-metal.m
|-- ggml-metal.metal
|-- ggml-opencl.cpp
|-- ggml-opencl.h
|-- ggml-quants.c
|-- ggml-quants.h
|-- ggml-rpc.cpp
|-- ggml-rpc.h
|-- ggml-sycl.cpp
|-- ggml-sycl.h
|-- ggml-vulkan-shaders.hpp
|-- ggml-vulkan.cpp
|-- ggml-vulkan.h
|-- ggml.c
|-- ggml.h
|-- ggml_vk_generate_shaders.py
|-- gguf-py
|-- grammars
|-- kompute
|-- kompute-shaders
|-- llama.cpp
|-- llama.h
|-- media
|-- models
|-- mypy.ini
|-- pocs
|-- prompts
|-- pyrightconfig.json
|-- requirements
|-- requirements.txt
|-- scripts
|-- sgemm.cpp
|-- sgemm.h
|-- spm-headers
|-- tests
|-- unicode-data.cpp
|-- unicode-data.h
|-- unicode.cpp
`-- unicode.h

3. 编译llama.cpp

Build llama.cpp locally

3.1 编译CPU版本

# 非首次编译
make cleanmake -j32
root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# make -j32
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE
I NVCCFLAGS: -std=c++11 -O3
I LDFLAGS:
I CC:        cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
I CXX:       c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml.c -o ggml.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c llama.cpp -o llama.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/common.cpp -o common.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/sampling.cpp -o sampling.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/grammar-parser.cpp -o grammar-parser.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/json-schema-to-grammar.cpp -o json-schema-to-grammar.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/console.cpp -o console.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c sgemm.cpp -o sgemm.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c unicode.cpp -o unicode.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c unicode-data.cpp -o unicode-data.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/train.cpp -o train.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/ngram-cache.cpp -o ngram-cache.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion  -c tests/test-c.c -o tests/test-c.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/build-info.cpp -o build-info.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c pocs/vdot/vdot.cpp -o pocs/vdot/vdot.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c pocs/vdot/q8dot.cpp -o pocs/vdot/q8dot.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/gguf/gguf.cpp -o examples/gguf/gguf.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/benchmark/benchmark-matmult.cpp -o examples/benchmark/benchmark-matmult.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/export-lora/export-lora.cpp -o examples/export-lora/export-lora.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf/gguf.o -o gguf
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/q8dot.o -o q8dot
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/vdot.o -o vdot
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  build-info.o ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/benchmark/benchmark-matmult.o -o benchmark-matmult
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/export-lora/export-lora.o -o export-lora
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/main/main.cpp -o examples/main/main.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/quantize/quantize.cpp -o examples/quantize/quantize.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/quantize-stats/quantize-stats.cpp -o examples/quantize-stats/quantize-stats.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/perplexity/perplexity.cpp -o examples/perplexity/perplexity.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/imatrix/imatrix.cpp -o examples/imatrix/imatrix.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/embedding/embedding.cpp -o examples/embedding/embedding.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/train-text-from-scratch/train-text-from-scratch.cpp -o examples/train-text-from-scratch/train-text-from-scratch.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp -o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/simple/simple.cpp -o examples/simple/simple.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/batched/batched.cpp -o examples/batched/batched.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/batched-bench/batched-bench.cpp -o examples/batched-bench/batched-bench.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/save-load-state/save-load-state.cpp -o examples/save-load-state/save-load-state.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/server/server.cpp -o examples/server/server.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/gguf-split/gguf-split.cpp -o examples/gguf-split/gguf-split.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/eval-callback/eval-callback.cpp -o examples/eval-callback/eval-callback.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/llama-bench/llama-bench.cpp -o examples/llama-bench/llama-bench.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/llava/llava-cli.cpp -o examples/llava/llava-cli.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/baby-llama/baby-llama.cpp -o examples/baby-llama/baby-llama.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/beam-search/beam-search.cpp -o examples/beam-search/beam-search.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/retrieval/retrieval.cpp -o examples/retrieval/retrieval.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/speculative/speculative.cpp -o examples/speculative/speculative.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/infill/infill.cpp -o examples/infill/infill.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/tokenize/tokenize.cpp -o examples/tokenize/tokenize.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/parallel/parallel.cpp -o examples/parallel/parallel.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/finetune/finetune.cpp -o examples/finetune/finetune.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/lookahead/lookahead.cpp -o examples/lookahead/lookahead.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/lookup/lookup.cpp -o examples/lookup/lookup.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/passkey/passkey.cpp -o examples/passkey/passkey.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/gritlm/gritlm.cpp -o examples/gritlm/gritlm.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/baby-llama/baby-llama.o -o baby-llama
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/tokenize/tokenize.o -o tokenize
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/eval-callback/eval-callback.o -o eval-callback 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/save-load-state/save-load-state.o -o save-load-state
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/beam-search/beam-search.o -o beam-search
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf-split/gguf-split.o -o gguf-split
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/simple/simple.o -o simple
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gritlm/gritlm.o -o gritlm
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/embedding/embedding.o -o embedding
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  build-info.o ggml.o llama.o common.o sampling.o grammar-parser.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched-bench/batched-bench.o -o batched-bench 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/passkey/passkey.o -o passkey
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched/batched.o -o batched
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize/quantize.o -o quantize
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup.o -o lookup
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookahead/lookahead.o -o lookahead
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/llava/clip.cpp  -o examples/llava/clip.o -Wno-cast-qual
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/parallel/parallel.o -o parallel
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/train-text-from-scratch/train-text-from-scratch.o -o train-text-from-scratch
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/lookup/lookup-create.cpp -o examples/lookup/lookup-create.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/retrieval/retrieval.o -o retrieval
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o -o convert-llama2c-to-ggml
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/infill/infill.o -o infill
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/imatrix/imatrix.o -o imatrix
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/speculative/speculative.o -o speculative
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/finetune/finetune.o -o finetune
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/main/main.o -o main
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-create.o -o lookup-create====  Run ./main -h for help.  ====c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/lookup/lookup-merge.cpp -o examples/lookup/lookup-merge.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-merge.o -o lookup-merge
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/perplexity/perplexity.o -o perplexity
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/lookup/lookup-stats.cpp -o examples/lookup/lookup-stats.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  build-info.o ggml.o llama.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize-stats/quantize-stats.o -o quantize-stats
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-stats.o -o lookup-stats
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llama-bench/llama-bench.o -o llama-bench
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/llava/llava.cpp -o examples/llava/llava.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llava/llava-cli.o examples/llava/clip.o examples/llava/llava.o -o llava-cli
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o -Iexamples/server examples/server/server.o -o server

编译后的文件目录

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# tree -L 1
.
|-- AUTHORS
|-- CMakeLists.txt
|-- CMakePresets.json
|-- LICENSE
|-- Makefile
|-- Package.swift
|-- README-sycl.md
|-- README.md
|-- SECURITY.md
|-- baby-llama
|-- batched
|-- batched-bench
|-- beam-search
|-- benchmark-matmult
|-- build-info.o
|-- ci
|-- cmake
|-- codecov.yml
|-- common
|-- common.o
|-- console.o
|-- convert-hf-to-gguf-update.py
|-- convert-hf-to-gguf.py
|-- convert-llama-ggml-to-gguf.py
|-- convert-llama2c-to-ggml
|-- convert.py
|-- docs
|-- embedding
|-- eval-callback
|-- examples
|-- export-lora
|-- finetune
|-- flake.lock
|-- flake.nix
|-- ggml-alloc.c
|-- ggml-alloc.h
|-- ggml-alloc.o
|-- ggml-backend-impl.h
|-- ggml-backend.c
|-- ggml-backend.h
|-- ggml-backend.o
|-- ggml-common.h
|-- ggml-cuda
|-- ggml-cuda.cu
|-- ggml-cuda.h
|-- ggml-impl.h
|-- ggml-kompute.cpp
|-- ggml-kompute.h
|-- ggml-metal.h
|-- ggml-metal.m
|-- ggml-metal.metal
|-- ggml-opencl.cpp
|-- ggml-opencl.h
|-- ggml-quants.c
|-- ggml-quants.h
|-- ggml-quants.o
|-- ggml-rpc.cpp
|-- ggml-rpc.h
|-- ggml-sycl.cpp
|-- ggml-sycl.h
|-- ggml-vulkan-shaders.hpp
|-- ggml-vulkan.cpp
|-- ggml-vulkan.h
|-- ggml.c
|-- ggml.h
|-- ggml.o
|-- ggml_vk_generate_shaders.py
|-- gguf
|-- gguf-py
|-- gguf-split
|-- grammar-parser.o
|-- grammars
|-- gritlm
|-- imatrix
|-- infill
|-- json-schema-to-grammar.o
|-- kompute
|-- kompute-shaders
|-- libllava.a
|-- llama-bench
|-- llama.cpp
|-- llama.h
|-- llama.o
|-- llava-cli
|-- lookahead
|-- lookup
|-- lookup-create
|-- lookup-merge
|-- lookup-stats
|-- main
|-- media
|-- models
|-- mypy.ini
|-- ngram-cache.o
|-- parallel
|-- passkey
|-- perplexity
|-- pocs
|-- prompts
|-- pyrightconfig.json
|-- q8dot
|-- quantize
|-- quantize-stats
|-- requirements
|-- requirements.txt
|-- retrieval
|-- sampling.o
|-- save-load-state
|-- scripts
|-- server
|-- sgemm.cpp
|-- sgemm.h
|-- sgemm.o
|-- simple
|-- speculative
|-- spm-headers
|-- tests
|-- tokenize
|-- train-text-from-scratch
|-- train.o
|-- unicode-data.cpp
|-- unicode-data.h
|-- unicode-data.o
|-- unicode.cpp
|-- unicode.h
|-- unicode.o
`-- vdot

解释说明

  • main,用于推理模型。
  • quantize,用于量化模型。
  • server,用于提供模型API服务。

3.2 编译GPU版本

hipblas

speedup ROCm AMD Unified Memory Architecture #7399

Install and run llama.cpp with ROCm 5.7 on Ubuntu 22.04

HIP_VISIBLE_DEVICES

User Guide for AMDGPU Backend

用 llama.cpp 跑 llama 2,用 AMD Radeon RX 6900 做 GPU 加速

值得注意的是,海光的DCU是基于ROCm平台的GPGPU,可参考AMDGPU相关资料进行操作。

# 查看GPU架构
rocminfo | grep gfx
或者
rocminfo | grep gfx | head -1 | awk '{print $2}'# 编译
make -j32 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx928
(ollama) root@notebook-1823641624653922306-scnlbe5oi5-42808:~# rocminfo | grep gfxName:                    gfx928Name:                    amdgcn-amd-amdhsa--gfx928:sramecc+:xnack-
root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# make -j32 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx928
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA
I NVCCFLAGS: -std=c++11 -O3
I LDFLAGS:   -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
I CC:        cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
I CXX:       c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
...
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/main/main.cpp -o examples/main/main.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/quantize/quantize.cpp -o examples/quantize/quantize.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/quantize-stats/quantize-stats.cpp -o examples/quantize-stats/quantize-stats.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/perplexity/perplexity.cpp -o examples/perplexity/perplexity.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/imatrix/imatrix.cpp -o examples/imatrix/imatrix.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/embedding/embedding.cpp -o examples/embedding/embedding.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c pocs/vdot/vdot.cpp -o pocs/vdot/vdot.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c pocs/vdot/q8dot.cpp -o pocs/vdot/q8dot.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/train-text-from-scratch/train-text-from-scratch.cpp -o examples/train-text-from-scratch/train-text-from-scratch.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp -o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/simple/simple.cpp -o examples/simple/simple.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/batched/batched.cpp -o examples/batched/batched.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/batched-bench/batched-bench.cpp -o examples/batched-bench/batched-bench.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/save-load-state/save-load-state.cpp -o examples/save-load-state/save-load-state.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/server/server.cpp -o examples/server/server.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/gguf/gguf.cpp -o examples/gguf/gguf.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/gguf-split/gguf-split.cpp -o examples/gguf-split/gguf-split.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/eval-callback/eval-callback.cpp -o examples/eval-callback/eval-callback.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/llama-bench/llama-bench.cpp -o examples/llama-bench/llama-bench.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/llava/llava-cli.cpp -o examples/llava/llava-cli.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/baby-llama/baby-llama.cpp -o examples/baby-llama/baby-llama.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/beam-search/beam-search.cpp -o examples/beam-search/beam-search.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/retrieval/retrieval.cpp -o examples/retrieval/retrieval.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/speculative/speculative.cpp -o examples/speculative/speculative.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/infill/infill.cpp -o examples/infill/infill.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/tokenize/tokenize.cpp -o examples/tokenize/tokenize.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/benchmark/benchmark-matmult.cpp -o examples/benchmark/benchmark-matmult.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/parallel/parallel.cpp -o examples/parallel/parallel.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/finetune/finetune.cpp -o examples/finetune/finetune.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/export-lora/export-lora.cpp -o examples/export-lora/export-lora.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/lookahead/lookahead.cpp -o examples/lookahead/lookahead.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf/gguf.o -o gguf -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/q8dot.o -o q8dot -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/vdot.o -o vdot -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/baby-llama/baby-llama.o -o baby-llama -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  build-info.o ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/benchmark/benchmark-matmult.o -o benchmark-matmult -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/tokenize/tokenize.o -o tokenize -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/eval-callback/eval-callback.o -o eval-callback -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/beam-search/beam-search.o -o beam-search -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/save-load-state/save-load-state.o -o save-load-state -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/lookup/lookup.cpp -o examples/lookup/lookup.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/passkey/passkey.cpp -o examples/passkey/passkey.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/gritlm/gritlm.cpp -o examples/gritlm/gritlm.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf-split/gguf-split.o -o gguf-split -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  build-info.o ggml.o llama.o common.o sampling.o grammar-parser.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched-bench/batched-bench.o -o batched-bench -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/simple/simple.o -o simple -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched/batched.o -o batched -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/embedding/embedding.o -o embedding -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/export-lora/export-lora.o -o export-lora -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize/quantize.o -o quantize -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookahead/lookahead.o -o lookahead -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/parallel/parallel.o -o parallel -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/llava/clip.cpp  -o examples/llava/clip.o -Wno-cast-qual
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o -o convert-llama2c-to-ggml -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/train-text-from-scratch/train-text-from-scratch.o -o train-text-from-scratch -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/retrieval/retrieval.o -o retrieval -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/imatrix/imatrix.o -o imatrix -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/speculative/speculative.o -o speculative -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/infill/infill.o -o infill -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/finetune/finetune.o -o finetune -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gritlm/gritlm.o -o gritlm -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/main/main.o -o main -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/passkey/passkey.o -o passkey -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup.o -o lookup -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas====  Run ./main -h for help.  ====c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/lookup/lookup-create.cpp -o examples/lookup/lookup-create.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/perplexity/perplexity.o -o perplexity -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-create.o -o lookup-create -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/lookup/lookup-merge.cpp -o examples/lookup/lookup-merge.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  build-info.o ggml.o llama.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize-stats/quantize-stats.o -o quantize-stats -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-merge.o -o lookup-merge -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/lookup/lookup-stats.cpp -o examples/lookup/lookup-stats.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-stats.o -o lookup-stats -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llama-bench/llama-bench.o -o llama-bench -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/llava/llava.cpp -o examples/llava/llava.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llava/llava-cli.o examples/llava/clip.o examples/llava/llava.o -o llava-cli -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o -Iexamples/server examples/server/server.o -o server -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas

4. 准备模型

在 huggingface 上找到合适格式的模型,下载至 llama.cpp 的 models 目录下。 或本地已下载的模型上传至models目录。

4.1 下载原版LLaMA模型

如果下载的是Meta原版LLaMA模型,则需要将原版LLaMA模型转换为HF格式。

使用transformers提供的脚本convert_llama_weights_to_hf.py,将原版LLaMA模型转换为HuggingFace格式。

python src/transformers/models/llama/convert_llama_weights_to_hf.py \--input_dir path_to_original_llama_root_dir \--model_size 7B \--output_dir path_to_original_llama_hf_dir

值得注意的是,将原版LLaMA的tokenizer.model放在--input_dir指定的目录,其余文件放在${input_dir}/${model_size}下。 执行以下命令后,--output_dir中将存放转换好的HF版权重。

4.2 下载gguf模型

可以直接下载gguf模型,跳过gguf格式转换过程。

QuantFactory/Meta-Llama-3-8B-Instruct-GGUF

./main -m $(./scripts/hf.sh --repo QuantFactory/Meta-Llama-3-8B-Instruct-GGUF --file Meta-Llama-3-8B-Instruct.Q4_0.gguf --outdir ./models) ./main -m $(./scripts/hf.sh --url https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf --outdir ./models)./main -m $(./scripts/hf.sh https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf --outdir ./models)

4.3 下载HuggingFace模型

以LLM-Research/Meta-Llama-3-8B-Instruct 模型为例。由于从Hugging Face申请许可失败,从ModelScope魔塔社区中下载该模型。

模型下载方法,请参考:Hugging Face和ModelScope大模型/数据集的下载加速方法

5.(可选)合并LoRA权重

由于原版的LLaMA模型不具备中文理解能力,而Chinese-LLaMA-Alpaca具有良好的中文理解能力。因此,通过对原版LLaMA模型(HF格式)扩充中文词表,并与LoRA权重进行合并,生成全量模型权重。

合并LoRA权重的详细步骤,请参考:llama.cpp一种在本地CPU上部署的量化模型(超低配推理llama)

6. 转换gguf格式

Converting HuggingFace Models to GGUF/GGML

The convert-hf-to-gguf-update.py seems doesn’t work. #7088

Tutorial: How to convert HuggingFace model to GGUF format #2948

llama.cpp 支持转换的模型格式有PyTorch 的 .pth,huggingface的 .safetensors,还有之前 llamma.cpp 采用的 ggmlv3

6.1 convert脚本

convert脚本包括:

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ll | grep convert
-rwxr-xr-x  1 root root   13029 Aug  8 10:57 convert-hf-to-gguf-update.py*
-rwxr-xr-x  1 root root  127129 Aug  8 10:57 convert-hf-to-gguf.py*
-rwxr-xr-x  1 root root   18993 Aug  8 10:57 convert-llama-ggml-to-gguf.py*
-rwxr-xr-x  1 root root 2218136 Aug  8 11:02 convert-llama2c-to-ggml*
-rwxr-xr-x  1 root root   69417 Aug  8 10:57 convert.py*

解释说明

  • convert_hf_to_gguf_update.py: Downloads the tokenizer models of the specified models from Huggingface and generates the get_vocab_base_pre() function for convert_hf_to_gguf.py.
  • convert-hf-to-gguf.py: Convert from HuggingFace format to gguf.
  • convert-llama-ggml-to-gguf.py: Convert from ggml format to gguf.
  • convert-llama2c-to-ggml: Convert from llama2.c model format to ggml.
  • convert.py.

6.2 convert.py

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# python convert.py -h
usage: convert.py [-h] [--dump] [--dump-single] [--vocab-only] [--no-vocab] [--outtype {f32,f16,q8_0}] [--vocab-dir VOCAB_DIR][--vocab-type VOCAB_TYPE] [--outfile OUTFILE] [--ctx CTX] [--concurrency CONCURRENCY] [--big-endian][--pad-vocab] [--skip-unknown] [--verbose] [--metadata METADATA] [--get-outfile]modelConvert a LLaMA model to a GGML compatible filepositional arguments:model                 directory containing model file, or model file itself (*.pth, *.pt, *.bin)options:-h, --help            show this help message and exit--dump                don't convert, just show what's in the model--dump-single         don't convert, just show what's in a single model file--vocab-only          extract only the vocab--no-vocab            store model without the vocab--outtype {f32,f16,q8_0}output format - note: q8_0 may be very slow (default: f16 or f32 based on input)--vocab-dir VOCAB_DIRdirectory containing tokenizer.model, if separate from model file--vocab-type VOCAB_TYPEvocab types to try in order, choose from 'spm', 'bpe', 'hfft' (default: spm,hfft)--outfile OUTFILE     path to write to; default: based on input--ctx CTX             model training context (default: based on input)--concurrency CONCURRENCYconcurrency used for conversion (default: 8)--big-endian          model is executed on big endian machine--pad-vocab           add pad tokens when model vocab expects more than tokenizer metadata provides--skip-unknown        skip unknown tensor names instead of failing--verbose             increase output verbosity--metadata METADATA   Specify the path for a metadata file--get-outfile         get calculated default outfile name

解释说明

  • --outtype,包括:{f32,f16,q8_0}
  • --vocab-type,包括:{'spm', 'bpe', 'hfft'}

6.3 执行转换

将Hugging Face下载的模型转换为gguf格式,输出类型为FP16。

Llama-3相比其前两代显著扩充了词表大小,由32K扩充至128K,并且改为BPE词表。因此需要使用--vocab-type参数指定分词算法,默认值是spm,如果是bpe,需要显示指定。

注意:官方文档说 convert.py 不支持LLaMA 3,喊使用 convert-hf-to-gguf.py,但它不支持 --vocab-type,且出现异常:error: unrecognized arguments: --vocab-type bpe,因此使用 convert.py 不会出错。

python convert.py models/Meta-Llama-3-8B-Instruct/ --outfile models/ggml-vocab-llama3-8B-instruct-f16.gguf --outtype f16 --vocab-type bpe
(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# python convert.py models/Meta-Llama-3-8B-Instruct/ --outfile models/ggml-vocab-llama3-8B-instruct-f16.gguf --outtype f16 --vocab-type bpe
INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00002-of-00004.safetensors
INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00003-of-00004.safetensors
INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00004-of-00004.safetensors
INFO:convert:model parameters count : 8030261248 (8B)
INFO:convert:params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('models/Meta-Llama-3-8B-Instruct'))
INFO:convert:Loaded vocab file PosixPath('models/Meta-Llama-3-8B-Instruct/tokenizer.json'), type 'bpe'
INFO:convert:Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
INFO:convert:Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128009}, add special tokens unset>
INFO:convert:Writing models/ggml-vocab-llama3-8B-instruct-f16.gguf, format 1
WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it.
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128009
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>' }}{% endif %}
INFO:convert:[  1/291] Writing tensor token_embd.weight                      | size 128256 x   4096  | type F16  | T+   4
INFO:convert:[  2/291] Writing tensor blk.0.attn_norm.weight                 | size   4096           | type F32  | T+   5
INFO:convert:[  3/291] Writing tensor blk.0.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   5
INFO:convert:[  4/291] Writing tensor blk.0.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   5
INFO:convert:[  5/291] Writing tensor blk.0.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   5
INFO:convert:[  6/291] Writing tensor blk.0.ffn_norm.weight                  | size   4096           | type F32  | T+   5
INFO:convert:[  7/291] Writing tensor blk.0.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   5
INFO:convert:[  8/291] Writing tensor blk.0.attn_output.weight               | size   4096 x   4096  | type F16  | T+   5
INFO:convert:[  9/291] Writing tensor blk.0.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   5
INFO:convert:[ 10/291] Writing tensor blk.0.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   6
INFO:convert:[ 11/291] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   6
INFO:convert:[ 12/291] Writing tensor blk.1.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   6
INFO:convert:[ 13/291] Writing tensor blk.1.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   7
INFO:convert:[ 14/291] Writing tensor blk.1.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   7
INFO:convert:[ 15/291] Writing tensor blk.1.ffn_norm.weight                  | size   4096           | type F32  | T+   7
INFO:convert:[ 16/291] Writing tensor blk.1.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   7
INFO:convert:[ 17/291] Writing tensor blk.1.attn_output.weight               | size   4096 x   4096  | type F16  | T+   7
INFO:convert:[ 18/291] Writing tensor blk.1.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   7
INFO:convert:[ 19/291] Writing tensor blk.1.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   7
INFO:convert:[ 20/291] Writing tensor blk.2.attn_norm.weight                 | size   4096           | type F32  | T+   7
INFO:convert:[ 21/291] Writing tensor blk.2.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   8
INFO:convert:[ 22/291] Writing tensor blk.2.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   8
INFO:convert:[ 23/291] Writing tensor blk.2.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   8
INFO:convert:[ 24/291] Writing tensor blk.2.ffn_norm.weight                  | size   4096           | type F32  | T+   8
INFO:convert:[ 25/291] Writing tensor blk.2.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   8
INFO:convert:[ 26/291] Writing tensor blk.2.attn_output.weight               | size   4096 x   4096  | type F16  | T+   8
INFO:convert:[ 27/291] Writing tensor blk.2.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   8
INFO:convert:[ 28/291] Writing tensor blk.2.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   8
INFO:convert:[ 29/291] Writing tensor blk.3.attn_norm.weight                 | size   4096           | type F32  | T+   8
INFO:convert:[ 30/291] Writing tensor blk.3.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   9
INFO:convert:[ 31/291] Writing tensor blk.3.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   9
INFO:convert:[ 32/291] Writing tensor blk.3.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   9
INFO:convert:[ 33/291] Writing tensor blk.3.ffn_norm.weight                  | size   4096           | type F32  | T+   9
INFO:convert:[ 34/291] Writing tensor blk.3.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   9
INFO:convert:[ 35/291] Writing tensor blk.3.attn_output.weight               | size   4096 x   4096  | type F16  | T+   9
INFO:convert:[ 36/291] Writing tensor blk.3.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   9
INFO:convert:[ 37/291] Writing tensor blk.3.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   9
INFO:convert:[ 38/291] Writing tensor blk.4.attn_norm.weight                 | size   4096           | type F32  | T+   9
INFO:convert:[ 39/291] Writing tensor blk.4.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  10
INFO:convert:[ 40/291] Writing tensor blk.4.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  10
INFO:convert:[ 41/291] Writing tensor blk.4.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  11
INFO:convert:[ 42/291] Writing tensor blk.4.ffn_norm.weight                  | size   4096           | type F32  | T+  12
INFO:convert:[ 43/291] Writing tensor blk.4.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  12
INFO:convert:[ 44/291] Writing tensor blk.4.attn_output.weight               | size   4096 x   4096  | type F16  | T+  12
INFO:convert:[ 45/291] Writing tensor blk.4.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  12
INFO:convert:[ 46/291] Writing tensor blk.4.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  12
INFO:convert:[ 47/291] Writing tensor blk.5.attn_norm.weight                 | size   4096           | type F32  | T+  12
INFO:convert:[ 48/291] Writing tensor blk.5.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  12
INFO:convert:[ 49/291] Writing tensor blk.5.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  12
INFO:convert:[ 50/291] Writing tensor blk.5.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  12
INFO:convert:[ 51/291] Writing tensor blk.5.ffn_norm.weight                  | size   4096           | type F32  | T+  13
INFO:convert:[ 52/291] Writing tensor blk.5.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  13
INFO:convert:[ 53/291] Writing tensor blk.5.attn_output.weight               | size   4096 x   4096  | type F16  | T+  13
INFO:convert:[ 54/291] Writing tensor blk.5.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  13
INFO:convert:[ 55/291] Writing tensor blk.5.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  13
INFO:convert:[ 56/291] Writing tensor blk.6.attn_norm.weight                 | size   4096           | type F32  | T+  13
INFO:convert:[ 57/291] Writing tensor blk.6.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  13
INFO:convert:[ 58/291] Writing tensor blk.6.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  13
INFO:convert:[ 59/291] Writing tensor blk.6.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  14
INFO:convert:[ 60/291] Writing tensor blk.6.ffn_norm.weight                  | size   4096           | type F32  | T+  14
INFO:convert:[ 61/291] Writing tensor blk.6.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  14
INFO:convert:[ 62/291] Writing tensor blk.6.attn_output.weight               | size   4096 x   4096  | type F16  | T+  14
INFO:convert:[ 63/291] Writing tensor blk.6.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  14
INFO:convert:[ 64/291] Writing tensor blk.6.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  14
INFO:convert:[ 65/291] Writing tensor blk.7.attn_norm.weight                 | size   4096           | type F32  | T+  14
INFO:convert:[ 66/291] Writing tensor blk.7.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  14
INFO:convert:[ 67/291] Writing tensor blk.7.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  15
INFO:convert:[ 68/291] Writing tensor blk.7.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  15
INFO:convert:[ 69/291] Writing tensor blk.7.ffn_norm.weight                  | size   4096           | type F32  | T+  15
INFO:convert:[ 70/291] Writing tensor blk.7.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  15
INFO:convert:[ 71/291] Writing tensor blk.7.attn_output.weight               | size   4096 x   4096  | type F16  | T+  15
INFO:convert:[ 72/291] Writing tensor blk.7.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  15
INFO:convert:[ 73/291] Writing tensor blk.7.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  15
INFO:convert:[ 74/291] Writing tensor blk.8.attn_norm.weight                 | size   4096           | type F32  | T+  15
INFO:convert:[ 75/291] Writing tensor blk.8.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  16
INFO:convert:[ 76/291] Writing tensor blk.8.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  16
INFO:convert:[ 77/291] Writing tensor blk.8.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  16
INFO:convert:[ 78/291] Writing tensor blk.8.ffn_norm.weight                  | size   4096           | type F32  | T+  16
INFO:convert:[ 79/291] Writing tensor blk.8.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  16
INFO:convert:[ 80/291] Writing tensor blk.8.attn_output.weight               | size   4096 x   4096  | type F16  | T+  16
INFO:convert:[ 81/291] Writing tensor blk.8.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  16
INFO:convert:[ 82/291] Writing tensor blk.8.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  17
INFO:convert:[ 83/291] Writing tensor blk.10.attn_norm.weight                | size   4096           | type F32  | T+  17
INFO:convert:[ 84/291] Writing tensor blk.10.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  17
INFO:convert:[ 85/291] Writing tensor blk.10.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  17
INFO:convert:[ 86/291] Writing tensor blk.10.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  17
INFO:convert:[ 87/291] Writing tensor blk.10.ffn_norm.weight                 | size   4096           | type F32  | T+  17
INFO:convert:[ 88/291] Writing tensor blk.10.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  17
INFO:convert:[ 89/291] Writing tensor blk.10.attn_output.weight              | size   4096 x   4096  | type F16  | T+  17
INFO:convert:[ 90/291] Writing tensor blk.10.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  18
INFO:convert:[ 91/291] Writing tensor blk.10.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  18
INFO:convert:[ 92/291] Writing tensor blk.11.attn_norm.weight                | size   4096           | type F32  | T+  18
INFO:convert:[ 93/291] Writing tensor blk.11.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  18
INFO:convert:[ 94/291] Writing tensor blk.11.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  18
INFO:convert:[ 95/291] Writing tensor blk.11.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  18
INFO:convert:[ 96/291] Writing tensor blk.11.ffn_norm.weight                 | size   4096           | type F32  | T+  18
INFO:convert:[ 97/291] Writing tensor blk.11.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  18
INFO:convert:[ 98/291] Writing tensor blk.11.attn_output.weight              | size   4096 x   4096  | type F16  | T+  18
INFO:convert:[ 99/291] Writing tensor blk.11.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  19
INFO:convert:[100/291] Writing tensor blk.11.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  19
INFO:convert:[101/291] Writing tensor blk.12.attn_norm.weight                | size   4096           | type F32  | T+  19
INFO:convert:[102/291] Writing tensor blk.12.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  20
INFO:convert:[103/291] Writing tensor blk.12.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  20
INFO:convert:[104/291] Writing tensor blk.12.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  20
INFO:convert:[105/291] Writing tensor blk.12.ffn_norm.weight                 | size   4096           | type F32  | T+  20
INFO:convert:[106/291] Writing tensor blk.12.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  20
INFO:convert:[107/291] Writing tensor blk.12.attn_output.weight              | size   4096 x   4096  | type F16  | T+  20
INFO:convert:[108/291] Writing tensor blk.12.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  20
INFO:convert:[109/291] Writing tensor blk.12.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  20
INFO:convert:[110/291] Writing tensor blk.13.attn_norm.weight                | size   4096           | type F32  | T+  20
INFO:convert:[111/291] Writing tensor blk.13.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  21
INFO:convert:[112/291] Writing tensor blk.13.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  21
INFO:convert:[113/291] Writing tensor blk.13.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  21
INFO:convert:[114/291] Writing tensor blk.13.ffn_norm.weight                 | size   4096           | type F32  | T+  21
INFO:convert:[115/291] Writing tensor blk.13.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  22
INFO:convert:[116/291] Writing tensor blk.13.attn_output.weight              | size   4096 x   4096  | type F16  | T+  22
INFO:convert:[117/291] Writing tensor blk.13.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  22
INFO:convert:[118/291] Writing tensor blk.13.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  22
INFO:convert:[119/291] Writing tensor blk.14.attn_norm.weight                | size   4096           | type F32  | T+  22
INFO:convert:[120/291] Writing tensor blk.14.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  22
INFO:convert:[121/291] Writing tensor blk.14.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  22
INFO:convert:[122/291] Writing tensor blk.14.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  22
INFO:convert:[123/291] Writing tensor blk.14.ffn_norm.weight                 | size   4096           | type F32  | T+  22
INFO:convert:[124/291] Writing tensor blk.14.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  22
INFO:convert:[125/291] Writing tensor blk.14.attn_output.weight              | size   4096 x   4096  | type F16  | T+  22
INFO:convert:[126/291] Writing tensor blk.14.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  22
INFO:convert:[127/291] Writing tensor blk.14.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  23
INFO:convert:[128/291] Writing tensor blk.15.attn_norm.weight                | size   4096           | type F32  | T+  23
INFO:convert:[129/291] Writing tensor blk.15.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  23
INFO:convert:[130/291] Writing tensor blk.15.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  23
INFO:convert:[131/291] Writing tensor blk.15.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  23
INFO:convert:[132/291] Writing tensor blk.15.ffn_norm.weight                 | size   4096           | type F32  | T+  24
INFO:convert:[133/291] Writing tensor blk.15.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  24
INFO:convert:[134/291] Writing tensor blk.15.attn_output.weight              | size   4096 x   4096  | type F16  | T+  24
INFO:convert:[135/291] Writing tensor blk.15.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  24
INFO:convert:[136/291] Writing tensor blk.15.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  24
INFO:convert:[137/291] Writing tensor blk.16.attn_norm.weight                | size   4096           | type F32  | T+  24
INFO:convert:[138/291] Writing tensor blk.16.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  24
INFO:convert:[139/291] Writing tensor blk.16.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  24
INFO:convert:[140/291] Writing tensor blk.16.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  25
INFO:convert:[141/291] Writing tensor blk.16.ffn_norm.weight                 | size   4096           | type F32  | T+  25
INFO:convert:[142/291] Writing tensor blk.16.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  25
INFO:convert:[143/291] Writing tensor blk.16.attn_output.weight              | size   4096 x   4096  | type F16  | T+  25
INFO:convert:[144/291] Writing tensor blk.16.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  25
INFO:convert:[145/291] Writing tensor blk.16.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  25
INFO:convert:[146/291] Writing tensor blk.17.attn_norm.weight                | size   4096           | type F32  | T+  25
INFO:convert:[147/291] Writing tensor blk.17.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  25
INFO:convert:[148/291] Writing tensor blk.17.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  26
INFO:convert:[149/291] Writing tensor blk.17.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  26
INFO:convert:[150/291] Writing tensor blk.17.ffn_norm.weight                 | size   4096           | type F32  | T+  26
INFO:convert:[151/291] Writing tensor blk.17.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  26
INFO:convert:[152/291] Writing tensor blk.17.attn_output.weight              | size   4096 x   4096  | type F16  | T+  26
INFO:convert:[153/291] Writing tensor blk.17.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  26
INFO:convert:[154/291] Writing tensor blk.17.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  26
INFO:convert:[155/291] Writing tensor blk.18.attn_norm.weight                | size   4096           | type F32  | T+  26
INFO:convert:[156/291] Writing tensor blk.18.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  26
INFO:convert:[157/291] Writing tensor blk.18.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  27
INFO:convert:[158/291] Writing tensor blk.18.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  27
INFO:convert:[159/291] Writing tensor blk.18.ffn_norm.weight                 | size   4096           | type F32  | T+  27
INFO:convert:[160/291] Writing tensor blk.18.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  27
INFO:convert:[161/291] Writing tensor blk.18.attn_output.weight              | size   4096 x   4096  | type F16  | T+  27
INFO:convert:[162/291] Writing tensor blk.18.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  27
INFO:convert:[163/291] Writing tensor blk.18.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  27
INFO:convert:[164/291] Writing tensor blk.19.attn_norm.weight                | size   4096           | type F32  | T+  27
INFO:convert:[165/291] Writing tensor blk.19.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  28
INFO:convert:[166/291] Writing tensor blk.19.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  28
INFO:convert:[167/291] Writing tensor blk.19.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  28
INFO:convert:[168/291] Writing tensor blk.19.ffn_norm.weight                 | size   4096           | type F32  | T+  28
INFO:convert:[169/291] Writing tensor blk.19.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  28
INFO:convert:[170/291] Writing tensor blk.19.attn_output.weight              | size   4096 x   4096  | type F16  | T+  28
INFO:convert:[171/291] Writing tensor blk.19.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  29
INFO:convert:[172/291] Writing tensor blk.19.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  29
INFO:convert:[173/291] Writing tensor blk.20.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  29
INFO:convert:[174/291] Writing tensor blk.20.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  29
INFO:convert:[175/291] Writing tensor blk.20.attn_output.weight              | size   4096 x   4096  | type F16  | T+  29
INFO:convert:[176/291] Writing tensor blk.20.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  29
INFO:convert:[177/291] Writing tensor blk.20.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  29
INFO:convert:[178/291] Writing tensor blk.9.attn_norm.weight                 | size   4096           | type F32  | T+  29
INFO:convert:[179/291] Writing tensor blk.9.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  30
INFO:convert:[180/291] Writing tensor blk.9.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  30
INFO:convert:[181/291] Writing tensor blk.9.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  30
INFO:convert:[182/291] Writing tensor blk.9.ffn_norm.weight                  | size   4096           | type F32  | T+  30
INFO:convert:[183/291] Writing tensor blk.9.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  30
INFO:convert:[184/291] Writing tensor blk.9.attn_output.weight               | size   4096 x   4096  | type F16  | T+  30
INFO:convert:[185/291] Writing tensor blk.9.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  30
INFO:convert:[186/291] Writing tensor blk.9.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  30
INFO:convert:[187/291] Writing tensor blk.20.attn_norm.weight                | size   4096           | type F32  | T+  30
INFO:convert:[188/291] Writing tensor blk.20.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  31
INFO:convert:[189/291] Writing tensor blk.20.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  31
INFO:convert:[190/291] Writing tensor blk.20.ffn_norm.weight                 | size   4096           | type F32  | T+  31
INFO:convert:[191/291] Writing tensor blk.21.attn_norm.weight                | size   4096           | type F32  | T+  31
INFO:convert:[192/291] Writing tensor blk.21.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  31
INFO:convert:[193/291] Writing tensor blk.21.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  31
INFO:convert:[194/291] Writing tensor blk.21.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  32
INFO:convert:[195/291] Writing tensor blk.21.ffn_norm.weight                 | size   4096           | type F32  | T+  32
INFO:convert:[196/291] Writing tensor blk.21.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  32
INFO:convert:[197/291] Writing tensor blk.21.attn_output.weight              | size   4096 x   4096  | type F16  | T+  32
INFO:convert:[198/291] Writing tensor blk.21.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  32
INFO:convert:[199/291] Writing tensor blk.21.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  32
INFO:convert:[200/291] Writing tensor blk.22.attn_norm.weight                | size   4096           | type F32  | T+  32
INFO:convert:[201/291] Writing tensor blk.22.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  33
INFO:convert:[202/291] Writing tensor blk.22.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  33
INFO:convert:[203/291] Writing tensor blk.22.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  33
INFO:convert:[204/291] Writing tensor blk.22.ffn_norm.weight                 | size   4096           | type F32  | T+  33
INFO:convert:[205/291] Writing tensor blk.22.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  33
INFO:convert:[206/291] Writing tensor blk.22.attn_output.weight              | size   4096 x   4096  | type F16  | T+  33
INFO:convert:[207/291] Writing tensor blk.22.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  33
INFO:convert:[208/291] Writing tensor blk.22.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  33
INFO:convert:[209/291] Writing tensor blk.23.attn_norm.weight                | size   4096           | type F32  | T+  33
INFO:convert:[210/291] Writing tensor blk.23.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  33
INFO:convert:[211/291] Writing tensor blk.23.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  34
INFO:convert:[212/291] Writing tensor blk.23.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  34
INFO:convert:[213/291] Writing tensor blk.23.ffn_norm.weight                 | size   4096           | type F32  | T+  34
INFO:convert:[214/291] Writing tensor blk.23.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  34
INFO:convert:[215/291] Writing tensor blk.23.attn_output.weight              | size   4096 x   4096  | type F16  | T+  34
INFO:convert:[216/291] Writing tensor blk.23.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  34
INFO:convert:[217/291] Writing tensor blk.23.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  34
INFO:convert:[218/291] Writing tensor blk.24.attn_norm.weight                | size   4096           | type F32  | T+  34
INFO:convert:[219/291] Writing tensor blk.24.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  35
INFO:convert:[220/291] Writing tensor blk.24.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  35
INFO:convert:[221/291] Writing tensor blk.24.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  35
INFO:convert:[222/291] Writing tensor blk.24.ffn_norm.weight                 | size   4096           | type F32  | T+  35
INFO:convert:[223/291] Writing tensor blk.24.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  35
INFO:convert:[224/291] Writing tensor blk.24.attn_output.weight              | size   4096 x   4096  | type F16  | T+  35
INFO:convert:[225/291] Writing tensor blk.24.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  35
INFO:convert:[226/291] Writing tensor blk.24.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  35
INFO:convert:[227/291] Writing tensor blk.25.attn_norm.weight                | size   4096           | type F32  | T+  35
INFO:convert:[228/291] Writing tensor blk.25.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  36
INFO:convert:[229/291] Writing tensor blk.25.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  36
INFO:convert:[230/291] Writing tensor blk.25.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  36
INFO:convert:[231/291] Writing tensor blk.25.ffn_norm.weight                 | size   4096           | type F32  | T+  37
INFO:convert:[232/291] Writing tensor blk.25.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  37
INFO:convert:[233/291] Writing tensor blk.25.attn_output.weight              | size   4096 x   4096  | type F16  | T+  37
INFO:convert:[234/291] Writing tensor blk.25.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  37
INFO:convert:[235/291] Writing tensor blk.25.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  37
INFO:convert:[236/291] Writing tensor blk.26.attn_norm.weight                | size   4096           | type F32  | T+  37
INFO:convert:[237/291] Writing tensor blk.26.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  37
INFO:convert:[238/291] Writing tensor blk.26.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  37
INFO:convert:[239/291] Writing tensor blk.26.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  37
INFO:convert:[240/291] Writing tensor blk.26.ffn_norm.weight                 | size   4096           | type F32  | T+  38
INFO:convert:[241/291] Writing tensor blk.26.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  38
INFO:convert:[242/291] Writing tensor blk.26.attn_output.weight              | size   4096 x   4096  | type F16  | T+  38
INFO:convert:[243/291] Writing tensor blk.26.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  38
INFO:convert:[244/291] Writing tensor blk.26.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  38
INFO:convert:[245/291] Writing tensor blk.27.attn_norm.weight                | size   4096           | type F32  | T+  38
INFO:convert:[246/291] Writing tensor blk.27.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  39
INFO:convert:[247/291] Writing tensor blk.27.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  39
INFO:convert:[248/291] Writing tensor blk.27.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  39
INFO:convert:[249/291] Writing tensor blk.27.ffn_norm.weight                 | size   4096           | type F32  | T+  39
INFO:convert:[250/291] Writing tensor blk.27.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  39
INFO:convert:[251/291] Writing tensor blk.27.attn_output.weight              | size   4096 x   4096  | type F16  | T+  39
INFO:convert:[252/291] Writing tensor blk.27.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  39
INFO:convert:[253/291] Writing tensor blk.27.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  39
INFO:convert:[254/291] Writing tensor blk.28.attn_norm.weight                | size   4096           | type F32  | T+  39
INFO:convert:[255/291] Writing tensor blk.28.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  40
INFO:convert:[256/291] Writing tensor blk.28.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  40
INFO:convert:[257/291] Writing tensor blk.28.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  40
INFO:convert:[258/291] Writing tensor blk.28.ffn_norm.weight                 | size   4096           | type F32  | T+  40
INFO:convert:[259/291] Writing tensor blk.28.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  40
INFO:convert:[260/291] Writing tensor blk.28.attn_output.weight              | size   4096 x   4096  | type F16  | T+  40
INFO:convert:[261/291] Writing tensor blk.28.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  40
INFO:convert:[262/291] Writing tensor blk.28.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  40
INFO:convert:[263/291] Writing tensor blk.29.attn_norm.weight                | size   4096           | type F32  | T+  40
INFO:convert:[264/291] Writing tensor blk.29.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  41
INFO:convert:[265/291] Writing tensor blk.29.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  41
INFO:convert:[266/291] Writing tensor blk.29.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  41
INFO:convert:[267/291] Writing tensor blk.29.ffn_norm.weight                 | size   4096           | type F32  | T+  41
INFO:convert:[268/291] Writing tensor blk.29.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  41
INFO:convert:[269/291] Writing tensor blk.29.attn_output.weight              | size   4096 x   4096  | type F16  | T+  41
INFO:convert:[270/291] Writing tensor blk.29.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  41
INFO:convert:[271/291] Writing tensor blk.29.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  41
INFO:convert:[272/291] Writing tensor blk.30.attn_norm.weight                | size   4096           | type F32  | T+  41
INFO:convert:[273/291] Writing tensor blk.30.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  42
INFO:convert:[274/291] Writing tensor blk.30.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  42
INFO:convert:[275/291] Writing tensor blk.30.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  43
INFO:convert:[276/291] Writing tensor blk.30.ffn_norm.weight                 | size   4096           | type F32  | T+  43
INFO:convert:[277/291] Writing tensor blk.30.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  43
INFO:convert:[278/291] Writing tensor blk.30.attn_output.weight              | size   4096 x   4096  | type F16  | T+  43
INFO:convert:[279/291] Writing tensor blk.30.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  43
INFO:convert:[280/291] Writing tensor blk.30.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  43
INFO:convert:[281/291] Writing tensor blk.31.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  43
INFO:convert:[282/291] Writing tensor blk.31.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  43
INFO:convert:[283/291] Writing tensor blk.31.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  43
INFO:convert:[284/291] Writing tensor blk.31.attn_output.weight              | size   4096 x   4096  | type F16  | T+  43
INFO:convert:[285/291] Writing tensor blk.31.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  43
INFO:convert:[286/291] Writing tensor blk.31.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  44
INFO:convert:[287/291] Writing tensor output.weight                          | size 128256 x   4096  | type F16  | T+  48
INFO:convert:[288/291] Writing tensor blk.31.attn_norm.weight                | size   4096           | type F32  | T+  49
INFO:convert:[289/291] Writing tensor blk.31.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  49
INFO:convert:[290/291] Writing tensor blk.31.ffn_norm.weight                 | size   4096           | type F32  | T+  49
INFO:convert:[291/291] Writing tensor output_norm.weight                     | size   4096           | type F32  | T+  49
INFO:convert:Wrote models/ggml-vocab-llama3-8B-instruct-f16.gguf

7. 量化模型

quantize

Quantization of LLMs with llama.cpp

Llama.cpp量化简明手册

7.1 查看量化类型

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./quantize -h
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing--pure: Disable k-quant mixtures and quantize all tensors to the same type--imatrix file_name: use data in file_name as importance matrix for quant optimizations--include-weights tensor_name: use importance matrix for this/these tensor(s)--exclude-weights tensor_name: use importance matrix for this/these tensor(s)--output-tensor-type ggml_type: use this ggml_type for the output.weight tensor--token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor--keep-split: will generate quatized model in the same shards as input  --override-kv KEY=TYPE:VALUEAdvanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used togetherAllowed quantization types:2  or  Q4_0    :  3.56G, +0.2166 ppl @ LLaMA-v1-7B3  or  Q4_1    :  3.90G, +0.1585 ppl @ LLaMA-v1-7B8  or  Q5_0    :  4.33G, +0.0683 ppl @ LLaMA-v1-7B9  or  Q5_1    :  4.70G, +0.0349 ppl @ LLaMA-v1-7B19  or  IQ2_XXS :  2.06 bpw quantization20  or  IQ2_XS  :  2.31 bpw quantization28  or  IQ2_S   :  2.5  bpw quantization29  or  IQ2_M   :  2.7  bpw quantization24  or  IQ1_S   :  1.56 bpw quantization31  or  IQ1_M   :  1.75 bpw quantization10  or  Q2_K    :  2.63G, +0.6717 ppl @ LLaMA-v1-7B21  or  Q2_K_S  :  2.16G, +9.0634 ppl @ LLaMA-v1-7B23  or  IQ3_XXS :  3.06 bpw quantization26  or  IQ3_S   :  3.44 bpw quantization27  or  IQ3_M   :  3.66 bpw quantization mix12  or  Q3_K    : alias for Q3_K_M22  or  IQ3_XS  :  3.3 bpw quantization11  or  Q3_K_S  :  2.75G, +0.5551 ppl @ LLaMA-v1-7B12  or  Q3_K_M  :  3.07G, +0.2496 ppl @ LLaMA-v1-7B13  or  Q3_K_L  :  3.35G, +0.1764 ppl @ LLaMA-v1-7B25  or  IQ4_NL  :  4.50 bpw non-linear quantization30  or  IQ4_XS  :  4.25 bpw non-linear quantization15  or  Q4_K    : alias for Q4_K_M14  or  Q4_K_S  :  3.59G, +0.0992 ppl @ LLaMA-v1-7B15  or  Q4_K_M  :  3.80G, +0.0532 ppl @ LLaMA-v1-7B17  or  Q5_K    : alias for Q5_K_M16  or  Q5_K_S  :  4.33G, +0.0400 ppl @ LLaMA-v1-7B17  or  Q5_K_M  :  4.45G, +0.0122 ppl @ LLaMA-v1-7B18  or  Q6_K    :  5.15G, +0.0008 ppl @ LLaMA-v1-7B7  or  Q8_0    :  6.70G, +0.0004 ppl @ LLaMA-v1-7B1  or  F16     : 14.00G, -0.0020 ppl @ Mistral-7B32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B0  or  F32     : 26.00G              @ 7BCOPY    : only copy tensors, no quantizing

解释说明

  1. 使用quantize量化模型,它提供各种量化位数的模型:Q2、Q3、Q4、Q5、Q6、Q8、F16。
  2. 量化模型的命名方法遵循: Q + 量化比特位 + 变种。量化位数越少,对硬件资源的要求越低,但是模型的精度也越低。

7.2 执行量化

对FP16模型进行4-bit量化。

./quantize models/ggml-vocab-llama3-8B-instruct-f16.gguf models/ggml-vocab-llama3-8B-instruct-q4_0.gguf Q4_0
(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./quantize models/ggml-vocab-llama3-8B-instruct-f16.gguf models/ggml-vocab-llama3-8B-instruct-q4_0.gguf Q4_0
main: build = 3045 (59b0d077)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: quantizing 'models/ggml-vocab-llama3-8B-instruct-f16.gguf' to 'models/ggml-vocab-llama3-8B-instruct-q4_0.gguf' as Q4_0
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/ggml-vocab-llama3-8B-instruct-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
[   1/ 291]                    token_embd.weight - [ 4096, 128256,     1,     1], type =    f16, converting to q4_0 .. size =  1002.00 MiB ->   281.81 MiB
[   2/ 291]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   3/ 291]                blk.0.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[   4/ 291]                blk.0.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[   5/ 291]                  blk.0.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[   6/ 291]                blk.0.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   7/ 291]                  blk.0.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[   8/ 291]             blk.0.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[   9/ 291]                  blk.0.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  10/ 291]                  blk.0.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  11/ 291]               blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  12/ 291]                blk.1.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  13/ 291]                blk.1.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  14/ 291]                  blk.1.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  15/ 291]                blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  16/ 291]                  blk.1.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  17/ 291]             blk.1.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  18/ 291]                  blk.1.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  19/ 291]                  blk.1.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  20/ 291]               blk.2.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  21/ 291]                blk.2.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  22/ 291]                blk.2.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  23/ 291]                  blk.2.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  24/ 291]                blk.2.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  25/ 291]                  blk.2.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  26/ 291]             blk.2.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  27/ 291]                  blk.2.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  28/ 291]                  blk.2.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  29/ 291]               blk.3.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  30/ 291]                blk.3.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  31/ 291]                blk.3.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  32/ 291]                  blk.3.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  33/ 291]                blk.3.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  34/ 291]                  blk.3.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  35/ 291]             blk.3.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  36/ 291]                  blk.3.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  37/ 291]                  blk.3.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  38/ 291]               blk.4.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  39/ 291]                blk.4.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  40/ 291]                blk.4.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  41/ 291]                  blk.4.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  42/ 291]                blk.4.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  43/ 291]                  blk.4.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  44/ 291]             blk.4.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  45/ 291]                  blk.4.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  46/ 291]                  blk.4.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  47/ 291]               blk.5.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  48/ 291]                blk.5.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  49/ 291]                blk.5.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  50/ 291]                  blk.5.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  51/ 291]                blk.5.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  52/ 291]                  blk.5.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  53/ 291]             blk.5.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  54/ 291]                  blk.5.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  55/ 291]                  blk.5.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  56/ 291]               blk.6.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  57/ 291]                blk.6.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  58/ 291]                blk.6.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  59/ 291]                  blk.6.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  60/ 291]                blk.6.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  61/ 291]                  blk.6.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  62/ 291]             blk.6.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  63/ 291]                  blk.6.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  64/ 291]                  blk.6.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  65/ 291]               blk.7.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  66/ 291]                blk.7.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  67/ 291]                blk.7.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  68/ 291]                  blk.7.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  69/ 291]                blk.7.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  70/ 291]                  blk.7.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  71/ 291]             blk.7.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  72/ 291]                  blk.7.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  73/ 291]                  blk.7.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  74/ 291]               blk.8.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  75/ 291]                blk.8.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  76/ 291]                blk.8.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  77/ 291]                  blk.8.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  78/ 291]                blk.8.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  79/ 291]                  blk.8.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  80/ 291]             blk.8.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  81/ 291]                  blk.8.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  82/ 291]                  blk.8.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  83/ 291]              blk.10.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  84/ 291]               blk.10.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  85/ 291]               blk.10.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  86/ 291]                 blk.10.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  87/ 291]               blk.10.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  88/ 291]                 blk.10.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  89/ 291]            blk.10.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  90/ 291]                 blk.10.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  91/ 291]                 blk.10.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  92/ 291]              blk.11.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  93/ 291]               blk.11.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  94/ 291]               blk.11.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  95/ 291]                 blk.11.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  96/ 291]               blk.11.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  97/ 291]                 blk.11.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  98/ 291]            blk.11.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  99/ 291]                 blk.11.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 100/ 291]                 blk.11.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 101/ 291]              blk.12.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 102/ 291]               blk.12.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 103/ 291]               blk.12.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 104/ 291]                 blk.12.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 105/ 291]               blk.12.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 106/ 291]                 blk.12.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 107/ 291]            blk.12.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 108/ 291]                 blk.12.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 109/ 291]                 blk.12.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 110/ 291]              blk.13.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 111/ 291]               blk.13.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 112/ 291]               blk.13.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 113/ 291]                 blk.13.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 114/ 291]               blk.13.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 115/ 291]                 blk.13.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 116/ 291]            blk.13.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 117/ 291]                 blk.13.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 118/ 291]                 blk.13.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 119/ 291]              blk.14.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 120/ 291]               blk.14.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 121/ 291]               blk.14.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 122/ 291]                 blk.14.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 123/ 291]               blk.14.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 124/ 291]                 blk.14.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 125/ 291]            blk.14.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 126/ 291]                 blk.14.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 127/ 291]                 blk.14.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 128/ 291]              blk.15.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 129/ 291]               blk.15.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 130/ 291]               blk.15.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 131/ 291]                 blk.15.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 132/ 291]               blk.15.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 133/ 291]                 blk.15.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 134/ 291]            blk.15.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 135/ 291]                 blk.15.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 136/ 291]                 blk.15.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 137/ 291]              blk.16.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 138/ 291]               blk.16.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 139/ 291]               blk.16.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 140/ 291]                 blk.16.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 141/ 291]               blk.16.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 142/ 291]                 blk.16.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 143/ 291]            blk.16.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 144/ 291]                 blk.16.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 145/ 291]                 blk.16.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 146/ 291]              blk.17.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 147/ 291]               blk.17.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 148/ 291]               blk.17.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 149/ 291]                 blk.17.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 150/ 291]               blk.17.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 151/ 291]                 blk.17.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 152/ 291]            blk.17.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 153/ 291]                 blk.17.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 154/ 291]                 blk.17.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 155/ 291]              blk.18.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 156/ 291]               blk.18.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 157/ 291]               blk.18.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 158/ 291]                 blk.18.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 159/ 291]               blk.18.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 160/ 291]                 blk.18.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 161/ 291]            blk.18.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 162/ 291]                 blk.18.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 163/ 291]                 blk.18.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 164/ 291]              blk.19.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 165/ 291]               blk.19.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 166/ 291]               blk.19.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 167/ 291]                 blk.19.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 168/ 291]               blk.19.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 169/ 291]                 blk.19.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 170/ 291]            blk.19.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 171/ 291]                 blk.19.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 172/ 291]                 blk.19.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 173/ 291]               blk.20.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 174/ 291]                 blk.20.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 175/ 291]            blk.20.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 176/ 291]                 blk.20.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 177/ 291]                 blk.20.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 178/ 291]               blk.9.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 179/ 291]                blk.9.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 180/ 291]                blk.9.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 181/ 291]                  blk.9.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 182/ 291]                blk.9.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 183/ 291]                  blk.9.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 184/ 291]             blk.9.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 185/ 291]                  blk.9.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 186/ 291]                  blk.9.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 187/ 291]              blk.20.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 188/ 291]               blk.20.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 189/ 291]                 blk.20.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 190/ 291]               blk.20.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 191/ 291]              blk.21.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 192/ 291]               blk.21.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 193/ 291]               blk.21.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 194/ 291]                 blk.21.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 195/ 291]               blk.21.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 196/ 291]                 blk.21.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 197/ 291]            blk.21.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 198/ 291]                 blk.21.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 199/ 291]                 blk.21.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 200/ 291]              blk.22.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 201/ 291]               blk.22.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 202/ 291]               blk.22.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 203/ 291]                 blk.22.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 204/ 291]               blk.22.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 205/ 291]                 blk.22.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 206/ 291]            blk.22.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 207/ 291]                 blk.22.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 208/ 291]                 blk.22.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 209/ 291]              blk.23.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 210/ 291]               blk.23.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 211/ 291]               blk.23.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 212/ 291]                 blk.23.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 213/ 291]               blk.23.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 214/ 291]                 blk.23.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 215/ 291]            blk.23.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 216/ 291]                 blk.23.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 217/ 291]                 blk.23.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 218/ 291]              blk.24.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 219/ 291]               blk.24.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 220/ 291]               blk.24.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 221/ 291]                 blk.24.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 222/ 291]               blk.24.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 223/ 291]                 blk.24.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 224/ 291]            blk.24.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 225/ 291]                 blk.24.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 226/ 291]                 blk.24.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 227/ 291]              blk.25.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 228/ 291]               blk.25.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 229/ 291]               blk.25.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 230/ 291]                 blk.25.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 231/ 291]               blk.25.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 232/ 291]                 blk.25.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 233/ 291]            blk.25.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 234/ 291]                 blk.25.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 235/ 291]                 blk.25.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 236/ 291]              blk.26.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 237/ 291]               blk.26.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 238/ 291]               blk.26.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 239/ 291]                 blk.26.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 240/ 291]               blk.26.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 241/ 291]                 blk.26.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 242/ 291]            blk.26.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 243/ 291]                 blk.26.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 244/ 291]                 blk.26.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 245/ 291]              blk.27.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 246/ 291]               blk.27.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 247/ 291]               blk.27.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 248/ 291]                 blk.27.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 249/ 291]               blk.27.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 250/ 291]                 blk.27.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 251/ 291]            blk.27.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 252/ 291]                 blk.27.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 253/ 291]                 blk.27.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 254/ 291]              blk.28.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 255/ 291]               blk.28.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 256/ 291]               blk.28.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 257/ 291]                 blk.28.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 258/ 291]               blk.28.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 259/ 291]                 blk.28.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 260/ 291]            blk.28.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 261/ 291]                 blk.28.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 262/ 291]                 blk.28.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 263/ 291]              blk.29.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 264/ 291]               blk.29.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 265/ 291]               blk.29.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 266/ 291]                 blk.29.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 267/ 291]               blk.29.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 268/ 291]                 blk.29.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 269/ 291]            blk.29.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 270/ 291]                 blk.29.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 271/ 291]                 blk.29.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 272/ 291]              blk.30.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 273/ 291]               blk.30.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 274/ 291]               blk.30.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 275/ 291]                 blk.30.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 276/ 291]               blk.30.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 277/ 291]                 blk.30.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 278/ 291]            blk.30.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 279/ 291]                 blk.30.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 280/ 291]                 blk.30.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 281/ 291]               blk.31.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 282/ 291]                 blk.31.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 283/ 291]                 blk.31.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 284/ 291]            blk.31.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 285/ 291]                 blk.31.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 286/ 291]                 blk.31.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 287/ 291]                        output.weight - [ 4096, 128256,     1,     1], type =    f16, converting to q6_K .. size =  1002.00 MiB ->   410.98 MiB
[ 288/ 291]              blk.31.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 289/ 291]               blk.31.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 290/ 291]               blk.31.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 291/ 291]                   output_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
llama_model_quantize_internal: model size  = 15317.02 MB
llama_model_quantize_internal: quant size  =  4437.80 MBmain: quantize time = 61476.12 ms
main:    total time = 61476.12 ms

经过Q4_0量化后,模型的大小从15317.02 MB降低到4437.80 MB,但模型精度从16位浮点数降低到4位整数。

更详细的使用教程请访问:https://github.com/ggerganov/llama.cpp#quantization

8. 模型推理

8.1 main指令

llama.cpp/examples/main

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./main -husage: ./main [options]options:-h, --help            show this help message and exit--version             show version and build info-i, --interactive     run in interactive mode--special             special tokens output enabled--interactive-specials allow special tokens in user text, in interactive mode--interactive-first   run in interactive mode and wait for input right away-cnv, --conversation  run in conversation mode (does not print special tokens and suffix/prefix)-ins, --instruct      run in instruction mode (use with Alpaca models)-cml, --chatml        run in chatml mode (use with ChatML-compatible models)--multiline-input     allows you to write or paste multiple lines without ending each in '\'-r PROMPT, --reverse-prompt PROMPThalt generation at PROMPT, return control in interactive mode(can be specified more than once for multiple prompts).--color               colorise output to distinguish prompt and user input from generations-s SEED, --seed SEED  RNG seed (default: -1, use random seed for < 0)-t N, --threads N     number of threads to use during generation (default: 128)-tb N, --threads-batch Nnumber of threads to use during batch and prompt processing (default: same as --threads)-td N, --threads-draft N                        number of threads to use during generation (default: same as --threads)-tbd N, --threads-batch-draft Nnumber of threads to use during batch and prompt processing (default: same as --threads-draft)-p PROMPT, --prompt PROMPTprompt to start generation with (default: empty)-e, --escape          process prompt escapes sequences (\n, \r, \t, \', \", \\)--prompt-cache FNAME  file to cache prompt state for faster startup (default: none)--prompt-cache-all    if specified, saves user input and generations to cache as well.not supported with --interactive or other interactive options--prompt-cache-ro     if specified, uses the prompt cache but does not update it.--random-prompt       start with a randomized prompt.--in-prefix-bos       prefix BOS to user inputs, preceding the `--in-prefix` string--in-prefix STRING    string to prefix user inputs with (default: empty)--in-suffix STRING    string to suffix after user inputs with (default: empty)-f FNAME, --file FNAMEprompt file to start generation.-bf FNAME, --binary-file FNAMEbinary file containing multiple choice tasks.-n N, --n-predict N   number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)-c N, --ctx-size N    size of the prompt context (default: 512, 0 = loaded from model)-b N, --batch-size N  logical maximum batch size (default: 2048)-ub N, --ubatch-size Nphysical maximum batch size (default: 512)--samplers            samplers that will be used for generation in the order, separated by ';'(default: top_k;tfs_z;typical_p;top_p;min_p;temperature)--sampling-seq        simplified sequence for samplers that will be used (default: kfypmt)--top-k N             top-k sampling (default: 40, 0 = disabled)--top-p N             top-p sampling (default: 0.9, 1.0 = disabled)--min-p N             min-p sampling (default: 0.1, 0.0 = disabled)--tfs N               tail free sampling, parameter z (default: 1.0, 1.0 = disabled)--typical N           locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)--repeat-last-n N     last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)--repeat-penalty N    penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)--presence-penalty N  repeat alpha presence penalty (default: 0.0, 0.0 = disabled)--frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)--dynatemp-range N    dynamic temperature range (default: 0.0, 0.0 = disabled)--dynatemp-exp N      dynamic temperature exponent (default: 1.0)--mirostat N          use Mirostat sampling.Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)--mirostat-lr N       Mirostat learning rate, parameter eta (default: 0.1)--mirostat-ent N      Mirostat target entropy, parameter tau (default: 5.0)-l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIASmodifies the likelihood of token appearing in the completion,i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'--grammar GRAMMAR     BNF-like grammar to constrain generations (see samples in grammars/ dir)--grammar-file FNAME  file to read grammar from-j SCHEMA, --json-schema SCHEMAJSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object.For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead--cfg-negative-prompt PROMPTnegative prompt to use for guidance. (default: empty)--cfg-negative-prompt-file FNAMEnegative prompt file to use for guidance. (default: empty)--cfg-scale N         strength of guidance (default: 1.000000, 1.0 = disable)--rope-scaling {none,linear,yarn}RoPE frequency scaling method, defaults to linear unless specified by the model--rope-scale N        RoPE context scaling factor, expands context by a factor of N--rope-freq-base N    RoPE base frequency, used by NTK-aware scaling (default: loaded from model)--rope-freq-scale N   RoPE frequency scaling factor, expands context by a factor of 1/N--yarn-orig-ctx N     YaRN: original context size of model (default: 0 = model training context size)--yarn-ext-factor N   YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation)--yarn-attn-factor N  YaRN: scale sqrt(t) or attention magnitude (default: 1.0)--yarn-beta-slow N    YaRN: high correction dim or alpha (default: 1.0)--yarn-beta-fast N    YaRN: low correction dim or beta (default: 32.0)--pooling {none,mean,cls}pooling type for embeddings, use model default if unspecified-dt N, --defrag-thold NKV cache defragmentation threshold (default: -1.0, < 0 - disabled)--ignore-eos          ignore end of stream token and continue generating (implies --logit-bias 2-inf)--penalize-nl         penalize newline tokens--temp N              temperature (default: 0.8)--all-logits          return logits for all tokens in the batch (default: disabled)--hellaswag           compute HellaSwag score over random tasks from datafile supplied with -f--hellaswag-tasks N   number of tasks to use when computing the HellaSwag score (default: 400)--winogrande          compute Winogrande score over random tasks from datafile supplied with -f--winogrande-tasks N  number of tasks to use when computing the Winogrande score (default: 0)--multiple-choice     compute multiple choice score over random tasks from datafile supplied with -f--multiple-choice-tasks N number of tasks to use when computing the multiple choice score (default: 0)--kl-divergence       computes KL-divergence to logits provided via --kl-divergence-base--keep N              number of tokens to keep from the initial prompt (default: 0, -1 = all)--draft N             number of tokens to draft for speculative decoding (default: 5)--chunks N            max number of chunks to process (default: -1, -1 = all)-np N, --parallel N   number of parallel sequences to decode (default: 1)-ns N, --sequences N  number of sequences to decode (default: 1)-ps N, --p-split N    speculative decoding split probability (default: 0.1)-cb, --cont-batching  enable continuous batching (a.k.a dynamic batching) (default: disabled)-fa, --flash-attn     enable Flash Attention (default: disabled)--mmproj MMPROJ_FILE  path to a multimodal projector file for LLaVA. see examples/llava/README.md--image IMAGE_FILE    path to an image file. use with multimodal models. Specify multiple times for batching--mlock               force system to keep model in RAM rather than swapping or compressing--no-mmap             do not memory-map model (slower load but may reduce pageouts if not using mlock)--numa TYPE           attempt optimizations that help on some NUMA systems- distribute: spread execution evenly over all nodes- isolate: only spawn threads on CPUs on the node that execution started on- numactl: use the CPU map provided by numactlif run without this previously, it is recommended to drop the system page cache before using thissee https://github.com/ggerganov/llama.cpp/issues/1437--rpc SERVERS         comma separated list of RPC servers--verbose-prompt      print a verbose prompt before generation (default: false)--no-display-prompt   don't print prompt at generation (default: false)-gan N, --grp-attn-n Ngroup-attention factor (default: 1)-gaw N, --grp-attn-w Ngroup-attention width (default: 512.0)-dkvc, --dump-kv-cacheverbose print of the KV cache-nkvo, --no-kv-offloaddisable KV offload-ctk TYPE, --cache-type-k TYPEKV cache data type for K (default: f16)-ctv TYPE, --cache-type-v TYPEKV cache data type for V (default: f16)--simple-io           use basic IO for better compatibility in subprocesses and limited consoles--lora FNAME          apply LoRA adapter (implies --no-mmap)--lora-scaled FNAME S apply LoRA adapter with user defined scaling S (implies --no-mmap)--lora-base FNAME     optional model to use as a base for the layers modified by the LoRA adapter--control-vector FNAMEadd a control vector--control-vector-scaled FNAME Sadd a control vector with user defined scaling S--control-vector-layer-range START ENDlayer range to apply the control vector(s) to, start and end inclusive-m FNAME, --model FNAMEmodel path (default: models/$filename with filename from --hf-file or --model-url if set, otherwise models/7B/ggml-model-f16.gguf)-md FNAME, --model-draft FNAMEdraft model for speculative decoding (default: unused)-mu MODEL_URL, --model-url MODEL_URLmodel download url (default: unused)-hfr REPO, --hf-repo REPOHugging Face model repository (default: unused)-hff FILE, --hf-file FILEHugging Face model file (default: unused)-ld LOGDIR, --logdir LOGDIRpath under which to save YAML logs (no logging if unset)-lcs FNAME, --lookup-cache-static FNAMEpath to static lookup cache to use for lookup decoding (not updated by generation)-lcd FNAME, --lookup-cache-dynamic FNAMEpath to dynamic lookup cache to use for lookup decoding (updated by generation)--override-kv KEY=TYPE:VALUEadvanced option to override model metadata by key. may be specified multiple times.types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false-ptc N, --print-token-count Nprint token count every N tokens (default: -1)--check-tensors       check model tensor data for invalid valueslog options:--log-test            Run simple logging test--log-disable         Disable trace logs--log-enable          Enable trace logs--log-file            Specify a log filename (without extension)--log-new             Create a separate new log file on start. Each log file will have unique name: "<name>.<ID>.log"

参数解释

命令描述
-m指定 LLaMA 模型文件的路径
-mu指定远程 http url 来下载文件
-i以交互模式运行程序
-ins以指令模式运行程序,类似ChatGPT的对话交流模式
-f指定prompt模板,alpaca模型请加载prompts/alpaca.txt指令模板
-n控制回复生成的最大长度(默认:-1,表示无穷大)
-c设置提示上下文的大小,值越大越能参考更长的历史对话(默认:512)
-b控制batch size(默认:2048)
-t控制线程数量(默认:128)
--repeat_penalty控制生成回复中对重复文本的惩罚力度
--temp温度系数,值越低回复的随机性越小
--top_p, top_k控制解码采样的相关参数
--color区分用户输入和生成的文本

更详细的官方说明请参考:https://github.com/ggerganov/llama.cpp/tree/master/examples/main

8.2 CPU推理

启动CPU推理,程序卡住,CPU利用率90%以上

# 以指令模式执行推理
./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1
(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1
Log start
main: build = 3045 (59b0d077)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1723114741
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/ggml-vocab-llama3-8B-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens cache size = 256.
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  4437.80 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1system_info: n_threads = 128 / 255 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
Reverse prompt: '### Instruction:'
sampling:repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.200mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 2048, n_predict = 256, n_keep = 19== Running in interactive mode. ==- Press Ctrl+C to interject at any time.- Press Return to return control to the AI.- To return control without starting a new line, end your input with '/'.- If you want to submit another line, end your input with '\'.Below is an instruction that describes a task. Write a response that appropriately completes the request.
> hi
Hello! I'm happy to help. Please provide more context or clarify what you would like me to assist you with, and I'll do my best to respond accordingly.
> 

在提示符 > 之后输入你的prompt,command+c中断输出,多行信息以\作为行尾。如需查看帮助和参数说明,请执行./main -h命令。

8.3 GPU/DCU推理

使用-ngl N或者 --n-gpu-layers N参数,表示加载到GPU的网络层数。

# 指定GPU
export HIP_VISIBLE_DEVICES="0"# 指定GFX version版本
export HSA_OVERRIDE_GFX_VERSION=9.2.8# 以指令模式执行推理
./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1 --n_gpu_layers 40 --no-mmap
# 或者
export HSA_OVERRIDE_GFX_VERSION=9.2.8 && export HIP_VISIBLE_DEVICES=0 && ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1 --n_gpu_layers 40 --no-mmap
(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# export HSA_OVERRIDE_GFX_VERSION=9.2.8 && export HIP_VISIBLE_DEVICES=0 && ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1 --n_gpu_layers 40 --no-mmap
Log start
main: build = 3045 (59b0d077)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1723178798
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/ggml-vocab-llama3-8B-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens cache size = 256.
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:Device 0: DCU K100_AI, compute capability 9.2, VMM: no
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  4155.99 MiB
llm_load_tensors:  ROCm_Host buffer size =   281.81 MiB
..........................................

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/400572.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基于SpringBoot的企业资产管理系统

TOC springboot117基于SpringBoot的企业资产管理系统 系统概述 1.1 研究背景 智慧养老是面向居家老人、社区及养老机构的传感网系统与信息平台&#xff0c;并在此基础上提供实时、快捷、高效、低成本的&#xff0c;物联化、互联化、智能化的养老服务。 随着科技进步&#…

mysql中log

目录 MySQL 日志系统概述 日志类型 日志的作用和重要性 Mermaid图示 1. Undo Log 和 Redo Log 的协同工作图 2. Redo Log 确保持久性的流程图 Undo Log&#xff08;回滚日志&#xff09; 事务的原子性&#xff08;Atomicity&#xff09;保障 事务回滚机制 MVCC&#…

【二叉树进阶】--- 二叉搜索树转双向链表 最近公共祖先

Welcome to 9ilks Code World (๑•́ ₃ •̀๑) 个人主页: 9ilk (๑•́ ₃ •̀๑) 文章专栏&#xff1a; 数据结构 本篇博客我们继续了解一些二叉树的进阶算法。 &#x1f3e0; 二叉搜索 树转化为双向循环链表 &#x1f4cc; 题目内容 将二叉搜索树转化为排序…

失败:Windows--WSL2--Ubuntuon--Docker

编写目的&#xff1a; 在Windows上安装Docker&#xff0c;用Docker安装Gitlab、Jenkins等软件。 文章记录一下Windows上安装Docker的过程。 参考文档&#xff1a; 旧版 WSL 的手动安装步骤 | Microsoft Learn 下面用"参考文档"代替 目录 第一步&#xff1a;启…

SAP与网易大数据系统集成案例

一、项目环境 江西某药业有限公司是一家以医药产业为主营、资本经营为平台的大型民营企业集团。公司成立迄今&#xff0c;企业经营一直呈现稳健、快速发展的态势集团总销售额超40亿元。 为了帮助企业更有效的进行分配和管理&#xff0c;包括人力、物资、时间和预算等资源&a…

UVa1660/LA3031 Cable TV Network

UVa1660/LA3031 Cable TV Network 题目链接题意分析AC 代码 题目链接 本题是2004年icpc欧洲区域赛东南欧赛区的题目 题意 给定一个n&#xff08;n≤50&#xff09;个点的无向图&#xff0c;求它的点连通度&#xff0c;即最少删除多少个点&#xff0c;使得图不连通。如下图所示…

C语言----约瑟夫环

约瑟夫环 实例说明&#xff1a; 本实例使用循环链表实现约瑟夫环。给定一组编号分别是4、7、5、9、3、2、6、1、8。报数初始值由用户输入&#xff0c;这里输入4&#xff0c;如图12.18所示&#xff0c;按照约瑟夫环原理打印输出队列。 实现过程&#xff1a; (1)在 VC6.0中创建…

GlobalMapper软件安装流程

目录 一、环境准备 二、安装步骤 三、软件激活 一、环境准备 系统&#xff1a;win7操作系统 安装包下载&#xff1a;链接&#xff1a;https://pan.baidu.com/s/1Vb4VVRFBRYawt3MT-5gYOw 提取码&#xff1a;sxdj 二、安装步骤 1、解压&#xff0c;右键global-mapper-23_1-x…

Redis的简单介绍

一、Redis简介 1.NOSQL NoSQL( Not Only SQL)&#xff0c;意即“不仅仅是SQL”&#xff0c;泛指非关系型的数据库。随着互联网web2.0网站的兴起&#xff0c;传统的关系数据库在应付web2.0网站&#xff0c;纯动态网站已经显得力不从心&#xff0c;暴露了很多难以克服的问题&am…

java学习19VUE

VUE NPM npm的全称是Node Package Manager 中文名为Node.js包管理器&#xff0c;是一个NodeJS包管理和分发工具&#xff0c;已经成为了非官方的发布Node模块(包)的标准。NPM可以方便地从一个全球的代码库中获取并安装Node.js模块&#xff0c;这些模块可以用于构建应用程序、…

【LeetCode Cookbook(C++ 描述)】一刷二叉树之层序遍历(BFS)

目录 LeetCode #102&#xff1a;Binary Tree Lever Order Traversal 二叉树的层序遍历递归解法迭代解法 LeetCode #107&#xff1a;Binary Tree Level Order Traversal II - 二叉树的层序遍历 II递归解法迭代解法 LeetCode #429&#xff1a;N-ary Tree Level Order Traversal -…

python结合csv和正则实现条件筛选数据统计分数

前景提要&#xff1a; 有一个项目的数值和员工统计的对不上&#xff0c;如果一页一页翻找自己手动算&#xff0c;一个就有16、7页&#xff0c; 功能实现 1、创建csv文件 需要将每一个模块的所有数据头提取出来&#xff0c;这个可以直接用爬虫或者手工复制出来&#xff0c;因…

The Sandbox 游戏制作教程第 4 章|使用装备制作游戏,触发独特互动

欢迎回到我们的系列&#xff0c;我们将记录 The Sandbox Game Maker 的 “On-Equip”&#xff08;装备&#xff09;功能的多种用途。 如果你刚加入 The Sandbox&#xff0c;On-Equip 功能是 “可收集组件”&#xff08;Collectable Component&#xff09;中的一个多功能工具&a…

无人机的电压和放电速率,你知道吗?

一、无人机电压 无人机电瓶多采用锂电池&#xff0c;其电压范围在3.7伏至44.4伏之间&#xff0c;具体取决于电池的单体电压和串联的电池节数。 单体电压&#xff1a;锂电池的单体电压通常为3.7V&#xff0c;但在满电状态下可能达到4.2V。 串联电池节数&#xff1a;无人机电瓶…

Java面试八股之消息队列通常由哪些角色组成

消息队列通常由哪些角色组成 消息队列系统通常涉及几个核心角色&#xff0c;这些角色协同工作以实现消息的传递和处理。主要的角色包括&#xff1a; 生产者&#xff08;Producer&#xff09;&#xff1a; 生产者是消息的创建者&#xff0c;负责将消息发送到消息队列中。生产者…

基于RK3568 Android11 移除长按电源按键弹窗的对话框中的 [关机] 和 [紧急呼救] 选项(详细分析)

一般来说&#xff0c;与Android按键窗口事件相关的基本是与frameworks/base/services/core/java/com/android/server/policy/PhoneWindowManager.java 这个文件有关。   因此先打开与输入相关的日志&#xff0c;如下&#xff1a;   然后重新编译烧录后查看打印的日志可以看…

基于Python、Django开发Web计算器

1、创建项目 创建Django项目参照https://blog.csdn.net/qq_42148307/article/details/140798249&#xff0c;其中项目名为compute&#xff0c;并在该项目下创建一个名为app的应用&#xff0c;并且进行基本的配置。 2、导入Bootstrap前端框架 Bootstrap的使用参照https://blo…

uvm(7)factory

重载 针对任务或者函数&#xff0c;定义virtual;然后就可以重载 第二个是 约束的重载 然后 m_trans.crc_err_cons.constraint_mode(0); 这个是关闭此约束 m_trans.constraint_mode(0); 这是关闭所有约束 还可以集成原来的transation直接重写约束起到重载的作用 用factory重…

【数据结构】二叉树(一)

目录 1. 树型结构 概念 树的表示形式 ​编辑 2. 二叉树&#xff08;重点&#xff09; 2.1 概念 2.2 二叉树的性质 2.3 二叉树的存储 2.4 二叉树的遍历 前中后序遍历 层序遍历&#xff1a; 2.5二叉树的基本操作 本篇主要理解树和二叉树相关概念&#xff0c;二叉树遍…