开源模型应用落地-glm模型小试-glm-4-9b-chat-vLLM集成(四)

一、前言

    GLM-4是智谱AI团队于2024年1月16日发布的基座大模型,旨在自动理解和规划用户的复杂指令,并能调用网页浏览器。其功能包括数据分析、图表创建、PPT生成等,支持128K的上下文窗口,使其在长文本处理和精度召回方面表现优异,且在中文对齐能力上超过GPT-4。与之前的GLM系列产品相比,GLM-4在各项性能上提高了60%,并且在指令跟随和多模态功能上有显著强化,适合于多种应用场景。尽管在某些领域仍逊于国际一流模型,GLM-4的中文处理能力使其在国内大模型中占据领先地位。该模型的研发历程自2020年始,经过多次迭代和改进,最终构建出这一高性能的AI系统。

    在开源模型应用落地-glm模型小试-glm-4-9b-chat-快速体验(一)已经掌握了glm-4-9b-chat的基本入门。

    在开源模型应用落地-glm模型小试-glm-4-9b-chat-批量推理(二)已经掌握了glm-4-9b-chat的批量推理。

    在开源模型应用落地-glm模型小试-glm-4-9b-chat-Gradio集成(三)已经掌握了如何集成Gradio进行页面交互。

    本篇将介绍如何集成vLLM进行推理加速。


二、术语

2.1.GLM-4-9B

    是智谱 AI 推出的一个开源预训练模型,属于 GLM-4 系列。它于 2024 年 6 月 6 日发布,专为满足高效能语言理解和生成任务而设计,并支持最高 1M(约两百万字)的上下文输入。该模型拥有更强的基础能力,支持26种语言,并且在多模态能力上首次实现了显著进展。

GLM-4-9B的基础能力包括:

- 中英文综合性能提升 40%,在特别的中文对齐能力、指令遵从和工程代码等任务中显著增强

- 较 Llama 3 8B 的性能提升,尤其在数学问题解决和代码编写等复杂任务中表现优越

- 增强的函数调用能力,提升了 40% 的性能

- 支持多轮对话,还支持网页浏览、代码执行、自定义工具调用等高级功能,能够快速处理大量信息并给出高质量的回答

2.2.GLM-4-9B-Chat

    是智谱 AI 在 GLM-4-9B 系列中推出的对话版本模型。它设计用于处理多轮对话,并具有一些高级功能,使其在自然语言处理任务中更加高效和灵活。

2.3.vLLM

    vLLM是一个开源的大模型推理加速框架,通过PagedAttention高效地管理attention中缓存的张量,实现了比HuggingFace Transformers高14-24倍的吞吐量。


三、前置条件

3.1.基础环境及前置条件

     1. 操作系统:centos7

     2. NVIDIA Tesla V100 32GB   CUDA Version: 12.2 

    3.最低硬件要求

3.2.下载模型

huggingface:

https://huggingface.co/THUDM/glm-4-9b-chat/tree/main

ModelScope:

魔搭社区

使用git-lfs方式下载示例:

3.3.创建虚拟环境

conda create --name glm4 python=3.10
conda activate glm4

3.4.安装依赖库

pip install torch>=2.5.0
pip install torchvision>=0.20.0
pip install transformers>=4.46.0
pip install huggingface-hub>=0.25.1
pip install sentencepiece>=0.2.0
pip install jinja2>=3.1.4
pip install pydantic>=2.9.2
pip install timm>=1.0.9
pip install tiktoken>=0.7.0
pip install numpy==1.26.4 
pip install accelerate>=1.0.1
pip install sentence_transformers>=3.1.1
pip install openai>=1.51.0
pip install einops>=0.8.0
pip install pillow>=10.4.0
pip install sse-starlette>=2.1.3
pip install bitsandbytes>=0.43.3# using with VLLM Framework
pip install vllm>=0.6.3

四、技术实现

4.1.vLLM服务端实现

# -*- coding: utf-8 -*-
import time
from asyncio.log import logger
import re
import uvicorn
import gc
import json
import torch
import random
import string
from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine
from fastapi import FastAPI, HTTPException, Response
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
from typing import List, Literal, Optional, Union
from pydantic import BaseModel, Field
from transformers import AutoTokenizer, LogitsProcessor
from sse_starlette.sse import EventSourceResponseEventSourceResponse.DEFAULT_PING_INTERVAL = 1000MAX_MODEL_LENGTH = 8192@asynccontextmanager
async def lifespan(app: FastAPI):yieldif torch.cuda.is_available():torch.cuda.empty_cache()torch.cuda.ipc_collect()app = FastAPI(lifespan=lifespan)app.add_middleware(CORSMiddleware,allow_origins=["*"],allow_credentials=True,allow_methods=["*"],allow_headers=["*"],
)def generate_id(prefix: str, k=29) -> str:suffix = ''.join(random.choices(string.ascii_letters + string.digits, k=k))return f"{prefix}{suffix}"class ModelCard(BaseModel):id: str = ""object: str = "model"created: int = Field(default_factory=lambda: int(time.time()))owned_by: str = "owner"root: Optional[str] = Noneparent: Optional[str] = Nonepermission: Optional[list] = Noneclass ModelList(BaseModel):object: str = "list"data: List[ModelCard] = ["glm-4"]class FunctionCall(BaseModel):name: Optional[str] = Nonearguments: Optional[str] = Noneclass ChoiceDeltaToolCallFunction(BaseModel):name: Optional[str] = Nonearguments: Optional[str] = Noneclass UsageInfo(BaseModel):prompt_tokens: int = 0total_tokens: int = 0completion_tokens: Optional[int] = 0class ChatCompletionMessageToolCall(BaseModel):index: Optional[int] = 0id: Optional[str] = Nonefunction: FunctionCalltype: Optional[Literal["function"]] = 'function'class ChatMessage(BaseModel):# “function” 字段解释:# 使用较老的OpenAI API版本需要注意在这里添加 function 字段并在 process_messages函数中添加相应角色转换逻辑为 observationrole: Literal["user", "assistant", "system", "tool"]content: Optional[str] = Nonefunction_call: Optional[ChoiceDeltaToolCallFunction] = Nonetool_calls: Optional[List[ChatCompletionMessageToolCall]] = Noneclass DeltaMessage(BaseModel):role: Optional[Literal["user", "assistant", "system"]] = Nonecontent: Optional[str] = Nonefunction_call: Optional[ChoiceDeltaToolCallFunction] = Nonetool_calls: Optional[List[ChatCompletionMessageToolCall]] = Noneclass ChatCompletionResponseChoice(BaseModel):index: intmessage: ChatMessagefinish_reason: Literal["stop", "length", "tool_calls"]class ChatCompletionResponseStreamChoice(BaseModel):delta: DeltaMessagefinish_reason: Optional[Literal["stop", "length", "tool_calls"]]index: intclass ChatCompletionResponse(BaseModel):model: strid: Optional[str] = Field(default_factory=lambda: generate_id('chatcmpl-', 29))object: Literal["chat.completion", "chat.completion.chunk"]choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]created: Optional[int] = Field(default_factory=lambda: int(time.time()))system_fingerprint: Optional[str] = Field(default_factory=lambda: generate_id('fp_', 9))usage: Optional[UsageInfo] = Noneclass ChatCompletionRequest(BaseModel):model: strmessages: List[ChatMessage]temperature: Optional[float] = 0.8top_p: Optional[float] = 0.8max_tokens: Optional[int] = Nonestream: Optional[bool] = Falsetools: Optional[Union[dict, List[dict]]] = Nonetool_choice: Optional[Union[str, dict]] = Nonerepetition_penalty: Optional[float] = 1.1class InvalidScoreLogitsProcessor(LogitsProcessor):def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:if torch.isnan(scores).any() or torch.isinf(scores).any():scores.zero_()scores[..., 5] = 5e4return scoresdef process_response(output: str, tools: dict | List[dict] = None, use_tool: bool = False) -> Union[str, dict]:lines = output.strip().split("\n")arguments_json = Nonespecial_tools = ["cogview", "simple_browser"]tools = {tool['function']['name'] for tool in tools} if tools else {}if len(lines) >= 2 and lines[1].startswith("{"):function_name = lines[0].strip()arguments = "\n".join(lines[1:]).strip()if function_name in tools or function_name in special_tools:try:arguments_json = json.loads(arguments)is_tool_call = Trueexcept json.JSONDecodeError:is_tool_call = function_name in special_toolsif is_tool_call and use_tool:content = {"name": function_name,"arguments": json.dumps(arguments_json if isinstance(arguments_json, dict) else arguments,ensure_ascii=False)}if function_name == "simple_browser":search_pattern = re.compile(r'search\("(.+?)"\s*,\s*recency_days\s*=\s*(\d+)\)')match = search_pattern.match(arguments)if match:content["arguments"] = json.dumps({"query": match.group(1),"recency_days": int(match.group(2))}, ensure_ascii=False)elif function_name == "cogview":content["arguments"] = json.dumps({"prompt": arguments}, ensure_ascii=False)return contentreturn output.strip()@torch.inference_mode()
async def generate_stream_glm4(params):messages = params["messages"]tools = params["tools"]tool_choice = params["tool_choice"]temperature = float(params.get("temperature", 1.0))repetition_penalty = float(params.get("repetition_penalty", 1.0))top_p = float(params.get("top_p", 1.0))max_new_tokens = int(params.get("max_tokens", 8192))messages = process_messages(messages, tools=tools, tool_choice=tool_choice)inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)params_dict = {"n": 1,"best_of": 1,"presence_penalty": 1.0,"frequency_penalty": 0.0,"temperature": temperature,"top_p": top_p,"top_k": -1,"repetition_penalty": repetition_penalty,"stop_token_ids": [151329, 151336, 151338],"ignore_eos": False,"max_tokens": max_new_tokens,"logprobs": None,"prompt_logprobs": None,"skip_special_tokens": True,}sampling_params = SamplingParams(**params_dict)async for output in engine.generate(prompt=inputs, sampling_params=sampling_params, request_id=f"{time.time()}"):output_len = len(output.outputs[0].token_ids)input_len = len(output.prompt_token_ids)ret = {"text": output.outputs[0].text,"usage": {"prompt_tokens": input_len,"completion_tokens": output_len,"total_tokens": output_len + input_len},"finish_reason": output.outputs[0].finish_reason,}yield retgc.collect()torch.cuda.empty_cache()def process_messages(messages, tools=None, tool_choice="none"):_messages = messagesprocessed_messages = []msg_has_sys = Falsedef filter_tools(tool_choice, tools):function_name = tool_choice.get('function', {}).get('name', None)if not function_name:return []filtered_tools = [tool for tool in toolsif tool.get('function', {}).get('name') == function_name]return filtered_toolsif tool_choice != "none":if isinstance(tool_choice, dict):tools = filter_tools(tool_choice, tools)if tools:processed_messages.append({"role": "system","content": None,"tools": tools})msg_has_sys = Trueif isinstance(tool_choice, dict) and tools:processed_messages.append({"role": "assistant","metadata": tool_choice["function"]["name"],"content": ""})for m in _messages:role, content, func_call = m.role, m.content, m.function_calltool_calls = getattr(m, 'tool_calls', None)if role == "function":processed_messages.append({"role": "observation","content": content})elif role == "tool":processed_messages.append({"role": "observation","content": content,"function_call": True})elif role == "assistant":if tool_calls:for tool_call in tool_calls:processed_messages.append({"role": "assistant","metadata": tool_call.function.name,"content": tool_call.function.arguments})else:for response in content.split("\n"):if "\n" in response:metadata, sub_content = response.split("\n", maxsplit=1)else:metadata, sub_content = "", responseprocessed_messages.append({"role": role,"metadata": metadata,"content": sub_content.strip()})else:if role == "system" and msg_has_sys:msg_has_sys = Falsecontinueprocessed_messages.append({"role": role, "content": content})if not tools or tool_choice == "none":for m in _messages:if m.role == 'system':processed_messages.insert(0, {"role": m.role, "content": m.content})breakreturn processed_messages@app.get("/health")
async def health() -> Response:"""Health check."""return Response(status_code=200)@app.get("/v1/models", response_model=ModelList)
async def list_models():model_card = ModelCard(id="glm-4")return ModelList(data=[model_card])@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):if len(request.messages) < 1 or request.messages[-1].role == "assistant":raise HTTPException(status_code=400, detail="Invalid request")gen_params = dict(messages=request.messages,temperature=request.temperature,top_p=request.top_p,max_tokens=request.max_tokens or 1024,echo=False,stream=request.stream,repetition_penalty=request.repetition_penalty,tools=request.tools,tool_choice=request.tool_choice,)logger.debug(f"==== request ====\n{gen_params}")if request.stream:predict_stream_generator = predict_stream(request.model, gen_params)output = await anext(predict_stream_generator)if output:return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")logger.debug(f"First result output:\n{output}")function_call = Noneif output and request.tools:try:function_call = process_response(output, request.tools, use_tool=True)except:logger.warning("Failed to parse tool call")if isinstance(function_call, dict):function_call = ChoiceDeltaToolCallFunction(**function_call)generate = parse_output_text(request.model, output, function_call=function_call)return EventSourceResponse(generate, media_type="text/event-stream")else:return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")response = ""async for response in generate_stream_glm4(gen_params):passif response["text"].startswith("\n"):response["text"] = response["text"][1:]response["text"] = response["text"].strip()usage = UsageInfo()function_call, finish_reason = None, "stop"tool_calls = Noneif request.tools:try:function_call = process_response(response["text"], request.tools, use_tool=True)except Exception as e:logger.warning(f"Failed to parse tool call: {e}")if isinstance(function_call, dict):finish_reason = "tool_calls"function_call_response = ChoiceDeltaToolCallFunction(**function_call)function_call_instance = FunctionCall(name=function_call_response.name,arguments=function_call_response.arguments)tool_calls = [ChatCompletionMessageToolCall(id=generate_id('call_', 24),function=function_call_instance,type="function")]message = ChatMessage(role="assistant",content=None if tool_calls else response["text"],function_call=None,tool_calls=tool_calls,)logger.debug(f"==== message ====\n{message}")choice_data = ChatCompletionResponseChoice(index=0,message=message,finish_reason=finish_reason,)task_usage = UsageInfo.model_validate(response["usage"])for usage_key, usage_value in task_usage.model_dump().items():setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)return ChatCompletionResponse(model=request.model,choices=[choice_data],object="chat.completion",usage=usage)async def predict_stream(model_id, gen_params):output = ""is_function_call = Falsehas_send_first_chunk = Falsecreated_time = int(time.time())function_name = Noneresponse_id = generate_id('chatcmpl-', 29)system_fingerprint = generate_id('fp_', 9)tools = {tool['function']['name'] for tool in gen_params['tools']} if gen_params['tools'] else {}delta_text = ""async for new_response in generate_stream_glm4(gen_params):decoded_unicode = new_response["text"]delta_text += decoded_unicode[len(output):]output = decoded_unicodelines = output.strip().split("\n")# 检查是否为工具# 这是一个简单的工具比较函数,不能保证拦截所有非工具输出的结果,比如参数未对齐等特殊情况。##TODO 如果你希望做更多处理,可以在这里进行逻辑完善。if not is_function_call and len(lines) >= 2:first_line = lines[0].strip()if first_line in tools:is_function_call = Truefunction_name = first_linedelta_text = lines[1]# 工具调用返回if is_function_call:if not has_send_first_chunk:function_call = {"name": function_name, "arguments": ""}tool_call = ChatCompletionMessageToolCall(index=0,id=generate_id('call_', 24),function=FunctionCall(**function_call),type="function")message = DeltaMessage(content=None,role="assistant",function_call=None,tool_calls=[tool_call])choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield ""yield chunk.model_dump_json(exclude_unset=True)has_send_first_chunk = Truefunction_call = {"name": None, "arguments": delta_text}delta_text = ""tool_call = ChatCompletionMessageToolCall(index=0,id=None,function=FunctionCall(**function_call),type="function")message = DeltaMessage(content=None,role=None,function_call=None,tool_calls=[tool_call])choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)# 用户请求了 Function Call 但是框架还没确定是否为Function Callelif (gen_params["tools"] and gen_params["tool_choice"] != "none") or is_function_call:continue# 常规返回else:finish_reason = new_response.get("finish_reason", None)if not has_send_first_chunk:message = DeltaMessage(content="",role="assistant",function_call=None,)choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)has_send_first_chunk = Truemessage = DeltaMessage(content=delta_text,role="assistant",function_call=None,)delta_text = ""choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)# 工具调用需要额外返回一个字段以对齐 OpenAI 接口if is_function_call:yield ChatCompletionResponse(model=model_id,id=response_id,system_fingerprint=system_fingerprint,choices=[ChatCompletionResponseStreamChoice(index=0,delta=DeltaMessage(content=None,role=None,function_call=None,),finish_reason="tool_calls")],created=created_time,object="chat.completion.chunk",usage=None).model_dump_json(exclude_unset=True)elif delta_text != "":message = DeltaMessage(content="",role="assistant",function_call=None,)choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)finish_reason = 'stop'message = DeltaMessage(content=delta_text,role="assistant",function_call=None,)delta_text = ""choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)yield '[DONE]'else:yield '[DONE]'async def parse_output_text(model_id: str, value: str, function_call: ChoiceDeltaToolCallFunction = None):delta = DeltaMessage(role="assistant", content=value)if function_call is not None:delta.function_call = function_callchoice_data = ChatCompletionResponseStreamChoice(index=0,delta=delta,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,choices=[choice_data],object="chat.completion.chunk")yield "{}".format(chunk.model_dump_json(exclude_unset=True))yield '[DONE]'if __name__ == "__main__":MODEL_PATH = "/data/model/glm-4-9b-chat"tensor_parallel_size = 1tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)engine_args = AsyncEngineArgs(model=MODEL_PATH,tokenizer=MODEL_PATH,# 如果你有多张显卡,可以在这里设置成你的显卡数量tensor_parallel_size=tensor_parallel_size,dtype=torch.float16,trust_remote_code=True,# 占用显存的比例,请根据你的显卡显存大小设置合适的值,例如,如果你的显卡有80G,您只想使用24G,请按照24/80=0.3设置gpu_memory_utilization=0.9,enforce_eager=True,worker_use_ray=False,disable_log_requests=True,max_model_len=MAX_MODEL_LENGTH,)engine = AsyncLLMEngine.from_engine_args(engine_args)uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)

4.2.vLLM服务端启动

(glm4) [root@gpu test]# python -u glm_server.py 
WARNING 11-06 12:11:19 config.py:1668] Casting torch.bfloat16 to torch.float16.
WARNING 11-06 12:11:23 config.py:395] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 11-06 12:11:23 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/data/model/glm-4-9b-chat', speculative_config=None, tokenizer='/data/model/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/model/glm-4-9b-chat, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
WARNING 11-06 12:11:24 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 11-06 12:11:24 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-06 12:11:24 selector.py:115] Using XFormers backend.
/usr/local/miniconda3/envs/glm4/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/miniconda3/envs/glm4/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 11-06 12:11:25 model_runner.py:1056] Starting to load model /data/model/glm-4-9b-chat...
INFO 11-06 12:11:25 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-06 12:11:25 selector.py:115] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/10 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  10% Completed | 1/10 [00:00<00:08,  1.01it/s]
Loading safetensors checkpoint shards:  20% Completed | 2/10 [00:01<00:07,  1.13it/s]
Loading safetensors checkpoint shards:  30% Completed | 3/10 [00:02<00:06,  1.14it/s]
Loading safetensors checkpoint shards:  40% Completed | 4/10 [00:03<00:05,  1.15it/s]
Loading safetensors checkpoint shards:  50% Completed | 5/10 [00:04<00:04,  1.18it/s]
Loading safetensors checkpoint shards:  60% Completed | 6/10 [00:05<00:03,  1.08it/s]
Loading safetensors checkpoint shards:  70% Completed | 7/10 [00:06<00:02,  1.07it/s]
Loading safetensors checkpoint shards:  80% Completed | 8/10 [00:07<00:01,  1.13it/s]
Loading safetensors checkpoint shards:  90% Completed | 9/10 [00:08<00:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:08<00:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:08<00:00,  1.11it/s]INFO 11-06 12:11:35 model_runner.py:1067] Loading model weights took 17.5635 GB
INFO 11-06 12:11:37 gpu_executor.py:122] # GPU blocks: 12600, # CPU blocks: 6553
INFO 11-06 12:11:37 gpu_executor.py:126] Maximum concurrency for 8192 tokens per request: 24.61x
INFO:     Started server process [1627618]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

4.3.客户端实现

# -*- coding: utf-8 -*-
from openai import OpenAIbase_url = "http://127.0.0.1:8000/v1/"
client = OpenAI(api_key="EMPTY", base_url=base_url)
MODEL_PATH = "/data/model/glm-4-9b-chat"def chat(use_stream=False):messages = [{"role": "system","content": "你是一名专业的导游。",},{"role": "user","content": "请推荐一些广州特色的景点?",}]response = client.chat.completions.create(model=MODEL_PATH,messages=messages,stream=use_stream,max_tokens=8192,temperature=0.4,presence_penalty=1.2,top_p=0.9,)if response:if use_stream:for chunk in response:msg = chunk.choices[0].delta.contentprint(msg,end='',flush=True)else:print(response)else:print("Error:", response.status_code)if __name__ == "__main__":chat(use_stream=True)

4.4.客户端调用

(glm4) [root@gpu test]# python -u glm_client.py 当然可以!广州是中国广东省的省会,历史悠久,文化底蕴深厚,同时也是一座现代化的大都市。以下是一些广州的特色景点推荐:1. **白云山** - 广州著名的风景区,有“羊城第一秀”之称。山上空气清新,景色优美,是登山和观赏广州市区全景的好地方。2. **珠江夜游** - 乘坐游船在珠江上欣赏两岸的夜景,可以看到广州塔、海心沙等著名地标,以及璀璨的灯光秀。3. **长隆旅游度假区** - 包括长隆野生动物世界、长隆水上乐园、长隆国际大马戏等多个主题公园,适合家庭游玩。4. **陈家祠** - 又称陈氏书院,是一座典型的岭南传统建筑,以其精美的木雕、石雕和砖雕闻名。5. **越秀公园** - 公园内有五羊雕像,是广州的象征之一。还有中山纪念碑、镇海楼等历史遗迹。6. **北京路步行街** - 这里集合了购物、餐饮、娱乐于一体,是一条充满活力的商业街区。7. **上下九步行街** - 这条古老的街道以骑楼建筑为特色,两旁有许多老字号商店和小吃店,是体验广州传统文化的好去处。8. **广州塔(小蛮腰)** - 作为广州的地标性建筑,游客可以从这里俯瞰整个城市的壮丽景观。9. **南越王宫博物馆** - 展示了两千多年前南越国的历史文化,馆内有一座复原的宫殿模型。10. **荔湾湖公园** - 一个集自然风光与人文景观于一体的公园,湖水清澈,环境宜人。11. **广州动物园** - 是中国最大的城市动物园之一,拥有多种珍稀动物。12. **广州艺术博物院** - 收藏了大量珍贵的艺术品和历史文物,是了解广东乃至华南地区文化艺术的重要场所。这些景点不仅展示了广州的自然美景,也体现了其丰富的文化遗产和现代都市的风貌。希望您在广州旅行时能有一个愉快的体验!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/465573.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

K8S篇(基本介绍)

目录 一、什么是Kubernetes&#xff1f; 二、Kubernetes管理员认证&#xff08;CKA&#xff09; 1. 简介 2. 考试难易程度 3. 考试时长 4. 多少分及格 5. 考试费用 三、Kubernetes整体架构 Master Nodes 四、Kubernetes架构及和核心组件 五、Kubernetes各个组件及功…

关于路由笔记

路由 定义&#xff1a; 在计算机网络中&#xff0c;路由是将数据包从源节点传输到目标节点的过程。这个过程涉及到网络中的多个设备&#xff0c;如路由器、交换机等&#xff0c;其中路由器起着关键的决策作用。 工作原理示例&#xff1a; 假设你在一个公司的局域网内&#…

人工智能之人脸识别(人脸采集人脸识别)

文章目录 前言PySimpleGUI 库1-布局和窗口 前言 例如&#xff1a;随着人工智能的不断发展&#xff0c;本文主要介绍关于人工智能中GUI和PyMysql相应用。 本文采用代码&#xff0b;逻辑思路分析的方式有助于理解代码。 PySimpleGUI 库 PySimpleGUI 是一个用于简化 GUI 编程的…

如何找到养生生活视频素材?推荐几个优秀网站

今天&#xff0c;我们来聊一个实用的话题&#xff0c;那就是如何找到优质的养生视频素材。作为自媒体创作者&#xff0c;高质量的视频素材对内容制作至关重要。不论你是刚入行的新手&#xff0c;还是已经积累了一定粉丝的大V&#xff0c;找到合适的养生视频素材都能帮助你更好地…

旋转对称性,旋转矩阵的特征矢量也是T3矩阵的特征矢量

旋转对称性要求T3矩阵&#xff0c;在旋转后&#xff0c;特征矢量没发生改变&#xff0c;特征值大小也没变&#xff0c;即T3矩阵没有改变

美畅物联丨物联网通信新纪元:Cat.1与5G RedCap的差异化应用

​ 在物联网&#xff08;IoT&#xff09;迅猛发展的时代&#xff0c;通信标准对物联网设备的连接性、性能和适用性有着极为关键的作用。小编在《美畅物联丨Cat.1与NB-IoT&#xff1a;物联网设备的通信标准对比》中提到Cat.1与NB-IoT的对比区别&#xff0c;后来就有小伙伴问&…

OpenCV视觉分析之目标跟踪(12)找到局部的最大值函数meanShift()的使用

操作系统&#xff1a;ubuntu22.04 OpenCV版本&#xff1a;OpenCV4.9 IDE:Visual Studio Code 编程语言&#xff1a;C11 算法描述 在反向投影图像上找到一个对象。 meanShift 是一种用于图像处理和计算机视觉领域的算法&#xff0c;特别适用于目标跟踪、图像分割等任务。该算…

应急救援无人车:用科技守护安全!

一、核心功能 快速进入危险区域&#xff1a; 救援无人车能够迅速进入地震、火灾、洪水等自然灾害或重大事故的现场&#xff0c;这些区域往往对人类救援人员构成极大威胁。 通过自主导航和环境感知技术&#xff0c;无人车能够避开危险区域&#xff0c;确保自身安全的同时&…

安装acondana3, Conda command not found

Linux 服务器安装acondana3后 输入conda找不到 写入路径也没找到 vim ~/.bashrc 加入 PATH"root/anaconda3/bin:$PATH" 更新文件&#xff1a; source ~/.bashrc 还是找不到conda 命令 解决办法 source ~/anaconda3/etc/profile.d/conda.sh conda activate Your_e…

第07章 运算符的使用

一、算数运算符 算术运算符主要用于数学运算&#xff0c;其可以连接运算符前后的两个数值或表达式&#xff0c;对数值或表达式进行加 &#xff08;&#xff09;、减&#xff08;-&#xff09;、乘&#xff08;*&#xff09;、除&#xff08;/&#xff09;和取模&#xff08;%&a…

6堆(超级重点)

堆&#xff08;Heap&#xff09;的核心概述 堆针对一个 JVM 进程来说是唯一的&#xff0c;也就是一个进程只有一个 JVM&#xff0c;但是进程包含多个线程&#xff0c;他们是共享同一堆空间的。 6.1.1. 堆内存细分 Java 7 及之前堆内存逻辑上分为三部分&#xff1a;新生区养老…

Google Guava 发布订阅模式/生产消费者模式 使用详情

目录 Guava 介绍 应用场景举例 1. 引入 Maven 依赖 2. 自定义 Event 事件类 3. 定义 EventListener 事件订阅者 4. 定义 EventBus 事件总线 5. 定义 Controller 进行测试 Guava 介绍 Guava 是一组来自 Google 的核心 Java 库&#xff0c;里面包括新的集合 类型&#xff08…

全面解析:网络协议及其应用

&#x1f493; 博客主页&#xff1a;瑕疵的CSDN主页 &#x1f4dd; Gitee主页&#xff1a;瑕疵的gitee主页 ⏩ 文章专栏&#xff1a;《热点资讯》 # 全面解析&#xff1a;网络协议及其应用 文章目录 网络协议概述定义发展历程主要优势 主要网络协议应用层协议传输层协议网络层…

如何使用SSH密钥和公钥加密技术保护您的cPanel服务器

在服务器管理过程中&#xff0c;cPanel和WHM是我们常用的管理工具。然而&#xff0c;有时我们仍然需要直接登录到服务器的Shell环境&#xff0c;以便执行脚本或修改配置文件。使用SSH是最安全的远程登录方式。SSH是一种安全协议&#xff0c;它能够加密你向服务器发送的命令以及…

【前端】JSX 中的 Fragments 详解

在 React 和 JSX 中&#xff0c;Fragments 是一个非常有用的概念&#xff0c;用于在不引入额外 DOM 节点的情况下返回多个元素。Fragments 可以帮助你保持 DOM 结构的整洁&#xff0c;避免不必要的嵌套层级。本文将详细介绍 Fragments 的概念、用法以及其在实际开发中的应用场景…

mac单独打开QT帮助文档助手

mac单独打开QT帮助文档助手 1.概述 windows和mac查看QT帮助文档的路径不同&#xff0c;下面给出两个系统的查找路径。 Windows 下&#xff1a; C:\Qt\Qt5.9.9\5.9.9\mingw49_32\bin\assistant.exeMac 下&#xff1a; /Users/apple/Qt5.9.9/5.9.9/clang_64/bin/Assistant2.使…

SSLHandshakeException错误解决方案

1、错误提示 调用Http工具报如下异常信息&#xff1a; cn.hutool.core.io.IORuntimeException: SSLHandshakeException: Received fatal alert: handshake_failure2、查询问题 一开始我以为是代码bug&#xff0c;网络bug甚至是配置环境未生效&#xff0c;找了一大圈&#xf…

海量数据迁移:Elasticsearch到OpenSearch的无缝迁移策略与实践

文章目录 一&#xff0e;迁移背景二&#xff0e;迁移分析三&#xff0e;方案制定3.1 使用工具迁移3.2 脚本迁移 四&#xff0e;方案建议 一&#xff0e;迁移背景 目前有两个es集群&#xff0c;版本为5.2.2和7.16.0&#xff0c;总数据量为700T。迁移过程需要不停服务迁移&#…

【IEEE出版】第六届国际科技创新学术交流大会暨信息技术与计算机应用学术会议(ITCA 2024,12月06-08)

第六届国际科技创新学术交流大会暨信息技术与计算机应用学术会议&#xff08;ITCA 2024) 2024 6th International Conference on Information Technology and Computer Application 会议官网&#xff1a;itca2024.iaecst.org 会议时间&#xff1a;2024年12月06-08日 截稿时…

Charles抓包_Android

1.下载地址 2.破解方法 3.安卓调试办法 查看官方文档&#xff0c;Android N之后抓包要声明App可用User目录下的CA证书 3.1.在Proxy下进行以下设置&#xff08;路径Proxy->Proxy Settings&#xff09; 3.1.1.不抓包Windows&#xff0c;即不勾选此项&#xff0c;免得打输出不…