目录
- 1、测试环境
- 1.1、显卡
- 1.2、模型
- 1.3、部署环境
- 1.3.1、docker
- 1.3.2、执行命令
- 2、测试问题
- 2.1、20字左右问题
- 2.2、50字左右问题
- 2.3、100字左右问题
- 3、测试代码
- 3.1、通用测试代码
- 3.2、通用测试代码(仅供参考)
- 4、测试结果
- 4.1、通用测试结果
- 4.2、RAG测试结果
1、测试环境
1.1、显卡
1.2、模型
Qwen2.5-32B-Instruct
1.3、部署环境
xinference
1.3.1、docker
docker run \
-v ~/.xinference:/root/.xinference \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/.cache/modelscope:/root/.cache/modelscope \
-e XINFERENCE_MODEL_SRC=modelscope \
-p 9998:9997 \
--gpus all \
--shm-size 20g \
--name xinference \
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest \xinference-local \
-H 0.0.0.0 \
--log-level debug
1.3.2、执行命令
xinference launch --model_path ~/.cache/modelscope/hub/qwen/Qwen2___5-32B-Instruct --model-engine vLLM --model-name qwen2.5-instruct --size-in-billions 32 --model-format pytorch --quantization none --n-gpu 4 --gpu-idx 0,1,2,3
2、测试问题
以下是三个需要模型“深度思考”的问题(用ChatGTP得到):
2.1、20字左右问题
宇宙中是否存在多重宇宙,如果存在,它们之间如何相互影响?
2.2、50字左右问题
人工智能是否可能真正理解“情感”的含义?如果可以,情感的理解会如何影响人类与 AI 的关系?
2.3、100字左右问题
在人类社会的未来发展中,科技不断进步是否会导致人类完全依赖技术?如果技术突然崩溃,人类是否具备足够的韧性重新建立自给自足的社会?如何在高速发展的同时保留这种生存能力?
3、测试代码
3.1、通用测试代码
import requests
import time
from concurrent.futures import ThreadPoolExecutor
import uuid
from openai import OpenAIclient = OpenAI(base_url="http://127.0.0.1:9998/v1", api_key="not used actually")def openai_request():try:response = client.chat.completions.create(model="qwen2.5-instruct",messages=[{"role": "user", "content": "在人类社会的未来发展中,科技不断进步是否会导致人类完全依赖技术?如果技术突然崩溃,人类是否具备足够的韧性重新建立自给自足的社会?如何在高速发展的同时保留这种生存能力?"}],stream=True,)for chunk in response:content = chunk.choices[0].delta.contentif content:# print(content, end='', flush=True)passreturn Trueexcept requests.RequestException as e:print(f"请求失败: {e}")return False# 计算QPS
def calculate_qps(num_requests, num_threads):with ThreadPoolExecutor(max_workers=num_threads) as executor:start_time = time.time()futures = [executor.submit(openai_request) for _ in range(num_requests)]successful_requests = sum([future.result() for future in futures])end_time = time.time()duration = end_time - start_time # 测试时长qps = successful_requests / duration # 计算 QPSreturn qps, successful_requests, durationreq_test = [(10,10),(10,20),(10,50),(20,20),(20,50),(50,50),(50,100)]for num_threads,num_requests in req_test:qps, successful_requests, duration = calculate_qps(num_requests, num_threads)print(f"并发请求数: {num_threads}")print(f"请求个数: {num_requests}")print(f"总测试时长: {duration:.2f}秒")print(f"QPS: {qps:.2f} 请求/秒")print(f"成功请求数量: {successful_requests}")print("*"*33)
3.2、通用测试代码(仅供参考)
import requests
import time
from concurrent.futures import ThreadPoolExecutor
import uuid
from openai import OpenAI
# 测试的 URL 和 payload
url = "http://127.0.0.1:7860/v1/chat/completions"
payload = {"conversation_id": "0","messages": [{"role": "user","content": "空调系统由哪几个系统组成" # 你可以根据需求更换成其他问题},],"stream": True,"temperature": 0
}# 单次请求函数
def send_request():try:payload["conversation_id"] = str(uuid.uuid4())response = requests.post(url, json=payload, stream=True)for chunk in response.iter_content(chunk_size=512): # 分块接收响应if chunk:chunk = chunk.decode('utf-8', errors='ignore').strip()# 这里可以对流式响应进行处理# print(chunk) # 可选,输出流式返回的每个部分return True # 如果成功接收到响应,返回 Trueexcept requests.RequestException as e:print(f"请求失败: {e}")return False# 计算QPS
def calculate_qps(num_requests, num_threads):with ThreadPoolExecutor(max_workers=num_threads) as executor:start_time = time.time()futures = [executor.submit(send_request) for _ in range(num_requests)]successful_requests = sum([future.result() for future in futures])end_time = time.time()duration = end_time - start_time # 测试时长qps = successful_requests / duration # 计算 QPSreturn qps, successful_requests, durationreq_test = [(10,10),(10,20),(10,50),(20,20),(20,50),(50,50),(50,100)]for num_threads,num_requests in req_test:qps, successful_requests, duration = calculate_qps(num_requests, num_threads)print(f"并发请求数: {num_threads}")print(f"请求个数: {num_requests}")print(f"总测试时长: {duration:.2f}秒")print(f"QPS: {qps:.2f} 请求/秒")print(f"成功请求数量: {successful_requests}")print("*"*33)
4、测试结果
4.1、通用测试结果
并发请求数 | 请求个数 | 时间s-20 | QPS-20 | 时间s-50 | QPS-50 | 时间s-100 | QPS-100 |
---|---|---|---|---|---|---|---|
10 | 10 | 25.54 | 0.39 | 28.95 | 0.35 | 30.56 | 0.33 |
10 | 20 | 47.85 | 0.42 | 57.46 | 0.35 | 64.24 | 0.31 |
10 | 50 | 122.27 | 0.41 | 135.08 | 0.37 | 151.01 | 0.33 |
20 | 20 | 34.52 | 0.58 | 35.91 | 0.56 | 44.84 | 0.45 |
20 | 50 | 83.04 | 0.6 | 93.56 | 0.53 | 106.35 | 0.47 |
50 | 50 | 49.91 | 1.00 | 54.11 | 0.92 | 66.72 | 0.75 |
50 | 100 | 101.47 | 0.99 | 110.77 | 0.90 | 123.49 | 0.81 |
4.2、RAG测试结果
并发请求数 | 请求个数 | 时间s | QPS | 失败 |
---|---|---|---|---|
10 | 1000 | 2968.06 | 0.33 | 0 |
10 | 100 | 299.89 | 0.33 | 0 |
10 | 50 | 178.77 | 0.28 | 0 |
10 | 20 | 61.24 | 0.33 | 0 |
10 | 10 | 32.87 | 0.3 | 0 |
20 | 20 | 54.89 | 0.36 | 0 |
20 | 40 | 108.23 | 0.36 | 1 |
20 | 50 | 136.97 | 0.35 | 2 |
50 | 50 | 120.85 | 0.15 | 32 |
50 | 100 | 224.15 | 0.08 | 82 |