最近,法国人工智能初创公司 Mistral AI 宣布了他们的新一代语言模型 ——Ministral3B 和 Ministral8B。
这两款新模型是 “Ministraux” 系列的一部分,专为边缘设备和边缘计算场景而设计,支持高达128,000个 token 的上下文长度。这意味着这些模型不仅处理能力强大,而且可以在数据隐私和本地处理尤为重要的情况下使用。
Mistral 表示,Ministraux 系列模型非常适合于一系列应用,例如本地翻译、离线智能助手、数据分析以及自主机器人技术。为了进一步提升效率,Ministraux 模型还可以与更大的语言模型(比如 Mistral Large)结合使用,作为多步骤工作流中的有效中介。
在性能上,Mistral 提供的基准测试显示,Ministral3B 和8B 在多个类别中都超过了许多同类模型,比如谷歌的 Gemma 2 2B 和 Meta 的 Llama3.1 8B。值得一提的是,尽管 Ministral3B 的参数数量较少,但在某些测试中,它的表现超越了其前身 Mistral 7B。
实际上,Mistral 8B 在所有测试中都表现优异,尤其是在知识、常识、功能调用和多语言能力等方面。
关于定价,Ministral AI 的这两款新模型已经可以通过 API 获取。Ministral 8B 的费用为每百万个 token0.10美元,而 Ministral 3B 则是0.04美元。此外,Mistral 还为研究用途提供了 Ministral 8B Instruct 的模型权重。值得注意的是,Mistral 的这两款新模型很快也会通过谷歌 Vertex 和 AWS 等云合作伙伴上线。
mistralai/Ministral-8B-Instruct-2410
我们为本地智能、设备计算和边缘用例引入了两种新的先进模型。 我们称之为 Ministraux: Ministral 3B 和 Ministral 8B。 Ministral-8B-Instruct-2410 语言模型是根据 Mistral Research License 发布的微调模型,其性能明显优于现有的同类模型。 如果您有兴趣在商业上使用 Ministral-3B 或 Ministral-8B,并使其性能优于 Mistral-7B,请联系我们。 有关 Ministraux 的更多详情,请参阅我们的发布博文。
Ministral 8B 主要特点
- 根据 Mistral 研究许可证发布,如需商业许可证,请联系我们
- 使用 128k 上下文窗口和交错滑动窗口注意力进行训练
- 在大量多语言和代码数据上进行训练
- 支持函数调用 词汇量为 131k,使用 V3-Tekken 标记化器
基本指令模板(V3-Tekken)
<s>[INST]user message[/INST]assistant response</s>[INST]new user message[/INST]
Ministral 8B 架构
Feature | Value |
---|---|
Architecture | Dense Transformer |
Parameters | 8,019,808,256 |
Layers | 36 |
Heads | 32 |
Dim | 4096 |
KV Heads (GQA) | 8 |
Hidden Dim | 12288 |
Head Dim | 128 |
Vocab Size | 131,072 |
Context Length | 128k |
Attention Pattern | Ragged (128k,32k,32k,32k) |
基准
Base Model
知识与常识
Model | MMLU | AGIEval | Winogrande | Arc-c | TriviaQA |
---|---|---|---|---|---|
Mistral 7B Base | 62.5 | 42.5 | 74.2 | 67.9 | 62.5 |
Llama 3.1 8B Base | 64.7 | 44.4 | 74.6 | 46.0 | 60.2 |
Ministral 8B Base | 65.0 | 48.3 | 75.3 | 71.9 | 65.5 |
Gemma 2 2B Base | 52.4 | 33.8 | 68.7 | 42.6 | 47.8 |
Llama 3.2 3B Base | 56.2 | 37.4 | 59.6 | 43.1 | 50.7 |
Ministral 3B Base | 60.9 | 42.1 | 72.7 | 64.2 | 56.7 |
代码与数学
Model | HumanEval pass@1 | GSM8K maj@8 |
---|---|---|
Mistral 7B Base | 26.8 | 32.0 |
Llama 3.1 8B Base | 37.8 | 42.2 |
Ministral 8B Base | 34.8 | 64.5 |
Gemma 2 2B | 20.1 | 35.5 |
Llama 3.2 3B | 14.6 | 33.5 |
Ministral 3B | 34.2 | 50.9 |
多种语言
Model | French MMLU | German MMLU | Spanish MMLU |
---|---|---|---|
Mistral 7B Base | 50.6 | 49.6 | 51.4 |
Llama 3.1 8B Base | 50.8 | 52.8 | 54.6 |
Ministral 8B Base | 57.5 | 57.4 | 59.6 |
Gemma 2 2B Base | 41.0 | 40.1 | 41.7 |
Llama 3.2 3B Base | 42.3 | 42.2 | 43.1 |
Ministral 3B Base | 49.1 | 48.3 | 49.5 |
Instruct Models
Model | MTBench | Arena Hard | Wild bench |
---|---|---|---|
Mistral 7B Instruct v0.3 | 6.7 | 44.3 | 33.1 |
Llama 3.1 8B Instruct | 7.5 | 62.4 | 37.0 |
Gemma 2 9B Instruct | 7.6 | 68.7 | 43.8 |
Ministral 8B Instruct | 8.3 | 70.9 | 41.3 |
Gemma 2 2B Instruct | 7.5 | 51.7 | 32.5 |
Llama 3.2 3B Instruct | 7.2 | 46.0 | 27.2 |
Ministral 3B Instruct | 8.1 | 64.3 | 36.3 |
代码与数学
Model | MBPP pass@1 | HumanEval pass@1 | Math maj@1 |
---|---|---|---|
Mistral 7B Instruct v0.3 | 50.2 | 38.4 | 13.2 |
Gemma 2 9B Instruct | 68.5 | 67.7 | 47.4 |
Llama 3.1 8B Instruct | 69.7 | 67.1 | 49.3 |
Ministral 8B Instruct | 70.0 | 76.8 | 54.5 |
Gemma 2 2B Instruct | 54.5 | 42.7 | 22.8 |
Llama 3.2 3B Instruct | 64.6 | 61.0 | 38.4 |
Ministral 3B Instruct | 67.7 | 77.4 | 51.7 |
Demo
vLLM
pip install --upgrade vllm
pip install --upgrade mistral_common
from vllm import LLM
from vllm.sampling_params import SamplingParamsmodel_name = "mistralai/Ministral-8B-Instruct-2410"sampling_params = SamplingParams(max_tokens=8192)# note that running Ministral 8B on a single GPU requires 24 GB of GPU RAM
# If you want to divide the GPU requirement over multiple devices, please add *e.g.* `tensor_parallel=2`
llm = LLM(model=model_name, tokenizer_mode="mistral", config_format="mistral", load_format="mistral")prompt = "Do we need to think for 10 seconds to find the answer of 1 + 1?"messages = [{"role": "user","content": prompt},
]outputs = llm.chat(messages, sampling_params=sampling_params)print(outputs[0].outputs[0].text)
# You don't need to think for 10 seconds to find the answer to 1 + 1. The answer is 2,
# and you can easily add these two numbers in your mind very quickly without any delay.
Server
vllm serve mistralai/Ministral-8B-Instruct-2410 --tokenizer_mode mistral --config_format mistral --load_format mistral
注:在单 GPU 上运行 Ministral-8B 需要 24 GB GPU 内存。
如果要将 GPU 需求分配给多个设备,请添加 --tensor_parallel=2 等信息
Client
curl --location 'http://<your-node-url>:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{"model": "mistralai/Ministral-8B-Instruct-2410","messages": [{"role": "user","content": "Do we need to think for 10 seconds to find the answer of 1 + 1?"}]
}'
Mistral-inference
pip install mistral_inference --upgrade
下载
from huggingface_hub import snapshot_download
from pathlib import Pathmistral_models_path = Path.home().joinpath('mistral_models', '8B-Instruct')
mistral_models_path.mkdir(parents=True, exist_ok=True)snapshot_download(repo_id="mistralai/Ministral-8B-Instruct-2410", allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"], local_dir=mistral_models_path)
Chat
mistral-chat $HOME/mistral_models/8B-Instruct --instruct --max_tokens 256
密码检测
在本示例中,密钥信息有超过 10 万个标记,而 mistral-inference 没有分块预填充机制。 因此,运行下面的示例需要大量 GPU 内存(80 GB)。 如果想获得更节省内存的解决方案,我们建议使用 vLLM。
from mistral_inference.transformer import Transformer
from pathlib import Path
import json
from mistral_inference.generate import generate
from huggingface_hub import hf_hub_downloadfrom mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequestdef load_passkey_request() -> ChatCompletionRequest:passkey_file = hf_hub_download(repo_id="mistralai/Ministral-8B-Instruct-2410", filename="passkey_example.json")with open(passkey_file, "r") as f:data = json.load(f)message_content = data["messages"][0]["content"]return ChatCompletionRequest(messages=[UserMessage(content=message_content)])tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
model = Transformer.from_folder(mistral_models_path, softmax_fp32=False)completion_request = load_passkey_request()tokens = tokenizer.encode_chat_completion(completion_request).tokensout_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])print(result) # The pass key is 13005.
指导如下
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generatefrom mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequesttokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
model = Transformer.from_folder(mistral_models_path)completion_request = ChatCompletionRequest(messages=[UserMessage(content="How often does the letter r occur in Mistral?")])tokens = tokenizer.encode_chat_completion(completion_request).tokensout_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])print(result)
Function calling
from mistral_common.protocol.instruct.tool_calls import Function, Tool
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generatefrom mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.tekken import SpecialTokenPolicytokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
tekken = tokenizer.instruct_tokenizer.tokenizer
tekken.special_token_policy = SpecialTokenPolicy.IGNOREmodel = Transformer.from_folder(mistral_models_path)completion_request = ChatCompletionRequest(tools=[Tool(function=Function(name="get_current_weather",description="Get the current weather",parameters={"type": "object","properties": {"location": {"type": "string","description": "The city and state, e.g. San Francisco, CA",},"format": {"type": "string","enum": ["celsius", "fahrenheit"],"description": "The temperature unit to use. Infer this from the users location.",},},"required": ["location", "format"],},))],messages=[UserMessage(content="What's the weather like today in Paris?"),],
)tokens = tokenizer.encode_chat_completion(completion_request).tokensout_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])print(result)