ChatGLM 系列
简介
ChatGLM 是智谱 AI 推出的双语(中英)对话模型,基于 GLM(General Language Model)架构。GLM-4 系列支持 128K 上下文、工具调用、代码执行,是国内开源可部署的主流选择。
安装
bash
pip install zhipuai
# 本地部署
pip install transformers torch accelerateAPI 调用(BigModel 平台)
python
from zhipuai import ZhipuAI
client = ZhipuAI(api_key="your-api-key")
response = client.chat.completions.create(
model="glm-4-flash", # 免费模型
messages=[
{"role": "user", "content": "解释什么是智能体(AI Agent)"}
]
)
print(response.choices[0].message.content)模型选型
| 模型 | 上下文 | 特点 | 费用 |
|---|---|---|---|
| glm-4-flash | 128K | 速度极快 | 免费 |
| glm-4-air | 128K | 均衡性能 | 低价 |
| glm-4 | 128K | 旗舰模型 | 标准 |
| glm-4-long | 1M | 超长上下文 | 标准 |
| glm-4v | 8K | 视觉理解 | 标准 |
本地部署(开源版本)
python
from transformers import AutoTokenizer, AutoModel
import torch
# 加载 ChatGLM3-6B(需要约 13GB 显存)
model_path = "THUDM/chatglm3-6b"
tokenizer = AutoTokenizer.from_pretrained(
model_path, trust_remote_code=True
)
model = AutoModel.from_pretrained(
model_path,
trust_remote_code=True,
device_map="auto", # 自动分配 GPU
torch_dtype=torch.float16 # 半精度节省显存
).eval()
# 对话
response, history = model.chat(
tokenizer,
"你好,介绍一下自己",
history=[]
)
print(response)
# 多轮对话
response, history = model.chat(
tokenizer,
"你能做什么?",
history=history
)
print(response)流式输出
python
# API 流式
stream = client.chat.completions.create(
model="glm-4-flash",
messages=[{"role": "user", "content": "写一份贷款风险评估报告框架"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)工具调用
python
tools = [
{
"type": "function",
"function": {
"name": "query_credit_score",
"description": "查询客户信用评分",
"parameters": {
"type": "object",
"properties": {
"customer_id": {"type": "string", "description": "客户ID"},
"query_type": {
"type": "string",
"enum": ["basic", "detailed"],
"description": "查询类型"
}
},
"required": ["customer_id"]
}
}
}
]
response = client.chat.completions.create(
model="glm-4",
messages=[{"role": "user", "content": "查询客户 C001 的信用评分"}],
tools=tools
)
tool_call = response.choices[0].message.tool_calls[0]
print(f"工具: {tool_call.function.name}")
print(f"参数: {tool_call.function.arguments}")量化部署(低显存)
python
# INT4 量化,仅需 6GB 显存
model = AutoModel.from_pretrained(
"THUDM/chatglm3-6b",
trust_remote_code=True
).quantize(4).cuda().eval()
# INT8 量化,需要 10GB 显存
model = AutoModel.from_pretrained(
"THUDM/chatglm3-6b",
trust_remote_code=True
).quantize(8).cuda().eval()选型建议
- 快速原型:glm-4-flash(免费,速度快)
- 生产 API:glm-4-air(性价比高)
- 私有化部署:ChatGLM3-6B 量化版(6GB 显存可运行)
- 高精度任务:glm-4(旗舰版)