HuggingFace:
https://huggingface.co/openai/gpt-oss-20b
https://huggingface.co/openai/gpt-oss-120b
OpenAI 博文:https://openai.com/zh-Hans-CN/index/introducing-gpt-oss/
gpt-oss-120b — 适用于生产环境、通用目的、高推理要求的场景,可部署在单个 80GB 显存的 GPU 上(如 NVIDIA H100 或 AMD MI300X)。该模型拥有 1170 亿参数,其中活跃参数为 51 亿。
gpt-oss-20b — 适用于对延迟要求较低的场景,以及本地部署或专业化应用场景。该模型拥有 210 亿参数,其中活跃参数为 36 亿。
| 模型 | 层数 | 总参数 | 每个令牌的活跃参数 | 总专家数 | 每个令牌的活跃专家数 | 上下文长度 |
|---|---|---|---|---|---|---|
| gpt-oss-120b | 36 | 117b | 5.1b | 128 | 4 | 128k |
| gpt-oss-20b | 24 | 21b | 3.6b | 32 | 4 | 128k |
模型参数统计,
| Component | 120b | 20b |
|---|---|---|
| MLP | 114.71B | 19.12B |
| Attention | 0.96B | 0.64B |
| Embed + Uneembed | 1.16B | 1.16B |
| Active Parameters | 5.13B | 3.61B |
| Total Parameters | 116.83B | 20.91B |
| Checkpoint Size | 60.8GiB | 12.8GiB |
vLLM 运行
直接使用 vLLM 进行采样(sampling),确保输入提示遵循 harmony 响应格式( harmony response format)非常重要,否则模型将无法正常运行。为此,可以使用openai-harmony SDK 。
import json
from openai_harmony import (
HarmonyEncodingName,
load_harmony_encoding,
Conversation,
Message,
Role,
SystemContent,
DeveloperContent,
)
from vllm import LLM, SamplingParams
# --- 1) Render the prefill with Harmony ---
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
convo = Conversation.from_messages(
[
Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
Message.from_role_and_content(
Role.DEVELOPER,
DeveloperContent.new().with_instructions("Always respond in riddles"),
),
Message.from_role_and_content(Role.USER, "What is the weather like in SF?"),
]
)
prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)
# Harmony stop tokens (pass to sampler so they won't be included in output)
stop_token_ids = encoding.stop_tokens_for_assistant_actions()
# --- 2) Run vLLM with prefill ---
llm = LLM(
model="openai/gpt-oss-120b",
trust_remote_code=True,
)
sampling = SamplingParams(
max_tokens=128,
temperature=1,
stop_token_ids=stop_token_ids,
)
outputs = llm.generate(
prompt_token_ids=[prefill_ids], # batch of size 1
sampling_params=sampling,
)
# vLLM gives you both text and token IDs
gen = outputs[0].outputs[0]
text = gen.text
output_tokens = gen.token_ids # <-- these are the completion token IDs (no prefill)
# --- 3) Parse the completion token IDs back into structured Harmony messages ---
entries = encoding.parse_messages_from_completion_tokens(output_tokens, Role.ASSISTANT)
# 'entries' is a sequence of structured conversation entries (assistant messages, tool calls, etc.).
for message in entries:
print(f"{json.dumps(message.to_dict())}")