vLLM Python 运行 gpt-oss 模型

September 13, 2025

588 views

2943 words

HuggingFace：
https://huggingface.co/openai/gpt-oss-20b
https://huggingface.co/openai/gpt-oss-120b
OpenAI 博文：https://openai.com/zh-Hans-CN/index/introducing-gpt-oss/

gpt-oss-120b — 适用于生产环境、通用目的、高推理要求的场景，可部署在单个 80GB 显存的 GPU 上（如 NVIDIA H100 或 AMD MI300X）。该模型拥有 1170 亿参数，其中活跃参数为 51 亿。

gpt-oss-20b — 适用于对延迟要求较低的场景，以及本地部署或专业化应用场景。该模型拥有 210 亿参数，其中活跃参数为 36 亿。

模型	层数	总参数	每个令牌的活跃参数	总专家数	每个令牌的活跃专家数	上下文长度
gpt-oss-120b	36	117b	5.1b	128	4	128k
gpt-oss-20b	24	21b	3.6b	32	4	128k

模型参数统计，

Component	120b	20b
MLP	114.71B	19.12B
Attention	0.96B	0.64B
Embed + Uneembed	1.16B	1.16B
Active Parameters	5.13B	3.61B
Total Parameters	116.83B	20.91B
Checkpoint Size	60.8GiB	12.8GiB

vLLM 运行

https://cookbook.openai.com/articles/gpt-oss/run-vllm

直接使用 vLLM 进行采样(sampling)，确保输入提示遵循 harmony 响应格式( harmony response format)非常重要，否则模型将无法正常运行。为此，可以使用openai-harmony SDK 。

import json
from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent,
)
 
from vllm import LLM, SamplingParams
 
# --- 1) Render the prefill with Harmony ---
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
 
convo = Conversation.from_messages(
    [
        Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
        Message.from_role_and_content(
            Role.DEVELOPER,
            DeveloperContent.new().with_instructions("Always respond in riddles"),
        ),
        Message.from_role_and_content(Role.USER, "What is the weather like in SF?"),
    ]
)
 
prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)
 
# Harmony stop tokens (pass to sampler so they won't be included in output)
stop_token_ids = encoding.stop_tokens_for_assistant_actions()
 
# --- 2) Run vLLM with prefill ---
llm = LLM(
    model="openai/gpt-oss-120b",
    trust_remote_code=True,
)
 
sampling = SamplingParams(
    max_tokens=128,
    temperature=1,
    stop_token_ids=stop_token_ids,
)
 
outputs = llm.generate(
    prompt_token_ids=[prefill_ids],   # batch of size 1
    sampling_params=sampling,
)
 
# vLLM gives you both text and token IDs
gen = outputs[0].outputs[0]
text = gen.text
output_tokens = gen.token_ids  # <-- these are the completion token IDs (no prefill)
 
# --- 3) Parse the completion token IDs back into structured Harmony messages ---
entries = encoding.parse_messages_from_completion_tokens(output_tokens, Role.ASSISTANT)
 
# 'entries' is a sequence of structured conversation entries (assistant messages, tool calls, etc.).
for message in entries:
    print(f"{json.dumps(message.to_dict())}")

vLLM 运行