HuggingFace:

https://huggingface.co/openai/gpt-oss-20b

https://huggingface.co/openai/gpt-oss-120b

OpenAI 博文:https://openai.com/zh-Hans-CN/index/introducing-gpt-oss/

gpt-oss-120b — 适用于生产环境、通用目的、高推理要求的场景,可部署在单个 80GB 显存的 GPU 上(如 NVIDIA H100 或 AMD MI300X)。该模型拥有 1170 亿参数,其中活跃参数为 51 亿。

gpt-oss-20b — 适用于对延迟要求较低的场景,以及本地部署或专业化应用场景。该模型拥有 210 亿参数,其中活跃参数为 36 亿。

模型层数总参数每个令牌的活跃参数总专家数每个令牌的活跃专家数上下文长度
gpt-oss-120b36117b5.1b1284128k
gpt-oss-20b2421b3.6b324128k

模型参数统计,

Component120b20b
MLP114.71B19.12B
Attention0.96B0.64B
Embed + Uneembed1.16B1.16B
Active Parameters5.13B3.61B
Total Parameters116.83B20.91B
Checkpoint Size60.8GiB12.8GiB

vLLM 运行

https://cookbook.openai.com/articles/gpt-oss/run-vllm

直接使用 vLLM 进行采样(sampling),确保输入提示遵循 harmony 响应格式( harmony response format)非常重要,否则模型将无法正常运行。为此,可以使用openai-harmony SDK

import json
from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent,
)
 
from vllm import LLM, SamplingParams
 
# --- 1) Render the prefill with Harmony ---
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
 
convo = Conversation.from_messages(
    [
        Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
        Message.from_role_and_content(
            Role.DEVELOPER,
            DeveloperContent.new().with_instructions("Always respond in riddles"),
        ),
        Message.from_role_and_content(Role.USER, "What is the weather like in SF?"),
    ]
)
 
prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)
 
# Harmony stop tokens (pass to sampler so they won't be included in output)
stop_token_ids = encoding.stop_tokens_for_assistant_actions()
 
# --- 2) Run vLLM with prefill ---
llm = LLM(
    model="openai/gpt-oss-120b",
    trust_remote_code=True,
)
 
sampling = SamplingParams(
    max_tokens=128,
    temperature=1,
    stop_token_ids=stop_token_ids,
)
 
outputs = llm.generate(
    prompt_token_ids=[prefill_ids],   # batch of size 1
    sampling_params=sampling,
)
 
# vLLM gives you both text and token IDs
gen = outputs[0].outputs[0]
text = gen.text
output_tokens = gen.token_ids  # <-- these are the completion token IDs (no prefill)
 
# --- 3) Parse the completion token IDs back into structured Harmony messages ---
entries = encoding.parse_messages_from_completion_tokens(output_tokens, Role.ASSISTANT)
 
# 'entries' is a sequence of structured conversation entries (assistant messages, tool calls, etc.).
for message in entries:
    print(f"{json.dumps(message.to_dict())}")
Last modification:September 13, 2025
If you think my article is useful to you, please feel free to appreciate