
Open reasoning model with 256K context window, native INT4 quantization and enhanced tool use
Kimi K2 Thinking is the most capable open-source thinking model. Starting with Kimi K2 Thinking, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, Kimi K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
This model is ready for commercial/non-commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Kimi-K2-Thinking Model Card
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information: Modified MIT License.
Global
This model is designed for advanced reasoning, agentic AI with deep thinking capabilities, multi-step problem-solving with tool orchestration, complex mathematical reasoning, coding with autonomous workflows, and research tasks requiring long-horizon agency. It can be used for autonomous research workflows, complex coding projects spanning hundreds of steps, mathematical problem-solving with extended reasoning, web browsing and information synthesis, and tool-orchestrated task execution.
Key Features
build.nvidia.com 12/08/2025: Available via link
Huggingface Available via link
References:
Architecture Type: Transformer
Input Types: Text, Tool Definitions
Input Formats: String, JSON
Input Parameters: One-Dimensional (1D)
Other Input Properties: The model has a context window of up to 256,000 tokens. Supports interleaved reasoning traces and tool calls.
Input Context Length (ISL): 256K
Output Format: String, JSON (for tool calls)
Output Parameters: One-Dimensional (1D)
Other Output Properties: Includes separate reasoning_content traces alongside final responses. Supports streaming and non-streaming modes.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engines:
Supported Hardware:
Operating Systems: Linux
Kimi K2 Thinking v1.0
Training Data Collection: Undisclosed
Training Labeling: Undisclosed
Training Properties: Trained with Quantization-Aware Training (QAT) during post-training phase for native INT4 support.
Testing Data Collection: Undisclosed
Testing Labeling: Undisclosed
Testing Properties: Undisclosed
Evaluation Benchmark Score:
Evaluation Data Collection: Hybrid: Human, Automated
Evaluation Labeling: Human
Evaluation Properties: HLE, AIME25, HMMT25, IMO-AnswerBench, GPQA, MMLU-Pro, MMLU-Redux, Longform Writing, HealthBench, BrowseComp, BrowseComp-ZH, Seal-0, FinSearchComp-T3, Frames, SWE-bench Verified, SWE-bench Multilingual, Multi-SWE-bench, SciCode, LiveCodeBench, Terminal-Bench
| Benchmark | Setting | Kimi K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
|---|---|---|---|---|---|---|---|
| Reasoning Tasks | |||||||
| HLE (Text-only) | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
| HLE | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
| HLE | heavy | 51.0 | 42.0 | - | - | - | 50.7 |
| AIME25 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 |
| AIME25 | w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1* | 98.8 |
| AIME25 | heavy | 100.0 | 100.0 | - | - | - | 100.0 |
| HMMT25 | no tools | 89.4 | 93.3 | 74.6* | 38.8 | 83.6 | 90.0 |
| HMMT25 | w/ python | 95.1 | 96.7 | 88.8* | 70.4 | 49.5* | 93.9 |
| HMMT25 | heavy | 97.5 | 100.0 | - | - | - | 96.7 |
| IMO-AnswerBench | no tools | 78.6 | 76.0* | 65.9* | 45.8 | 76.0* | 73.1 |
| GPQA | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |
| Agentic Search Tasks | |||||||
| BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 | - |
| BrowseComp-ZH | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 | - |
| Seal-0 | w/ tools | 56.3 | 51.4* | 53.4* | 25.2 | 38.5* | - |
| FinSearchComp-T3 | w/ tools | 47.4 | 48.5* | 44.0* | 10.4 | 27.0* | - |
| Frames | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* | - |
| Coding Tasks | |||||||
| SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 | - |
| SWE-bench Multilingual | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 | - |
| Multi-SWE-bench | w/ tools | 41.9 | 39.3* | 44.3 | 33.5 | 30.6 | - |
| SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 | - |
| LiveCodeBench | no tools | 64.8 | 64.4 | 60.4 | 49.8 | 60.8 | - |
| Terminal-Bench | w/ tools | 36.8 | 42.0 | - | 5.0 | 26.7 | - |
| General Tasks | |||||||
| MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 | - |
| MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 | - |
| Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 | - |
| HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 | - |
Acceleration Engine: vLLM, SGLang, KTransformers
Test Hardware: NVIDIA H100, NVIDIA A100
You can access Kimi K2 Thinking's API on https://platform.moonshot.ai , we provide OpenAI/Anthropic-compatible API for you.
Currently, Kimi K2 Thinking is recommended to run on the following inference engines:
Deployment examples can be found in the Model Deployment Guide.
Once the local inference service is up, you can interact with it through the chat endpoint:
def simple_chat(client: openai.OpenAI, model_name: str):
messages = [
{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
{"role": "user", "content": [{"type": "text", "text": "which one is bigger, 9.11 or 9.9? think carefully."}]},
]
response = client.chat.completions.create(
model=model_name,
messages=messages,
stream=False,
temperature=1.0,
max_tokens=4096
)
print(f"k2 answer: {response.choices[0].message.content}")
print("=====below is reasoning content======")
print(f"reasoning content: {response.choices[0].message.reasoning_content}")
NOTE
The recommended temperature for Kimi K2 Thinking is temperature = 1.0.
If no special instructions are required, the system prompt above is a good default.
Kimi K2 Thinking has the same tool calling settings as Kimi K2 Instruct. To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them.
The following example demonstrates calling a weather tool end-to-end:
# Your tool implementation
def get_weather(city: str) -> dict:
return {"weather": "Sunny"}
# Tool schema definition
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Retrieve current weather information. Call this when the user asks about the weather.",
"parameters": {
"type": "object",
"required": ["city"],
"properties": {
"city": {
"type": "string",
"description": "Name of the city"
}
}
}
}
}]
# Map tool names to their implementations
tool_map = {
"get_weather": get_weather
}
def tool_call_with_client(client: OpenAI, model_name: str):
messages = [
{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
{"role": "user", "content": "What's the weather like in Beijing today? Use the tool to check."}
]
finish_reason = None
while finish_reason is None or finish_reason == "tool_calls":
completion = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=1.0,
tools=tools, # tool list defined above
tool_choice="auto"
)
choice = completion.choices[0]
finish_reason = choice.finish_reason
if finish_reason == "tool_calls":
messages.append(choice.message)
for tool_call in choice.message.tool_calls:
tool_call_name = tool_call.function.name
tool_call_arguments = json.loads(tool_call.function.arguments)
tool_function = tool_map[tool_call_name]
tool_result = tool_function(**tool_call_arguments)
print("tool_result:", tool_result)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call_name,
"content": json.dumps(tool_result)
})
print("-" * 100)
print(choice.message.content)
The tool_call_with_client function implements the pipeline from user query to tool execution.
This pipeline requires the inference engine to support Kimi K2 Thinking's native tool-parsing logic.
For more information, see the Tool Calling Guide.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here