from openai import OpenAI
client = OpenAI(
base_url = "https://integrate.api.nvidia.com/v1",
api_key = "$NVIDIA_API_KEY"
)
completion = client.chat.completions.create(
model="nvidia/nemotron-3-super-120b-a12b",
messages=[{"role":"user","content":""}],
temperature=1,
top_p=0.95,
max_tokens=16384,
extra_body={"chat_template_kwargs":{"enable_thinking":True},"reasoning_budget":16384},
stream=True
)
for chunk in completion:
if not chunk.choices:
continue
reasoning = getattr(chunk.choices[0].delta, "reasoning_content", None)
if reasoning:
print(reasoning, end="")
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")Follow the steps below to download and run the NVIDIA NIM inference microservice for this model on your infrastructure of choice.
$ docker login nvcr.io
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>
Pull and run the NVIDIA NIM with the command below. This will download the optimized model for your infrastructure.
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
chmod -R a+w "$LOCAL_NIM_CACHE"
docker run -it --rm \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:latest
You can now make a local API call using this curl command:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/nemotron-3-super-120b-a12b",
"messages": [{"role":"user", "content":"Which number is larger, 9.11 or 9.8?"}],
"max_tokens": 64
}'
For more details on getting started with this NIM, visit the NVIDIA NIM Docs.