NemoClaw with Nemotron-3-Super and vLLM on DGX Station

Phase 1: Prerequisites

These steps prepare a fresh DGX Station for NemoClaw. If Docker, the NVIDIA runtime, and vLLM are already configured, skip to Phase 2.

IMPORTANT

Disk space: NemoClaw’s onboard flow pulls a multi-gigabyte sandbox image and runs Docker, k3s, and the gateway together. If root or Docker’s data disk is nearly full (for example only a few gigabytes free), onboarding can fail with generic errors such as “K8s namespace not ready” with no clear hint about storage. Before you start, check free space: df -h / /var/lib/docker. NVIDIA recommends at least 40 GB free on the filesystem that holds Docker layers (often / or /var/lib/docker); treat under ~15 GB as high risk for first-time onboard failures.

Step 1
Configure Docker and the NVIDIA container runtime

OpenShell's gateway runs k3s inside Docker. On DGX Station (Ubuntu 24.04, cgroup v2), Docker must be configured with the NVIDIA runtime and host cgroup namespace mode.

Configure the NVIDIA container runtime for Docker:

sudo nvidia-ctk runtime configure --runtime=docker

Expected:

INFO Loading config from /etc/docker/daemon.json
INFO Wrote updated config to /etc/docker/daemon.json
INFO It is recommended that docker daemon be restarted.

Set the cgroup namespace mode required by OpenShell on DGX Station:

sudo python3 -c "
import json, os
path = '/etc/docker/daemon.json'
d = json.load(open(path)) if os.path.exists(path) else {}
d['default-cgroupns-mode'] = 'host'
json.dump(d, open(path, 'w'), indent=2)
"

Restart Docker:

sudo systemctl restart docker

Verify the NVIDIA runtime works:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Expected:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA GB300                   On  |   00000009:06:00.0 Off |                    0 |
| N/A   46C    P0            215W / 1300W |   18661MiB / 256703MiB |      0%      Default |
+-----------------------------------------+------------------------+----------------------+

If you get a permission denied error on docker, add your user to the Docker group and activate the new group in your current session:

sudo usermod -aG docker $USER
newgrp docker

This applies the group change immediately. Alternatively, you can log out and back in instead of running newgrp docker.

NOTE

DGX Station uses cgroup v2. OpenShell's gateway embeds k3s inside Docker and needs host cgroup namespace access. Without default-cgroupns-mode: host, the gateway can fail with "Failed to start ContainerManager" errors.

Step 2
Pull the Nemotron-3-Super model

Install pip and the Hugging Face CLI (if not already installed):

sudo apt install -y python3-pip
pip3 install --break-system-packages huggingface-hub

Download Nemotron 3 Super 120B in NVFP4 quantization (~60 GB; may take 10--20 minutes depending on network speed):

hf download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Expected (on a fresh download; cached downloads complete instantly):

Fetching 36 files: 100%|██████████| 36/36 [15:42<00:00, 26.18s/it]
/home/nvidia/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/snapshots/0d6fa3ecad422a...

Verify the download completed:

ls ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/

Expected:

blobs  refs  snapshots

NOTE

The NVFP4 quantization is chosen because it fits entirely in one GB300 GPU’s 256 GB HBM3e with room for KV cache. On a two-GPU station you can still use NVFP4 with --tensor-parallel-size 1 and a single visible GPU, or shard with --tensor-parallel-size 2. For other quantization variants, see Troubleshooting.

Step 3
Start the vLLM inference server

Launch vLLM using the NVIDIA-optimized container image.

Single GPU (default on one-GPU systems, or pin to one GPU on multi-GPU stations): vLLM can emit mixed device warnings if several GPUs are visible but the model is only meant to use one. Pinning avoids accidentally placing weights on an unexpected device.

docker run -d --name vllm-nemotron \
  --runtime nvidia --gpus '"device=0"' \
  -e CUDA_VISIBLE_DEVICES=0 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --restart unless-stopped \
  nvcr.io/nvidia/vllm:26.03-py3 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --max-model-len 32768 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --reasoning-parser nemotron_v3

Two GPUs (tensor parallel): If your DGX Station has two Blackwell GPUs and you want Nemotron sharded across both, use both devices and set tensor parallel size to 2 (VRAM is summed across the GPUs):

docker run -d --name vllm-nemotron \
  --runtime nvidia --gpus all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --restart unless-stopped \
  nvcr.io/nvidia/vllm:26.03-py3 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --max-model-len 32768 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --reasoning-parser nemotron_v3

Pick a GPU index by name (optional one-liner): To print the device index of the first GPU whose name contains GB300 (adjust the pattern if your nvidia-smi name string differs), run on the host:

nvidia-smi --query-gpu=index,name --format=csv,noheader | awk -F', ' '/GB300/ { gsub(/^ +/,"",\$1); print \$1; exit }'

Use that index in Docker as --gpus '"device=N"' (replace N with the printed index).

NOTE

--tool-call-parser qwen3_xml: Nemotron’s tool-call wire format is exposed through vLLM’s Qwen3-compatible XML tool parser — the name refers to the parser implementation, not the base model. This pairing is what vLLM expects for correct function/tool calling with this checkpoint.

The first startup loads ~70 GB of weights into GPU memory. Watch the logs until you see the model is ready:

docker logs -f vllm-nemotron

Wait until you see the following in the logs (typically 3--5 minutes):

INFO Loading weights took 55.47 seconds
INFO Model loading took 69.39 GiB memory and 71.31 seconds
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Then verify the API is responding:

curl -s http://localhost:8000/v1/models

Expected:

{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]}

Send a test request to warm up the model before proceeding to Step 4. The first inference request compiles CUDA graphs and can take 30--90 seconds:

curl -s --max-time 120 http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"Say hello."}],"max_tokens":10}'

Expected (the first request may take 30--90 seconds; subsequent requests are much faster):

{"id":"chatcmpl-...","object":"chat.completion","model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","choices":[{"index":0,"message":{"role":"assistant","content":"..."},"finish_reason":"length"}],...}

IMPORTANT

Warm up the model before running the NemoClaw installer. The onboard wizard validates the vLLM endpoint with a short timeout. If the model has not served at least one request, this validation will time out and the install will fail.

IMPORTANT

Always start vLLM via the Docker container -- do not run vllm serve directly on the host. The NVIDIA container image (nvcr.io/nvidia/vllm:26.03-py3) includes optimized kernels for the GB300's Blackwell architecture that are not available in the pip-installed version.

NOTE

Key flags explained:

--tensor-parallel-size -- 1 for a single visible GPU; 2 when you expose two GPUs for tensor-parallel sharding (see Step 3).
--trust-remote-code -- required for the Mamba2-Transformer hybrid architecture
--max-model-len 32768 -- maximum context length (increase up to 1M if VRAM allows)
--enable-auto-tool-choice --tool-call-parser qwen3_xml -- enables function/tool calling for the agent (see the note above on the parser name).
--reasoning-parser nemotron_v3 -- separates chain-of-thought reasoning from the response so the TUI/Web UI can display them cleanly

Phase 2: Install and Run NemoClaw

Step 4
Install NemoClaw

The installer script installs Node.js (if needed), OpenShell, the NemoClaw CLI, and runs onboarding to create a sandbox. The vLLM provider requires the experimental flag and an extended inference timeout (the default 15-second validation timeout is too short for a 120B model).

Recommended: non-interactive install (copy-paste friendly)

This path is best for SSH sessions, automation, and documentation — no arrow-key TUI in the terminal.

NEMOCLAW_EXPERIMENTAL=1 \
NEMOCLAW_NON_INTERACTIVE=1 \
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NEMOCLAW_SANDBOX_NAME=my-assistant \
NEMOCLAW_PROVIDER=vllm \
NEMOCLAW_MODEL="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \
bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)"

Optional: include Telegram in the first onboard without typing the token over SSH — export credentials on the host before running the installer (same variables the NemoClaw Telegram bridge guide documents):

export TELEGRAM_BOT_TOKEN='<paste-token-here>'
# Optional DM allowlist (comma-separated Telegram user IDs):
# export TELEGRAM_ALLOWED_IDS='123456789,987654321'

Use Telegram Desktop or web.telegram.org on a laptop to copy the token from @BotFather and paste into your SSH session (or into a small env file you source). Typing a 46+ character token on a phone keyboard into a remote shell is error-prone.

To persist TELEGRAM_BOT_TOKEN across reboots, keep it in a root-owned or user-only file and source it from your shell profile (example — adjust path and permissions):

install -m 600 /dev/null ~/.nemoclaw/telegram.env
nano ~/.nemoclaw/telegram.env   # add: export TELEGRAM_BOT_TOKEN='...'
grep -q 'nemoclaw/telegram.env' ~/.bashrc || echo 'source ~/.nemoclaw/telegram.env 2>/dev/null' >> ~/.bashrc

NemoClaw also stores messaging credentials in its credential store when you onboard or run nemoclaw … channels add telegram; the file above is mainly for re-running scripts or non-interactive flows that read the environment.

Alternative: interactive installer

If you prefer the wizard:

NEMOCLAW_EXPERIMENTAL=1 \
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \
bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)"

The wizard asks six high-level prompts (third-party notice, inference provider, Brave Search, messaging channels, sandbox name, policy presets). In parallel, the installer prints eight numbered onboard sub-phases, [1/8] … [8/8] (preflight, gateway, inference detection, inference route, messaging channels, sandbox creation, OpenClaw inside sandbox, policy presets). Those two numberings are different on purpose — the [n/8] lines are internal progress steps; the numbered list above is what you answer in the TUI.

Third-party software notice -- Type yes to accept and continue.
Inference provider -- The wizard detects vLLM running locally. Select option 8 (Local vLLM [experimental] — running).
Brave Web Search -- Optional. Type skip if you don't have a Brave Search API key.
Messaging channels -- Optional. Press Enter to skip, or toggle Telegram/Discord/Slack if desired (this is the step that corresponds to onboard phase [5/8] in the log).
Sandbox name -- Pick a name (e.g. my-assistant). Names must be lowercase alphanumeric with hyphens only.
Policy presets -- Use arrow keys to toggle presets. pypi and npm are selected by default. Press Enter to confirm.

The install takes approximately 3 minutes. Example milestones in the output (wording may vary slightly by release):

[1/3] Node.js
  Node.js found: v22.22.2

[2/3] NemoClaw CLI
  Installing NemoClaw from GitHub...
  Verified: nemoclaw is available at /home/nvidia/.local/bin/nemoclaw

[3/3] Onboarding
  [1/8] Preflight checks
    ✓ Docker is running
    ✓ NVIDIA GPU detected: 2 GPU(s), 256703 MB VRAM   # example on a two-GPU system
  [2/8] Starting OpenShell gateway
    ✓ Gateway is healthy
  [3/8] Configuring inference (NIM)
    ✓ Using existing vLLM on localhost:8000
    Detected model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
  [4/8] Setting up inference provider
    ✓ Inference route set: vllm-local / nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
  [5/8] Messaging channels
    (example) Telegram disabled — skipped
    # or: Telegram enabled; token stored in credential store
  [6/8] Creating sandbox
    ✓ Sandbox 'my-assistant' created
  [7/8] Setting up OpenClaw inside sandbox
    ✓ OpenClaw gateway launched inside sandbox
  [8/8] Policy presets
    Applied preset: pypi
    Applied preset: npm

When complete you will see:

──────────────────────────────────────────────────
Sandbox      my-assistant (Landlock + seccomp + netns)
Model        nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (Local vLLM)
──────────────────────────────────────────────────
Run:         nemoclaw my-assistant connect
Status:      nemoclaw my-assistant status
Logs:        nemoclaw my-assistant logs --follow

OpenClaw UI (tokenized URL; treat it like a password)
http://127.0.0.1:18789/#token=<long-token-here>
──────────────────────────────────────────────────

IMPORTANT

Save the tokenized Web UI URL printed at the end -- you will need it in Step 8. It looks like: http://127.0.0.1:18789/#token=<long-token-here>

IMPORTANT

NEMOCLAW_EXPERIMENTAL=1 is required for the vLLM provider. Without it, the installer will report "Requested provider 'vllm' is not available in this environment."

IMPORTANT

NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 extends the validation timeout from the default 15 seconds to 300 seconds. Without this, the endpoint validation will fail on a cold 120B model, even if you warmed it up in Step 3 -- the installer sends its own test prompt which may be slower.

NOTE

If nemoclaw is not found after install, run source ~/.bashrc to reload your shell path.

Step 5
Connect to the sandbox and verify inference

Connect to the sandbox:

nemoclaw my-assistant connect

Expected:

sandbox@my-assistant:~$

You are now inside the sandboxed environment. Verify that the inference route is working:

curl -sf https://inference.local/v1/models

Expected:

{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]}

Step 6
Talk to the agent (CLI)

Still inside the sandbox, send a test message through the OpenClaw gateway (the default path). The --local flag is intentionally blocked inside the NemoClaw OpenShell sandbox — it would bypass gateway controls — so the command you may see in generic OpenClaw quickstarts will fail here.

openclaw agent --agent main -m "hello" --session-id test

Expected (the agent will think, then respond -- first response may take 30--90 seconds): streaming or printed assistant text ending with a normal reply.

If you see a response from the agent, inference is working end-to-end.

Step 7
Interactive TUI

Launch the terminal UI for an interactive chat session:

openclaw tui

Press Ctrl+C to exit the TUI.

Step 8
Exit the sandbox and access the Web UI

Exit the sandbox to return to the host:

exit

If accessing the Web UI directly on the DGX Station (keyboard and monitor attached), open a browser and navigate to the tokenized URL from Step 4. Prefer 127.0.0.1 in the URL bar (not localhost) so it matches strict gateway origin checks:

http://127.0.0.1:18789/#token=<long-token-here>

If accessing the Web UI from a remote machine, you need to set up port forwarding.

First, find your DGX Station's IP address. On the Station, run:

hostname -I | awk '{print $1}'

Start the port forward on the DGX Station host:

openshell forward start 18789 my-assistant --background

Expected:

Forwarding 127.0.0.1:18789 -> my-assistant:18789 (background)

If the forward was already started during onboarding, you will see:

Error: Port 18789 is already forwarded to sandbox 'my-assistant'.

This is fine -- the forward is already running.

Then from your remote machine, create an SSH tunnel to the Station (replace <your-station-ip> with the IP address from above):

ssh -L 18789:127.0.0.1:18789 <your-user>@<your-station-ip>

Now open the tokenized URL in your remote machine's browser. Either of these usually works on the client side because both bind to your loopback through the tunnel:

http://127.0.0.1:18789/#token=<long-token-here>

IMPORTANT

Use 127.0.0.1, not localhost -- the gateway origin check requires an exact match.

Phase 3: Telegram Bot

Messaging (Telegram, Discord, Slack) is wired during onboarding — credentials are stored, OpenShell providers are created, and channel configuration is baked into the sandbox image. Runtime config under /sandbox/.openclaw/ is not safely patchable from inside the running sandbox.

nemoclaw start does not start the Telegram bridge. In current NemoClaw releases it starts optional host services such as the cloudflared tunnel when installed; Telegram delivery stays under OpenShell. See NemoClaw commands and Set up Telegram bridge.

Step 9
Create a Telegram bot

Open Telegram, find @BotFather, send /newbot, and follow the prompts. Copy the bot token.

Tip: Use Telegram Desktop or web.telegram.org so you can copy-paste the token into your terminal or env file instead of typing 46+ characters from your phone into SSH.

Step 10
Enable Telegram (first time or after skipping it)

Path A — You have not installed yet, or you can re-run onboard

Export the token on the host, then run the installer / onboard again (non-interactive variables from Step 4, plus TELEGRAM_BOT_TOKEN). The wizard’s Messaging channels step (installer phase [5/8]) is the right time to toggle Telegram interactively.

Re-onboarding after a sandbox exists is supported; NemoClaw can detect token changes and rebuild the sandbox — see the official Telegram bridge page.

Path B — NemoClaw is already installed (recommended host command)

On the host (run exit if you are inside nemoclaw … connect):

Allow outbound access to the Telegram API if you have not already — add the telegram network preset:

nemoclaw my-assistant policy-add

When prompted, select telegram and confirm.

Register the bot token and rebuild the sandbox image so Telegram is included:

export TELEGRAM_BOT_TOKEN='<your-bot-token>'
nemoclaw my-assistant channels add telegram

Follow the prompts to rebuild when asked (or run nemoclaw my-assistant rebuild --yes afterward if non-interactive mode queued a rebuild — see NEMOCLAW_NON_INTERACTIVE=1 behavior in the commands reference).

Pause or resume Telegram delivery without changing credentials: use the nemoclaw channels stop / nemoclaw channels start patterns for the telegram channel described in Set up Telegram bridge (exact subcommand spelling may vary slightly by NemoClaw version; use nemoclaw --help if in doubt).

Check overall status:

nemoclaw status

Open Telegram, find your bot, and send it a message.

NOTE

The first response may take 30--90 seconds for a 120B parameter model running locally.

NOTE

To persist TELEGRAM_BOT_TOKEN for shell-based flows, use a chmod 600 env file and source it from ~/.bashrc as shown in Step 4.

NOTE

For chat allowlists and advanced Telegram behavior, see NemoClaw Telegram bridge documentation.

Phase 4: Cleanup and Uninstall

Step 11
Stop services

Stop any running auxiliary services (Telegram bridge, cloudflared tunnel):

nemoclaw stop

Expected:

[services] All services stopped.

Stop the port forward (always pass port and sandbox name):

openshell forward list
openshell forward stop 18789 my-assistant

Stop and remove the vLLM container so the name vllm-nemotron is free for a future run. The playbook created the container with --restart unless-stopped, so docker stop alone is not enough: Docker would restart it after reboot and the container would keep reserving GPU memory.

docker update --restart=no vllm-nemotron 2>/dev/null || true
docker stop vllm-nemotron
docker rm vllm-nemotron

To remove the container in one step even if it is running: docker rm -f vllm-nemotron.

Step 12
Uninstall NemoClaw

Run the uninstaller from the cloned source directory. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and vLLM are preserved.

cd ~/.nemoclaw/source
./uninstall.sh

Uninstaller flags:

Flag	Effect
`--yes`	Skip the confirmation prompt
`--keep-openshell`	Leave the `openshell` binary in place
`--delete-models`	Removes local inference models pulled by older NemoClaw flows (the upstream flag name still references Ollama). It does not remove Hugging Face weights used by this playbook’s vLLM container — delete those separately (below).

To also remove the vLLM container and cached model weights:

./uninstall.sh --yes
docker rm -f vllm-nemotron 2>/dev/null || true
rm -rf ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/

The uninstaller runs 6 steps:

Stop NemoClaw helper services and port-forward processes
Delete all OpenShell sandboxes, the NemoClaw gateway, and providers
Remove the global nemoclaw npm package
Remove NemoClaw/OpenShell Docker containers, images, and volumes
Remove Ollama models (only with --delete-models)
Remove state directories (~/.nemoclaw, ~/.config/openshell, ~/.config/nemoclaw) and the OpenShell binary

NOTE

The source clone at ~/.nemoclaw/source is removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller.

Useful commands

Command	Description
`nemoclaw my-assistant connect`	Shell into the sandbox
`nemoclaw my-assistant status`	Show sandbox status and inference config
`nemoclaw my-assistant logs --follow`	Stream sandbox logs in real time
`nemoclaw list`	List all registered sandboxes
`nemoclaw tunnel start`	Start optional host services such as cloudflared (public dashboard URL when installed); does not start Telegram
`nemoclaw start`	Deprecated alias for tunnel/aux host services — not for Telegram
`nemoclaw stop`	Stop host auxiliary services started by `nemoclaw tunnel start` / `nemoclaw start`
`nemoclaw <sandbox> channels add telegram`	Store Telegram token and rebuild sandbox (host)
`openshell term`	Open the monitoring TUI on the host
`openshell forward list`	List active port forwards
`openshell forward start 18789 my-assistant --background`	Start port forwarding for Web UI
`openshell forward stop 18789 my-assistant`	Stop Web UI port forward
`docker logs -f vllm-nemotron`	Stream vLLM inference server logs
`docker restart vllm-nemotron`	Restart the vLLM inference server
`curl http://localhost:8000/v1/models`	Check vLLM API status
`cd ~/.nemoclaw/source && ./uninstall.sh`	Remove NemoClaw (preserves Docker, Node.js, vLLM image)

Phase 1: Prerequisites

These steps prepare a fresh DGX Station for NemoClaw. If Docker, the NVIDIA runtime, and vLLM are already configured, skip to Phase 2.

IMPORTANT

Step 1
Configure Docker and the NVIDIA container runtime

OpenShell's gateway runs k3s inside Docker. On DGX Station (Ubuntu 24.04, cgroup v2), Docker must be configured with the NVIDIA runtime and host cgroup namespace mode.

Configure the NVIDIA container runtime for Docker:

sudo nvidia-ctk runtime configure --runtime=docker

Expected:

INFO Loading config from /etc/docker/daemon.json
INFO Wrote updated config to /etc/docker/daemon.json
INFO It is recommended that docker daemon be restarted.

Set the cgroup namespace mode required by OpenShell on DGX Station:

sudo python3 -c "
import json, os
path = '/etc/docker/daemon.json'
d = json.load(open(path)) if os.path.exists(path) else {}
d['default-cgroupns-mode'] = 'host'
json.dump(d, open(path, 'w'), indent=2)
"

Restart Docker:

sudo systemctl restart docker

Verify the NVIDIA runtime works:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Expected:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA GB300                   On  |   00000009:06:00.0 Off |                    0 |
| N/A   46C    P0            215W / 1300W |   18661MiB / 256703MiB |      0%      Default |
+-----------------------------------------+------------------------+----------------------+

If you get a permission denied error on docker, add your user to the Docker group and activate the new group in your current session:

sudo usermod -aG docker $USER
newgrp docker

This applies the group change immediately. Alternatively, you can log out and back in instead of running newgrp docker.

NOTE

Step 2
Pull the Nemotron-3-Super model

Install pip and the Hugging Face CLI (if not already installed):

sudo apt install -y python3-pip
pip3 install --break-system-packages huggingface-hub

Download Nemotron 3 Super 120B in NVFP4 quantization (~60 GB; may take 10--20 minutes depending on network speed):

hf download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Expected (on a fresh download; cached downloads complete instantly):

Fetching 36 files: 100%|██████████| 36/36 [15:42<00:00, 26.18s/it]
/home/nvidia/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/snapshots/0d6fa3ecad422a...

Verify the download completed:

ls ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/

Expected:

blobs  refs  snapshots

NOTE

Step 3
Start the vLLM inference server

Launch vLLM using the NVIDIA-optimized container image.

docker run -d --name vllm-nemotron \
  --runtime nvidia --gpus '"device=0"' \
  -e CUDA_VISIBLE_DEVICES=0 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --restart unless-stopped \
  nvcr.io/nvidia/vllm:26.03-py3 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --max-model-len 32768 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --reasoning-parser nemotron_v3

docker run -d --name vllm-nemotron \
  --runtime nvidia --gpus all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --restart unless-stopped \
  nvcr.io/nvidia/vllm:26.03-py3 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --max-model-len 32768 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --reasoning-parser nemotron_v3

nvidia-smi --query-gpu=index,name --format=csv,noheader | awk -F', ' '/GB300/ { gsub(/^ +/,"",\$1); print \$1; exit }'

Use that index in Docker as --gpus '"device=N"' (replace N with the printed index).

NOTE

The first startup loads ~70 GB of weights into GPU memory. Watch the logs until you see the model is ready:

docker logs -f vllm-nemotron

Wait until you see the following in the logs (typically 3--5 minutes):

INFO Loading weights took 55.47 seconds
INFO Model loading took 69.39 GiB memory and 71.31 seconds
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Then verify the API is responding:

curl -s http://localhost:8000/v1/models

Expected:

{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]}

Send a test request to warm up the model before proceeding to Step 4. The first inference request compiles CUDA graphs and can take 30--90 seconds:

curl -s --max-time 120 http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"Say hello."}],"max_tokens":10}'

Expected (the first request may take 30--90 seconds; subsequent requests are much faster):

{"id":"chatcmpl-...","object":"chat.completion","model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","choices":[{"index":0,"message":{"role":"assistant","content":"..."},"finish_reason":"length"}],...}

IMPORTANT

NOTE

Key flags explained:

--tensor-parallel-size -- 1 for a single visible GPU; 2 when you expose two GPUs for tensor-parallel sharding (see Step 3).
--trust-remote-code -- required for the Mamba2-Transformer hybrid architecture
--max-model-len 32768 -- maximum context length (increase up to 1M if VRAM allows)
--enable-auto-tool-choice --tool-call-parser qwen3_xml -- enables function/tool calling for the agent (see the note above on the parser name).
--reasoning-parser nemotron_v3 -- separates chain-of-thought reasoning from the response so the TUI/Web UI can display them cleanly

Phase 2: Install and Run NemoClaw

Step 4
Install NemoClaw

Recommended: non-interactive install (copy-paste friendly)

This path is best for SSH sessions, automation, and documentation — no arrow-key TUI in the terminal.

NEMOCLAW_EXPERIMENTAL=1 \
NEMOCLAW_NON_INTERACTIVE=1 \
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NEMOCLAW_SANDBOX_NAME=my-assistant \
NEMOCLAW_PROVIDER=vllm \
NEMOCLAW_MODEL="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \
bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)"

export TELEGRAM_BOT_TOKEN='<paste-token-here>'
# Optional DM allowlist (comma-separated Telegram user IDs):
# export TELEGRAM_ALLOWED_IDS='123456789,987654321'

To persist TELEGRAM_BOT_TOKEN across reboots, keep it in a root-owned or user-only file and source it from your shell profile (example — adjust path and permissions):

install -m 600 /dev/null ~/.nemoclaw/telegram.env
nano ~/.nemoclaw/telegram.env   # add: export TELEGRAM_BOT_TOKEN='...'
grep -q 'nemoclaw/telegram.env' ~/.bashrc || echo 'source ~/.nemoclaw/telegram.env 2>/dev/null' >> ~/.bashrc

Alternative: interactive installer

If you prefer the wizard:

NEMOCLAW_EXPERIMENTAL=1 \
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \
bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)"

Third-party software notice -- Type yes to accept and continue.
Inference provider -- The wizard detects vLLM running locally. Select option 8 (Local vLLM [experimental] — running).
Brave Web Search -- Optional. Type skip if you don't have a Brave Search API key.
Messaging channels -- Optional. Press Enter to skip, or toggle Telegram/Discord/Slack if desired (this is the step that corresponds to onboard phase [5/8] in the log).
Sandbox name -- Pick a name (e.g. my-assistant). Names must be lowercase alphanumeric with hyphens only.
Policy presets -- Use arrow keys to toggle presets. pypi and npm are selected by default. Press Enter to confirm.

The install takes approximately 3 minutes. Example milestones in the output (wording may vary slightly by release):

[1/3] Node.js
  Node.js found: v22.22.2

[2/3] NemoClaw CLI
  Installing NemoClaw from GitHub...
  Verified: nemoclaw is available at /home/nvidia/.local/bin/nemoclaw

[3/3] Onboarding
  [1/8] Preflight checks
    ✓ Docker is running
    ✓ NVIDIA GPU detected: 2 GPU(s), 256703 MB VRAM   # example on a two-GPU system
  [2/8] Starting OpenShell gateway
    ✓ Gateway is healthy
  [3/8] Configuring inference (NIM)
    ✓ Using existing vLLM on localhost:8000
    Detected model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
  [4/8] Setting up inference provider
    ✓ Inference route set: vllm-local / nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
  [5/8] Messaging channels
    (example) Telegram disabled — skipped
    # or: Telegram enabled; token stored in credential store
  [6/8] Creating sandbox
    ✓ Sandbox 'my-assistant' created
  [7/8] Setting up OpenClaw inside sandbox
    ✓ OpenClaw gateway launched inside sandbox
  [8/8] Policy presets
    Applied preset: pypi
    Applied preset: npm

When complete you will see:

──────────────────────────────────────────────────
Sandbox      my-assistant (Landlock + seccomp + netns)
Model        nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (Local vLLM)
──────────────────────────────────────────────────
Run:         nemoclaw my-assistant connect
Status:      nemoclaw my-assistant status
Logs:        nemoclaw my-assistant logs --follow

OpenClaw UI (tokenized URL; treat it like a password)
http://127.0.0.1:18789/#token=<long-token-here>
──────────────────────────────────────────────────

IMPORTANT

Save the tokenized Web UI URL printed at the end -- you will need it in Step 8. It looks like: http://127.0.0.1:18789/#token=<long-token-here>

IMPORTANT

NEMOCLAW_EXPERIMENTAL=1 is required for the vLLM provider. Without it, the installer will report "Requested provider 'vllm' is not available in this environment."

IMPORTANT

NOTE

If nemoclaw is not found after install, run source ~/.bashrc to reload your shell path.

Step 5
Connect to the sandbox and verify inference

Connect to the sandbox:

nemoclaw my-assistant connect

Expected:

sandbox@my-assistant:~$

You are now inside the sandboxed environment. Verify that the inference route is working:

curl -sf https://inference.local/v1/models

Expected:

{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]}

Step 6
Talk to the agent (CLI)

openclaw agent --agent main -m "hello" --session-id test

Expected (the agent will think, then respond -- first response may take 30--90 seconds): streaming or printed assistant text ending with a normal reply.

If you see a response from the agent, inference is working end-to-end.

Step 7
Interactive TUI

Launch the terminal UI for an interactive chat session:

openclaw tui

Press Ctrl+C to exit the TUI.

Step 8
Exit the sandbox and access the Web UI

Exit the sandbox to return to the host:

exit

http://127.0.0.1:18789/#token=<long-token-here>

If accessing the Web UI from a remote machine, you need to set up port forwarding.

First, find your DGX Station's IP address. On the Station, run:

hostname -I | awk '{print $1}'

Start the port forward on the DGX Station host:

openshell forward start 18789 my-assistant --background

Expected:

Forwarding 127.0.0.1:18789 -> my-assistant:18789 (background)

If the forward was already started during onboarding, you will see:

Error: Port 18789 is already forwarded to sandbox 'my-assistant'.

This is fine -- the forward is already running.

Then from your remote machine, create an SSH tunnel to the Station (replace <your-station-ip> with the IP address from above):

ssh -L 18789:127.0.0.1:18789 <your-user>@<your-station-ip>

Now open the tokenized URL in your remote machine's browser. Either of these usually works on the client side because both bind to your loopback through the tunnel:

http://127.0.0.1:18789/#token=<long-token-here>

IMPORTANT

Use 127.0.0.1, not localhost -- the gateway origin check requires an exact match.

Phase 3: Telegram Bot

Step 9
Create a Telegram bot

Open Telegram, find @BotFather, send /newbot, and follow the prompts. Copy the bot token.

Tip: Use Telegram Desktop or web.telegram.org so you can copy-paste the token into your terminal or env file instead of typing 46+ characters from your phone into SSH.

Step 10
Enable Telegram (first time or after skipping it)

Path A — You have not installed yet, or you can re-run onboard

Re-onboarding after a sandbox exists is supported; NemoClaw can detect token changes and rebuild the sandbox — see the official Telegram bridge page.

Path B — NemoClaw is already installed (recommended host command)

On the host (run exit if you are inside nemoclaw … connect):

Allow outbound access to the Telegram API if you have not already — add the telegram network preset:

nemoclaw my-assistant policy-add

When prompted, select telegram and confirm.

Register the bot token and rebuild the sandbox image so Telegram is included:

export TELEGRAM_BOT_TOKEN='<your-bot-token>'
nemoclaw my-assistant channels add telegram

Pause or resume Telegram delivery without changing credentials: use the nemoclaw channels stop / nemoclaw channels start patterns for the telegram channel described in Set up Telegram bridge (exact subcommand spelling may vary slightly by NemoClaw version; use nemoclaw --help if in doubt).

Check overall status:

nemoclaw status

Open Telegram, find your bot, and send it a message.

NOTE

The first response may take 30--90 seconds for a 120B parameter model running locally.

NOTE

To persist TELEGRAM_BOT_TOKEN for shell-based flows, use a chmod 600 env file and source it from ~/.bashrc as shown in Step 4.

NOTE

For chat allowlists and advanced Telegram behavior, see NemoClaw Telegram bridge documentation.

Phase 4: Cleanup and Uninstall

Step 11
Stop services

Stop any running auxiliary services (Telegram bridge, cloudflared tunnel):

nemoclaw stop

Expected:

[services] All services stopped.

Stop the port forward (always pass port and sandbox name):

openshell forward list
openshell forward stop 18789 my-assistant

docker update --restart=no vllm-nemotron 2>/dev/null || true
docker stop vllm-nemotron
docker rm vllm-nemotron

To remove the container in one step even if it is running: docker rm -f vllm-nemotron.

Step 12
Uninstall NemoClaw

cd ~/.nemoclaw/source
./uninstall.sh

Uninstaller flags:

Flag	Effect
`--yes`	Skip the confirmation prompt
`--keep-openshell`	Leave the `openshell` binary in place
`--delete-models`	Removes local inference models pulled by older NemoClaw flows (the upstream flag name still references Ollama). It does not remove Hugging Face weights used by this playbook’s vLLM container — delete those separately (below).

To also remove the vLLM container and cached model weights:

./uninstall.sh --yes
docker rm -f vllm-nemotron 2>/dev/null || true
rm -rf ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/

The uninstaller runs 6 steps:

Stop NemoClaw helper services and port-forward processes
Delete all OpenShell sandboxes, the NemoClaw gateway, and providers
Remove the global nemoclaw npm package
Remove NemoClaw/OpenShell Docker containers, images, and volumes
Remove Ollama models (only with --delete-models)
Remove state directories (~/.nemoclaw, ~/.config/openshell, ~/.config/nemoclaw) and the OpenShell binary

NOTE

The source clone at ~/.nemoclaw/source is removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller.

Useful commands

Command	Description
`nemoclaw my-assistant connect`	Shell into the sandbox
`nemoclaw my-assistant status`	Show sandbox status and inference config
`nemoclaw my-assistant logs --follow`	Stream sandbox logs in real time
`nemoclaw list`	List all registered sandboxes
`nemoclaw tunnel start`	Start optional host services such as cloudflared (public dashboard URL when installed); does not start Telegram
`nemoclaw start`	Deprecated alias for tunnel/aux host services — not for Telegram
`nemoclaw stop`	Stop host auxiliary services started by `nemoclaw tunnel start` / `nemoclaw start`
`nemoclaw <sandbox> channels add telegram`	Store Telegram token and rebuild sandbox (host)
`openshell term`	Open the monitoring TUI on the host
`openshell forward list`	List active port forwards
`openshell forward start 18789 my-assistant --background`	Start port forwarding for Web UI
`openshell forward stop 18789 my-assistant`	Stop Web UI port forward
`docker logs -f vllm-nemotron`	Stream vLLM inference server logs
`docker restart vllm-nemotron`	Restart the vLLM inference server
`curl http://localhost:8000/v1/models`	Check vLLM API status
`cd ~/.nemoclaw/source && ./uninstall.sh`	Remove NemoClaw (preserves Docker, Node.js, vLLM image)

NemoClaw with Nemotron-3-Super and vLLM on DGX Station

Phase 1: Prerequisites

Step 1Configure Docker and the NVIDIA container runtime

Step 2Pull the Nemotron-3-Super model

Step 3Start the vLLM inference server

Phase 2: Install and Run NemoClaw

Step 4Install NemoClaw

Recommended: non-interactive install (copy-paste friendly)

Alternative: interactive installer

Step 5Connect to the sandbox and verify inference

Step 6Talk to the agent (CLI)

Step 7Interactive TUI

Step 8Exit the sandbox and access the Web UI

Phase 3: Telegram Bot

Step 9Create a Telegram bot

Step 10Enable Telegram (first time or after skipping it)

Path A — You have not installed yet, or you can re-run onboard

Path B — NemoClaw is already installed (recommended host command)

Phase 4: Cleanup and Uninstall

Step 11Stop services

Step 12Uninstall NemoClaw

Useful commands

Resources

NemoClaw with Nemotron-3-Super and vLLM on DGX Station

Phase 1: Prerequisites

Step 1Configure Docker and the NVIDIA container runtime

Step 2Pull the Nemotron-3-Super model

Step 3Start the vLLM inference server

Phase 2: Install and Run NemoClaw

Step 4Install NemoClaw

Recommended: non-interactive install (copy-paste friendly)

Alternative: interactive installer

Step 5Connect to the sandbox and verify inference

Step 6Talk to the agent (CLI)

Step 7Interactive TUI

Step 8Exit the sandbox and access the Web UI

Phase 3: Telegram Bot

Step 9Create a Telegram bot

Step 10Enable Telegram (first time or after skipping it)

Path A — You have not installed yet, or you can re-run onboard

Path B — NemoClaw is already installed (recommended host command)

Phase 4: Cleanup and Uninstall

Step 11Stop services

Step 12Uninstall NemoClaw

Useful commands

Resources

Step 1
Configure Docker and the NVIDIA container runtime

Step 2
Pull the Nemotron-3-Super model

Step 3
Start the vLLM inference server

Step 4
Install NemoClaw

Step 5
Connect to the sandbox and verify inference

Step 6
Talk to the agent (CLI)

Step 7
Interactive TUI

Step 8
Exit the sandbox and access the Web UI

Step 9
Create a Telegram bot

Step 10
Enable Telegram (first time or after skipping it)

Step 11
Stop services

Step 12
Uninstall NemoClaw

Step 1
Configure Docker and the NVIDIA container runtime

Step 2
Pull the Nemotron-3-Super model

Step 3
Start the vLLM inference server

Step 4
Install NemoClaw

Step 5
Connect to the sandbox and verify inference

Step 6
Talk to the agent (CLI)

Step 7
Interactive TUI

Step 8
Exit the sandbox and access the Web UI

Step 9
Create a Telegram bot

Step 10
Enable Telegram (first time or after skipping it)

Step 11
Stop services

Step 12
Uninstall NemoClaw