Install NemoClaw on DGX Station with local vLLM inference and Telegram bot integration
These steps prepare a fresh DGX Station for NemoClaw. If Docker, the NVIDIA runtime, and vLLM are already configured, skip to Phase 2.
IMPORTANT
Disk space: NemoClaw’s onboard flow pulls a multi-gigabyte sandbox image and runs Docker, k3s, and the gateway together. If root or Docker’s data disk is nearly full (for example only a few gigabytes free), onboarding can fail with generic errors such as “K8s namespace not ready” with no clear hint about storage. Before you start, check free space: df -h / /var/lib/docker. NVIDIA recommends at least 40 GB free on the filesystem that holds Docker layers (often / or /var/lib/docker); treat under ~15 GB as high risk for first-time onboard failures.
OpenShell's gateway runs k3s inside Docker. On DGX Station (Ubuntu 24.04, cgroup v2), Docker must be configured with the NVIDIA runtime and host cgroup namespace mode.
Configure the NVIDIA container runtime for Docker:
sudo nvidia-ctk runtime configure --runtime=docker
Expected:
INFO Loading config from /etc/docker/daemon.json
INFO Wrote updated config to /etc/docker/daemon.json
INFO It is recommended that docker daemon be restarted.
Set the cgroup namespace mode required by OpenShell on DGX Station:
sudo python3 -c "
import json, os
path = '/etc/docker/daemon.json'
d = json.load(open(path)) if os.path.exists(path) else {}
d['default-cgroupns-mode'] = 'host'
json.dump(d, open(path, 'w'), indent=2)
"
Restart Docker:
sudo systemctl restart docker
Verify the NVIDIA runtime works:
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Expected:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| 0 NVIDIA GB300 On | 00000009:06:00.0 Off | 0 |
| N/A 46C P0 215W / 1300W | 18661MiB / 256703MiB | 0% Default |
+-----------------------------------------+------------------------+----------------------+
If you get a permission denied error on docker, add your user to the Docker group and activate the new group in your current session:
sudo usermod -aG docker $USER
newgrp docker
This applies the group change immediately. Alternatively, you can log out and back in instead of running newgrp docker.
NOTE
DGX Station uses cgroup v2. OpenShell's gateway embeds k3s inside Docker and needs host cgroup namespace access. Without default-cgroupns-mode: host, the gateway can fail with "Failed to start ContainerManager" errors.
Install pip and the Hugging Face CLI (if not already installed):
sudo apt install -y python3-pip
pip3 install --break-system-packages huggingface-hub
Download Nemotron 3 Super 120B in NVFP4 quantization (~60 GB; may take 10--20 minutes depending on network speed):
hf download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
Expected (on a fresh download; cached downloads complete instantly):
Fetching 36 files: 100%|██████████| 36/36 [15:42<00:00, 26.18s/it]
/home/nvidia/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/snapshots/0d6fa3ecad422a...
Verify the download completed:
ls ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/
Expected:
blobs refs snapshots
NOTE
The NVFP4 quantization is chosen because it fits entirely in one GB300 GPU’s 256 GB HBM3e with room for KV cache. On a two-GPU station you can still use NVFP4 with --tensor-parallel-size 1 and a single visible GPU, or shard with --tensor-parallel-size 2. For other quantization variants, see Troubleshooting.
Launch vLLM using the NVIDIA-optimized container image.
Single GPU (default on one-GPU systems, or pin to one GPU on multi-GPU stations): vLLM can emit mixed device warnings if several GPUs are visible but the model is only meant to use one. Pinning avoids accidentally placing weights on an unexpected device.
docker run -d --name vllm-nemotron \
--runtime nvidia --gpus '"device=0"' \
-e CUDA_VISIBLE_DEVICES=0 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--restart unless-stopped \
nvcr.io/nvidia/vllm:26.03-py3 \
python3 -m vllm.entrypoints.openai.api_server \
--model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--trust-remote-code \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser nemotron_v3
Two GPUs (tensor parallel): If your DGX Station has two Blackwell GPUs and you want Nemotron sharded across both, use both devices and set tensor parallel size to 2 (VRAM is summed across the GPUs):
docker run -d --name vllm-nemotron \
--runtime nvidia --gpus all \
-e CUDA_VISIBLE_DEVICES=0,1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--restart unless-stopped \
nvcr.io/nvidia/vllm:26.03-py3 \
python3 -m vllm.entrypoints.openai.api_server \
--model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--trust-remote-code \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser nemotron_v3
Pick a GPU index by name (optional one-liner): To print the device index of the first GPU whose name contains GB300 (adjust the pattern if your nvidia-smi name string differs), run on the host:
nvidia-smi --query-gpu=index,name --format=csv,noheader | awk -F', ' '/GB300/ { gsub(/^ +/,"",\$1); print \$1; exit }'
Use that index in Docker as --gpus '"device=N"' (replace N with the printed index).
NOTE
--tool-call-parser qwen3_xml: Nemotron’s tool-call wire format is exposed through vLLM’s Qwen3-compatible XML tool parser — the name refers to the parser implementation, not the base model. This pairing is what vLLM expects for correct function/tool calling with this checkpoint.
The first startup loads ~70 GB of weights into GPU memory. Watch the logs until you see the model is ready:
docker logs -f vllm-nemotron
Wait until you see the following in the logs (typically 3--5 minutes):
INFO Loading weights took 55.47 seconds
INFO Model loading took 69.39 GiB memory and 71.31 seconds
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
Then verify the API is responding:
curl -s http://localhost:8000/v1/models
Expected:
{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]}
Send a test request to warm up the model before proceeding to Step 4. The first inference request compiles CUDA graphs and can take 30--90 seconds:
curl -s --max-time 120 http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"Say hello."}],"max_tokens":10}'
Expected (the first request may take 30--90 seconds; subsequent requests are much faster):
{"id":"chatcmpl-...","object":"chat.completion","model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","choices":[{"index":0,"message":{"role":"assistant","content":"..."},"finish_reason":"length"}],...}
IMPORTANT
Warm up the model before running the NemoClaw installer. The onboard wizard validates the vLLM endpoint with a short timeout. If the model has not served at least one request, this validation will time out and the install will fail.
IMPORTANT
Always start vLLM via the Docker container -- do not run vllm serve directly on the host. The NVIDIA container image (nvcr.io/nvidia/vllm:26.03-py3) includes optimized kernels for the GB300's Blackwell architecture that are not available in the pip-installed version.
NOTE
Key flags explained:
--tensor-parallel-size -- 1 for a single visible GPU; 2 when you expose two GPUs for tensor-parallel sharding (see Step 3).--trust-remote-code -- required for the Mamba2-Transformer hybrid architecture--max-model-len 32768 -- maximum context length (increase up to 1M if VRAM allows)--enable-auto-tool-choice --tool-call-parser qwen3_xml -- enables function/tool calling for the agent (see the note above on the parser name).--reasoning-parser nemotron_v3 -- separates chain-of-thought reasoning from the response so the TUI/Web UI can display them cleanlyThe installer script installs Node.js (if needed), OpenShell, the NemoClaw CLI, and runs onboarding to create a sandbox. The vLLM provider requires the experimental flag and an extended inference timeout (the default 15-second validation timeout is too short for a 120B model).
This path is best for SSH sessions, automation, and documentation — no arrow-key TUI in the terminal.
NEMOCLAW_EXPERIMENTAL=1 \
NEMOCLAW_NON_INTERACTIVE=1 \
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NEMOCLAW_SANDBOX_NAME=my-assistant \
NEMOCLAW_PROVIDER=vllm \
NEMOCLAW_MODEL="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \
bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)"
Optional: include Telegram in the first onboard without typing the token over SSH — export credentials on the host before running the installer (same variables the NemoClaw Telegram bridge guide documents):
export TELEGRAM_BOT_TOKEN='<paste-token-here>'
# Optional DM allowlist (comma-separated Telegram user IDs):
# export TELEGRAM_ALLOWED_IDS='123456789,987654321'
Use Telegram Desktop or web.telegram.org on a laptop to copy the token from @BotFather and paste into your SSH session (or into a small env file you source). Typing a 46+ character token on a phone keyboard into a remote shell is error-prone.
To persist TELEGRAM_BOT_TOKEN across reboots, keep it in a root-owned or user-only file and source it from your shell profile (example — adjust path and permissions):
install -m 600 /dev/null ~/.nemoclaw/telegram.env
nano ~/.nemoclaw/telegram.env # add: export TELEGRAM_BOT_TOKEN='...'
grep -q 'nemoclaw/telegram.env' ~/.bashrc || echo 'source ~/.nemoclaw/telegram.env 2>/dev/null' >> ~/.bashrc
NemoClaw also stores messaging credentials in its credential store when you onboard or run nemoclaw … channels add telegram; the file above is mainly for re-running scripts or non-interactive flows that read the environment.
If you prefer the wizard:
NEMOCLAW_EXPERIMENTAL=1 \
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \
bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)"
The wizard asks six high-level prompts (third-party notice, inference provider, Brave Search, messaging channels, sandbox name, policy presets). In parallel, the installer prints eight numbered onboard sub-phases, [1/8] … [8/8] (preflight, gateway, inference detection, inference route, messaging channels, sandbox creation, OpenClaw inside sandbox, policy presets). Those two numberings are different on purpose — the [n/8] lines are internal progress steps; the numbered list above is what you answer in the TUI.
yes to accept and continue.Local vLLM [experimental] — running).skip if you don't have a Brave Search API key.my-assistant). Names must be lowercase alphanumeric with hyphens only.pypi and npm are selected by default. Press Enter to confirm.The install takes approximately 3 minutes. Example milestones in the output (wording may vary slightly by release):
[1/3] Node.js
Node.js found: v22.22.2
[2/3] NemoClaw CLI
Installing NemoClaw from GitHub...
Verified: nemoclaw is available at /home/nvidia/.local/bin/nemoclaw
[3/3] Onboarding
[1/8] Preflight checks
✓ Docker is running
✓ NVIDIA GPU detected: 2 GPU(s), 256703 MB VRAM # example on a two-GPU system
[2/8] Starting OpenShell gateway
✓ Gateway is healthy
[3/8] Configuring inference (NIM)
✓ Using existing vLLM on localhost:8000
Detected model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
[4/8] Setting up inference provider
✓ Inference route set: vllm-local / nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
[5/8] Messaging channels
(example) Telegram disabled — skipped
# or: Telegram enabled; token stored in credential store
[6/8] Creating sandbox
✓ Sandbox 'my-assistant' created
[7/8] Setting up OpenClaw inside sandbox
✓ OpenClaw gateway launched inside sandbox
[8/8] Policy presets
Applied preset: pypi
Applied preset: npm
When complete you will see:
──────────────────────────────────────────────────
Sandbox my-assistant (Landlock + seccomp + netns)
Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (Local vLLM)
──────────────────────────────────────────────────
Run: nemoclaw my-assistant connect
Status: nemoclaw my-assistant status
Logs: nemoclaw my-assistant logs --follow
OpenClaw UI (tokenized URL; treat it like a password)
http://127.0.0.1:18789/#token=<long-token-here>
──────────────────────────────────────────────────
IMPORTANT
Save the tokenized Web UI URL printed at the end -- you will need it in Step 8. It looks like:
http://127.0.0.1:18789/#token=<long-token-here>
IMPORTANT
NEMOCLAW_EXPERIMENTAL=1 is required for the vLLM provider. Without it, the installer will report "Requested provider 'vllm' is not available in this environment."
IMPORTANT
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 extends the validation timeout from the default 15 seconds to 300 seconds. Without this, the endpoint validation will fail on a cold 120B model, even if you warmed it up in Step 3 -- the installer sends its own test prompt which may be slower.
NOTE
If nemoclaw is not found after install, run source ~/.bashrc to reload your shell path.
Connect to the sandbox:
nemoclaw my-assistant connect
Expected:
sandbox@my-assistant:~$
You are now inside the sandboxed environment. Verify that the inference route is working:
curl -sf https://inference.local/v1/models
Expected:
{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]}
Still inside the sandbox, send a test message through the OpenClaw gateway (the default path). The --local flag is intentionally blocked inside the NemoClaw OpenShell sandbox — it would bypass gateway controls — so the command you may see in generic OpenClaw quickstarts will fail here.
openclaw agent --agent main -m "hello" --session-id test
Expected (the agent will think, then respond -- first response may take 30--90 seconds): streaming or printed assistant text ending with a normal reply.
If you see a response from the agent, inference is working end-to-end.
Launch the terminal UI for an interactive chat session:
openclaw tui
Press Ctrl+C to exit the TUI.
Exit the sandbox to return to the host:
exit
If accessing the Web UI directly on the DGX Station (keyboard and monitor attached), open a browser and navigate to the tokenized URL from Step 4. Prefer 127.0.0.1 in the URL bar (not localhost) so it matches strict gateway origin checks:
http://127.0.0.1:18789/#token=<long-token-here>
If accessing the Web UI from a remote machine, you need to set up port forwarding.
First, find your DGX Station's IP address. On the Station, run:
hostname -I | awk '{print $1}'
Start the port forward on the DGX Station host:
openshell forward start 18789 my-assistant --background
Expected:
Forwarding 127.0.0.1:18789 -> my-assistant:18789 (background)
If the forward was already started during onboarding, you will see:
Error: Port 18789 is already forwarded to sandbox 'my-assistant'.
This is fine -- the forward is already running.
Then from your remote machine, create an SSH tunnel to the Station (replace <your-station-ip> with the IP address from above):
ssh -L 18789:127.0.0.1:18789 <your-user>@<your-station-ip>
Now open the tokenized URL in your remote machine's browser. Either of these usually works on the client side because both bind to your loopback through the tunnel:
http://127.0.0.1:18789/#token=<long-token-here>
IMPORTANT
Use 127.0.0.1, not localhost -- the gateway origin check requires an exact match.
Messaging (Telegram, Discord, Slack) is wired during onboarding — credentials are stored, OpenShell providers are created, and channel configuration is baked into the sandbox image. Runtime config under /sandbox/.openclaw/ is not safely patchable from inside the running sandbox.
nemoclaw start does not start the Telegram bridge. In current NemoClaw releases it starts optional host services such as the cloudflared tunnel when installed; Telegram delivery stays under OpenShell. See NemoClaw commands and Set up Telegram bridge.
Open Telegram, find @BotFather, send /newbot, and follow the prompts. Copy the bot token.
Tip: Use Telegram Desktop or web.telegram.org so you can copy-paste the token into your terminal or env file instead of typing 46+ characters from your phone into SSH.
Export the token on the host, then run the installer / onboard again (non-interactive variables from Step 4, plus TELEGRAM_BOT_TOKEN). The wizard’s Messaging channels step (installer phase [5/8]) is the right time to toggle Telegram interactively.
Re-onboarding after a sandbox exists is supported; NemoClaw can detect token changes and rebuild the sandbox — see the official Telegram bridge page.
On the host (run exit if you are inside nemoclaw … connect):
telegram network preset:nemoclaw my-assistant policy-add
When prompted, select telegram and confirm.
export TELEGRAM_BOT_TOKEN='<your-bot-token>'
nemoclaw my-assistant channels add telegram
Follow the prompts to rebuild when asked (or run nemoclaw my-assistant rebuild --yes afterward if non-interactive mode queued a rebuild — see NEMOCLAW_NON_INTERACTIVE=1 behavior in the commands reference).
nemoclaw channels stop / nemoclaw channels start patterns for the telegram channel described in Set up Telegram bridge (exact subcommand spelling may vary slightly by NemoClaw version; use nemoclaw --help if in doubt).Check overall status:
nemoclaw status
Open Telegram, find your bot, and send it a message.
NOTE
The first response may take 30--90 seconds for a 120B parameter model running locally.
NOTE
To persist TELEGRAM_BOT_TOKEN for shell-based flows, use a chmod 600 env file and source it from ~/.bashrc as shown in Step 4.
NOTE
For chat allowlists and advanced Telegram behavior, see NemoClaw Telegram bridge documentation.
Stop any running auxiliary services (Telegram bridge, cloudflared tunnel):
nemoclaw stop
Expected:
[services] All services stopped.
Stop the port forward (always pass port and sandbox name):
openshell forward list
openshell forward stop 18789 my-assistant
Stop and remove the vLLM container so the name vllm-nemotron is free for a future run. The playbook created the container with --restart unless-stopped, so docker stop alone is not enough: Docker would restart it after reboot and the container would keep reserving GPU memory.
docker update --restart=no vllm-nemotron 2>/dev/null || true
docker stop vllm-nemotron
docker rm vllm-nemotron
To remove the container in one step even if it is running: docker rm -f vllm-nemotron.
Run the uninstaller from the cloned source directory. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and vLLM are preserved.
cd ~/.nemoclaw/source
./uninstall.sh
Uninstaller flags:
| Flag | Effect |
|---|---|
--yes | Skip the confirmation prompt |
--keep-openshell | Leave the openshell binary in place |
--delete-models | Removes local inference models pulled by older NemoClaw flows (the upstream flag name still references Ollama). It does not remove Hugging Face weights used by this playbook’s vLLM container — delete those separately (below). |
To also remove the vLLM container and cached model weights:
./uninstall.sh --yes
docker rm -f vllm-nemotron 2>/dev/null || true
rm -rf ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/
The uninstaller runs 6 steps:
nemoclaw npm package--delete-models)~/.nemoclaw, ~/.config/openshell, ~/.config/nemoclaw) and the OpenShell binaryNOTE
The source clone at ~/.nemoclaw/source is removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller.
| Command | Description |
|---|---|
nemoclaw my-assistant connect | Shell into the sandbox |
nemoclaw my-assistant status | Show sandbox status and inference config |
nemoclaw my-assistant logs --follow | Stream sandbox logs in real time |
nemoclaw list | List all registered sandboxes |
nemoclaw tunnel start | Start optional host services such as cloudflared (public dashboard URL when installed); does not start Telegram |
nemoclaw start | Deprecated alias for tunnel/aux host services — not for Telegram |
nemoclaw stop | Stop host auxiliary services started by nemoclaw tunnel start / nemoclaw start |
nemoclaw <sandbox> channels add telegram | Store Telegram token and rebuild sandbox (host) |
openshell term | Open the monitoring TUI on the host |
openshell forward list | List active port forwards |
openshell forward start 18789 my-assistant --background | Start port forwarding for Web UI |
openshell forward stop 18789 my-assistant | Stop Web UI port forward |
docker logs -f vllm-nemotron | Stream vLLM inference server logs |
docker restart vllm-nemotron | Restart the vLLM inference server |
curl http://localhost:8000/v1/models | Check vLLM API status |
cd ~/.nemoclaw/source && ./uninstall.sh | Remove NemoClaw (preserves Docker, Node.js, vLLM image) |