| Symptom | Cause | Fix |
|---|---|---|
nemoclaw: command not found after install | Shell PATH not updated | Run source ~/.bashrc (or source ~/.zshrc for zsh), or open a new terminal window. |
| Installer fails with Node.js version error | Node.js version below 22.16 | Install Node.js 22.16+: curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash - && sudo apt-get install -y nodejs then re-run the installer. |
npm install fails with EACCES permission error | npm global directory not writable | mkdir -p ~/.npm-global && npm config set prefix ~/.npm-global && export PATH=~/.npm-global/bin:$PATH then re-run the installer. Add the export line to ~/.bashrc to make it permanent. |
| Docker permission denied | User not in docker group | sudo usermod -aG docker $USER, then log out and back in. |
| Gateway fails with cgroup / "Failed to start ContainerManager" errors | Older OpenShell or Docker still using a private cgroup namespace for the gateway so kubelet cannot see cgroup v2 controllers | First upgrade OpenShell (re-run the Phase 1 nemoclaw.sh install so you get a build that sets host cgroupns on the gateway container). If it still fails, force Docker's default to host mode by running the daemon.json cgroup fix below, then run sudo systemctl restart docker. |
| Gateway fails with "port 8080 is held by container..." | Another OpenShell gateway or container is using port 8080 | Stop the conflicting container: openshell gateway destroy -g <old-gateway-name> or docker stop <container-name> && docker rm <container-name>, then retry nemoclaw onboard. |
| Sandbox creation fails | Stale gateway state or DNS not propagated | Run openshell gateway destroy && openshell gateway start, then re-run the installer or nemoclaw onboard. |
| CoreDNS crash loop | Known issue on some DGX Spark configurations | Re-run the NemoClaw installer (curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash) which includes the CoreDNS fix. If the issue persists, see NemoClaw troubleshooting. |
| "No GPU detected" during onboard | DGX Spark GB10 reports unified memory differently | Expected on DGX Spark. The wizard still works and uses Ollama for inference. |
| Inference timeout or hangs | Ollama not running or not reachable | Check Ollama: curl http://127.0.0.1:11434. If not running: sudo systemctl restart ollama. Verify the NemoClaw auth proxy is healthy: curl http://127.0.0.1:11435/api/tags. If both respond, check nemoclaw my-assistant status for the Inference health line. |
| Agent gives no response or is very slow | First response can be slow, especially with larger models | Response time depends on model size (30B: a few seconds, 120B: 30–90 seconds). Verify inference route: nemoclaw my-assistant status. |
| Port 18789 already in use | Another process is bound to the port | lsof -i :18789 then kill <PID>. If needed, kill -9 <PID> to force-terminate. |
| Web UI port forward dies or dashboard unreachable | Port forward not active | openshell forward stop 18789 my-assistant then openshell forward start 18789 my-assistant --background. |
Web UI shows origin not allowed | Accessing via localhost instead of 127.0.0.1 | Use http://127.0.0.1:18789/#token=... in the browser. The gateway origin check requires 127.0.0.1 exactly. |
| Telegram bridge does not start | Telegram channel not registered with sandbox | Run nemoclaw <sandbox-name> channels add telegram to register the bot token and rebuild the sandbox. Verify with nemoclaw <sandbox-name> status. |
| Telegram stops responding after sandbox rebuild | Telegram long-polling session stale after rebuild | Run nemoclaw <sandbox-name> recover to restart the gateway. If still unresponsive, run nemoclaw <sandbox-name> channels add telegram to re-register and rebuild. |
| Telegram bot receives messages but does not reply | Telegram network egress policy not added | Run nemoclaw <sandbox-name> policy-add, select telegram, and confirm. This is a hot-reload — no rebuild needed. |
daemon.json cgroup fix
Use this script as the fallback for the cgroup / "Failed to start ContainerManager" row above. It validates any existing /etc/docker/daemon.json, writes a .bak backup, sets default-cgroupns-mode to host, and atomically replaces the file. It exits non-zero with an error on stderr if anything fails, leaving the original daemon.json untouched.
sudo python3 - <<'PY'
import json, os, shutil, sys, tempfile
path = '/etc/docker/daemon.json'
try:
if os.path.exists(path):
with open(path) as f:
data = json.load(f)
if not isinstance(data, dict):
raise ValueError(f'{path} is not a JSON object')
else:
data = {}
except (json.JSONDecodeError, ValueError, OSError) as e:
print(f'error: failed to read {path}: {e}', file=sys.stderr)
sys.exit(1)
if os.path.exists(path):
try:
shutil.copy2(path, path + '.bak')
except OSError as e:
print(f'error: failed to back up {path}: {e}', file=sys.stderr)
sys.exit(1)
data['default-cgroupns-mode'] = 'host'
target_dir = os.path.dirname(path) or '/'
fd, tmp = tempfile.mkstemp(prefix='daemon.json.', dir=target_dir)
try:
with os.fdopen(fd, 'w') as f:
json.dump(data, f, indent=2)
f.write('\n')
os.chmod(tmp, 0o644)
os.replace(tmp, path)
except OSError as e:
if os.path.exists(tmp):
try:
os.unlink(tmp)
except OSError:
pass
print(f'error: failed to write {path}: {e}', file=sys.stderr)
sys.exit(1)
PY
NOTE
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
For the latest known issues, please review the DGX Spark User Guide.