Run OpenClaw with local models in an NVIDIA OpenShell sandbox on DGX Station
| Symptom | Cause | Fix |
|---|---|---|
openshell gateway start fails with "connection refused" or Docker errors | Docker is not running | Start Docker with sudo systemctl start docker or launch Docker Desktop, then retry openshell gateway start |
openshell status shows gateway as unhealthy | Gateway container crashed or failed to initialize | Run openshell gateway destroy and then openshell gateway start to recreate it. Check Docker logs with docker ps -a and docker logs <container-id> for details |
openshell sandbox create --from openclaw fails to build | Network issue pulling the community sandbox or Dockerfile build failure | Check internet connectivity. Retry the command. If the build fails on a specific package, check if the base image is compatible with your Docker version |
Sandbox is in Error phase after creation | Policy validation failed or container startup crashed | Run openshell logs <sandbox-name> to see error details. Common causes: invalid policy YAML, missing provider credentials, or port conflicts |
Agent cannot reach inference.local inside the sandbox | Inference routing not configured or provider unreachable | Run openshell inference get to verify the provider and model are set. From the host, test vLLM: curl -s http://localhost:8000/v1/models. The provider base URL must use the host’s real IP (not 127.0.0.1/localhost) so the gateway container can reach vLLM (see instructions.md Step 6). |
| 503 verification failed or timeout when the gateway validates vLLM | vLLM not listening on all interfaces, firewall blocking port 8000, model still loading, or first-request CUDA graph compile | Ensure the vLLM server was started with --host 0.0.0.0 and port 8000 mapped (see Step 5). Warm up with a chat completion request before openshell inference set. Allow port 8000 if you use a host firewall: sudo ufw allow 8000/tcp comment 'vLLM for OpenShell Gateway' (then sudo ufw reload if needed). For very large models, try openshell inference set ... --no-verify after confirming vLLM works from the host. |
| Agent's outbound connections are all denied | Default policy does not include the required endpoints | Monitor denials with openshell logs <sandbox-name> --tail --source sandbox. Pull the current policy with openshell policy get <sandbox-name> --full, add the needed host/port under network_policies, and push with openshell policy set <sandbox-name> --policy <file> --wait |
| "Permission denied" or Landlock errors inside the sandbox | Agent trying to access a path not in read_only or read_write filesystem policy | Pull the current policy and add the path to read_write (or read_only if read access is sufficient). Push the updated policy. Note: filesystem policy is static and requires sandbox recreation |
| vLLM OOM or very slow inference | Model too large for available VRAM, --max-model-len too high, or GPU contention | Free GPU memory (close other GPU workloads), use a smaller Hugging Face model or quantized variant, or lower --max-model-len. Check docker logs for the vLLM container. Monitor with nvidia-smi |
openshell sandbox connect hangs or times out | Sandbox not in Ready phase | Run openshell sandbox get <sandbox-name> to check the phase. If stuck in Provisioning, wait or check logs. If in Error, delete and recreate the sandbox |
| Policy push returns exit code 1 (validation failed) | Malformed YAML or invalid policy fields | Check the YAML syntax. Common issues: paths not starting with /, .. traversal in paths, root as run_as_user, or endpoints missing required host/port fields. Fix and re-push |
openshell gateway start fails with "K8s namespace not ready" / timed out waiting for namespace | The k3s cluster inside the Docker container takes longer to bootstrap than the CLI timeout allows. The internal components (TLS secrets, Helm chart, namespace creation) may need extra time, especially on first run when images are pulled inside the container. | First, check whether the container is still running and progressing: docker ps --filter name=openshell (look for health: starting). Inspect k3s state inside the container: docker exec <container> sh -c "KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl get ns" and kubectl get pods -A. If pods are in ContainerCreating and TLS secrets are missing (navigator-server-tls, openshell-server-tls), the cluster is still bootstrapping — wait a few minutes and run openshell status again. If it does not recover, destroy with openshell gateway destroy (and docker rm -f <container> if needed) and retry openshell gateway start. Ensure Docker has enough resources (memory and disk) for the k3s cluster. |
openshell status says "No gateway configured" even though the Docker container is running | The gateway start command failed or timed out before it could save the gateway configuration to the local config store | The container may still be healthy — check with docker ps --filter name=openshell. If the container is running and healthy, try openshell gateway start again (it should detect the existing container). If the container is unhealthy or stuck, remove it with docker rm -f <container> and then openshell gateway destroy followed by openshell gateway start. |