Secure Long Running AI Agents with OpenShell on DGX Station

Symptom	Cause	Fix
`openshell gateway start` fails with "connection refused" or Docker errors	Docker is not running	Start Docker with `sudo systemctl start docker` or launch Docker Desktop, then retry `openshell gateway start`
`openshell status` shows gateway as unhealthy	Gateway container crashed or failed to initialize	Run `openshell gateway destroy` and then `openshell gateway start` to recreate it. Check Docker logs with `docker ps -a` and `docker logs <container-id>` for details
`openshell sandbox create --from openclaw` fails to build	Network issue pulling the community sandbox or Dockerfile build failure	Check internet connectivity. Retry the command. If the build fails on a specific package, check if the base image is compatible with your Docker version
Sandbox is in `Error` phase after creation	Policy validation failed or container startup crashed	Run `openshell logs <sandbox-name>` to see error details. Common causes: invalid policy YAML, missing provider credentials, or port conflicts
Agent cannot reach `inference.local` inside the sandbox	Inference routing not configured or provider unreachable	Run `openshell inference get` to verify the provider and model are set. From the host, test vLLM: `curl -s http://localhost:8000/v1/models`. The provider base URL must use the host’s real IP (not `127.0.0.1`/`localhost`) so the gateway container can reach vLLM (see `instructions.md` Step 6).
503 verification failed or timeout when the gateway validates vLLM	vLLM not listening on all interfaces, firewall blocking port 8000, model still loading, or first-request CUDA graph compile	Ensure the vLLM server was started with `--host 0.0.0.0` and port `8000` mapped (see Step 5). Warm up with a chat completion request before `openshell inference set`. Allow port 8000 if you use a host firewall: `sudo ufw allow 8000/tcp comment 'vLLM for OpenShell Gateway'` (then `sudo ufw reload` if needed). For very large models, try `openshell inference set ... --no-verify` after confirming vLLM works from the host.
Agent's outbound connections are all denied	Default policy does not include the required endpoints	Monitor denials with `openshell logs <sandbox-name> --tail --source sandbox`. Pull the current policy with `openshell policy get <sandbox-name> --full`, add the needed host/port under `network_policies`, and push with `openshell policy set <sandbox-name> --policy <file> --wait`
"Permission denied" or Landlock errors inside the sandbox	Agent trying to access a path not in `read_only` or `read_write` filesystem policy	Pull the current policy and add the path to `read_write` (or `read_only` if read access is sufficient). Push the updated policy. Note: filesystem policy is static and requires sandbox recreation
vLLM OOM or very slow inference	Model too large for available VRAM, `--max-model-len` too high, or GPU contention	Free GPU memory (close other GPU workloads), use a smaller Hugging Face model or quantized variant, or lower `--max-model-len`. Check `docker logs` for the vLLM container. Monitor with `nvidia-smi`
`openshell sandbox connect` hangs or times out	Sandbox not in `Ready` phase	Run `openshell sandbox get <sandbox-name>` to check the phase. If stuck in `Provisioning`, wait or check logs. If in `Error`, delete and recreate the sandbox
Policy push returns exit code 1 (validation failed)	Malformed YAML or invalid policy fields	Check the YAML syntax. Common issues: paths not starting with `/`, `..` traversal in paths, `root` as `run_as_user`, or endpoints missing required `host`/`port` fields. Fix and re-push
`openshell gateway start` fails with "K8s namespace not ready" / timed out waiting for namespace	The k3s cluster inside the Docker container takes longer to bootstrap than the CLI timeout allows. The internal components (TLS secrets, Helm chart, namespace creation) may need extra time, especially on first run when images are pulled inside the container.	First, check whether the container is still running and progressing: `docker ps --filter name=openshell` (look for `health: starting`). Inspect k3s state inside the container: `docker exec <container> sh -c "KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl get ns"` and `kubectl get pods -A`. If pods are in `ContainerCreating` and TLS secrets are missing (`navigator-server-tls`, `openshell-server-tls`), the cluster is still bootstrapping — wait a few minutes and run `openshell status` again. If it does not recover, destroy with `openshell gateway destroy` (and `docker rm -f <container>` if needed) and retry `openshell gateway start`. Ensure Docker has enough resources (memory and disk) for the k3s cluster.
`openshell status` says "No gateway configured" even though the Docker container is running	The `gateway start` command failed or timed out before it could save the gateway configuration to the local config store	The container may still be healthy — check with `docker ps --filter name=openshell`. If the container is running and healthy, try `openshell gateway start` again (it should detect the existing container). If the container is unhealthy or stuck, remove it with `docker rm -f <container>` and then `openshell gateway destroy` followed by `openshell gateway start`.

Secure Long Running AI Agents with OpenShell on DGX Station

Resources