Secure Long Running AI Agents with OpenShell on DGX Spark

Symptom	Cause	Fix
`openshell status` shows "Connection refused" after install	The `openshell-gateway` systemd user service failed to start, usually because the user's systemd session predates the docker group add	Run `systemctl --user status --no-pager openshell-gateway` to confirm the failure. Then run `systemctl --user start openshell-gateway`. If it still fails with a Docker socket auth error, apply a temporary ACL: `sudo setfacl -m u:$USER:rw /var/run/docker.sock` and restart: `systemctl --user restart openshell-gateway`. For a permanent fix, reboot the Spark (Step 2 recommends this after `usermod`) so the user session picks up the docker group.
`openshell status` shows gateway as unhealthy	The gateway service crashed	Run `journalctl --user -u openshell-gateway --no-pager -n 50` to see the error. Restart with `systemctl --user restart openshell-gateway`. If Docker socket access is denied, see the row above.
`openshell sandbox create --from openclaw` fails to build	Network issue pulling the community sandbox or Dockerfile build failure	Check internet connectivity. Retry the command. If the build fails on a specific package, check if the base image is compatible with your Docker version
Sandbox is in `Error` phase after creation	Policy validation failed or container startup crashed	Run `openshell logs <sandbox-name>` to see error details. Common causes: invalid policy YAML, missing provider credentials, or port conflicts
Agent cannot reach `inference.local` inside the sandbox	Inference routing not configured or provider unreachable	Run `openshell inference get` to verify the provider and model are set. Test the vLLM server from the host: `curl http://localhost:8000/v1/models`. Ensure the provider `OPENAI_BASE_URL` uses the Spark's IP address (not `localhost`), since the gateway runs inside Docker
503 verification failed or timeout when gateway/sandbox accesses vLLM on the host	Provider URL points at `localhost`, or host firewall blocking port 8000	The recipe already binds vLLM to all interfaces (`--host 0.0.0.0`). Confirm the provider `OPENAI_BASE_URL` uses the Spark's IP (from `hostname -I`) so the gateway container (e.g. on Docker network 172.17.x.x) can reach it. Allow port 8000 through the host firewall: `sudo ufw allow 8000/tcp comment 'vLLM for OpenShell Gateway'` (then `sudo ufw reload` if needed).
Agent's outbound connections are all denied	Default policy does not include the required endpoints	Monitor denials with `openshell logs <sandbox-name> --tail --source sandbox`. Pull the current policy with `openshell policy get <sandbox-name> --full`, add the needed host/port under `network_policies`, and push with `openshell policy set <sandbox-name> --policy <file> --wait`
"Permission denied" or Landlock errors inside the sandbox	Agent trying to access a path not in `read_only` or `read_write` filesystem policy	Pull the current policy and add the path to `read_write` (or `read_only` if read access is sufficient). Push the updated policy. Note: filesystem policy is static and requires sandbox recreation
vLLM OOM or very slow inference	Model too large for available memory or GPU contention	Free GPU memory (close other GPU workloads), or relaunch vLLM with a lower `--gpu-memory-utilization` / `--max-model-len` (or a smaller model handle). Monitor with `nvidia-smi`
`openshell sandbox connect` hangs or times out	Sandbox not in `Ready` phase	Run `openshell sandbox get <sandbox-name>` to check the phase. If stuck in `Provisioning`, wait or check logs. If in `Error`, delete and recreate the sandbox
Policy push returns exit code 1 (validation failed)	Malformed YAML or invalid policy fields	Check the YAML syntax. Common issues: paths not starting with `/`, `..` traversal in paths, `root` as `run_as_user`, or endpoints missing required `host`/`port` fields. Fix and re-push
`openshell status` says "No gateway configured"	The gateway service is not running or was never started	Run `systemctl --user start openshell-gateway` and then `openshell status`. If the service fails to start, check logs: `journalctl --user -u openshell-gateway --no-pager -n 50`.

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

For the latest known issues, please review the DGX Spark User Guide.

Secure Long Running AI Agents with OpenShell on DGX Spark

Resources