Run healthcare AI agents that analyze patient data and predict protein structures in an OpenShell sandbox on DGX Station
| Symptom | Cause | Fix |
|---|---|---|
make up hangs on model pull | Nemotron-3-Super is ~86 GB and takes 15–25 min on first download (longer on slow links) | Wait. Check progress with docker compose logs -f ollama. If interrupted, re-run — it resumes where it left off. |
OpenFold3: ✗ down in make status | OpenFold3 takes ~3 minutes to load model weights on startup | Wait and re-run make status. Check logs with docker compose logs -f openfold3. |
failed to bind host port for 0.0.0.0:11434 on docker compose up ollama | Host Ollama is already listening on 11434 (common after the NemoClaw playbook) | Stop host Ollama: sudo systemctl stop ollama && sudo systemctl disable ollama. Or override in .env: OLLAMA_PORT=11435 — make setup and setup_sandbox.sh source .env and configure the sandbox provider against the new port. |
unauthorized: <html><head><title>401 Authorization Required when pulling nvcr.io/nim/openfold/openfold3 | Docker is not authenticated against NGC; NGC_API_KEY in .env is the runtime credential, not the pull credential | Run make ngc-login (reads NGC_API_KEY from .env). Manual equivalent: echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin. |
OpenFold3 crashes with device >= 0 && device < num_gpus INTERNAL ASSERT FAILED | OpenFold3's PyTorch backend rejects multi-GPU containers; count: all exposes both GPUs on a dual-GPU station | docker-compose.yml pins to LLM_GPU/OPENFOLD_GPU (default 0). On dual-GPU stations, set both to the GB300 index in .env and docker compose up -d --force-recreate openfold3. |
NGC_API_KEY not set error | .env file missing or NGC key not configured | Run cp .env.example .env and edit to add your NGC API key from ngc.nvidia.com. |
exec format error when pulling containers | Container architecture mismatch (x86 container on ARM64) | Ensure you're using ARM64-compatible containers. OpenFold3 (v1.3.0+) and Ollama support ARM64. Check with docker inspect --format '{{.Architecture}}' <image>. |
| Sandbox policy validation fails on startup | landlock: hard_requirement aborts if filesystem paths can't be enforced | Check that all paths in sandbox-policy.yaml exist on the system. If running on non-standard DGX OS, try compatibility: best_effort temporarily to diagnose. |
node: command not found or OpenShell rejects v18 | DGX Station ships with Node.js v18.19.1; OpenShell/OpenClaw need v22+ | curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash - && sudo apt-get install -y nodejs. make prereq validates the version automatically. |
| Symptom | Cause | Fix |
|---|---|---|
| Gateway fails with "ContainerManager" error | DGX Station uses cgroup v2 and needs the systemd driver flag | Start gateway with: OPENSHELL_K3S_ARGS='--kubelet-arg=cgroup-driver=systemd' openshell gateway start |
openshell status returns "Connection reset by peer" right after gateway start | k3s inside the gateway container takes 10–15s to accept connections | Wait. Use the polling loop from instructions.md Step 4: for i in $(seq 1 30); do openshell status 2>/dev/null | grep -q Connected && break; sleep 2; done. |
openshell status shows "Not Connected" after 30s | Gateway not started or crashed | Run openshell gateway start (with the cgroup flag above). Check docker ps for the gateway container. |
openshell sandbox create fails with "port already forwarded" or hangs on --forward 18789 | Stale port forward from a previously deleted sandbox is still registered | List forwards: openshell forward list. Stop each one bound to :18789: openshell forward stop 18789 <sandbox-name>. setup_sandbox.sh does this automatically before re-creating the sandbox. |
| Existing OpenShell gateway from another playbook silently reused with new name | openshell gateway start resumes any existing gateway in stopped state | Acceptable, but to start clean: openshell gateway destroy before running openshell gateway start. |
| Port 18789 not accessible remotely | SSH tunnel not active or port forward dead inside sandbox | Check with openshell forward list. If dead: openshell forward stop 18789 clinical-sandbox && openshell forward start -d 18789 clinical-sandbox. Then re-establish SSH tunnel from your machine. |
requests library doesn't work in sandbox | Sandbox Python uses curl subprocess for HTTP, not the requests library | This is by design. All HTTP calls in agent scripts must use subprocess.run(["curl", ...]) and json.loads(). The fhir_helpers.py library handles this automatically. |
| Symptom | Cause | Fix |
|---|---|---|
| Agent returns empty response or timeout | Model unloaded from GPU memory after idle timeout | Send a warmup message first. Check OLLAMA_KEEP_ALIVE is set to 4h in docker-compose.yml. |
curl: (7) Failed to connect to inference.local | OpenShell inference provider not configured or Ollama not running | Verify Ollama: curl -sf http://localhost:${OLLAMA_PORT:-11434}/. Re-run make setup — it configures the inference provider automatically. |
| Sandbox cannot reach host Ollama (only Docker bridge IP times out) | Host Ollama's systemd unit binds to 127.0.0.1 by default | Add a systemd override binding to all interfaces: sudo systemctl edit ollama and insert [Service] then Environment="OLLAMA_HOST=0.0.0.0", then sudo systemctl daemon-reload && sudo systemctl restart ollama. Docker Ollama (the default in this playbook) already binds to 0.0.0.0. |
| OpenFold3 returns error for molecular visualization | Protein sequence too long or malformed input | OpenFold3 supports sequences up to 4096 amino acids (PyTorch backend) or 2048 (TensorRT). Check the protein sequence in build_viewer.py's drug-target table. |
| Symptom | Cause | Fix |
|---|---|---|
make setup fails | Setup did not complete successfully | Re-run make setup — the script recreates the sandbox from scratch with fresh config. Ensure you're on OpenShell >= 0.0.33. |
make check shows stale skills | Workspace skill copies don't match the repo after an update | The check output tells you which skills are stale. Re-run make setup or manually copy from /sandbox/clinical-intelligence/skills/ to ~/.openclaw/workspace/skills/ inside the sandbox. |
| ENOENT errors for memory files in logs | OpenClaw tries to read daily memory files that don't exist | Create the memory directory: mkdir -p ~/.openclaw/workspace/memory && touch ~/.openclaw/workspace/MEMORY.md inside the sandbox. make check detects this. |
| Agent writes code from scratch instead of using helpers | Stale IDENTITY.md or analysis-methods skill in workspace | Run make check to verify. If stale, the workspace IDENTITY.md doesn't have the fhir_helpers import instruction. |
| Agent uses wrong LOINC code for eGFR | Agent used its own training knowledge instead of reading the skill file | Run make check to verify skills are synced. The fhir-basics skill lists 33914-3 for eGFR. If the workspace copy is stale, the model uses its own (often wrong) LOINC codes. |
| Symptom | Cause | Fix |
|---|---|---|
| FHIR queries return 0 patients | Wrong SNOMED code format | Use bare codes: code=44054006, not code=http://snomed.info/sct|44054006. The skill files contain the correct patterns. |
| Charts not visible in dashboard | Canvas directory not accessible or file not saved to correct path | Charts must be saved to ~/.openclaw/canvas/. View canvas at http://localhost:18789/__openclaw__/canvas/. |
make test-full fails on L4/L5 agent tests | Agent query timed out, FHIR server unreachable from sandbox, or Ollama model unloaded | Check step by step: (1) make status — are Ollama and OpenFold3 healthy? (2) make check — are skills and config synced? (3) Send a warmup message in the dashboard to reload the model. (4) Run make test --level 3 first to isolate whether the issue is infrastructure, config, or agent-level. |