NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • Nanochat Training

use case

  • NemoClaw with Nemotron-3-Super and vLLM on DGX Station
  • Local Coding Agent
  • Secure Long Running AI Agents with OpenShell on DGX Station

inference

  • Serve Qwen3-235B with vLLM

Local Healthcare Agent on DGX Station

60 MIN

Run healthcare AI agents that analyze patient data and predict protein structures in an OpenShell sandbox on DGX Station

AI AgentDGX StationFHIRGB300HealthcareNemoClawNemotronOpenFold3OpenShell
View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Docker and infrastructure

SymptomCauseFix
make up hangs on model pullNemotron-3-Super is ~86 GB and takes 15–25 min on first download (longer on slow links)Wait. Check progress with docker compose logs -f ollama. If interrupted, re-run — it resumes where it left off.
OpenFold3: ✗ down in make statusOpenFold3 takes ~3 minutes to load model weights on startupWait and re-run make status. Check logs with docker compose logs -f openfold3.
failed to bind host port for 0.0.0.0:11434 on docker compose up ollamaHost Ollama is already listening on 11434 (common after the NemoClaw playbook)Stop host Ollama: sudo systemctl stop ollama && sudo systemctl disable ollama. Or override in .env: OLLAMA_PORT=11435 — make setup and setup_sandbox.sh source .env and configure the sandbox provider against the new port.
unauthorized: <html><head><title>401 Authorization Required when pulling nvcr.io/nim/openfold/openfold3Docker is not authenticated against NGC; NGC_API_KEY in .env is the runtime credential, not the pull credentialRun make ngc-login (reads NGC_API_KEY from .env). Manual equivalent: echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin.
OpenFold3 crashes with device >= 0 && device < num_gpus INTERNAL ASSERT FAILEDOpenFold3's PyTorch backend rejects multi-GPU containers; count: all exposes both GPUs on a dual-GPU stationdocker-compose.yml pins to LLM_GPU/OPENFOLD_GPU (default 0). On dual-GPU stations, set both to the GB300 index in .env and docker compose up -d --force-recreate openfold3.
NGC_API_KEY not set error.env file missing or NGC key not configuredRun cp .env.example .env and edit to add your NGC API key from ngc.nvidia.com.
exec format error when pulling containersContainer architecture mismatch (x86 container on ARM64)Ensure you're using ARM64-compatible containers. OpenFold3 (v1.3.0+) and Ollama support ARM64. Check with docker inspect --format '{{.Architecture}}' <image>.
Sandbox policy validation fails on startuplandlock: hard_requirement aborts if filesystem paths can't be enforcedCheck that all paths in sandbox-policy.yaml exist on the system. If running on non-standard DGX OS, try compatibility: best_effort temporarily to diagnose.
node: command not found or OpenShell rejects v18DGX Station ships with Node.js v18.19.1; OpenShell/OpenClaw need v22+curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash - && sudo apt-get install -y nodejs. make prereq validates the version automatically.

Gateway and sandbox

SymptomCauseFix
Gateway fails with "ContainerManager" errorDGX Station uses cgroup v2 and needs the systemd driver flagStart gateway with: OPENSHELL_K3S_ARGS='--kubelet-arg=cgroup-driver=systemd' openshell gateway start
openshell status returns "Connection reset by peer" right after gateway startk3s inside the gateway container takes 10–15s to accept connectionsWait. Use the polling loop from instructions.md Step 4: for i in $(seq 1 30); do openshell status 2>/dev/null | grep -q Connected && break; sleep 2; done.
openshell status shows "Not Connected" after 30sGateway not started or crashedRun openshell gateway start (with the cgroup flag above). Check docker ps for the gateway container.
openshell sandbox create fails with "port already forwarded" or hangs on --forward 18789Stale port forward from a previously deleted sandbox is still registeredList forwards: openshell forward list. Stop each one bound to :18789: openshell forward stop 18789 <sandbox-name>. setup_sandbox.sh does this automatically before re-creating the sandbox.
Existing OpenShell gateway from another playbook silently reused with new nameopenshell gateway start resumes any existing gateway in stopped stateAcceptable, but to start clean: openshell gateway destroy before running openshell gateway start.
Port 18789 not accessible remotelySSH tunnel not active or port forward dead inside sandboxCheck with openshell forward list. If dead: openshell forward stop 18789 clinical-sandbox && openshell forward start -d 18789 clinical-sandbox. Then re-establish SSH tunnel from your machine.
requests library doesn't work in sandboxSandbox Python uses curl subprocess for HTTP, not the requests libraryThis is by design. All HTTP calls in agent scripts must use subprocess.run(["curl", ...]) and json.loads(). The fhir_helpers.py library handles this automatically.

Inference and model

SymptomCauseFix
Agent returns empty response or timeoutModel unloaded from GPU memory after idle timeoutSend a warmup message first. Check OLLAMA_KEEP_ALIVE is set to 4h in docker-compose.yml.
curl: (7) Failed to connect to inference.localOpenShell inference provider not configured or Ollama not runningVerify Ollama: curl -sf http://localhost:${OLLAMA_PORT:-11434}/. Re-run make setup — it configures the inference provider automatically.
Sandbox cannot reach host Ollama (only Docker bridge IP times out)Host Ollama's systemd unit binds to 127.0.0.1 by defaultAdd a systemd override binding to all interfaces: sudo systemctl edit ollama and insert [Service] then Environment="OLLAMA_HOST=0.0.0.0", then sudo systemctl daemon-reload && sudo systemctl restart ollama. Docker Ollama (the default in this playbook) already binds to 0.0.0.0.
OpenFold3 returns error for molecular visualizationProtein sequence too long or malformed inputOpenFold3 supports sequences up to 4096 amino acids (PyTorch backend) or 2048 (TensorRT). Check the protein sequence in build_viewer.py's drug-target table.

Agent and skills

SymptomCauseFix
make setup failsSetup did not complete successfullyRe-run make setup — the script recreates the sandbox from scratch with fresh config. Ensure you're on OpenShell >= 0.0.33.
make check shows stale skillsWorkspace skill copies don't match the repo after an updateThe check output tells you which skills are stale. Re-run make setup or manually copy from /sandbox/clinical-intelligence/skills/ to ~/.openclaw/workspace/skills/ inside the sandbox.
ENOENT errors for memory files in logsOpenClaw tries to read daily memory files that don't existCreate the memory directory: mkdir -p ~/.openclaw/workspace/memory && touch ~/.openclaw/workspace/MEMORY.md inside the sandbox. make check detects this.
Agent writes code from scratch instead of using helpersStale IDENTITY.md or analysis-methods skill in workspaceRun make check to verify. If stale, the workspace IDENTITY.md doesn't have the fhir_helpers import instruction.
Agent uses wrong LOINC code for eGFRAgent used its own training knowledge instead of reading the skill fileRun make check to verify skills are synced. The fhir-basics skill lists 33914-3 for eGFR. If the workspace copy is stale, the model uses its own (often wrong) LOINC codes.

Demo and queries

SymptomCauseFix
FHIR queries return 0 patientsWrong SNOMED code formatUse bare codes: code=44054006, not code=http://snomed.info/sct|44054006. The skill files contain the correct patterns.
Charts not visible in dashboardCanvas directory not accessible or file not saved to correct pathCharts must be saved to ~/.openclaw/canvas/. View canvas at http://localhost:18789/__openclaw__/canvas/.
make test-full fails on L4/L5 agent testsAgent query timed out, FHIR server unreachable from sandbox, or Ollama model unloadedCheck step by step: (1) make status — are Ollama and OpenFold3 healthy? (2) make check — are skills and config synced? (3) Send a warmup message in the dashboard to reload the model. (4) Run make test --level 3 first to isolate whether the issue is infrastructure, config, or agent-level.

Resources

  • OpenShell Documentation
  • NemoClaw (OpenClaw Plugin)
  • OpenFold3 NIM
  • HL7 FHIR R4 Specification
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation