Run local CLI coding agents with Ollama on DGX Station (NVIDIA GB300) using glm-4.7-flash (fast) or unsloth/GLM-4.7-GGUF:Q8_0 (best quality)
Description: Verify the GPU is visible before installing anything.
nvidia-smi
Expected output (example): A table showing driver version and GPU(s). On DGX Station, the GPU name may appear as NVIDIA GB300 (without "Ultra"):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 5xx.xx Driver Version: 5xx.xx CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| 0 NVIDIA GB300 On | 00000000:06:00.0 Off | 0 |
...
Description: Install Ollama or ensure it is recent enough for modern coding models.
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
To install a specific version (e.g. 0.15.0 or newer, required for GLM-4.7-Flash):
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
If Ollama is already present and the version is 0.15.0 or newer, simply run:
ollama --version
Expected output (example):
ollama version is 0.15.0
Description: Download the model weights to your DGX Station. This playbook supports two model options on Ollama; choose one (or both) depending on whether you want fast loading and testing or best quality.
For fast loading and testing — glm-4.7-flash (~19 GB for latest; loads quickly; ensure Ollama 0.15.0+):
ollama pull glm-4.7-flash
For best quality — unsloth/GLM-4.7-GGUF
ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0
Other glm-4.7-flash variants on GB300 (more GPU memory; bf16 is ~60 GB):
ollama pull glm-4.7-flash:q8_0
ollama pull glm-4.7-flash:bf16
Expected output (example): Progress lines followed by "success" and the model in ollama list:
ollama list
NAME ID SIZE MODIFIED
glm-4.7-flash:latest abc123... 19 GB 1 minute ago
unsloth/GLM-4.7-GGUF:Q8_0 def456... ... ...
Description: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. glm-4.7-flash for fast testing, or hf.co/unsloth/GLM-4.7-GGUF:Q8_0 for best quality).
ollama run glm-4.7-flash
Or, if you pulled the larger model:
ollama run hf.co/unsloth/GLM-4.7-GGUF:Q8_0
Try a prompt like:
Write a short README checklist for a Python project.
Expected output: GLM-4.7-Flash may show Thinking... and reasoning text before the final answer, then the model's response. This is normal; wait for the reply to complete.
Exit the Ollama REPL when done: type /bye or press Ctrl+D.
Description: Install the CLI tool that will drive the local model.
curl -fsSL https://claude.ai/install.sh | sh
Verify the installation:
claude --version
Expected output (example): A version string such as claude 0.x.x or similar. If you see claude: command not found, ensure the install script added the CLI to your PATH (e.g. restart the terminal or source your shell profile); see Troubleshooting.
Description: Ollama defaults to a 4096 token context length. For coding agents and larger codebases, set it to 64K tokens. This increases memory usage. For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
Set the context length per session in the Ollama REPL (use the same model name you pulled, e.g. glm-4.7-flash or hf.co/unsloth/GLM-4.7-GGUF:Q8_0):
ollama run glm-4.7-flash
Then, in the Ollama prompt:
/set parameter num_ctx 64000
Exit when done: type /bye or press Ctrl+D.
Optional method (set globally when serving Ollama):
sudo systemctl stop ollama
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
Keep this terminal open and run the next step in a new terminal.
Description: Point Claude Code to the local Ollama server and launch it. Use the model you pulled: glm-4.7-flash (fast) or hf.co/unsloth/GLM-4.7-GGUF:Q8_0 (best quality).
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
claude --model glm-4.7-flash
If you are using the larger model:
claude --model hf.co/unsloth/GLM-4.7-GGUF:Q8_0
ANTHROPIC_AUTH_TOKEN=ollama: Claude Code treats the literal value ollama as a special token that means "use the local Ollama backend" instead of Anthropic's cloud API. No real API key is needed when using Ollama.ANTHROPIC_BASE_URL: Tells Claude Code to send requests to your local Ollama server at port 11434.Persist these variables (optional) so you don't have to re-export every terminal session. Add to ~/.bashrc or your shell profile (e.g. ~/.zshrc):
echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
source ~/.bashrc
Expected output: Claude Code starts and uses the local model.
Exit Claude Code when done: type /exit or press Ctrl+C.
Description: Create a tiny repo and let Claude Code implement a function and tests.
mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
If you do not already have pytest installed:
python -m pip install -U pytest
In Claude Code, enter:
Please implement add() in math_utils.py and make sure the test passes.
Exit Claude Code when finished: type /exit or press Ctrl+C, then run the test:
python -m pytest -q
Expected output should show the test passing.
Description: Remove the model and stop the Ollama service if you no longer need them. Remove the model first (while the Ollama server is running), then stop the service.
WARNING
The following removes the downloaded model files from disk.
1. Remove the model (Ollama must be running). Use the same name you pulled:
ollama rm glm-4.7-flash
Or, for the Hugging Face model:
ollama rm hf.co/unsloth/GLM-4.7-GGUF:Q8_0
Use the exact tag you pulled (e.g. glm-4.7-flash:bf16 if you used that variant).
2. Stop the Ollama service:
sudo systemctl stop ollama
glm-4.7-flash:bf16, glm-4.7-flash:q8_0) on DGX Station (NVIDIA GB300).