Powers complex conversations with superior contextual understanding, reasoning and text generation.
Follow the steps below to download and run the NVIDIA NIM inference microservice for this model on your infrastructure of choice.
Install the NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
helm install nim-operator nvidia/k8s-nim-operator --create-namespace -n nim-operator
kubectl create ns nim-service
kubectl create secret -n nim-service docker-registry ngc-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=<PASTE_API_KEY_HERE>
kubectl create secret -n nim-service generic ngc-api-secret \
--from-literal=NGC_API_KEY=<PASTE_API_KEY_HERE>
Ensure that a default StorageClass exists in the cluster. If none is present, create an appropriate StorageClass before proceeding.
NOTE:
model-size based on the model and GPU type as described here.nvidia.com/gpu: 1 based on the model and number of GPU requirementsapiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: llama-31-70b-instruct
namespace: nim-service
spec:
image:
repository: nvcr.io/nim/meta/llama-3.1-70b-instruct
tag: latest
pullPolicy: IfNotPresent
pullSecrets:
- ngc-secret
authSecret: ngc-api-secret
storage:
pvc:
create: true
size: "model-size"
volumeAccessMode: "ReadWriteOnce"
replicas: 1
resources:
limits:
nvidia.com/gpu: 1
expose:
service:
type: ClusterIP
port: 8000
kubectl run --rm -it -n default curl --image=curlimages/curl:latest -- ash
curl -X "POST" \
'http://llama-31-70b-instruct.nim-service:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.1-70b-instruct",
"messages": [
{
"content":"What should I do for a 4 day vacation at Cape Hatteras National Seashore?",
"role": "user"
}],
"top_p": 1,
"n": 1,
"max_tokens": 1024,
"stream": false,
"frequency_penalty": 0.0,
"stop": ["STOP"]
}'
For more details on getting started with this NIM, visit the NVIDIA NIM Operator Docs.