nvidia

Build a Digital Human

Create intelligent, interactive avatars for customer service across industries

This experience showcases James and Aria, our interactive digital humans who have the knowledge of NVIDIA’s products or O-RAN specifications by having direct access to corresponding knowledge bases. The Digital Human and the RAG-powered backend application use a collection of NVIDIA NIM microservices, NVIDIA ACE and Maxine technologies, and ElevenLabs text-to-speech to provide natural and immersive responses. Using James or Aria as an inspiration, users can download and customize the Digital Human for customer service blueprint for their industry use case, with document ingestion and retrieval-augmented generation (RAG), and customizing the avatar look and voice for their application.

Use Case Description

The digital human for customer service NVIDIA AI Blueprint is powered by NVIDIA Tokkio, a workflow based on ACE technologies, to bring enterprise applications to life with a 3D animated digital human interface. With approachable, human-like interactions, customer service applications can provide more engaging user experiences compared to traditional customer service options.

This workflow is designed to integrate within your existing generative AI applications built using RAG. Use this workflow to start evolving your applications running in your data center, in the cloud, or at the edge, to include a full digital human interface.

Architecture Diagram

Architecture Diagram

What’s included in the Blueprint

NIM and Other Software

The following NIM are used by this blueprint:

NVIDIA AI Blueprints are customizable AI workflow examples that equip enterprise developers with NIM microservices, reference code, documentation, and a Helm chart for deployment.

This blueprint provides a reference for the users to showcase how an LLM or a RAG application can be easily connected to a digital human pipeline. The digital human and the RAG application are deployed separately. The RAG application is responsible for generating the text content of the interaction and Tokkio customer service workflow is providing a solution to enable avatar live interaction. Those two entities are separated and communicate using the REST API. The users can develop their requirements and tune the app based on their needs. Included in this workflow are steps to setup and connect both components of the customer service pipeline. Each part of the pipelines consists of the following components:

Digital Human Pipeline

  • A composable Helm chart that sets up the digital human pipeline with ACE agent and deploys the Audio2Face-3D, and Riva Parakeet and FastPitch NIM microservices to deploy the default stylized avatar. The pipeline also provides different variations incorporating Audio2Face-2D to use 2D avatars.

RAG Pipeline

  • A Docker Compose application deploys a Llama 3 LLM NIM, NeMo Retriever nv-embed-qa embedding NIM, a NeMo Retriever mistral-4b reranking NIM and a LangChain RAG pipeline with a FastAPI endpoint for multiturn chat.
  • Notebooks ingestion of domain specific documents (ORAN data) and parameter efficient fine tuning on synthetic data generated from ORAN documents.

With this blueprint the users will be able to do the following:

  1. Use the pre-built Digital Human Helm chart to create a digital human interface powered by a sample avatar asset (named Ben), Riva text-to-speech (TTS) FastPitch, and automatic speech recognition (ASR) NIM microservices. The pre-built Helm chart also connects by default to Llama 3 8b NIM API endpoint to get the users’ started with an interactive Digital Human.
  2. Use the RAG application to demonstrate the power of industry knowledge with example ORAN database or ingest documents and customize the digital human knowledge for your specific industry.
  3. Able to deploy the digital human experience and the RAG application on either bare metal or their favorite cloud provider of choice with simple one-click deployment scripts.

Example Walkthrough with Sample Input/Output

Audio2Face-3D NIM

Input
Input Type(s): Audio
Input Format: bytes
Input Parameters: Tuning Parameters, Audio
Other Properties Related to Input: Supported Sampling rates: 22.05KHz, 44.1KHz, 16KHz; All audio is resampled to 16KHz. There is no max audio length.

Output
Output Type(s): Blendshape Coefficients
Output Format: Custom Protobuf Format
Output Parameters: Custom Protobuf Format

Audio2Face-2D NIM

Input Type: Portrait image, Audio Input Format: RGB image, 32 bit float PCM audio Input Parameters: 720p to 4K for the image, Audio Other Properties Related to Input: Supported sampling rate: 16kHz and mono channel audio. There is no max audio length.

Output Output Format: Animated RGB Image Output Parameters: Custom Protobuf Format Other Properties Related to Output: Input images post-processed using proprietary technique; 3 Channel, 32 bit image supported.

Llama-3-8b NIM

Input Input Format: Text
Input Parameters: Temperature, TopP

Output Output Format: Text and code
Output Parameters: Max output tokens

Riva Parakeet-ctc-1_1b-asr NIM

Input
Input Type(s): Audio in English
Input Format(s): Linear PCM 16-bit 1 channel

Output
Output Type(s): Text String in English with Timestamps

Fastpitch-hifigan-tts NIM

Input Input Format (For FastPitch 1st Stage): Text Strings in English Other Properties Related to Input: 400 Character Text String Limit Output Output Format (For HifiGAN 2nd Stage): Audio of shape (batch x time) in wav format

NeMo Retriever nv-embedqa-e5-v5 NIM

Input
Input Type: text
Input Format: list of strings with task-specific instructions

Output
Output Type: floats
Output Format: list of float arrays, each array containing the embeddings for the corresponding input string

NeMo Retriever nv-rerankqa-mistral4b-v3 NIM

Input
Input Type: Pair of Texts
Input Format: List of text pairs
Other Properties Related to Input: The model's maximum context length is 512 tokens. Texts longer than maximum length must either be chunked or truncated.

Output
Output Type: floats
Output Format: List of float arrays
Other Properties Related to Output: Each the probability score (or raw logits) The user can decide if a Sigmoid activation function is applied to the logits.

The Audio data captured from the user is sent to ACE agent which orchestrates the communication between various NIM microservices. The ACE agent uses the Riva Parakeet NIM to convert the audio data to text which is then sent to the RAG pipeline. The RAG pipelines uses the NeMo Retriever embedding and reranking and LLM NIM microservices to answer the question with context from documents fed to it. The text result is sent to TTS, and the voice output from TTS is used to animate the digital human using the Audio2Face-3D NIM or Audio2Face-2D NIM.

API Definition

API Interfaces for NIM collections conform to OpenAPI standards, and can be readily integrated with NVIDIA NIM containers deployed in any compatible compute cluster. Integration or replacement of API compatible components allow for easy modification of workloads to adapt to your specific use case where needed. See individual NIM documentation for the integration details.

By default, the digital human RAG plugin has support for an API that follows the OpenAPI specification. To customize the pipeline to connect to your own RAG system, follow the instructions here.

Minimum System requirements

Hardware Requirements

Digital human pipeline

The digital human pipeline supports the following hardware:

  • T4
  • A10
  • L4
  • L40S

You would need 2 GPUs minimum for 1 stream for the default 3D avatar workflow. Additional details for hardware and compute requirements for different variants of the digital human workflow can be found here.

RAG pipeline

The RAG pipeline needs 2xA100 GPUs, one for the embedding and reranking NIM and one for the LLM NIM.

OS Requirements

Both the digital human and the RAG pipeline can be deployed on Ubuntu 22.04 OS.

Terms of Use

GOVERNING TERMS:
Your use of this trial service is governed by the NVIDIA API Trial Terms of Service ACE NIM and NGC Microservices - NVIDIA AI Product License
Generative AI Examples - Apache 2
ADDITIONAL TERMS: Meta Llama 3 Community License Agreement at https://llama.meta.com/llama3/license/.