This experience showcases James and Aria, our interactive digital humans who have the knowledge of NVIDIA’s products or O-RAN specifications by having direct access to corresponding knowledge bases. The Digital Human and the RAG-powered backend application use a collection of NVIDIA NIM microservices, NVIDIA ACE and Maxine technologies, and ElevenLabs text-to-speech to provide natural and immersive responses. Using James or Aria as an inspiration, users can download and customize the Digital Human for customer service blueprint for their industry use case, with document ingestion and retrieval-augmented generation (RAG), and customizing the avatar look and voice for their application.
The digital human for customer service NVIDIA AI Blueprint is powered by NVIDIA Tokkio, a workflow based on ACE technologies, to bring enterprise applications to life with a 3D animated digital human interface. With approachable, human-like interactions, customer service applications can provide more engaging user experiences compared to traditional customer service options.
This workflow is designed to integrate within your existing generative AI applications built using RAG. Use this workflow to start evolving your applications running in your data center, in the cloud, or at the edge, to include a full digital human interface.
The following NIM are used by this blueprint:
NVIDIA AI Blueprints are customizable AI workflow examples that equip enterprise developers with NIM microservices, reference code, documentation, and a Helm chart for deployment.
This blueprint provides a reference for the users to showcase how an LLM or a RAG application can be easily connected to a digital human pipeline. The digital human and the RAG application are deployed separately. The RAG application is responsible for generating the text content of the interaction and Tokkio customer service workflow is providing a solution to enable avatar live interaction. Those two entities are separated and communicate using the REST API. The users can develop their requirements and tune the app based on their needs. Included in this workflow are steps to setup and connect both components of the customer service pipeline. Each part of the pipelines consists of the following components:
Digital Human Pipeline
RAG Pipeline
With this blueprint the users will be able to do the following:
Input
Input Type(s): Audio
Input Format: bytes
Input Parameters: Tuning Parameters, Audio
Other Properties Related to Input: Supported Sampling rates: 22.05KHz, 44.1KHz, 16KHz; All audio is resampled to 16KHz. There is no max audio length.
Output
Output Type(s): Blendshape Coefficients
Output Format: Custom Protobuf Format
Output Parameters: Custom Protobuf Format
Input Type: Portrait image, Audio Input Format: RGB image, 32 bit float PCM audio Input Parameters: 720p to 4K for the image, Audio Other Properties Related to Input: Supported sampling rate: 16kHz and mono channel audio. There is no max audio length.
Output Output Format: Animated RGB Image Output Parameters: Custom Protobuf Format Other Properties Related to Output: Input images post-processed using proprietary technique; 3 Channel, 32 bit image supported.
Input
Input Format: Text
Input Parameters: Temperature, TopP
Output
Output Format: Text and code
Output Parameters: Max output tokens
Input
Input Type(s): Audio in English
Input Format(s): Linear PCM 16-bit 1 channel
Output
Output Type(s): Text String in English with Timestamps
Input Input Format (For FastPitch 1st Stage): Text Strings in English Other Properties Related to Input: 400 Character Text String Limit Output Output Format (For HifiGAN 2nd Stage): Audio of shape (batch x time) in wav format
Input
Input Type: text
Input Format: list of strings with task-specific instructions
Output
Output Type: floats
Output Format: list of float arrays, each array containing the embeddings for the corresponding input string
Input
Input Type: Pair of Texts
Input Format: List of text pairs
Other Properties Related to Input: The model's maximum context length is 512 tokens. Texts longer than maximum length must either be chunked or truncated.
Output
Output Type: floats
Output Format: List of float arrays
Other Properties Related to Output: Each the probability score (or raw logits) The user can decide if a Sigmoid activation function is applied to the logits.
The Audio data captured from the user is sent to ACE agent which orchestrates the communication between various NIM microservices. The ACE agent uses the Riva Parakeet NIM to convert the audio data to text which is then sent to the RAG pipeline. The RAG pipelines uses the NeMo Retriever embedding and reranking and LLM NIM microservices to answer the question with context from documents fed to it. The text result is sent to TTS, and the voice output from TTS is used to animate the digital human using the Audio2Face-3D NIM or Audio2Face-2D NIM.
API Interfaces for NIM collections conform to OpenAPI standards, and can be readily integrated with NVIDIA NIM containers deployed in any compatible compute cluster. Integration or replacement of API compatible components allow for easy modification of workloads to adapt to your specific use case where needed. See individual NIM documentation for the integration details.
By default, the digital human RAG plugin has support for an API that follows the OpenAPI specification. To customize the pipeline to connect to your own RAG system, follow the instructions here.
Hardware Requirements
Digital human pipeline
The digital human pipeline supports the following hardware:
You would need 2 GPUs minimum for 1 stream for the default 3D avatar workflow. Additional details for hardware and compute requirements for different variants of the digital human workflow can be found here.
RAG pipeline
The RAG pipeline needs 2xA100 GPUs, one for the embedding and reranking NIM and one for the LLM NIM.
OS Requirements
Both the digital human and the RAG pipeline can be deployed on Ubuntu 22.04 OS.
GOVERNING TERMS:
Your use of this trial service is governed by the NVIDIA API Trial Terms of Service
ACE NIM and NGC Microservices - NVIDIA AI Product License
Generative AI Examples - Apache 2
ADDITIONAL TERMS:
Meta Llama 3 Community License Agreement at https://llama.meta.com/llama3/license/.
Create intelligent, interactive avatars for customer service across industries