
Unlock the power of on-the-go learning and tackle the challenge of information overload with generative AI-powered audio read-outs. Use this blueprint to build a generative AI application that transforms PDF data—such as training documents, technical research, or documentation—into personalized audio content.
Leverage large language models (LLMs), text-to-speech, and NVIDIA NIM microservices to deploy a customized solution tailored to your organization’s proprietary data. This approach remains compliant with privacy requirements throughout the process.
This blueprint is flexible and customizable, so you can add additional functionality that suits your users’ needs, whether that is specific branding, analytics, real-time translation, or a digital human interface to deepen engagement.
Architecture Diagram
Key Features
PDF to Markdown Service
- Extracts content from PDFs and converts it into markdown format for further processing.
Monologue or Dialogue Creation Service
- AI processes markdown content, enriching or structuring it to create natural and engaging audio content.
Text-to-Speech (TTS) Service
- Converts the processed content into high-quality speech.
Minimum System Requirements
There are two ways of running this blueprint.
- NVIDIA Hosted-endpoints: All model inference is performed on NVIDIA's cloud infrastructure.
- NVIDIA RTX AI PCs and Workstations: Current support includes NVIDIA GeForce RTX 4090, GeForce RTX 5090, or NVIDIA RTX 6000 Ada GPUs.
Software used in this blueprint
There are two ways of running this blueprint.
NVIDIA Hosted-endpoints: All model inference is performed on NVIDIA's cloud infrastructure. NVIDIA RTX AI PCs and Workstations: Current support includes GeForce RTX 4090, GeForce RTX 5090, and NVIDIA RTX 6000 Ada GPUs.
NVIDIA Hosted-endpoints: NIM microservices
NVIDIA RTX AI PCs and Workstations - NIM microservices
- mistral-nemo-12b-instruct for podcast transcript generation
- nv-yolox-page-elements-v1 and paddleocr for object detection and extraction. Used in conjunction with the NVIDIA-Ingest pipeline for PDF Ingestion and Extraction.
- parakeet-ctc-0.6b-asr for speech to text
- llama-3.2-nv-embedqa-1b-v2 for embedding and retrieval use-cases
3rd Party Software
- Langchain
- Docling Document Parser for PDF to Markdown Service
- ElevenLabs for Text-to-Speech Service
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility, and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure the models meet requirements for the relevant industry and use case and address unforeseen product misuse. For more detailed information on ethical considerations for the models, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI concerns here.
License
Use of the models in this blueprint is governed by the NVIDIA AI Foundation Models Community License.
Terms of Use
GOVERNING TERMS: The blueprint is governed by the NVIDIA Agreements | Enterprise Software | NVIDIA Software License Agreement and NVIDIA Agreements | Enterprise Software | Product Specific Terms for AI Product.
Meta Llama 3.3-70B
GOVERNING TERMS: The NIM container is governed by the NVIDIA Software License Agreement and the Product Specific Terms for AI Products;
Meta Llama 3.1 8B, Llama 3.1 70B Instruct, Llama 3.1 405B Instruct
GOVERNING TERMS: The NIM container is governed by the NVIDIA Software License Agreement and the Product Specific Terms for AI Products;
Use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement. ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama.
Mistral Nemo 12B Instruct
GOVERNING TERMS: The NIM container is governed by the NVIDIA Software License Agreement the Product-Specific Terms for NVIDIA AI Products; and the use of this model is governed by the AI Foundation Models Community License Agreement. ADDITIONAL INFORMATION: Apache 2.0.
NeMo Retriever PaddleOCR
The NIM container is governed by NVIDIA Agreements | Enterprise Software | NVIDIA Software License Agreement and NVIDIA Agreements | Enterprise Software | Product Specific Terms for AI Product; and the use of this model is governed by the AI Foundation Models Community License Agreement. Additional information.
NVIDIA Retrieval QA Llama 3.2 1B Embedding v2, NeMo Retriever YOLOX Structured Images v1, Parakeet 0.6b CTC en-US NIM
GOVERNING TERMS: The NIM container is governed by the NVIDIA Software License Agreement the Product-Specific Terms for NVIDIA AI Products; and the use of this model is governed by the AI Foundation Models Community License Agreement.