NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • Nanochat Training

use case

  • Run NemoClaw on DGX Station
  • Local Coding Agent
  • Secure Long Running AI Agents with OpenShell on DGX Station

inference

  • Serve Qwen3-235B with vLLM
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

Topic Modeling

45 MIN

Extract insights from massive text datasets using cuML's GPU-accelerated BERTopic

BERTopicData ScienceMachine LearningNLPcuML
View on GitHub
OverviewOverviewInstructionsInstructions

Basic idea

Topic modeling helps you discover hidden themes in large document collections—but traditional methods crawl when datasets grow to millions of records. This playbook shows how to process 40 million Amazon product reviews in minutes using GPU-accelerated BERTopic.

BERTopic combines transformer embeddings with clustering to extract human-readable topics from text. By swapping CPU-based UMAP and HDBSCAN with GPU-accelerated versions from RAPIDS cuML, you get the same results dramatically faster—no code changes required.

  • Drop-in GPU acceleration: Load cuml.accel and your existing UMAP/HDBSCAN code runs on GPU automatically
  • Scale to millions: Process datasets that would take hours on CPU in minutes on GPU
  • Interactive visualizations: Explore topic distributions, relationships, and document clusters

What you'll accomplish

You'll run a complete topic modeling pipeline on 40 million product reviews and generate interactive visualizations of discovered topics.

By the end, you'll be able to:

  • Use cuML's drop-in accelerators for UMAP and HDBSCAN
  • Generate sentence embeddings at scale with SentenceTransformers
  • Create topic visualizations including heatmaps, barcharts, and document datamaps

What to know before starting

  • Experience with Python and Jupyter notebooks
  • Basic understanding of machine learning concepts (embeddings, clustering)
  • Familiarity with pandas DataFrames

Prerequisites

Hardware Requirements:

  • NVIDIA DGX Station with GB300 GPU
  • Minimum 64GB GPU memory for processing 40M documents
  • At least 50GB available storage for dataset and embeddings

Software Requirements:

  • Conda (Miniconda or Anaconda): conda --version
  • CUDA 13.0 compatible drivers: nvidia-smi
  • Network access to download the Amazon Reviews dataset (~14GB compressed)

Ancillary files

All required assets are in the playbook directory nvidia/station-topic-modeling/assets (see Instructions, Step 7). Key file:

  • video_notebook_for_GPU_Accelerated_Machine_Learning_BERTopic_RTX6000_40M.ipynb - Complete Jupyter notebook with GPU-accelerated topic modeling pipeline (filename reflects original demo hardware; the notebook runs on GB300 and other NVIDIA GPUs)

Time & risk

  • Estimated time: 45 minutes (includes environment setup, dataset download, and embedding generation)
  • Risk level: Low
    • Large dataset download (~14GB) may take time depending on network speed
    • Embedding generation requires significant GPU memory
  • Rollback: Delete the downloaded dataset and any generated embedding files to restore state
  • Last Updated: 03/02/2026
    • First Publication

Resources

  • BERTopic Documentation
  • RAPIDS cuML Documentation