Topic Modeling

45 MIN

Extract insights from massive text datasets using cuML's GPU-accelerated BERTopic

(DGX Station) Hugging Face cache permissions

On DGX Station, ensure the Hugging Face cache is writable so model downloads succeed:

sudo chown -R $USER:$USER $HOME/.cache/huggingface 2>/dev/null || true
sudo chmod -R u+rwX $HOME/.cache/huggingface 2>/dev/null || true
mkdir -p $HOME/.cache/huggingface

If you see "Permission denied" when downloading models later, run the chown/chmod lines with your username (e.g. nvidia).

Install RAPIDS cuDF and cuML

Create a new conda environment with RAPIDS libraries for GPU-accelerated data processing.

conda create -n rapids-25.10 \
  -c rapidsai -c conda-forge \
  cudf=25.10 cuml=25.10 python=3.11 'cuda-version=13.0'

This installs cuDF (GPU DataFrame library) and cuML (GPU machine learning library) that provide drop-in acceleration for pandas and scikit-learn operations.

Activate the conda environment

conda activate rapids-25.10

Install machine learning packages

Install UMAP, HDBSCAN, BERTopic, and supporting libraries for topic modeling.

pip install \
  transformers datasets sentence-transformers \
  umap-learn hdbscan==0.8.40 bertopic matplotlib \
  scikit-learn==1.4.2 datamapplot

These packages provide:

  • sentence-transformers: Generate text embeddings
  • umap-learn / hdbscan: Dimensionality reduction and clustering (GPU-accelerated via cuML)
  • bertopic: Topic modeling framework
  • datamapplot: Document visualization

NOTE

Pip may report dependency conflicts (e.g. dask/distributed downgraded, cuml/rapids-dask-dependency). BERTopic and the notebook can still run. If you need cuML and RAPIDS dask together, consider keeping the conda default dask versions and installing only the BERTopic stack via pip in a separate env; see Troubleshooting.

Install visualization packages

Install JupyterLab and visualization libraries for interactive topic exploration.

conda install -c conda-forge \
    notebook=7.5.0 \
    jupyterlab=4.5.0 \
    ipywidgets=8.1.8 \
    jupyterlab-widgets=3.0.16 \
    bokeh=3.8.1 \
    colorcet=3.1.0 \
    datashader=0.18.2 \
    plotly=6.5.0

If conda reports PackagesNotFoundError for jupyterlab-widgets (e.g. on some platforms), install it with pip:

pip install jupyterlab-widgets

Install compatible PyTorch

Install PyTorch with CUDA 13.0 support for GPU-accelerated embedding generation.

pip install torch==2.9.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

Clone the repository and download the dataset

Clone the playbook repository and download the Amazon Electronics Reviews dataset.

git clone https://github.com/NVIDIA/dgx-station-playbooks
cd dgx-station-playbooks/nvidia/station-topic-modeling/assets

Download the dataset (~14GB compressed):

wget https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/review_categories/Electronics.jsonl.gz

Launch JupyterLab

Start JupyterLab from the assets directory:

jupyter lab

Select the rapids-25.10 kernel

In JupyterLab, open the notebook video_notebook_for_GPU_Accelerated_Machine_Learning_BERTopic_RTX6000_40M.ipynb.

Select the rapids-25.10 kernel from the kernel selector in the top right corner of the notebook interface.

Execute all cells

Run all cells in the notebook sequentially. The notebook will:

  1. Load data with cuDF: GPU-accelerated pandas via %load_ext cudf.pandas
  2. Preprocess text: Clean and normalize review text
  3. Generate embeddings: Create sentence embeddings (~8 minutes for 40M documents)
  4. Enable GPU acceleration: Load cuML accelerators via %load_ext cuml.accel
  5. Run BERTopic: Cluster documents into topics using GPU-accelerated UMAP and HDBSCAN
  6. Visualize results: Generate interactive topic visualizations

Explore the results

After the notebook completes, you'll have:

  • Topic information table: Discovered topics with keywords and document counts
  • Topic visualization: Interactive 2D map of topic relationships
  • Barchart: Top keywords for each topic
  • Heatmap: Topic similarity matrix
  • Document datamap: Visual clustering of documents by topic

Cleanup (optional)

Remove the conda environment when finished:

conda deactivate
conda env remove -n rapids-25.10

Remove the downloaded dataset:

rm Electronics.jsonl.gz

Remove generated embedding files and the cloned playbook directory if you no longer need them:

# Optional: remove Hugging Face cache (embedding cache from the notebook)
rm -rf ~/.cache/huggingface

# From the parent of dgx-station-playbooks/, remove the cloned repo
rm -rf dgx-station-playbooks/

Next steps

Apply this workflow to your own datasets:

  1. Adjust data size: Modify nrows parameter when loading data to process smaller subsets
  2. Tune clustering: Experiment with min_cluster_size and min_samples in HDBSCAN
  3. Try different embedding models: Swap all-MiniLM-L6-v2 for domain-specific models
  4. Export topics: Save the topic model using topic_model.save() for later analysis
  5. Monitor GPU usage: Run nvidia-smi -l 1 to watch GPU utilization during processing