Extract insights from massive text datasets using cuML's GPU-accelerated BERTopic
On DGX Station, ensure the Hugging Face cache is writable so model downloads succeed:
sudo chown -R $USER:$USER $HOME/.cache/huggingface 2>/dev/null || true
sudo chmod -R u+rwX $HOME/.cache/huggingface 2>/dev/null || true
mkdir -p $HOME/.cache/huggingface
If you see "Permission denied" when downloading models later, run the chown/chmod lines with your username (e.g. nvidia).
Create a new conda environment with RAPIDS libraries for GPU-accelerated data processing.
conda create -n rapids-25.10 \
-c rapidsai -c conda-forge \
cudf=25.10 cuml=25.10 python=3.11 'cuda-version=13.0'
This installs cuDF (GPU DataFrame library) and cuML (GPU machine learning library) that provide drop-in acceleration for pandas and scikit-learn operations.
conda activate rapids-25.10
Install UMAP, HDBSCAN, BERTopic, and supporting libraries for topic modeling.
Note: datamapplot will upgrade dask/distributed — the next step pins them back.
pip install \
transformers datasets sentence-transformers \
umap-learn hdbscan==0.8.40 bertopic matplotlib \
scikit-learn==1.4.2 datamapplot
Pin dask/distributed back to RAPIDS-compatible versions:
pip install "dask==2025.9.1" "distributed==2025.9.1"
These packages provide:
NOTE
Pip may report dependency conflicts (e.g. dask/distributed downgraded, cuml/rapids-dask-dependency). BERTopic and the notebook can still run. If you need cuML and RAPIDS dask together, consider keeping the conda default dask versions and installing only the BERTopic stack via pip in a separate env; see Troubleshooting.
Install JupyterLab and visualization libraries for interactive topic exploration.
conda install -c conda-forge \
notebook=7.5.0 \
jupyterlab=4.5.0 \
ipywidgets=8.1.8 \
jupyterlab-widgets=3.0.16 \
bokeh=3.8.1 \
colorcet=3.1.0 \
datashader=0.18.2 \
plotly=6.5.0
If conda reports PackagesNotFoundError for jupyterlab-widgets (e.g. on some platforms), install it with pip:
pip install jupyterlab-widgets
Install PyTorch with CUDA 13.0 support for GPU-accelerated embedding generation.
pip install torch==2.9.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
Clone the playbook repository and download the Amazon Electronics Reviews dataset.
git clone https://github.com/NVIDIA/dgx-station-playbooks
cd dgx-station-playbooks/nvidia/station-topic-modeling/assets
Download the dataset (~14GB compressed):
wget https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/review_categories/Electronics.jsonl.gz
The notebook files are stored in Git LFS — without this step, JupyterLab will throw a NotJSONError when trying to open them.
conda install -c conda-forge git-lfs
git lfs install
git lfs pull
Start JupyterLab from the assets directory:
jupyter lab
In JupyterLab, open the notebook video_notebook_for_GPU_Accelerated_Machine_Learning_BERTopic_1M.ipynb.
Select the rapids-25.10 kernel from the kernel selector in the top right corner of the notebook interface.
Run all cells in the notebook sequentially. The notebook will:
%load_ext cudf.pandas%load_ext cuml.accelAfter the notebook completes, you'll have:
Remove the conda environment when finished:
conda deactivate
conda env remove -n rapids-25.10
Remove the downloaded dataset:
rm Electronics.jsonl.gz
Remove generated embedding files and the cloned playbook directory if you no longer need them:
# Optional: remove Hugging Face cache (embedding cache from the notebook)
rm -rf ~/.cache/huggingface
# From the parent of dgx-station-playbooks/, remove the cloned repo
rm -rf dgx-station-playbooks/
Apply this workflow to your own datasets:
nrows parameter when loading data to process smaller subsetsmin_cluster_size and min_samples in HDBSCANall-MiniLM-L6-v2 for domain-specific modelstopic_model.save() for later analysisnvidia-smi -l 1 to watch GPU utilization during processing