Digital Trace Blog

Studying Canada's Media Ecosystem

Topic Modeling with DT Data

Topic modeling plays an essential role in helping us make sense of the large volumes of unstructured text inherent in digital trace data. It is fully embedded in our data pipeline: we gather data, process it through topic models, index the outputs, and explore the results using visualization tools.

Our open-source implementation of this pipeline is available on GitHub at our repository, where the README walks through how to set up and run the code. In this post, we focus on comparing the two primary approaches we rely on for topic modeling in digital trace data: BERTopic and Toponymy. Both methods aim to identify coherent themes in text, yet they do so with fundamentally different assumptions and computational strategies.

BERTopic: Embedding-Driven Topic Modeling

BERTopic is built on the idea that transformer-based embeddings provide a rich semantic representation of text. Instead of relying on word frequencies, it transforms each document into a numerical vector using models like Sentence-BERT, capturing conceptual similarity.

Because these vectors live in high-dimensional space, BERTopic applies dimension reduction—typically via UMAP—to project them into a representation suitable for clustering. It then uses HDBSCAN, a density-based algorithm that automatically infers the number of clusters and identifies noise points. This is ideal for heterogeneous digital trace data.Once clusters are identified, BERTopic produces interpretable representations using Class-Based TF-IDF (c-TF-IDF). This technique highlights words that uniquely characterize each cluster relative to the rest of the corpus. While these keywords can be optionally passed to an LLM for clearer labels, BERTopic does not require any external model calls and can run completely offline.

The topic modeling below was generated by the Bertopic model for Canada Election Day, April 28, 2025 from digital trace data.

Toponymy: LLM-Assisted Topic Construction and Interpretation

Toponymy approaches topic modeling from a different conceptual angle, prioritizing semantic interpretation through large language models (LLMs). It separates topic construction into a structural phase and an interpretive phase.

The structural phase may use clustering results from embeddings (sometimes even BERTopic’s) or rely on traditional methods. The distinguishing step is what happens afterward: The algorithm selects a set of representative elements (sample documents, key sentences) and provides them to an LLM.

The LLM is then asked to infer the underlying theme, transforming the clusters into human-interpretable topics. The model generates a short label and often a fuller description of what unifies the documents, creating a richer semantic representation closer to how a human analyst would structure the material. Therefore, using Toponymy requires access to an LLM, either through an API key or a local setup.

The topic modeling below was generated by the Toponymy model for Canada Election Day, April 28, 2025.

Summary of Methodological Distinctions
FeatureBERTopicToponymy
LLM requirementNot requiredRequired (API or local LLM)
Offline useFully supportedSupported only with local LLM
Topic constructionEmbedding + clustering + c-TF-IDFLLM-driven semantic interpretation
Topic styleStable, keyword-basedDescriptive, human-like summaries
PerformanceEfficient for large datasetsDependent on LLM inference speed

Recommendations & Notes

You can find detailed instructions on how to run the code in our project’s README.

  1. LLM Dependencies: For Toponymy, an OpenAI API key is necessary to access external LLMs. However, you can run BERTopic entirely without an API key. If you have a local LLM model you wish to use, you can find instructions on how to connect it to our code here.
  1. Handling Big Data: For large-scale data intervals we suggest implementing a sampling strategy before training. Train the model on a manageable subset of data (based on your hardware configuration), save the trained model, and then perform inference on the full dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *