7 Comments
Aug 21, 2023Liked by Maarten Grootendorst

I really enjoy reading your post. It's intuitive and easy to follow. Thank you for sharing your work!

Expand full comment
author

Thanks for the kind words! If there's any content you'd like to see, please let me know :-)

Expand full comment

is it not viable to let the LLM do the entire task of topic modelling instead of just the topic representation at the end? wonder if this would give better results.

Expand full comment
author

Yes and no.

If the intend is to pass all documents to the LLM, then that might be computationally difficult especially with large datasets. Passing a million documents is, at the moment, not an easy feat even with longer context lengths.

Passing a subset of documents is currently a more viable approach. You pass a small subset of documents that your LLM can handle, based on hardware (VRAM/GPU) and context lengths and let the LLM infer the topics. It does, however, often require an iterative approach to get the topics right. You do, however, often miss out on uncommon topics.

It is interesting that you mention this since I am currently exploring some possibilities for integrating something similar into BERTopic!

Expand full comment

Looks like the llama2 model needs a temperature being strictly positive instead of being 0.0

Expand full comment
author

I see, this was updated a few days ago! I'll make to change it. Thanks for spotting it!

Expand full comment

I was just trying it and got that error. Thanks for the great post! I'm also exploring different representation models like gpt-3.5 or gpt-4.

Expand full comment