is it not viable to let the LLM do the entire task of topic modelling instead of just the topic representation at the end? wonder if this would give better results.
If the intend is to pass all documents to the LLM, then that might be computationally difficult especially with large datasets. Passing a million documents is, at the moment, not an easy feat even with longer context lengths.
Passing a subset of documents is currently a more viable approach. You pass a small subset of documents that your LLM can handle, based on hardware (VRAM/GPU) and context lengths and let the LLM infer the topics. It does, however, often require an iterative approach to get the topics right. You do, however, often miss out on uncommon topics.
It is interesting that you mention this since I am currently exploring some possibilities for integrating something similar into BERTopic!
I really enjoy reading your post. It's intuitive and easy to follow. Thank you for sharing your work!
Thanks for the kind words! If there's any content you'd like to see, please let me know :-)
is it not viable to let the LLM do the entire task of topic modelling instead of just the topic representation at the end? wonder if this would give better results.
Yes and no.
If the intend is to pass all documents to the LLM, then that might be computationally difficult especially with large datasets. Passing a million documents is, at the moment, not an easy feat even with longer context lengths.
Passing a subset of documents is currently a more viable approach. You pass a small subset of documents that your LLM can handle, based on hardware (VRAM/GPU) and context lengths and let the LLM infer the topics. It does, however, often require an iterative approach to get the topics right. You do, however, often miss out on uncommon topics.
It is interesting that you mention this since I am currently exploring some possibilities for integrating something similar into BERTopic!
Looks like the llama2 model needs a temperature being strictly positive instead of being 0.0
I see, this was updated a few days ago! I'll make to change it. Thanks for spotting it!
I was just trying it and got that error. Thanks for the great post! I'm also exploring different representation models like gpt-3.5 or gpt-4.