9 Comments

Article is so good and helpful. Informative.

Expand full comment

Thanks for the article, Maarten. It really helps to dive into the AI.

I'm trying to analyse 6,6K reviews about retail companies, that I webscrapped from the internet. And the idea was to take out keywords with KeyLLM from each review. (I have two tesla p40 at my home pc, so I think it will not be long in time). For the next step I want to visualize on a umap, like you made in a previous article. But for now I have received a message on a first step, that maximum context length (512) exceeded.

Could you please tell me if this limit can be bypassed? Some reviews, that I analyse, are close to 4K tokens.

Expand full comment

You would have to either chunk the reviews in order to make sure that you adhere to the token length or increase the token length. You can find some parameters of ctransformers here: https://github.com/marella/ctransformers#config

Expand full comment

Thanks.

Is it possible to use it to anonymize any documents ?

Let’s suppose i want to anonymize some docs for a list of topics, let’s assume a have also a small dictionary for these topics ( to be complete)

How will you deal with this use case please ?

Expand full comment

Anonimization can be a tricky subject. It also depends on the extend which you want to anonimize. If it is only names and adresses you might be able to use a NER-like algorithm to detect them and replace them.

Expand full comment

Thanks for this nice article. Few questions on the part where you have mentioned "We assume that documents that are highly similar will have the same keywords, so there would be no need to extract keywords for all documents", 1> How do you choose which among all semantically close documents will keyLLM use for key word extraction? 2> What would be threshold for similarity check for the grouping the similar documents, is that something which we can configure?

Expand full comment

1. Behind the scenes it uses sentence-transformers `community_detection` which is used to find documents that have close similarities.

2. In the last few examples, you can see me using the `threshold` parameter which is used to configure the grouping of similar documents. The value represents the cosine similarity between embeddings. Generally, we want to set this higher than lower as we are searching for highly similar documents. Setting this too low will result in documents getting somewhat relevant keywords but not the correct ones. The actual value will depend on the embedding model since they all have different similarity distributions (but generally between 0 and 1).

Expand full comment

Thanks for the quick response, I am assuming we can pass in embedding model of our choice.

Expand full comment

Yes! As you can see in the last two use cases, there is the option to select an embedding model. Here, the `BAAI/bge-small-en-v1.5` is being used but you can use any other embedding model supported by KeyBERT.

Expand full comment