Thanks for the article, Maarten. It really helps to dive into the AI.
I'm trying to analyse 6,6K reviews about retail companies, that I webscrapped from the internet. And the idea was to take out keywords with KeyLLM from each review. (I have two tesla p40 at my home pc, so I think it will not be long in time). For the next step I want to visualize on a umap, like you made in a previous article. But for now I have received a message on a first step, that maximum context length (512) exceeded.
Could you please tell me if this limit can be bypassed? Some reviews, that I analyse, are close to 4K tokens.
You would have to either chunk the reviews in order to make sure that you adhere to the token length or increase the token length. You can find some parameters of ctransformers here: https://github.com/marella/ctransformers#config
Anonimization can be a tricky subject. It also depends on the extend which you want to anonimize. If it is only names and adresses you might be able to use a NER-like algorithm to detect them and replace them.
Thanks for this nice article. Few questions on the part where you have mentioned "We assume that documents that are highly similar will have the same keywords, so there would be no need to extract keywords for all documents", 1> How do you choose which among all semantically close documents will keyLLM use for key word extraction? 2> What would be threshold for similarity check for the grouping the similar documents, is that something which we can configure?
1. Behind the scenes it uses sentence-transformers `community_detection` which is used to find documents that have close similarities.
2. In the last few examples, you can see me using the `threshold` parameter which is used to configure the grouping of similar documents. The value represents the cosine similarity between embeddings. Generally, we want to set this higher than lower as we are searching for highly similar documents. Setting this too low will result in documents getting somewhat relevant keywords but not the correct ones. The actual value will depend on the embedding model since they all have different similarity distributions (but generally between 0 and 1).
Yes! As you can see in the last two use cases, there is the option to select an embedding model. Here, the `BAAI/bge-small-en-v1.5` is being used but you can use any other embedding model supported by KeyBERT.
Article is so good and helpful. Informative.
Thanks for the article, Maarten. It really helps to dive into the AI.
I'm trying to analyse 6,6K reviews about retail companies, that I webscrapped from the internet. And the idea was to take out keywords with KeyLLM from each review. (I have two tesla p40 at my home pc, so I think it will not be long in time). For the next step I want to visualize on a umap, like you made in a previous article. But for now I have received a message on a first step, that maximum context length (512) exceeded.
Could you please tell me if this limit can be bypassed? Some reviews, that I analyse, are close to 4K tokens.
You would have to either chunk the reviews in order to make sure that you adhere to the token length or increase the token length. You can find some parameters of ctransformers here: https://github.com/marella/ctransformers#config
Thanks.
Is it possible to use it to anonymize any documents ?
Let’s suppose i want to anonymize some docs for a list of topics, let’s assume a have also a small dictionary for these topics ( to be complete)
How will you deal with this use case please ?
Anonimization can be a tricky subject. It also depends on the extend which you want to anonimize. If it is only names and adresses you might be able to use a NER-like algorithm to detect them and replace them.
Thanks for this nice article. Few questions on the part where you have mentioned "We assume that documents that are highly similar will have the same keywords, so there would be no need to extract keywords for all documents", 1> How do you choose which among all semantically close documents will keyLLM use for key word extraction? 2> What would be threshold for similarity check for the grouping the similar documents, is that something which we can configure?
1. Behind the scenes it uses sentence-transformers `community_detection` which is used to find documents that have close similarities.
2. In the last few examples, you can see me using the `threshold` parameter which is used to configure the grouping of similar documents. The value represents the cosine similarity between embeddings. Generally, we want to set this higher than lower as we are searching for highly similar documents. Setting this too low will result in documents getting somewhat relevant keywords but not the correct ones. The actual value will depend on the embedding model since they all have different similarity distributions (but generally between 0 and 1).
Thanks for the quick response, I am assuming we can pass in embedding model of our choice.
Yes! As you can see in the last two use cases, there is the option to select an embedding model. Here, the `BAAI/bge-small-en-v1.5` is being used but you can use any other embedding model supported by KeyBERT.