9 Comments

The simplicity of the explanation was very helpful.

Thank you for creating this

Expand full comment

Nice work, I'll be much appreciated if could you do a PPO guide for me.

Expand full comment

In the middle of the article H(x) shown is 3x3 whereas softmax output G(x) has 4 values. That illustration is a bit confusing.

Is softmax distribution calculated per row or per column of H(x)?

Thanks

Expand full comment

Great article! I am wondering if we can translate your blog into Chinese and post it on AI community. We will keep the original link and state where it is translated from. Thank you.

Expand full comment

Sure! As long as the source is shared, I'm all for it. When it is translated, could you share the link? I will then also add it to the beginning of the post.

Expand full comment

Excellent blog. Very insightful!!!

Just a doubt. Shouldn't the sum of probabilities =1 in the diagram of router

https://newsletter.maartengrootendorst.com/i/148217245/the-router

Also here in diagram 2 , when we have expert capacity of 3. For token 4 & 5 expert 2 has highest probability. Shouldnt it be sent to them instead of expert 4?

https://newsletter.maartengrootendorst.com/i/148217245/expert-capacity

Expand full comment

You are correct, thank you for sharing! It seems I completely missed those things. I updated the ones you mentioned and also updated a couple of others that needed minor updates.

Expand full comment

Thanks a lot for confirming. Really appreciate the quality & effort put in both the book & the blog.

Expand full comment