16 Comments
User's avatar
Shanya Chaubey's avatar

The simplicity of the explanation was very helpful.

Thank you for creating this

Expand full comment
siyu's avatar

In the middle of the article H(x) shown is 3x3 whereas softmax output G(x) has 4 values. That illustration is a bit confusing.

Is softmax distribution calculated per row or per column of H(x)?

Thanks

Expand full comment
DO DUC TAI's avatar

I have the same confusion.

I think the router matrix should be 5*4 and the output should be 3*4. The softmax is taken on the rows so that it gives the output probabilities for 4 routers.

Each row represents a token and the top score in the row will select the expert to apply on that token.

Expand full comment
DO DUC TAI's avatar

Thank you Maarten for the incredible article!!!

These posts make me appreciate so much the value of open sources and open sharing in the community.

Expand full comment
Jiang's avatar

Great post! Could you tell me what tool you used to create these diagrams?

Expand full comment
Maarten Grootendorst's avatar

Sure! I generally use Figma to create these figures but I do want to stress that the tool is almost irrelevant for creating these visuals. Creating the visuals in something like PowerPoint or keynote would take the same effort.

The primary reason why I use Figma is so that I can easily create svg files for light-weight hosting.

I would actually advise using something like https://excalidraw.com/ to create visuals. It's a minimal framework that allows you to focus on the design rather than 100s of unnecessary features.

Expand full comment
Mayur's avatar

Wonderful article. It clarified many concepts with ease.

Expand full comment
Ruben's avatar

Hi Maarten, I wonder why is not possible to prune non active parameters of a model with MoE during run time so as to reduce memory requirements. Thanks!

Expand full comment
Maarten Grootendorst's avatar

During inference, any expert may be chosen, so these have to remain in memory ready to use for when they are called upon.

Expand full comment
Juli's avatar

Nice work, I'll be much appreciated if could you do a PPO guide for me.

Expand full comment
Jinxu's avatar

Great article! I am wondering if we can translate your blog into Chinese and post it on AI community. We will keep the original link and state where it is translated from. Thank you.

Expand full comment
Maarten Grootendorst's avatar

Sure! As long as the source is shared, I'm all for it. When it is translated, could you share the link? I will then also add it to the beginning of the post.

Expand full comment
PRATIK KUMAR's avatar

Excellent blog. Very insightful!!!

Just a doubt. Shouldn't the sum of probabilities =1 in the diagram of router

https://newsletter.maartengrootendorst.com/i/148217245/the-router

Also here in diagram 2 , when we have expert capacity of 3. For token 4 & 5 expert 2 has highest probability. Shouldnt it be sent to them instead of expert 4?

https://newsletter.maartengrootendorst.com/i/148217245/expert-capacity

Expand full comment
Maarten Grootendorst's avatar

You are correct, thank you for sharing! It seems I completely missed those things. I updated the ones you mentioned and also updated a couple of others that needed minor updates.

Expand full comment
PRATIK KUMAR's avatar

Thanks a lot for confirming. Really appreciate the quality & effort put in both the book & the blog.

Expand full comment