A Visual Guide to Mixture of Experts (MoE)

Maarten Grootendorst

Oct 7, 2024

356

Demystifying the role of MoE in Large Language Models

Read →

17 Comments

Shanya Chaubey

Oct 11

The simplicity of the explanation was very helpful.

Thank you for creating this

Expand full comment

siyu

Jan 9

In the middle of the article H(x) shown is 3x3 whereas softmax output G(x) has 4 values. That illustration is a bit confusing.

Is softmax distribution calculated per row or per column of H(x)?

Thanks

Expand full comment

Reply (1)

DO DUC TAI

Apr 16Edited

I have the same confusion.

I think the router matrix should be 5*4 and the output should be 3*4. The softmax is taken on the rows so that it gives the output probabilities for 4 routers.

Each row represents a token and the top score in the row will select the expert to apply on that token.

Expand full comment

Satya Saurabh Mishra

Apr 22

One the best blog for understanding the MOE with visuals. Thank you Maarten!.

Expand full comment

DO DUC TAI

Apr 16

Thank you Maarten for the incredible article!!!

These posts make me appreciate so much the value of open sources and open sharing in the community.

Expand full comment

Jiang

Feb 19

Great post! Could you tell me what tool you used to create these diagrams?

Expand full comment

Reply (1)

Maarten Grootendorst

Feb 23

Sure! I generally use Figma to create these figures but I do want to stress that the tool is almost irrelevant for creating these visuals. Creating the visuals in something like PowerPoint or keynote would take the same effort.

The primary reason why I use Figma is so that I can easily create svg files for light-weight hosting.

I would actually advise using something like https://excalidraw.com/ to create visuals. It's a minimal framework that allows you to focus on the design rather than 100s of unnecessary features.

Expand full comment

Mayur

Feb 7

Wonderful article. It clarified many concepts with ease.

Expand full comment

Ruben

Jan 28

Hi Maarten, I wonder why is not possible to prune non active parameters of a model with MoE during run time so as to reduce memory requirements. Thanks!

Expand full comment

Reply (1)

Maarten Grootendorst

Jan 28

During inference, any expert may be chosen, so these have to remain in memory ready to use for when they are called upon.

Expand full comment

Jonson Wong

Jan 13

You could find Chinese version here: https://mp.weixin.qq.com/s/0VxqGdmYU5BdQt6YbUI9DQ?token=647504773&lang=zh_CN

Expand full comment

Juli

Jan 12

Nice work, I'll be much appreciated if could you do a PPO guide for me.

Expand full comment

Jinxu

Oct 18

Great article! I am wondering if we can translate your blog into Chinese and post it on AI community. We will keep the original link and state where it is translated from. Thank you.

Expand full comment

Reply (1)

Maarten Grootendorst

Oct 18

Sure! As long as the source is shared, I'm all for it. When it is translated, could you share the link? I will then also add it to the beginning of the post.

Expand full comment

PRATIK KUMAR

Oct 15Edited

Excellent blog. Very insightful!!!

Just a doubt. Shouldn't the sum of probabilities =1 in the diagram of router

https://newsletter.maartengrootendorst.com/i/148217245/the-router

Also here in diagram 2 , when we have expert capacity of 3. For token 4 & 5 expert 2 has highest probability. Shouldnt it be sent to them instead of expert 4?

https://newsletter.maartengrootendorst.com/i/148217245/expert-capacity

Expand full comment

Reply (1)

Maarten Grootendorst

Oct 15

You are correct, thank you for sharing! It seems I completely missed those things. I updated the ones you mentioned and also updated a couple of others that needed minor updates.

Expand full comment

Reply (1)

PRATIK KUMAR

Oct 15

Thanks a lot for confirming. Really appreciate the quality & effort put in both the book & the blog.

Expand full comment

Exploring Language Models

A Visual Guide to Mixture of Experts (MoE)