2 Comments
User's avatar
⭠ Return to thread
siyu's avatar

In the middle of the article H(x) shown is 3x3 whereas softmax output G(x) has 4 values. That illustration is a bit confusing.

Is softmax distribution calculated per row or per column of H(x)?

Thanks

Expand full comment
DO DUC TAI's avatar

I have the same confusion.

I think the router matrix should be 5*4 and the output should be 3*4. The softmax is taken on the rows so that it gives the output probabilities for 4 routers.

Each row represents a token and the top score in the row will select the expert to apply on that token.

Expand full comment