I think the router matrix should be 5*4 and the output should be 3*4. The softmax is taken on the rows so that it gives the output probabilities for 4 routers.
Each row represents a token and the top score in the row will select the expert to apply on that token.
In the middle of the article H(x) shown is 3x3 whereas softmax output G(x) has 4 values. That illustration is a bit confusing.
Is softmax distribution calculated per row or per column of H(x)?
Thanks
I have the same confusion.
I think the router matrix should be 5*4 and the output should be 3*4. The softmax is taken on the rows so that it gives the output probabilities for 4 routers.
Each row represents a token and the top score in the row will select the expert to apply on that token.