1 Comment
User's avatar
⭠ Return to thread
DO DUC TAI's avatar

I have the same confusion.

I think the router matrix should be 5*4 and the output should be 3*4. The softmax is taken on the rows so that it gives the output probabilities for 4 routers.

Each row represents a token and the top score in the row will select the expert to apply on that token.

Expand full comment
ErrorError