I think the router matrix should be 5*4 and the output should be 3*4. The softmax is taken on the rows so that it gives the output probabilities for 4 routers.
Each row represents a token and the top score in the row will select the expert to apply on that token.
I think the router matrix should be 5*4 and the output should be 3*4. The softmax is taken on the rows so that it gives the output probabilities for 4 routers.
Each row represents a token and the top score in the row will select the expert to apply on that token.
I have the same confusion.
I think the router matrix should be 5*4 and the output should be 3*4. The softmax is taken on the rows so that it gives the output probabilities for 4 routers.
Each row represents a token and the top score in the row will select the expert to apply on that token.