Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!
This post is brilliant Maarten, loved how you sucked us in to this idea from ax to bread (explaining idea down to its delicious cuda)! Sorry if you answered before but how do you make the matrix vizualisations? They are great
Thank you for taking the time to break down the complexity into visually appealing, bite sized chunks. This is very interesting!
Thank you for your sharing, could you please share which tool you use to create these beautiful diagrams?
Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!
Thank you Maarten, great post! I have a question regarding the parallel scan: are the B_'s suppose to be different ?
This post is brilliant Maarten, loved how you sucked us in to this idea from ax to bread (explaining idea down to its delicious cuda)! Sorry if you answered before but how do you make the matrix vizualisations? They are great
I am very grateful to you Maarten for writing such an incredible piece. I am looking forward to your book.
Thank you so much for the clear explain of the process and making the visualized figures.
I am still a little confused about some of definitions:
1. Does k in x_k mean the timestamp t?
2. Does x_0, x_1, X_2 ... represent each token in the input sequence? If so, the number of timestamp (k) is equal to the input length (token number)?
3. For the BX_k, is the matrix multiplication (1xD)*(DxN)=1xN, or (LxD)*(DxN)=LxN ?
Thank you. This was very helpful. The visualization is great.
Can you highlight how the matrices B, and C are parametrized and learned?
Thank you!