Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!
This post is brilliant Maarten, loved how you sucked us in to this idea from ax to bread (explaining idea down to its delicious cuda)! Sorry if you answered before but how do you make the matrix vizualisations? They are great
It demonstrates, using JAX, how these matrices learn and are a great next step after reading through this visual guide. Going from a visual to a hands-on guide is a nice pipeline of learning.
Thank you for taking the time to break down the complexity into visually appealing, bite sized chunks. This is very interesting!
Thanks
Can you please explain how the weidyj of the conv 1d is picked?
People like you who can explain very complex topics in undestandable, bitesized chunks are worth their weight in gold. thank you
Thank you so much for the well written visual guide.
Do you think writing the equation the following way is valid?
h(t) = Ah(t-1) + Bx(t)
y(t) = Ch(t) + Dx(t)
Such a well explained article! Thank you
Great article, thanks!
Great article, thanks!
What do you think about writing also a visual guide to Hyena?
Thank you for your sharing, could you please share which tool you use to create these beautiful diagrams?
Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!
Thank you Maarten, great post! I have a question regarding the parallel scan: are the B_'s suppose to be different ?
This post is brilliant Maarten, loved how you sucked us in to this idea from ax to bread (explaining idea down to its delicious cuda)! Sorry if you answered before but how do you make the matrix vizualisations? They are great
I am very grateful to you Maarten for writing such an incredible piece. I am looking forward to your book.
Thank you so much for the clear explain of the process and making the visualized figures.
I am still a little confused about some of definitions:
1. Does k in x_k mean the timestamp t?
2. Does x_0, x_1, X_2 ... represent each token in the input sequence? If so, the number of timestamp (k) is equal to the input length (token number)?
3. For the BX_k, is the matrix multiplication (1xD)*(DxN)=1xN, or (LxD)*(DxN)=LxN ?
Thank you. This was very helpful. The visualization is great.
Can you highlight how the matrices B, and C are parametrized and learned?
Thank you!
I would definitely recommend the Annotated S4 - https://srush.github.io/annotated-s4/
It demonstrates, using JAX, how these matrices learn and are a great next step after reading through this visual guide. Going from a visual to a hands-on guide is a nice pipeline of learning.