24 Comments
User's avatar
Jenni_H's avatar

Thank you for taking the time to break down the complexity into visually appealing, bite sized chunks. This is very interesting!

William's avatar

Great post, thank you so much for sharing your insights! I do have a different perspective regarding the claim that "RNNs could be fast for both training and inference." My understanding is that, during training, RNNs process sequences in a token-by-token manner due to their recurrent structure. In contrast, Transformers can leverage parallelism by processing all tokens in a sequence simultaneously using causal masks. This parallelization makes Transformer training potentially L times faster than RNNs, where L is the sequence length.

Matti Eteläperä's avatar

This is a fine article, but with misleading terminology. Mamba is not a selective SSM, it is a gated neural network architecture. The selective SSM part in the original paper is S6 (S4+selection) and Mamba can be implemented with or without it. In fact, Table 1. in the original paper lists a comparison with Mamba as the architecture and S4, Hyena or S6 as the layer.

What I'm unable to understand is why the authors didn't name it Mamba-S6 and use vague terminology in parts of the paper. Mamba-S6 would have been so much more descriptive, but hey, the genie is out of the bottle.

Husun Shujaat's avatar

Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!

Mohamed Mabrok's avatar

Thank you. This was very helpful. The visualization is great.

Can you highlight how the matrices B, and C are parametrized and learned?

Thank you!

Maarten Grootendorst's avatar

I would definitely recommend the Annotated S4 - https://srush.github.io/annotated-s4/

It demonstrates, using JAX, how these matrices learn and are a great next step after reading through this visual guide. Going from a visual to a hands-on guide is a nice pipeline of learning.

Michael's avatar

Great article, I have a better understanding of the paper, 100 thanks to you. I am just wondering if you have an "advanced" vizu for the discretization process. Typically visualizing A and B on one side and Abar, Bbar on the other side for some simple matrices and for some time steps. I don't know if this is even possible but I am still trying to understand why Abar and Bbar are computed the way it is shown in the paper and why it discretizes the matrix. At the end of the day A, B, Abar, and Bbar are all matrices/vectors, so it is not clear to me what it means to discretize a matrix.

Shiyi's avatar

What a great tutorial with nice and clear figures!

MetaModeler's avatar

"Thanks so much! I really appreciated how clearly you explained State Space Models; the visuals were excellent. By the way, how did you create those awesome animated GIFs?

Carlos's avatar

Hi, I think the figure of the decoder is wrong. In the text you say that the FFN goes after MHA, but in the figure it is placed before, according to the arrow placed in the "Decoder" square.

bohr's avatar

In the GIF--Convolution Representation, I think the equation "y_2 = CAB x_0 ......" is wrong.After mutiplying the kernel, the equation should be "y_2 = CA^2Bx_0 ......"

Ivaylo Dimitrov's avatar

Hi Maarten!

Thank you very much for this post, I learned a lot reading it.

I just wanted to ask if in the section "Selectively Retain Information" the Shapes of matrices A B C are all DxN or rather that's true only for matrix C and matrices A and B are with shapes NxN and NxD respectfully.

Dr. Ashish Bamania's avatar

Very nicely explained! Thanks for this!

Walid Ahmed's avatar

Thanks

Can you please explain how the weidyj of the conv 1d is picked?

Sander's avatar

People like you who can explain very complex topics in undestandable, bitesized chunks are worth their weight in gold. thank you

James's avatar

Thank you so much for the well written visual guide.

Do you think writing the equation the following way is valid?

h(t) = Ah(t-1) + Bx(t)

y(t) = Ch(t) + Dx(t)

Swagata Ashwani's avatar

Such a well explained article! Thank you