18 Comments
User's avatar
Jenni_H's avatar

Thank you for taking the time to break down the complexity into visually appealing, bite sized chunks. This is very interesting!

Expand full comment
Matti Eteläperä's avatar

This is a fine article, but with misleading terminology. Mamba is not a selective SSM, it is a gated neural network architecture. The selective SSM part in the original paper is S6 (S4+selection) and Mamba can be implemented with or without it. In fact, Table 1. in the original paper lists a comparison with Mamba as the architecture and S4, Hyena or S6 as the layer.

What I'm unable to understand is why the authors didn't name it Mamba-S6 and use vague terminology in parts of the paper. Mamba-S6 would have been so much more descriptive, but hey, the genie is out of the bottle.

Expand full comment
Ivaylo Dimitrov's avatar

Hi Maarten!

Thank you very much for this post, I learned a lot reading it.

I just wanted to ask if in the section "Selectively Retain Information" the Shapes of matrices A B C are all DxN or rather that's true only for matrix C and matrices A and B are with shapes NxN and NxD respectfully.

Expand full comment
Dr. Ashish Bamania's avatar

Very nicely explained! Thanks for this!

Expand full comment
Walid Ahmed's avatar

Thanks

Can you please explain how the weidyj of the conv 1d is picked?

Expand full comment
Sander's avatar

People like you who can explain very complex topics in undestandable, bitesized chunks are worth their weight in gold. thank you

Expand full comment
James's avatar

Thank you so much for the well written visual guide.

Do you think writing the equation the following way is valid?

h(t) = Ah(t-1) + Bx(t)

y(t) = Ch(t) + Dx(t)

Expand full comment
Swagata Ashwani's avatar

Such a well explained article! Thank you

Expand full comment
Boyuan Zhang's avatar

Great article, thanks!

Expand full comment
Daniel Kleine's avatar

Great article, thanks!

What do you think about writing also a visual guide to Hyena?

Expand full comment
Ikun's avatar

Thank you for your sharing, could you please share which tool you use to create these beautiful diagrams?

Expand full comment
Husun Shujaat's avatar

Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!

Expand full comment
Patrick's avatar

Thank you Maarten, great post! I have a question regarding the parallel scan: are the B_'s suppose to be different ?

Expand full comment
Egil's avatar

This post is brilliant Maarten, loved how you sucked us in to this idea from ax to bread (explaining idea down to its delicious cuda)! Sorry if you answered before but how do you make the matrix vizualisations? They are great

Expand full comment
Jj's avatar

I am very grateful to you Maarten for writing such an incredible piece. I am looking forward to your book.

Expand full comment
TonyK's avatar

Thank you so much for the clear explain of the process and making the visualized figures.

I am still a little confused about some of definitions:

1. Does k in x_k mean the timestamp t?

2. Does x_0, x_1, X_2 ... represent each token in the input sequence? If so, the number of timestamp (k) is equal to the input length (token number)?

3. For the BX_k, is the matrix multiplication (1xD)*(DxN)=1xN, or (LxD)*(DxN)=LxN ?

Expand full comment