A Visual Guide to Mamba and State Space…

Maarten Grootendorst

Feb 19, 2024

286

An Alternative to Transformers for Language Modeling

Read →

20 Comments

Jenni_H

Feb 23, 2024

Thank you for taking the time to break down the complexity into visually appealing, bite sized chunks. This is very interesting!

Expand full comment

William

Jul 3

Great post, thank you so much for sharing your insights! I do have a different perspective regarding the claim that "RNNs could be fast for both training and inference." My understanding is that, during training, RNNs process sequences in a token-by-token manner due to their recurrent structure. In contrast, Transformers can leverage parallelism by processing all tokens in a sequence simultaneously using causal masks. This parallelization makes Transformer training potentially L times faster than RNNs, where L is the sequence length.

Expand full comment

bohr

May 28

In the GIF--Convolution Representation, I think the equation "y_2 = CAB x_0 ......" is wrong.After mutiplying the kernel, the equation should be "y_2 = CA^2Bx_0 ......"

Expand full comment

Matti Eteläperä

Apr 27

This is a fine article, but with misleading terminology. Mamba is not a selective SSM, it is a gated neural network architecture. The selective SSM part in the original paper is S6 (S4+selection) and Mamba can be implemented with or without it. In fact, Table 1. in the original paper lists a comparison with Mamba as the architecture and S4, Hyena or S6 as the layer.

What I'm unable to understand is why the authors didn't name it Mamba-S6 and use vague terminology in parts of the paper. Mamba-S6 would have been so much more descriptive, but hey, the genie is out of the bottle.

Expand full comment

Ivaylo Dimitrov

Apr 25

Hi Maarten!

Thank you very much for this post, I learned a lot reading it.

I just wanted to ask if in the section "Selectively Retain Information" the Shapes of matrices A B C are all DxN or rather that's true only for matrix C and matrices A and B are with shapes NxN and NxD respectfully.

Expand full comment

Dr. Ashish Bamania

Nov 26, 2024

Very nicely explained! Thanks for this!

Expand full comment

Walid Ahmed

Nov 6, 2024

Thanks

Can you please explain how the weidyj of the conv 1d is picked?

Expand full comment

Sander

Sep 18, 2024

People like you who can explain very complex topics in undestandable, bitesized chunks are worth their weight in gold. thank you

Expand full comment

James

Aug 13, 2024

Thank you so much for the well written visual guide.

Do you think writing the equation the following way is valid?

h(t) = Ah(t-1) + Bx(t)

y(t) = Ch(t) + Dx(t)

Expand full comment

Swagata Ashwani

Jul 21, 2024

Such a well explained article! Thank you

Expand full comment

Boyuan Zhang

Jun 20, 2024

Great article, thanks!

Expand full comment

Daniel Kleine

Jun 2, 2024

Great article, thanks!

What do you think about writing also a visual guide to Hyena?

Expand full comment

Ikun

May 14, 2024

Thank you for your sharing, could you please share which tool you use to create these beautiful diagrams?

Expand full comment

Husun Shujaat

Apr 24, 2024

Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!

Expand full comment

Patrick

Mar 27, 2024

Thank you Maarten, great post! I have a question regarding the parallel scan: are the B_'s suppose to be different ?

Expand full comment

Egil

Mar 24, 2024

This post is brilliant Maarten, loved how you sucked us in to this idea from ax to bread (explaining idea down to its delicious cuda)! Sorry if you answered before but how do you make the matrix vizualisations? They are great

Expand full comment

Exploring Language Models

A Visual Guide to Mamba and State Space…