Great post, thank you so much for sharing your insights! I do have a different perspective regarding the claim that "RNNs could be fast for both training and inference." My understanding is that, during training, RNNs process sequences in a token-by-token manner due to their recurrent structure. In contrast, Transformers can leverage parallelism by processing all tokens in a sequence simultaneously using causal masks. This parallelization makes Transformer training potentially L times faster than RNNs, where L is the sequence length.
In the GIF--Convolution Representation, I think the equation "y_2 = CAB x_0 ......" is wrong.After mutiplying the kernel, the equation should be "y_2 = CA^2Bx_0 ......"
This is a fine article, but with misleading terminology. Mamba is not a selective SSM, it is a gated neural network architecture. The selective SSM part in the original paper is S6 (S4+selection) and Mamba can be implemented with or without it. In fact, Table 1. in the original paper lists a comparison with Mamba as the architecture and S4, Hyena or S6 as the layer.
What I'm unable to understand is why the authors didn't name it Mamba-S6 and use vague terminology in parts of the paper. Mamba-S6 would have been so much more descriptive, but hey, the genie is out of the bottle.
Thank you very much for this post, I learned a lot reading it.
I just wanted to ask if in the section "Selectively Retain Information" the Shapes of matrices A B C are all DxN or rather that's true only for matrix C and matrices A and B are with shapes NxN and NxD respectfully.
Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!
This post is brilliant Maarten, loved how you sucked us in to this idea from ax to bread (explaining idea down to its delicious cuda)! Sorry if you answered before but how do you make the matrix vizualisations? They are great
Thank you for taking the time to break down the complexity into visually appealing, bite sized chunks. This is very interesting!
Great post, thank you so much for sharing your insights! I do have a different perspective regarding the claim that "RNNs could be fast for both training and inference." My understanding is that, during training, RNNs process sequences in a token-by-token manner due to their recurrent structure. In contrast, Transformers can leverage parallelism by processing all tokens in a sequence simultaneously using causal masks. This parallelization makes Transformer training potentially L times faster than RNNs, where L is the sequence length.
In the GIF--Convolution Representation, I think the equation "y_2 = CAB x_0 ......" is wrong.After mutiplying the kernel, the equation should be "y_2 = CA^2Bx_0 ......"
This is a fine article, but with misleading terminology. Mamba is not a selective SSM, it is a gated neural network architecture. The selective SSM part in the original paper is S6 (S4+selection) and Mamba can be implemented with or without it. In fact, Table 1. in the original paper lists a comparison with Mamba as the architecture and S4, Hyena or S6 as the layer.
What I'm unable to understand is why the authors didn't name it Mamba-S6 and use vague terminology in parts of the paper. Mamba-S6 would have been so much more descriptive, but hey, the genie is out of the bottle.
Hi Maarten!
Thank you very much for this post, I learned a lot reading it.
I just wanted to ask if in the section "Selectively Retain Information" the Shapes of matrices A B C are all DxN or rather that's true only for matrix C and matrices A and B are with shapes NxN and NxD respectfully.
Very nicely explained! Thanks for this!
Thanks
Can you please explain how the weidyj of the conv 1d is picked?
People like you who can explain very complex topics in undestandable, bitesized chunks are worth their weight in gold. thank you
Thank you so much for the well written visual guide.
Do you think writing the equation the following way is valid?
h(t) = Ah(t-1) + Bx(t)
y(t) = Ch(t) + Dx(t)
Such a well explained article! Thank you
Great article, thanks!
Great article, thanks!
What do you think about writing also a visual guide to Hyena?
Thank you for your sharing, could you please share which tool you use to create these beautiful diagrams?
Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!
Thank you Maarten, great post! I have a question regarding the parallel scan: are the B_'s suppose to be different ?
This post is brilliant Maarten, loved how you sucked us in to this idea from ax to bread (explaining idea down to its delicious cuda)! Sorry if you answered before but how do you make the matrix vizualisations? They are great