15 Comments
Feb 23Liked by Maarten Grootendorst

Thank you for taking the time to break down the complexity into visually appealing, bite sized chunks. This is very interesting!

Expand full comment

Thanks

Can you please explain how the weidyj of the conv 1d is picked?

Expand full comment

People like you who can explain very complex topics in undestandable, bitesized chunks are worth their weight in gold. thank you

Expand full comment

Thank you so much for the well written visual guide.

Do you think writing the equation the following way is valid?

h(t) = Ah(t-1) + Bx(t)

y(t) = Ch(t) + Dx(t)

Expand full comment

Such a well explained article! Thank you

Expand full comment

Great article, thanks!

Expand full comment

Great article, thanks!

What do you think about writing also a visual guide to Hyena?

Expand full comment

Thank you for your sharing, could you please share which tool you use to create these beautiful diagrams?

Expand full comment

Í had been reading the mamba paper, and it went all over my head. Read this after the paper, and this is so good, so easy to understand, and very well structured. Thank you!!

Expand full comment

Thank you Maarten, great post! I have a question regarding the parallel scan: are the B_'s suppose to be different ?

Expand full comment

This post is brilliant Maarten, loved how you sucked us in to this idea from ax to bread (explaining idea down to its delicious cuda)! Sorry if you answered before but how do you make the matrix vizualisations? They are great

Expand full comment

I am very grateful to you Maarten for writing such an incredible piece. I am looking forward to your book.

Expand full comment
Feb 26·edited Feb 26

Thank you so much for the clear explain of the process and making the visualized figures.

I am still a little confused about some of definitions:

1. Does k in x_k mean the timestamp t?

2. Does x_0, x_1, X_2 ... represent each token in the input sequence? If so, the number of timestamp (k) is equal to the input length (token number)?

3. For the BX_k, is the matrix multiplication (1xD)*(DxN)=1xN, or (LxD)*(DxN)=LxN ?

Expand full comment

Thank you. This was very helpful. The visualization is great.

Can you highlight how the matrices B, and C are parametrized and learned?

Thank you!

Expand full comment
author

I would definitely recommend the Annotated S4 - https://srush.github.io/annotated-s4/

It demonstrates, using JAX, how these matrices learn and are a great next step after reading through this visual guide. Going from a visual to a hands-on guide is a nice pipeline of learning.

Expand full comment