Discussion about this post

User's avatar
Jenni_H's avatar

Thank you for taking the time to break down the complexity into visually appealing, bite sized chunks. This is very interesting!

Expand full comment
William's avatar

Great post, thank you so much for sharing your insights! I do have a different perspective regarding the claim that "RNNs could be fast for both training and inference." My understanding is that, during training, RNNs process sequences in a token-by-token manner due to their recurrent structure. In contrast, Transformers can leverage parallelism by processing all tokens in a sequence simultaneously using causal masks. This parallelization makes Transformer training potentially L times faster than RNNs, where L is the sequence length.

Expand full comment
18 more comments...

No posts