19 Comments
User's avatar
Leon Chen's avatar

One of the best educational resources for LLMs on the internet. A heartfelt thank you for all the work you put into these.

Bal's avatar

Should I convince my manager to continue doing this kind of content? ~> yes

Maarten Grootendorst's avatar

Thank you! I'm definitely hoping to do more of these ;)

Emanuel Maceira's avatar

The Per-Layer Embeddings trick in E2B/E4B is genuinely clever for edge deployments. Storing the PLE lookup table in flash rather than VRAM is exactly the kind of hardware-aware design choice that separates models that benchmark well from models that actually ship on devices.

From the IoT and edge AI side, the E2B with audio + vision + text is the most exciting variant in the family. Most smart edge devices (security cameras, industrial sensors, voice-enabled gateways) need to process multiple modalities locally. Having a 2B-effective-parameter model that handles all three without requiring separate encoder pipelines per modality dramatically simplifies the deployment stack.

The variable resolution soft token budget is also underappreciated. In real edge deployments, you're constantly making tradeoffs between inference speed and accuracy based on power budget, thermal envelope, and connectivity. Being able to dial image resolution from 70 to 1120 tokens gives edge developers a runtime knob they've never had before -- you could run at 70 tokens on battery, switch to 560 when plugged in, all with the same model.

Please keep doing these visual guides -- the architecture diagrams alone are worth more than most technical blog posts. And yes, convince your manager.

Brendan's avatar

Yes, please keep doing this content! Very information dense and didn’t get the feeling I was reading verbose slop. Thank you!!

Jakub Lála's avatar

Amazing walkthrough, learned a lot!

Quick question on the vision projector bit — you mention it's "a small neural network" and then call it a "linear projection." Is there a nonlinearity in there, or is it just a single linear layer? Just wondering.

Lukas Martak's avatar

+1 had the same question while reading.

WvG's avatar
1dEdited

I follow Maarten's posts (especially the visual guides) ever since he started to publish them. They are very intuitive, easy to understand, and clearly explained. Which is why the O'Reilly book "Hands-on LLMs" (which Maarten wrote together with Jay Alammar) is so popular, as it uses a lot of these types of visuals.

Almost tempting to call these visuals "Deep visuals" as a tribute to Maarten's work at DeepMind now :)

As to the question whether the manager needs to be convinced. I think that is the wrong question. The question should be: "A manager who does NOT support this work should not be Maarten's manager in the first place". Because if you think about it: if Google is big on making model families such as Gemma available to the public, then any (visual) guide that goes along with this fits right into this strategy.

Keep up the great work, Maarten! And let me know if I need to talk to your manager if he/she needs convincing :)

lou's avatar

yes!! you definitely continue this kind of amazing content with or without convincing your manager hahaha

Uneet Kumar Singh's avatar

Amazing read!

One confusion here:

"With Mixture of Experts in Gemma 4, only 8 experts and 1 shared expert is actually used for intermediate calculations. All other 119 experts can take a backseat. These are the active parameters and represent the “A” in “26B A4B”. "

Should it be 119 or 120 experts taking back seat? There are 128 + 1 experts(mentioned in the figure also) and out of these 9 are selected at a time leaving 120 inactive.

Selina's avatar

Best illustration of gemma4! I was trying to understand the code by simplifying it https://animadversio.github.io/gemma4-simple/ But your post is a much better visual guide to the model! learned a lot

Matt Wigdahl's avatar

This is wonderful content! Definitely a vote from me on continuing if you are able!

Lukas Martak's avatar

YES. Convince your manager. This was great. Thanks for sharing!

Shanya Chaubey's avatar

Hi thanks for this amazing explanation. Question: in global attention how is the key = query if the size of query is doubled? Unless the size of the key is also doubled to ensure we can multiply q^T * K. Dimension wise both q and K weights would need to be 512?

Cha_le's avatar

Thank you so much for the write up. This has been very insightful.

Human Systems's avatar

Hey — I came across your writing and really liked how you think.

I’m exploring something similar from a different angle — writing about human behavior through a system design lens (like debugging internal patterns).

Just started publishing on Substack. If you ever get a moment to read, I’d genuinely value your perspective.

Also happy to support your work — feels like there’s an interesting overlap here.

Michael's avatar

I think there are typo!

"These embeddings are quite a bit smaller (256 versus 1536 dimensions in E2B and 2056 in E4B)"

should be 2560, I think :)