A Visual Guide to Gemma 4

Apr 3

A great start to a new job ;)

29 Comments

One of the best educational resources for LLMs on the internet. A heartfelt thank you for all the work you put into these.

Bal

Apr 6

Should I convince my manager to continue doing this kind of content? ~> yes

Reply (1)

Maarten Grootendorst

Apr 7

Thank you! I'm definitely hoping to do more of these ;)

Brendan

Apr 3

Yes, please keep doing this content! Very information dense and didn’t get the feeling I was reading verbose slop. Thank you!!

Emanuel Maceira

Apr 8

The Per-Layer Embeddings trick in E2B/E4B is genuinely clever for edge deployments. Storing the PLE lookup table in flash rather than VRAM is exactly the kind of hardware-aware design choice that separates models that benchmark well from models that actually ship on devices.

From the IoT and edge AI side, the E2B with audio + vision + text is the most exciting variant in the family. Most smart edge devices (security cameras, industrial sensors, voice-enabled gateways) need to process multiple modalities locally. Having a 2B-effective-parameter model that handles all three without requiring separate encoder pipelines per modality dramatically simplifies the deployment stack.

The variable resolution soft token budget is also underappreciated. In real edge deployments, you're constantly making tradeoffs between inference speed and accuracy based on power budget, thermal envelope, and connectivity. Being able to dial image resolution from 70 to 1120 tokens gives edge developers a runtime knob they've never had before -- you could run at 70 tokens on battery, switch to 560 when plugged in, all with the same model.

Please keep doing these visual guides -- the architecture diagrams alone are worth more than most technical blog posts. And yes, convince your manager.

Jakub Lála

Apr 8

Amazing walkthrough, learned a lot!

Quick question on the vision projector bit — you mention it's "a small neural network" and then call it a "linear projection." Is there a nonlinearity in there, or is it just a single linear layer? Just wondering.

Reply (1)

Lukas Martak

Apr 9

+1 had the same question while reading.

Daryl Luna Ray

Apr 17

Wow, this is intense! Thanks for the hard work that went into this. you should definitely convince your manager to allow you to keep creating content like this. they say the best way to learn is to teach and I think you’re doing both by sharing this to us.

Great post !

Super useful thanks for sharing

WvG

Apr 11Edited

I follow Maarten's posts (especially the visual guides) ever since he started to publish them. They are very intuitive, easy to understand, and clearly explained. Which is why the O'Reilly book "Hands-on LLMs" (which Maarten wrote together with Jay Alammar) is so popular, as it uses a lot of these types of visuals.

Almost tempting to call these visuals "Deep visuals" as a tribute to Maarten's work at DeepMind now :)

As to the question whether the manager needs to be convinced. I think that is the wrong question. The question should be: "A manager who does NOT support this work should not be Maarten's manager in the first place". Because if you think about it: if Google is big on making model families such as Gemma available to the public, then any (visual) guide that goes along with this fits right into this strategy.

Keep up the great work, Maarten! And let me know if I need to talk to your manager if he/she needs convincing :)

lou

Apr 10

yes!! you definitely continue this kind of amazing content with or without convincing your manager hahaha

Uneet Kumar Singh

Apr 10

Amazing read!

One confusion here:

"With Mixture of Experts in Gemma 4, only 8 experts and 1 shared expert is actually used for intermediate calculations. All other 119 experts can take a backseat. These are the active parameters and represent the “A” in “26B A4B”. "

Should it be 119 or 120 experts taking back seat? There are 128 + 1 experts(mentioned in the figure also) and out of these 9 are selected at a time leaving 120 inactive.

Selina

Apr 10

Best illustration of gemma4! I was trying to understand the code by simplifying it https://animadversio.github.io/gemma4-simple/ But your post is a much better visual guide to the model! learned a lot

Matt Wigdahl

Apr 9

This is wonderful content! Definitely a vote from me on continuing if you are able!

Lukas Martak

Apr 9

YES. Convince your manager. This was great. Thanks for sharing!

Shanya Chaubey

Apr 7

Hi thanks for this amazing explanation. Question: in global attention how is the key = query if the size of query is doubled? Unless the size of the key is also doubled to ensure we can multiply q^T * K. Dimension wise both q and K weights would need to be 512?

Cha_le

Apr 7

Thank you so much for the write up. This has been very insightful.

Exploring Language Models

A Visual Guide to Gemma 4