The Per-Layer Embeddings trick in E2B/E4B is genuinely clever for edge deployments. Storing the PLE lookup table in flash rather than VRAM is exactly the kind of hardware-aware design choice that separates models that benchmark well from models that actually ship on devices.
From the IoT and edge AI side, the E2B with audio + vision + text is the most exciting variant in the family. Most smart edge devices (security cameras, industrial sensors, voice-enabled gateways) need to process multiple modalities locally. Having a 2B-effective-parameter model that handles all three without requiring separate encoder pipelines per modality dramatically simplifies the deployment stack.
The variable resolution soft token budget is also underappreciated. In real edge deployments, you're constantly making tradeoffs between inference speed and accuracy based on power budget, thermal envelope, and connectivity. Being able to dial image resolution from 70 to 1120 tokens gives edge developers a runtime knob they've never had before -- you could run at 70 tokens on battery, switch to 560 when plugged in, all with the same model.
Please keep doing these visual guides -- the architecture diagrams alone are worth more than most technical blog posts. And yes, convince your manager.
Quick question on the vision projector bit — you mention it's "a small neural network" and then call it a "linear projection." Is there a nonlinearity in there, or is it just a single linear layer? Just wondering.
I follow Maarten's posts (especially the visual guides) ever since he started to publish them. They are very intuitive, easy to understand, and clearly explained. Which is why the O'Reilly book "Hands-on LLMs" (which Maarten wrote together with Jay Alammar) is so popular, as it uses a lot of these types of visuals.
Almost tempting to call these visuals "Deep visuals" as a tribute to Maarten's work at DeepMind now :)
As to the question whether the manager needs to be convinced. I think that is the wrong question. The question should be: "A manager who does NOT support this work should not be Maarten's manager in the first place". Because if you think about it: if Google is big on making model families such as Gemma available to the public, then any (visual) guide that goes along with this fits right into this strategy.
Keep up the great work, Maarten! And let me know if I need to talk to your manager if he/she needs convincing :)
"With Mixture of Experts in Gemma 4, only 8 experts and 1 shared expert is actually used for intermediate calculations. All other 119 experts can take a backseat. These are the active parameters and represent the “A” in “26B A4B”. "
Should it be 119 or 120 experts taking back seat? There are 128 + 1 experts(mentioned in the figure also) and out of these 9 are selected at a time leaving 120 inactive.
Best illustration of gemma4! I was trying to understand the code by simplifying it https://animadversio.github.io/gemma4-simple/ But your post is a much better visual guide to the model! learned a lot
Hi thanks for this amazing explanation. Question: in global attention how is the key = query if the size of query is doubled? Unless the size of the key is also doubled to ensure we can multiply q^T * K. Dimension wise both q and K weights would need to be 512?
One of the best educational resources for LLMs on the internet. A heartfelt thank you for all the work you put into these.
Should I convince my manager to continue doing this kind of content? ~> yes
Thank you! I'm definitely hoping to do more of these ;)
The Per-Layer Embeddings trick in E2B/E4B is genuinely clever for edge deployments. Storing the PLE lookup table in flash rather than VRAM is exactly the kind of hardware-aware design choice that separates models that benchmark well from models that actually ship on devices.
From the IoT and edge AI side, the E2B with audio + vision + text is the most exciting variant in the family. Most smart edge devices (security cameras, industrial sensors, voice-enabled gateways) need to process multiple modalities locally. Having a 2B-effective-parameter model that handles all three without requiring separate encoder pipelines per modality dramatically simplifies the deployment stack.
The variable resolution soft token budget is also underappreciated. In real edge deployments, you're constantly making tradeoffs between inference speed and accuracy based on power budget, thermal envelope, and connectivity. Being able to dial image resolution from 70 to 1120 tokens gives edge developers a runtime knob they've never had before -- you could run at 70 tokens on battery, switch to 560 when plugged in, all with the same model.
Please keep doing these visual guides -- the architecture diagrams alone are worth more than most technical blog posts. And yes, convince your manager.
Yes, please keep doing this content! Very information dense and didn’t get the feeling I was reading verbose slop. Thank you!!
Amazing walkthrough, learned a lot!
Quick question on the vision projector bit — you mention it's "a small neural network" and then call it a "linear projection." Is there a nonlinearity in there, or is it just a single linear layer? Just wondering.
+1 had the same question while reading.
I follow Maarten's posts (especially the visual guides) ever since he started to publish them. They are very intuitive, easy to understand, and clearly explained. Which is why the O'Reilly book "Hands-on LLMs" (which Maarten wrote together with Jay Alammar) is so popular, as it uses a lot of these types of visuals.
Almost tempting to call these visuals "Deep visuals" as a tribute to Maarten's work at DeepMind now :)
As to the question whether the manager needs to be convinced. I think that is the wrong question. The question should be: "A manager who does NOT support this work should not be Maarten's manager in the first place". Because if you think about it: if Google is big on making model families such as Gemma available to the public, then any (visual) guide that goes along with this fits right into this strategy.
Keep up the great work, Maarten! And let me know if I need to talk to your manager if he/she needs convincing :)
yes!! you definitely continue this kind of amazing content with or without convincing your manager hahaha
Amazing read!
One confusion here:
"With Mixture of Experts in Gemma 4, only 8 experts and 1 shared expert is actually used for intermediate calculations. All other 119 experts can take a backseat. These are the active parameters and represent the “A” in “26B A4B”. "
Should it be 119 or 120 experts taking back seat? There are 128 + 1 experts(mentioned in the figure also) and out of these 9 are selected at a time leaving 120 inactive.
Best illustration of gemma4! I was trying to understand the code by simplifying it https://animadversio.github.io/gemma4-simple/ But your post is a much better visual guide to the model! learned a lot
This is wonderful content! Definitely a vote from me on continuing if you are able!
YES. Convince your manager. This was great. Thanks for sharing!
Hi thanks for this amazing explanation. Question: in global attention how is the key = query if the size of query is doubled? Unless the size of the key is also doubled to ensure we can multiply q^T * K. Dimension wise both q and K weights would need to be 512?
Thank you so much for the write up. This has been very insightful.
Hey — I came across your writing and really liked how you think.
I’m exploring something similar from a different angle — writing about human behavior through a system design lens (like debugging internal patterns).
Just started publishing on Substack. If you ever get a moment to read, I’d genuinely value your perspective.
Also happy to support your work — feels like there’s an interesting overlap here.
I think there are typo!
"These embeddings are quite a bit smaller (256 versus 1536 dimensions in E2B and 2056 in E4B)"
should be 2560, I think :)
Very helpful!