A Visual Guide to Gemma 4 12B

Maarten Grootendorst

Jun 3

An in-depth explainer to Gemma 4 12B; a unified, encoder-free multimodal model!

Read →

13 Comments

Mohamed Yousef

Jun 4

Nice post.. thanks!

But I am missing a very important piece of info.. benchmarks :)

what net effect did this have ? e.g. a side by side comparison between two models w and w/o the encoders.. do we need more data ? more training ? what gap we have in terms of performance

Thanks!

Reply (1)

Max Andreacchi

Jun 5

I have the same question, curious if any evaluations exist identifying performance differences between multimodal models that have encoders versus those that don’t.

siyu

Jun 5

"The removal of the encoders, which are typically in charge of making sense of the multimodal inputs, places the burden of making sense of all outputs on the LLM."

What kind of sloppy writing is this??

No one is making "sense" of anything!! Model weights are just being updated to reduce the output error.. the author should know better

Jan

Jun 5

I'm curious how is the attention mask handled for the multi-modal "patches" (tokens?) ? Generative LLMs often encode text (the prompt) with a left-to-right attention (not always I understand). But I would guess for image/audio they keep the everyone-to-everyone attention across the patches ?

Max Andreacchi

Jun 5

Incredible post making a complex topic very digestible. Kudos!

Rubens Mau

Jun 4

Thanks for your wonderful work!

Suneel Marthi

Jun 4

Enjoyed reading this post Maarten. Thank you.

Pengqian Han

Jun 4

Very amazing tech, and very amazing blog, thanks for your sharing

Eteimorde Youdiowei

Jun 3

Awesome breakdown, thanks for this. I have couple of questions, since 2D-ROPE has been removed how does the model learn relative positioning in images. The previous gemma 4 visiom models had both 2D-RoPE and X/Y Patch Embeddings table. Doesn't this removal 2D-Rope affect the vision capabilities of the 12B model?

Sushrut Shitoot

Jun 3

Indeed! The previous posts were also super helpful but the 12B indeed fills up that empty slot where you have enough compute power but still locally usable by the average retail customer. I see much more value in this one. Thanks again!