13 Comments
User's avatar
Mohamed Yousef's avatar

Nice post.. thanks!

But I am missing a very important piece of info.. benchmarks :)

what net effect did this have ? e.g. a side by side comparison between two models w and w/o the encoders.. do we need more data ? more training ? what gap we have in terms of performance

Thanks!

Max Andreacchi's avatar

I have the same question, curious if any evaluations exist identifying performance differences between multimodal models that have encoders versus those that don’t.

siyu's avatar

"The removal of the encoders, which are typically in charge of making sense of the multimodal inputs, places the burden of making sense of all outputs on the LLM."

What kind of sloppy writing is this??

No one is making "sense" of anything!! Model weights are just being updated to reduce the output error.. the author should know better

Jan's avatar

I'm curious how is the attention mask handled for the multi-modal "patches" (tokens?) ? Generative LLMs often encode text (the prompt) with a left-to-right attention (not always I understand). But I would guess for image/audio they keep the everyone-to-everyone attention across the patches ?

Max Andreacchi's avatar

Incredible post making a complex topic very digestible. Kudos!

Rubens Mau's avatar

Thanks for your wonderful work!

Suneel Marthi's avatar

Enjoyed reading this post Maarten. Thank you.

Pengqian Han's avatar

Very amazing tech, and very amazing blog, thanks for your sharing

Eteimorde Youdiowei's avatar

Awesome breakdown, thanks for this. I have couple of questions, since 2D-ROPE has been removed how does the model learn relative positioning in images. The previous gemma 4 visiom models had both 2D-RoPE and X/Y Patch Embeddings table. Doesn't this removal 2D-Rope affect the vision capabilities of the 12B model?

Sushrut Shitoot's avatar

Indeed! The previous posts were also super helpful but the 12B indeed fills up that empty slot where you have enough compute power but still locally usable by the average retail customer. I see much more value in this one. Thanks again!

Sushrut Shitoot's avatar

This is useful. Also, this was pretty quick! I literally saw it on X 15 minutes ago 😅

Maarten Grootendorst's avatar

It helps that I could prepare this in advance ;) How cool would it be though if I could make something like this in 15 minutes!