But I am missing a very important piece of info.. benchmarks :)
what net effect did this have ? e.g. a side by side comparison between two models w and w/o the encoders.. do we need more data ? more training ? what gap we have in terms of performance
I have the same question, curious if any evaluations exist identifying performance differences between multimodal models that have encoders versus those that don’t.
"The removal of the encoders, which are typically in charge of making sense of the multimodal inputs, places the burden of making sense of all outputs on the LLM."
What kind of sloppy writing is this??
No one is making "sense" of anything!! Model weights are just being updated to reduce the output error.. the author should know better
I'm curious how is the attention mask handled for the multi-modal "patches" (tokens?) ? Generative LLMs often encode text (the prompt) with a left-to-right attention (not always I understand). But I would guess for image/audio they keep the everyone-to-everyone attention across the patches ?
Awesome breakdown, thanks for this. I have couple of questions, since 2D-ROPE has been removed how does the model learn relative positioning in images. The previous gemma 4 visiom models had both 2D-RoPE and X/Y Patch Embeddings table. Doesn't this removal 2D-Rope affect the vision capabilities of the 12B model?
Indeed! The previous posts were also super helpful but the 12B indeed fills up that empty slot where you have enough compute power but still locally usable by the average retail customer. I see much more value in this one. Thanks again!
Nice post.. thanks!
But I am missing a very important piece of info.. benchmarks :)
what net effect did this have ? e.g. a side by side comparison between two models w and w/o the encoders.. do we need more data ? more training ? what gap we have in terms of performance
Thanks!
I have the same question, curious if any evaluations exist identifying performance differences between multimodal models that have encoders versus those that don’t.
"The removal of the encoders, which are typically in charge of making sense of the multimodal inputs, places the burden of making sense of all outputs on the LLM."
What kind of sloppy writing is this??
No one is making "sense" of anything!! Model weights are just being updated to reduce the output error.. the author should know better
I'm curious how is the attention mask handled for the multi-modal "patches" (tokens?) ? Generative LLMs often encode text (the prompt) with a left-to-right attention (not always I understand). But I would guess for image/audio they keep the everyone-to-everyone attention across the patches ?
Incredible post making a complex topic very digestible. Kudos!
Thanks for your wonderful work!
Enjoyed reading this post Maarten. Thank you.
Very amazing tech, and very amazing blog, thanks for your sharing
Awesome breakdown, thanks for this. I have couple of questions, since 2D-ROPE has been removed how does the model learn relative positioning in images. The previous gemma 4 visiom models had both 2D-RoPE and X/Y Patch Embeddings table. Doesn't this removal 2D-Rope affect the vision capabilities of the 12B model?
Indeed! The previous posts were also super helpful but the 12B indeed fills up that empty slot where you have enough compute power but still locally usable by the average retail customer. I see much more value in this one. Thanks again!
Great write up!
This is useful. Also, this was pretty quick! I literally saw it on X 15 minutes ago 😅
It helps that I could prepare this in advance ;) How cool would it be though if I could make something like this in 15 minutes!