Discussion about this post

User's avatar
Mohamed Yousef's avatar

Nice post.. thanks!

But I am missing a very important piece of info.. benchmarks :)

what net effect did this have ? e.g. a side by side comparison between two models w and w/o the encoders.. do we need more data ? more training ? what gap we have in terms of performance

Thanks!

Eteimorde Youdiowei's avatar

Awesome breakdown, thanks for this. I have couple of questions, since 2D-ROPE has been removed how does the model learn relative positioning in images. The previous gemma 4 visiom models had both 2D-RoPE and X/Y Patch Embeddings table. Doesn't this removal 2D-Rope affect the vision capabilities of the 12B model?

11 more comments...

No posts

Ready for more?