In a major breakthrough for computer vision, Google DeepMind has introduced Vision Banana, an instruction-tuned model that combines image generation and visual understanding in a single system.
The research, published in the paper “Image Generators are Generalist Vision Learners”, challenges the long-held belief that generative models and vision models must remain separate. Vision Banana demonstrates that a single model can both create and interpret images with high accuracy.
Built on Google’s base image generator Nano Banana Pro, the model was enhanced through lightweight instruction tuning. This approach allows it to perform tasks such as semantic segmentation, instance segmentation, depth estimation, and surface normal prediction, while retaining its original image generation capabilities.
Instead of using separate modules for each task, Vision Banana outputs results as RGB images. These outputs follow precise color mappings, enabling them to be converted back into measurable data. This unified approach allows different tasks to be handled by simply changing prompts, without modifying the model itself.
The model also operates without relying on training data from evaluation benchmarks, ensuring that its performance reflects true generalisation rather than memorisation.
Performance results highlight its capabilities. In semantic segmentation, it achieved an mIoU of 0.699, outperforming SAM 3’s 0.652. In reasoning segmentation, it reached a gIoU of 0.793, surpassing SAM 3 Agent. For metric depth estimation, it scored 0.929 compared to Depth Anything V3’s 0.918, despite using no real-world depth data.
The architecture also delivers strong efficiency gains. It reduces the need for large datasets, supports multiple tasks within one model, and maintains its generative performance. In benchmarks, it achieved a 53.5% win rate in text-to-image tasks and remained competitive in image editing tasks.
Another key advantage is its ability to work across different hardware setups and conditions, making it more flexible for large-scale deployment.
Vision Banana signals a shift toward generalist AI models that can both generate and understand visual data, reducing complexity while improving performance across tasks.
Also read: Viksit Workforce for a Viksit Bharat
Do Follow: The Mainstream LinkedIn | The Mainstream Facebook | The Mainstream Youtube | The Mainstream Twitter
About us:
The Mainstream is a premier platform delivering the latest updates and informed perspectives across the technology business and cyber landscape. Built on research-driven, thought leadership and original intellectual property, The Mainstream also curates summits & conferences that convene decision makers to explore how technology reshapes industries and leadership. With a growing presence in India and globally across the Middle East, Africa, ASEAN, the USA, the UK and Australia, The Mainstream carries a vision to bring the latest happenings and insights to 8.2 billion people and to place technology at the centre of conversation for leaders navigating the future.


