Wednesday, March 11, 2026

Top 5 This Week

Related News

Google launches Gemini Embedding 2 to unify text, images, audio and video understanding

A new development in artificial intelligence is set to simplify how machines process different types of data. Google has introduced Gemini Embedding 2, its first fully multimodal embedding model designed to map text, images, audio and video into a single unified system.

The company shared details of the model in a blog post, highlighting that Gemini Embedding 2 is the successor to its earlier text-only embedding model released last year. The new system is capable of understanding semantic meaning across more than 100 languages. It is currently available in public preview through the Gemini API and Vertex AI.

Typically, artificial intelligence systems store different types of information such as text, images, audio and videos in separate processing structures. When a user requests information, the model searches within the relevant format. As a result, a concept like a “cat” mentioned in text and a “cat” shown in a video may be treated as two unrelated items.

Gemini Embedding 2 addresses this limitation by creating a single architecture that processes all forms of content within one shared embedding space. This approach allows the system to analyse mixed inputs, such as documents that contain both images and text, in a more natural way similar to how humans interpret information.

According to Google, the new model simplifies “complex pipelines and enhances a wide variety of multimodal downstream tasks.” These include Retrieval-Augmented Generation (RAG), semantic search, sentiment analysis and data clustering.

The model also offers several technical capabilities. It supports a text context window of up to 8,192 input tokens. It can process up to 6 images per request in PNG and JPEG formats, and handle video inputs of up to 120 seconds in MP4 and MOV formats. Additionally, the system can process audio data directly without requiring text transcriptions.

Gemini Embedding 2 can also embed PDF documents of up to 6 pages. Another key capability is its ability to understand interleaved inputs, meaning users can send multiple data types such as text and images in the same request. Google said this feature helps the model develop a more accurate understanding of complex real-world information.

Also read: Viksit Workforce for a Viksit Bharat

Do Follow: The Mainstream LinkedIn | The Mainstream Facebook | The Mainstream Youtube | The Mainstream Twitter

About us:

The Mainstream is a premier platform delivering the latest updates and informed perspectives across the technology business and cyber landscape. Built on research-driven, thought leadership and original intellectual property, The Mainstream also curates summits & conferences that convene decision makers to explore how technology reshapes industries and leadership. With a growing presence in India and globally across the Middle East, Africa, ASEAN, the USA, the UK and Australia, The Mainstream carries a vision to bring the latest happenings and insights to 8.2 billion people and to place technology at the centre of conversation for leaders navigating the future.

Popular Articles