Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the td-cloud-library domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /www/wwwroot/themainstream.co.in/wp-includes/functions.php on line 6121
Microsoft unveils 3 new AI models for image, voice and speech processing The Mainstream
back to top
Monday, April 6, 2026

Top 5 This Week

Related News

Microsoft unveils 3 new AI models for image, voice and speech processing

Expanding its artificial intelligence portfolio, Microsoft has introduced a new set of specialised models designed to enhance multimedia generation and transcription capabilities.

The company announced 3 new AI models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — aimed at image creation, voice generation, and speech-to-text conversion. These models are currently available through Microsoft Foundry and the MAI Playground, and are also being integrated into several consumer products.

MAI-Transcribe-1 is positioned as a key highlight. Microsoft claims it delivers state-of-the-art speech-to-text transcription across 25 widely used languages. Based on internal testing using the FLEURS benchmark, the model reportedly achieves lower error rates compared to competing models like Gemini 3.1 Flash and GPT-Transcribe. The company also states that it offers the “best price-performance of any large cloud provider.”

MAI-Voice-1 focuses on generating natural and expressive speech. It is designed to produce realistic voice output with emotional depth and consistency, even in long-form content. Within Foundry, users can create custom voices using just a few seconds of audio input. The model can generate 60 seconds of audio in 1 second and will support features such as Copilot Audio Expressions and Copilot Podcasts.

The third model, MAI-Image-2, builds on earlier capabilities to deliver improved image quality with faster output. Developed in collaboration with photographers, designers, and visual storytellers, it aims to enhance natural lighting, textures, and clarity of in-image text. WPP is among the first enterprise partners to adopt this model.

All 3 models are accessible via Microsoft Foundry and MAI Playground, with MAI-Image-2 also being rolled out across platforms like Copilot, Bing, and PowerPoint.

Also read: Viksit Workforce for a Viksit Bharat

Do Follow: The Mainstream LinkedIn | The Mainstream Facebook | The Mainstream Youtube | The Mainstream Twitter

About us:

The Mainstream is a premier platform delivering the latest updates and informed perspectives across the technology business and cyber landscape. Built on research-driven, thought leadership and original intellectual property, The Mainstream also curates summits & conferences that convene decision makers to explore how technology reshapes industries and leadership. With a growing presence in India and globally across the Middle East, Africa, ASEAN, the USA, the UK and Australia, The Mainstream carries a vision to bring the latest happenings and insights to 8.2 billion people and to place technology at the centre of conversation for leaders navigating the future.

Popular Articles