Google has introduced VaultGemma, a new AI model designed with privacy-preserving techniques to keep training data confidential.
VaultGemma is a small language model (SLM) with one billion parameters and is described as the largest model ever trained with differential privacy (DP). It has been developed by Google researchers in collaboration with its DeepMind AI unit, using a new set of scaling laws.
The model weights are available for free on Hugging Face and Kaggle. In a blog post on September 12, Google said, “VaultGemma represents a significant step forward in the journey toward building AI that is both powerful and private by design. By developing and applying a new, robust understanding of the scaling laws for DP, we have successfully trained and released the largest open, DP-trained language model to date.”
Focus on Privacy and Data Protection
Data privacy continues to be a major challenge in AI development. Large language models like ChatGPT and Gemini have raised concerns about exposing user data, as personal information can sometimes be retrieved from them through carefully designed prompts. For example, in an ongoing lawsuit against OpenAI, a media organisation alleged that ChatGPT reproduced its articles word for word.
Unlike traditional fine-tuning approaches with user-level protections, Google said VaultGemma integrates differential privacy at the pre-training stage. By introducing calibrated noise, the model avoids memorising or reproducing its training data.
Balancing Trade-offs
Applying differential privacy to large models is complex and comes with challenges such as reduced training stability, larger batch size requirements, and higher computational costs. To address this, Google developed new scaling laws that guide training configurations while balancing computing needs, privacy, and performance.
According to Google, “A key finding is that one should train a much smaller model with a much larger batch size than would be used without DP.”
Performance Benchmarks
VaultGemma achieved performance levels comparable to an older GPT-2 model of similar size across several standard academic benchmarks, including HellaSwag, BoolQ, PIQA, SocialIQA, TriviaQA, ARC-C, and ARC-E.
In privacy testing, Google prompted VaultGemma with partial text from training documents. The model did not reproduce the corresponding text, though Google noted that if several training sequences included related information, the model could still generate that fact.
The company emphasised that further research on differential privacy is required to narrow the gap between DP-trained models and non-DP-trained models in terms of utility.
Also read: Viksit Workforce for a Viksit Bharat
Do Follow: The Mainstream formerly known as CIO News LinkedIn Account | The Mainstream formerly known as CIO News Facebook | The Mainstream formerly known as CIO News Youtube | The Mainstream formerly known as CIO News Twitter |The Mainstream formerly known as CIO News Whatsapp Channel | The Mainstream formerly known as CIO News Instagram
About us:
The Mainstream formerly known as CIO News is a premier platform dedicated to delivering latest news, updates, and insights from the tech industry. With its strong foundation of intellectual property and thought leadership, the platform is well-positioned to stay ahead of the curve and lead conversations about how technology shapes our world. From its early days as CIO News to its rebranding as The Mainstream on November 28, 2024, it has been expanding its global reach, targeting key markets in the Middle East & Africa, ASEAN, the USA, and the UK. The Mainstream is a vision to put technology at the center of every conversation, inspiring professionals and organizations to embrace the future of tech.