Microsoft Azure Cognitive service for Vision Florence Foundation Model

Microsoft Research’s Project Florence initiative is focused on developing state-of-the-art computer vision technologies and developing the next generation framework for visual recognition. Azure Cognitive Service for Vision allows developers to include computer vision capabilities inside their apps including the ability to analyze images, read text, and detect faces with prebuilt image tagging, text extraction with optical character recognition (OCR), and responsible facial recognition.

Microsoft today announced the public preview of Microsoft’s Florence foundation model, trained with billions of text-image pairs and integrated as cost-effective, production-ready computer vision services in Azure Cognitive Service for Vision. The improved Vision Services will offer cutting-edge capabilities. Microsoft will be using these improved Vision services in Microsoft 365 apps like Teams, PowerPoint, Outlook, Word, Designer, OneDrive, in addition to the Microsoft Datacenter to enhance security and infrastructure reliability. Reddit will also be using this improved Vision service to generate captions for hundreds of millions of images on its platform.

Some out-of-the-box features available include the following:

  • Dense captions: Automatically deliver rich captions, design suggestions, accessible alt-text, SEO optimization, and intelligent photo curation to support digital content.
  • Image retrieval: Improve search recommendations and advertisements with natural language queries that seamlessly measure the similarity between images and text.
  • Background removal: Transform the look and feel of images by easily segmenting people and objects from their original background, replacing them with a preferred background scene.
  • Model customization: Lower costs and time to deliver custom models that match unique business demands at high precision, and with just a handful of images.
  • Video summarization (Video TL;DR): Search and interact with video content in the same intuitive way you think and write. Locate relevant content without the need for additional metadata.