At Build 2022, Microsoft announced a new capability in Azure Translator service that will allow customers to translate PDF documents directly. Until now, customers must preprocess them through an OCR engine before translation. This new feature eliminates the need for OCR preprocessing of PDFs before translation.
Document Translator allows customers to translate documents into more than 110 languages and dialects while preserving the layout and formatting of the original file.
Translating PDFs with scanned image content is a highly requested feature from Document translation customers.
Azure Translator’s Document translation service can:
- identify whether the PDF document contains scanned image content or not,
- route PDFs containing scanned image content to an OCR engine internally to extract text,
- reconstruct the translated content as regular text PDF while retaining the original layout and structure.
Right now, Document translation supports PDF documents containing scanned image content from 68 source languages into 87 target languages.