In recent years, artificial intelligence (AI) has rapidly transformed the way people interact with computers. The ability to create AI systems that understand natural language and perform complex tasks has opened up a world of possibilities for businesses and individuals alike. However, to build and train these systems, massive computing power is required. That’s where Microsoft Azure comes in.
In 2019, Microsoft partnered with OpenAI to build supercomputing resources on Azure that were specifically designed for training large language models. This infrastructure included thousands of NVIDIA AI-optimized GPUs linked together in a high-throughput, low-latency network based on NVIDIA Quantum InfiniBand communications for high-performance computing.
The scale of the cloud-computing infrastructure OpenAI needed to train its models was unprecedented – exponentially larger clusters of networked GPUs than anyone in the industry had tried to build. But with Microsoft’s expertise in high-performance computing and the company’s commitment to pushing boundaries on AI supercomputing, the partnership was able to overcome any technical challenges and build the infrastructure necessary to train and serve custom AI applications.
The result has been a revolution in AI capabilities. Thanks to Azure’s infrastructure, Microsoft and OpenAI have been able to build AI systems that create pictures of whatever people describe in plain language, a chatbot that can write rap lyrics, draft emails, and plan entire menus based on a handful of words, and much more.
The key to these breakthroughs has been the ability to build, operate, and maintain tens of thousands of co-located GPUs connected to each other on a high-throughput, low-latency InfiniBand network. The computation workload is partitioned across thousands of GPUs in a cluster, and an InfiniBand network accelerates the phase where the GPUs exchange information on the work they’ve done.
Microsoft’s Azure infrastructure optimized for large language model training is now available via Azure AI supercomputing capabilities in the cloud. This resource provides the combination of GPUs, networking hardware, and virtualization software required to deliver the compute needed to power the next wave of AI innovation.
But Azure isn’t just about training large language models. Microsoft has also been deploying GPUs for inferencing throughout the company’s Azure data center footprint, which spans more than 60 regions around the world. This is the infrastructure customers use to power chatbots customized to schedule healthcare appointments and run custom AI solutions that help keep airlines on schedule.
As trained AI model sizes grow larger, inference will require GPUs networked together in the same way they are for model training. Microsoft has been growing the ability to cluster GPUs with InfiniBand networking across the Azure data center footprint to speed up inferencing and make it more cost-effective.
Microsoft continues to innovate on the design and optimization of purpose-built AI infrastructure. This includes working with computer hardware suppliers and data center equipment manufacturers to build cloud computing infrastructure that provides the highest performance, highest scale, and the most cost-effective solution possible.
Thanks to Azure’s infrastructure, AI capabilities that were once the stuff of science fiction are now a reality. As Scott Guthrie, executive vice president of the Cloud and AI group at Microsoft, notes, “Azure really is the place now to develop and run large transformational AI workloads.” With the power of Azure behind them, businesses and individuals can unlock the full potential of AI and usher in a new era of innovation.