Running Ray at Scale on Azure Kubernetes Service: New Guidance from Microsoft and Anyscale

As artificial intelligence workloads continue to grow in scale and complexity, cloud providers are refining how distributed computing frameworks run in production. Microsoft’s Azure Kubernetes Service (AKS) team has released new technical guidance on deploying Anyscale’s managed Ray platform at scale, focusing on operational challenges such as GPU availability, machine learning data storage, and secure authentication.

The guidance builds on earlier work around open-source KubeRay deployments on AKS. It introduces Anyscale’s enhanced runtime—formerly known as RayTurbo—which brings advanced autoscaling, improved monitoring and fault-tolerant model training to Ray-based environments running in Azure.

What Ray and Anyscale Bring to AI Infrastructure

Ray is a Python-native distributed computing framework widely used for scaling artificial intelligence and machine learning workloads. Developers can run Ray locally on a laptop and scale the same code to clusters with thousands of nodes.

Anyscale, the company founded by Ray’s original creators, provides a managed platform that adds production-grade features on top of the open-source framework. These include intelligent autoscaling, integrated observability tools and improved reliability for large training jobs.

Microsoft’s latest guidance reflects a deeper collaboration between Azure and Anyscale to streamline Ray deployments on AKS. The goal is to make it easier for organizations running large AI models—such as generative AI systems or recommendation engines—to manage distributed workloads reliably in the cloud.

Addressing GPU Capacity Limits Across Regions

One of the biggest operational hurdles in large-scale machine learning is GPU availability. High-performance accelerators, particularly NVIDIA GPUs, are often subject to regional quotas and supply constraints within cloud platforms.

When demand spikes, organizations can face delays provisioning clusters or scheduling training jobs.

To mitigate this, Microsoft recommends a multi-cluster, multi-region architecture. By distributing Ray clusters across multiple AKS deployments in different Azure regions, teams can:

Aggregate GPU Capacity

Spreading workloads across regions allows organizations to effectively combine GPU quotas that would otherwise be limited within a single region.

Improve Resilience

Workloads can automatically shift to another region if one experiences outages or capacity shortages.

Extend Compute Beyond Azure

Using Azure Arc with AKS, clusters can also integrate on-premises infrastructure or resources from other cloud providers, expanding the available compute pool.

Within the Anyscale console, these clusters appear in a unified view. Anyscale Workspaces can then schedule workloads based on available capacity, either automatically or through manual selection.

Adding new regions is largely configuration-driven. Teams define resources in a cloud_resource.yaml manifest and apply it through the Anyscale CLI, enabling straightforward multi-region expansion.

Solving Data Movement in ML Pipelines

Another persistent challenge in machine learning operations involves transferring data between pipeline stages. Training datasets, model checkpoints and intermediate artifacts often need to move from pre-training to fine-tuning and eventually to inference environments.

Microsoft’s guidance recommends using Azure BlobFuse2, which mounts Azure Blob Storage directly into Ray worker pods as a POSIX-compatible filesystem.

From the Ray application’s perspective, the mounted storage behaves like a local directory. Ray tasks and actors read and write files using standard file I/O, while BlobFuse2 synchronizes those files with Azure Blob Storage.

This approach offers several advantages:

Shared Storage Across Nodes

Multiple Ray workers running on different nodes can access the same dataset simultaneously.

Local Caching for Performance

BlobFuse2 caches frequently accessed data locally, helping prevent GPU idle time during large training runs.

Compute-Storage Decoupling

Because data lives in object storage rather than on cluster nodes, Ray clusters can scale up or down without risking data loss.

Setting up the environment involves enabling the Blob CSI driver during cluster creation, defining a StorageClass that uses workload identity for authentication and creating a PersistentVolumeClaim with ReadWriteMany access.

The result is a portable architecture where Ray applications remain cloud-agnostic while still benefiting from Azure’s scalable storage infrastructure.

Improving Authentication and Credential Management

Authentication reliability has also been a concern in earlier integrations between Anyscale and Azure. Previous approaches relied on CLI tokens or API keys that expired every 30 days, requiring manual rotation and introducing the risk of service interruptions.

The new architecture replaces those mechanisms with Microsoft Entra service principals and AKS workload identity.

In this model, the Anyscale Kubernetes Operator pod uses a user-assigned managed identity. That identity automatically requests short-lived access tokens for the Anyscale service principal through Microsoft Entra ID.

Azure handles token renewal behind the scenes, eliminating the need for long-lived credentials or manual key rotation.

This approach becomes particularly valuable in multi-cluster deployments, where managing authentication across several clusters can quickly become an operational burden. The workload identity framework also enables granular role-based access control (RBAC) and produces detailed audit records through Azure Activity Logs.

Cloud Providers Compete Around Ray Ecosystems

The Anyscale-on-AKS integration is currently in private preview. Organizations interested in testing the setup can contact their Microsoft account team or submit a request through the AKS GitHub repository.

Sample configurations—including large language model fine-tuning workflows using DeepSpeed and LLaMA-Factory—are available in the Azure Samples repository, along with examples for deploying LLM inference endpoints.

Microsoft is not alone in investing in managed Ray infrastructure. Amazon Web Services announced its own partnership with Anyscale during Ray Summit 2024, connecting Amazon EKS clusters to the RayTurbo runtime. AWS emphasizes hardware flexibility by combining NVIDIA GPUs with proprietary accelerators such as Trainium and Inferentia, while SageMaker HyperPod can host long-running distributed training jobs.

Google Cloud has also contributed heavily to Ray’s open-source development. Engineers working on Google Kubernetes Engine collaborated with Anyscale to introduce label-based scheduling in Ray v2.49 and developed tools to reduce resource fragmentation in multi-chip TPU environments.

Across the industry, all three major hyperscalers have adopted the same managed Ray operator while layering their own infrastructure services on top.

The Growing Role of Kubernetes in AI Workloads

The convergence around Ray and Kubernetes reflects a broader shift in how large-scale AI infrastructure is built. Rather than competing over the distributed runtime itself, cloud providers are increasingly differentiating through surrounding services—compute availability, storage integration, networking and operational tooling.

For organizations deploying complex machine learning pipelines, the combination of Kubernetes orchestration and Ray’s distributed computing model is emerging as a common foundation for running AI workloads at scale.

Nolan Fraser