AI infrastructure has a networking problem, zero-trust overlays can help

By Brian Golatka on March 5, 2026

AI infrastructure nodes connected across cloud providers via an encrypted zero-trust overlay network

Gartner projects worldwide AI spending will hit $2.52 trillion in 2026. The five largest US cloud providers have collectively committed up to $690 billion in capital expenditure this year alone. Everyone’s talking about compute. But beneath all of this investment sits a problem that doesn’t get nearly enough attention: the network.

AI workloads are fundamentally distributed. Training runs span GPU clusters across providers. Inference happens at the edge. Agents call APIs across cloud boundaries. The networking approaches most organizations rely on (traditional VPNs, security groups, manual firewall rules) were designed for a simpler era.

Where things are breaking down

Ninety-two percent of enterprises now operate multi-cloud environments. AI teams spread GPU workloads across AWS, GCP, Azure, and specialized providers like CoreWeave or Lambda Labs, often because no single provider has the capacity they need. But each cloud comes with its own VPN tooling, addressing scheme, and firewall model. Connecting them securely requires bespoke configurations that are fragile and difficult to audit.

The security picture is worse. Between January 2025 and February 2026, at least 20 documented incidents exposed tens of millions of user records across AI-powered applications. Trend Micro found over 3,000 AI components publicly exposed online, including unprotected vector databases open to anyone. IBM reported that 97% of organizations that experienced AI-related breaches lacked proper access controls.

And only 49% of organizations say their current networks can even support the bandwidth and latency that AI requires. The network isn’t just plumbing anymore. It’s a bottleneck.

Why traditional approaches fall short

Perimeter security assumes a clear boundary. AI workloads don’t have one. A training job might span three cloud providers. An inference endpoint might serve requests from edge devices, internal apps, and partner APIs simultaneously.

VPNs create bottlenecks. Hub-and-spoke architectures route traffic through central gateways, adding latency and creating single points of failure. For latency-sensitive inference, that overhead matters.

IP-based trust is fragile. When GPU nodes spin up and down across clouds based on demand, tying security to IP addresses creates constant maintenance. A firewall rule that was supposed to be temporary becomes permanent, and suddenly an internal model endpoint is reachable from the public internet.

A different model: zero-trust overlay networking

Instead of relying on the underlying network for security, you create an encrypted overlay layer where every node must present a cryptographic certificate to communicate. Identity is tied to the certificate, not to an IP address or network location. Nodes can move between clouds, sit behind NATs, or join from edge locations, and the security model stays consistent.

This is the problem we built Nebula to solve. We originally created it at Slack, where it now powers 80,000+ production hosts. It uses the Noise Protocol Framework for mutual authentication, AES-256-GCM encryption, and peer-to-peer tunnels via UDP hole-punching, so nodes communicate directly rather than routing through a central gateway.

With Managed Nebula, we handle the operational side: automated certificate management, SSO integration, key rotation, and a management UI. Running your own certificate authority across hundreds or thousands of nodes is exactly the kind of burden that leads to security shortcuts.

Where this meets AI infrastructure

This isn’t about networking within a single cluster. It’s about the between: connecting distributed AI infrastructure across locations, clouds, and trust boundaries.

Distributed training across providers - GPU nodes in different clouds can join a shared overlay with a single certificate each. No per-cloud VPN setup, no cross-provider firewall negotiations. Provider-agnostic by design.

Private model serving - Instead of configuring Azure Private Link, AWS PrivateLink, and GCP Private Service Connect separately, a model endpoint on the overlay is simply unreachable from outside it. Nothing to misconfigure because the endpoint doesn’t exist on the public internet.

Edge inference fleets - Edge devices running inference models get a lightweight agent, a certificate, and encrypted tunnels back to your infrastructure, even behind restrictive NATs. Certificate rotation keeps it manageable at scale.

The AI boom isn’t just a compute story. As workloads spread across providers and push to the edge, the network will increasingly determine what’s actually possible. Organizations that bolt on VPNs and firewall rules reactively will keep making headlines for the wrong reasons. Those that build identity-based, encrypted connectivity into their AI infrastructure from the start won’t.

Nebula, but easier

Take the hassle out of managing your private network with Defined Networking, built by the creators of Nebula.

Get started