In late 2016, Nate Brown and I were searching for ways to improve the security and reliability of our rapidly expanding production network at Slack. We studied every available option at the time, by setting up and testing everything we could find. As we evaluated, we also looked into the underlying architecture of these options, hoping to find one that we could enhance to meet our performance and reliability requirements. None of them met our needs.
Ultimately, we created Nebula, and since mid-2017, it has been a core infrastructure service within Slack, encrypting and routing the majority of Slack’s network traffic.
I sometimes say “Nebula could only have been born inside of a company like Slack.” There are numerous reasons this is true, but scalability and reliability are the most important. Nebula had to work on a large, established network from day one. If we hadn’t been concerned about this, we would have gone about things entirely differently.
If we found ourselves trying to solve the same problems today, but Nebula didn’t exist, which network security solution would we choose? None of them. We would still build Nebula. Here’s why.
Nebula’s Architecture is Decentralized
Just like its cosmic namesake, a Nebula network has no explicit center. As a result, Nebula’s unique design doesn’t have single points of failure. It is extremely resilient, even in the case of failures within the underlying network and hardware.
When a Nebula host wants to connect to another Nebula host, it queries a set of Nebula nodes called “lighthouses” for information about the host it is trying to reach. Because these queries are lightweight, they are sent to every lighthouse simultaneously and then the answers are locally aggregated. If you have six lighthouses and five are down, the network will continue to work as normal.
Lighthouse hosts are entirely independent and autonomous. In fact, lighthouses don’t communicate with each other, at all. Early in the development of Nebula, we considered using “eventual consistency” or something like Raft, to keep the lighthouse servers synchronized, but quickly realized that by making them independent, we could ensure that a faulty or compromised lighthouse couldn’t adversely affect a Nebula network.
Most overlay (or mesh) networks depend on coordination servers for their uninterrupted operation. In Nebula’s design, these simply don’t exist. This is a key reason that Nebula has achieved superb uptime, even in large, complex networks.
Nebula’s trust model is based on certificates issued to hosts during provisioning or renewal, a form of PKI. When a Nebula host attempts to connect to another, it first sends its certificate and key information. If the host receiving these confirms they are valid and signed by a certificate authority it trusts, it replies with its own certificate and key, which the original host also validates before a connection is established.
Aside from Nebula, the majority of overlay networking solutions depend on out-of-band channels to distribute keys and identity information about participants in the network. In these architectures, the actual encryption of network traffic is done by a traditional VPN, such as Wireguard, but important details like key management, identity, and access control are implemented separately.
This decoupling of identity and transport means that the underlying VPN is relegated to being a design detail. This is the primary reason we didn’t build on top of Wireguard or any other existing VPN. We wanted a protocol that directly addressed identity and transport instead of attaching a separate complex codebase.
Avoiding Overhead and Complexity
Using PKI has another very important advantage. As mentioned, Nebula hosts exchange their certificates and keys when they first communicate with each other. There is no need to distribute anything before that first handshake. When you provision a Nebula host on your network, there is no associated administrative overhead.
In overlay networks where the protocol doesn’t implement any form of identity, adding a new host has an associated cost that becomes significant at scale. This is something we generally refer to as the “n-1 key distribution problem”.
Imagine you have an overlay network with 10 hosts, and you want to add another. To add that new host, you must to send an out-of-band signal to the existing 10 hosts to distribute key and identity information about the new host. That’s a relatively small number of hosts, so the overhead is manageable.
Now imagine you have an overlay network with 100,000 hosts. Adding another host now requires you to to send messages to 100,000 hosts to inform them of the new member of the network. If you add just ten hosts, that’s a million out-of-band messages. It simply does not scale.
In a Nebula network, there is no need to share any information with peers, eliminating this issue entirely. Nebula works great on small networks, but is unmatched in scalability on large ones.
We did not want to create something new. “Not invented here” is something tech companies often struggle to avoid. Creating something new is, at least initially, fun. Starting a project like this from scratch may seem straightforward to excited engineers. Still, the reality of bug reports, enhancements, and the inevitable refactoring should not be ignored when deciding to create software.
Revisiting our primary question – Would we still create Nebula today? Taking into account its design considerations, reliability, and how it meets our needs, we absolutely would. Even years later, nothing else offers the robustness, performance, and unmatched scalability we’ve designed into Nebula.
Nebula, but easier
Take the hassle out of managing your private network with Defined Networking, built by the creators of Nebula.