How Akamai Uses NATS to Power Distributed Applications at the Edge
Brian Apley shares how Akamai uses NATS to bring distributed applications closer to users. This allows Akamai to expand beyond content and security while delivering reliability, performance, and scale for their customers.
"Synadia and NATS are the perfect platform to express the ability to bring distributed applications closer to users and expand above and beyond just content or security; and deliver actual applications that can run in an edge native distributed environment that can deliver performance, security, reliability, and scale, and do so in a deeper, more rich way than we've ever done before.”
- Brian Apley, Principal Cloud, IoT & Messaging Architect, Akamai
Go Deeper
Full Transcript
Brian Apley:
Hi, I am Brian Apley. I am a Principal Cloud Architect with Akamai Technologies.
If you haven't heard of Akamai. I'm going to give a little bit of background, just because it makes sense in the overall Synadia discussion.
Akamai is the world's largest content delivery and security network. We run a distributed network of about 450,000 servers spread throughout the globe. We've been in business for about 27 years, and I have been there for 21 of the 27.
To talk about Synadia and NATS and why it's important to me, I have to step back a little bit and talk about what Akamai does from a solution standpoint, and what really the purpose and mission of Akamai is.
We've migrated to a general purpose compute. So we now have the ability to offer virtual machines, Kubernetes, you know, other compute primitives across our network. And really, the aim is the same. What we're driven by is this belief that to solve the problem of an inherently unreliable Internet, an Internet where there's no single throat to choke or help desk or whatnot that you need to deploy resources out close to where the end users, and where the eyeballs are.
And if you take that concept, then you can see, in the beginning customers had to solve problems of content delivery and getting web content closer. As content evolved to more rich interactive applications, then it became an issue of how do you accelerate that application? How do you ensure reliability when you can't cache it? And then, of course as more and more critical business processes migrated to the Internet, the question became, how do you secure it? And how do you secure it in a way that is scalable, that doesn't impact performance? I.e. Assuming, like a Ddos attack happens, that you're not bringing all of that traffic back to a single point of failure or a centralized endpoint.
So what's exciting now for me, who have worked my entire 21 years with customers and solving problems, is now with general purpose compute we're outside of the boxes of delivering content or securing an application, and we can do anything that you can do with a virtual machine or with Kubernetes. This brings us to Synadia. Because, if you're at RethinkConn right now, you've seen the power of Synadia and NATSin a distributed environment.
For us at Akamai and for me personally, Synadia NATS are the perfect platform to express this ability for us to bring distributed applications closer to users and really expand above and beyond just content or security, and deliver actual applications that can run in an edge native distributed environment that can deliver performance, security, reliability, and scale, but do so in a deeper, more rich way than we've ever done before.
So for RethinkConn, I wanted to show my Synadia rig, my NATS test rig and show how I use it in order to illustrate to customers what the power of our platform is. And I also use it for a few specific cases where we have to do very specific things like qualify a customer, or to tell a customer which one of our 40 compute regions makes sense for them to run their workloads in given where we're at. So I'm going to show a demo setting up a distributed, stretched NATS cluster across all of our compute regions and then show a little bit of what we do to make use of that data and that demo.
Okay, so this is the Akamai compute console. And I just wanted to show the virtual machines that I have set up to run my distributed NATS cluster. Not much to show here, but a couple things that I do want to point out. Firstly, I'm not using very big machines at all like these are what we call 4 gig linodes, which are shared instances meaning they're oversubscribed. So I think there's 2 cores and 4 gigs, but it's not dedicated. So again, really small virtual machines. One of the things that I point out to my customers that I love about NATS is the binary itself is small, and it's deployable almost anywhere. And certainly in a cloud environment you don't need a lot of horsepower to run that. And certainly not for a demo, and it scales up well if you do need that kind of firepower as well.
So when I set this particular cluster up. I was actually demonstrating GraphQL. So we were using NATS as a GraphQL resolver for both queries as well as subscriptions.
Another thing that I love about NATS, that I always point out to customers, is because it's cloud native. It is really easy to add in sidecars, containers, other things like protocol adapters that you want to interact with NATS. NATS has a lot of features in it, for sure. But if you need it to act as a resolver for an Apollo server or Router front end, it’s pretty easy to come up with that and deploy it, and have it operate.
There's something like 40 regions in here, or 40 nodes that I have set up. I actually just used all of the available regions that Akamai had at the time. We've expanded those regions since then, and we really aim to make this truly distributed edge compute when it comes to delivering our our service. So that's really all I wanted to show from just a setup standpoint.
I've got all of this scripted in Terraform and Ansible. I'm running it on a virtual machine, but I've got a Kubernetes script and deployment YAMLs to run that. I can recreate this entire cluster with just a couple of clicks, and have this up and running in a couple of minutes.
So let me show you the Grafana that I have set up for this. One of the things that I really love about NATS, that's built right in, that allows me to immediately see value from what I'm doing without having to deploy a single NATS client, is the telemetry that comes with it. So you know, when you interconnect a NATS cluster, one of the things that it immediately reports on is latency in milliseconds from that cluster node to the other members of the cluster. So what I've done here, I don't think this is available by default Prometheus Exporter, although that default exporter is terrific, so I just wrote like a lightweight exporter to get this data out. It is available in the NATS HTTP interface, but I don't think it was available in Prometheus. I think I had to do that on my own.
But, anyways, what this is showing is as I select Akamai regions, it shows the latency from that region to all of our other nodes. And this is important for us for a couple of reasons. The first is, customers want to understand if they're running NATS globally, or even within a continent or region, what the consistency time is to expect. So in this case, if they're running a global stretch cluster as I am, I can show them, ‘hey? At a very, very worst case scenario. It's going to take I believe I've selected Los Angeles here, US West. It's going to take US West about 285 ms to reach Mumbai.’ So in a very worst case you can achieve consistency on your data, say if you're writing into Los Angeles, you can achieve it in a little bit under 1/3rd of a second.
And you can see most of the other regions outside of India are below 200 ms, so we can show them, for instance, ‘hey, if you all you care about is North America, we can actually deliver data and make it consistent across North America in under 70 ms, and Europe from US about 150 ms or so.’
So that's exciting right off the bat. It also really helps Akamai illustrate what our message is, of the fact that we're the connected cloud, that our cloud computing resources are using our CDN backbone, our CDN connectivity. And thus that's a big differentiator for us, you have compute that's very well connected to a content delivery network that is the biggest in the world, as well as very, very proximate to most, if not nearly all, of the world's eyeballs on the Internet. And those are critically important to Akamai.
Another case that I've used this for is we don't really have the concept of availability zones with our compute product. Our customers just tend to pick region pairs that are close to each other. So this tool is awesome for showing as an example, if a customer wants to host in LA, what region is the most appropriate for them, and what latency to expect if they're running cross region replication. So in this case, from LA to San Francisco is 7 ms, and that would probably be a good region pair to select.
Another thing I want to point out is again native to NATS, you get that client reporting all the way down to the individual client. And again, the 2 tenants at Akamai are performance and security. So being able to show performance down to the individual client level, and hopefully, it's good performance as well is tremendous. So what I might do is if a customer has a bunch of test clients, I might give them the endpoint to connect to, and then I can show them, ‘this will show up in Grafana or Prometheus, and you can actually meter what your customers latency is, or what your clients latency is. You can be alerted if it exceeds certain levels. You know, through normal Grafana alerting, or whatever your observability stack is. That's awesome! And that really like dovetails well into what we at Akamai tried to do in terms of delivering a distributed cloud product.
Just a couple other things to point out here. This shows the latency over time and again, one of those tenants that I talked about security, performance, scale, the last one being reliability; you want to be able to show latency that has very, very little jitter. That has very little, if any, variance from time period to time period, and this clearly shows that, and shows it in fact, that there's hardly any latency at all, or jitter at all on that region to region latency.
And if they're not bored to tears by then, then I have the periodic table of the latencies. So here's a chart that I just show in a graph like literally any latency from endpoint to endpoint throughout the world. You can see our worst case is going from Sao Paulo, Brazil to Mumbai is 350 ms. Good to show that as well.
The last thing, and I don't have any data in there, is another case we get is a customers saying, ‘hey, my clients here are here, or my endpoints are here, here's the IP addresses that are going to talk to my cloud instance, or talk to my workload, what is the best region or regions to pick if I want a P95 to be like 50 ms. In other words, I want 95% of those users or clients to be able to reach my compute resource in 50 ms or less. We can actually send out a traceroute job through NATS to this cluster. So we'll distribute the job request out. Each node in the cluster will run a
MTR Basically, a trace route to those IP addresses. And then using NATS, they'll return what the latency was, step by step, to that address. So we can take all of that data and say, if you want your P95 to be, 50 ms, then you have to deploy it into at least these regions. And that's super powerful as well.
Again, it both demonstrates NATS and the promise and the power of that, as well as the value of Akamai computing and having distributed well-connected edge computing available to our customers.
I want to thank everybody for their time and for watching the demo. Today I'm really excited to share that Synadia is Akamai's newest qualified compute partner, certified to run on Akamai compute. And we're looking forward to working with the Synadia team on solving those customer problems, and bringing more relevance and more problem solving to our cloud platform. So have a great RethinkConn everybody, and we'll see you soon bye.