Using NATS to Implement Service Mesh Functionality, Part 1: Service Discovery
Can you use the power and simplicity of NATS to create a service mesh without all the heavy overhead and complexity of current service meshes? I have been asking myself that question for the last 9+ months. When NATS 2.0 came out and I saw the accounts and security and superclusters aspects, I started putting down notes on this very topic for myself. The service mesh setup seems like a LOT of overhead to go through with many YAMLs to create to get it working right, even with all the awesomeness it claims to help do. So is there a different way to do this via NATS? Not better but different…
A Very Quick Introduction to Service Mesh
Before we jump into NATS let us do a quick leveling of the phrase service mesh. Yours may or may not be the same. A service mesh is an infrastructure layer designed for managing interactions between services/microservices. It helps your microservices run smoothly, securely, and stable (3 S’s) while telling you what is going on with them. It handles things such as discovery, load balancing, failure recovery, metrics, monitoring, rate limiting, access control, and authentication.
So how can you use NATS as a basis for a service mesh to operate on with these items above? This article begins to go over what most people recognize (as far as I can tell) in a service mesh and then compares to NATS functionality. We start with the specific items below:
- Service Discovery (eventual consistency, distributed caching)
- Load Balancing (least request, consistent hashing, zone/latency aware)
- Communication Resiliency (retries, timeouts, circuit-breaking, rate limiting)
- Security (end-to-end encryption, authorization policies, service-to-service ACLs)
- Observability (Layer 7 metrics, tracing, alerting, Honeycomb.io)
- Routing Control (traffic shifting and mirroring)
- API (programmable interface, Kubernetes Custom Resource Definitions (CRD))
- Automated Canary rollouts (control percentage of canary rollout or blue/green)
- Fault Injection (adding a timeout or error to test resiliency)
That is a big subject so I am chopping it up into pieces. This specific article talks to Service Discovery.
Keep in mind most service mesh designs today (Istio, Linkerd) depend on running in Kubernetes (k8s) or have all their documentation pointing to k8s setup. You may or may not want the complexity of Kubernetes and using your applications within containers. Consul by Hashicorp and the new kuma from Kong add into this mix as well. This begs the question: do you need ALL FUNCTIONALTY that these tools do? Or do you need enough to get your job done? There is k3s and other “smaller” Kubernetes flavors and modifications sure. I am saying you may not even need that!
Side Note: Christian Posta has a great article on implementing a service mesh in steps over time that talks to this as well.
Here is the other thing that got me thinking: NATS as a service mesh idea is not constrained by only running in Kubernetes or Docker or containers. It could run on bare bones systems, virtual machines, Raspberry Pi boards, local Docker instances, Kubernetes, and the like. Whether on premise, edge, hybrid, or all in the cloud this model can work where you are. And with the NATS 2.0 cluster and supercluster ideas this can be worldwide and geo-aware to work with you as you scale out. So with that in mind lets jump into one of the first things touted by a service mesh: Service Discovery!
Service Discovery, or “where the heck are you?”
When you think of microservices whether they are in containers or functions or virtual machines, you think of all the little services loosely coupled to create a greater good. Or at least I do. In that setup, all the microservices that talk to other microservices (or services, or APIs or monolithic systems or messaging systems) must know where the other services are running so they are addressable. In a Kubernetes system the traffic inside of the k8s cluster could have any number of internal IP addresses as they move around, spin up and down, and expand out horizontally.
So how do you know where the service is? Well you either hard code it, use environment variables, put it all into one namespace, or you can use a service mesh. There are many, many articles on service mesh you can research on how this works via Medium or YouTube or CNCF webinars. So I won’t go into the functionality of how it works here. Summarized, the service mesh keeps track of where the other services are with a registry so you know where to call them.
NATS and Service Discovery
When using NATS as your messaging infrastructure, knowing where a service is to call (in my opinion) goes away when you are doing publish/subscribe or even request/reply as well as introducing queued groups. You have to know where your NATS cluster is for sure! However, the important main structures for communication here are the message subjects (or topics if you like that word) you need to request as well as the idea of Accounts and Service/Stream setup in NATS 2.0.
Your subjects are what you publish (I think of this as asynchronous) or request (I think of this as synchronous) and the subscription or reply is used to answer. Your service(s) need to know what subjects to listen for or what subjects to push out; however, they do not need to know necessarily where the other services are that are listening for a given subject or range of subjects. Just the cluster they should be connected to. You have to assume they are there listening of course.
Or you use NATS Streaming (new Jetstream coming) to make sure they have all the latest messages. The important piece is you do not need the IP or name of the microservice or API. You need to know the subject hierarchy and breakdown to send and receive data.
Using a publish/subscribe or request/reply setup in your system means the services must be smart about messaging platforms. And that people have to learn messaging versus the “this API calls this API via REST” that a lot of my colleagues are still stuck on. This is a different design paradigm to me and people have to try it and learn from it to make it work great for their scenarios. Same goes for service meshes and their functionality.
In the example below for part of a hypothetical hotel application, you can see there are a few services on the right in blue that generate data. There is a logging subscription that picks up on any message from the app that ends in a .error, .info, or .warn for example. Any of the “save” or “delete” things published from the 3 blue services gets picked up on the compliance subscription. And the persistence subscription can pick up all thing generated in the “app” space. The functions here and what each service does is not the most important piece. This is a glimpse into a simple example of using NATS messaging as a hub.
The fact is when using NATS, all of these services can communicate without having to know the other service’s IP. Or their addressable name. They do not even have to be developed in order actually! I can add on the ‘compliance’ service after the others are done while still in development. And this type of pattern lets you grow your application organically as required by adding on more topics or subjects into the hierarchy.
To me, NATS just makes the service discovery concept simpler. And I like simple. I am a software engineer who understands complex things. That does not mean I have to make things complex. K.I.S.S.
Security in your Service Mesh with NATS
This is a big subject and will be a Part 2 of this article honestly. There is a lot in NATS 2.0 that lets you setup security. And it is important enough to dedicate a separate article.
Just using NATS as a “service discovery” type of mechanism does not control security (see below) as that means anything (app, hardware, people, etc.) connected onto your message bus that has access can see and use the data in the message.
You must know the message subject or topic hierarchy per account and the relationship across accounts for service discovery and security to work hand-in-hand. Security in NATS is a first class citizen and is built into the system. And I love the new ways to use services and streams in NATS 2.0 as well. For the service mesh frameworks such as Istio or Linkerd you need to configure these. They are usually setup with a sidecar proxy you go through from all I have tested and seen. It gives you more flexibility by adding complexity.
To Sum It All Up
When I think of service discovery in NATS, I think of subject hierarchy. And that all my subscribers and publishers and queues must know where my cluster is. Once I have that set, I am good.
If you have not learned by now, I like NATS. I thank Alex Ellis for introducing NATS to me via his OpenFaaS OSS application. Some of my friends and colleagues have dubbed me a “NATS fan boy”. And I am OK with that. I have used it in three different projects now and it just made things simple, whether I am using C# or Golang for my client.
I believe this “service discovery” idea with NATS works whether you are using APIs, subscription services (a microservice does not just equate to an API), publishing services, or even messaging within a monolithic application. Using NATS to help solve this allows you to move on to the next thing in your architecture or software engineer efforts.
I hope I have at least given you something to think about.