Using NATS to Implement Service Mesh Functionality, Part 4: Load Balancing and Routing Control

8 min readDec 16, 2019

This is part 4 in using the power of NATS for a service mesh type of implementation in your application(s). In this installment we are going to talk on load balancing on APIs and services when you have more than 1 service that can answer requests or respond to events. We also discuss routing control as far as canary deployments, A/B testing, and mirroring of requests.

For a quick recap on this series of NATS and service mesh concepts, Part 1 talked on the service mesh in general and service discovery. Part 2 on security. And Part 3 on metrics, tracing, alerts, and observability.

Load Balancing (least request, weighted, zone/latency aware)
Routing Control (traffic shifting and mirroring)

Load Balancing in a Service Mesh

If you are like me, when you think of load balancing at a very high level you think of two or more things (APIs, servers, services, websites, etc.) being accessed by one name or URL. The load balancer figures out which one to send you to based on rules, a “who is next” listing, or which services are responding and healthy. You can use Load balancing so you have fault tolerance when one or more services break. And load balancing also lets you scale out to handle a larger workload (if designed correctly) by horizontally scaling your services or applications.

Note: I am going to use the word services here to refer to microservices, services, messaging clients, web applications, and anything else with respect to load balancing just to make it less wordy.

In a service mesh implementation, the mesh helps you perform load balancing easily when 2 or more services are setup as replicas/copies to help handle load and maintain a healthy system. You can use an ingress (incoming) controller to help you setup and layout specifics on your load balancing path, rules, weights, and services. You request a URL or named path, and the service mesh applies the rules to connect to one of your services behind the scenes and sends back the information to the requestor.

Courtesy of https://www.digitalocean.com/community/tutorials/what-is-load-balancing

You can have load balancing setup as round robin (each gets their turn) or as weighted based on locality (geography aware, latency aware). You also can load balance based on least request so those services that are not as busy can respond. There are links below to provide further details on Istio, Linkerd, and even the Envoy Proxy which is used in a lot of service mesh implementations.

Load Balancing in NATS

When it comes to load balancing in NATS, I will refer you to the queued subscriptions. You can use these to balance delivery across a group of NATS clients (subscribers) which lets you have fault tolerance and scale processing in a similar way to service meshes. At a surface level to me, this is similar to the round robin approach in a service mesh. (Your opinions may vary.)

Queued groups are made when clients register a queued name with NATS. There is no server configuration here to setup a queued group, which is a little different to me. The clients connect as a queued group through their applications. When messages come in on the registered subject, one member of the group is chosen to act on it.

Courtesy https://www.slideshare.net/nats_io/the-zen-of-high-performance-messaging-with-nats-76985268

In fact, you can spread these queued subscription groups across geographic boundaries (think East Coast US, West Coast US, Europe, Asia Pacific, and Australia) to allow load balancing and fault tolerance to allow regional failure. And with the latest NATS connect them via gateways to have clusters and superclusters. The subscribers that are “closer” to the publisher will get the subscription information. However, if they go offline the others can pick up the subscription. This effectively gives you Disaster Recovery here as well.

NATS currently does not have weighted load balancing, where you can specify X% of traffic go to Service1 and Y% go to Service2. However, in the NATS Slack channel they have said that is coming. So the gap of functionality to compare/contrast NATS with service mesh architectures will continue to lessen.

Routing Control in a Service Mesh

When referring to routing control in service mesh architectures in this article, I am talking about traffic shifting and mirroring specifically. Mirroring equates to “the request coming in, perform it over here so I can see what it does with this code”. It is also referred to as shadowing. An important note is that mirroring happens out of band and does not interfere with the primary call.

And traffic shifting is when you slowly migrate traffic from one service to another. I usually see this when implementing upgrades of services and almost turning it on as a “faucet” slowly but surely to replace an older service.

Mirroring in your service mesh lets you implement new functions with less risk to production as it happens separately. You can define a routing policy in YAML and implement it on your services to mirror traffic. Then act on it, send it to a separate data store or use the metrics and logs separately to see its functionality and impact. This lets your application run as normal, while letting you see what a new piece of code or service will do to the application and its data live.

For traffic shifting in your service mesh, you once again dip into the YAML editing routine. You in essence using a series of updates to a weighted routing policy to slowly move a new service into the 100% realm and turn off an old one. By upping the percentage of routing in YAML and applying the updates as you go, you can in essence move off the old service and to the new service while watching its impact.

Routing Control in NATS

As far as mirroring in NATS, that is fairly easy to do in theory if you are a subscriber at least. You have a client(s) that subscribes to the same subjects or wildcards. And you perform the new work with those new clients. I have done this very thing to test new functionality by “listening in” to subjects and seeing what the new code or functions will do. NATS is inherently setup to do that.

You just need to make sure any permissions or data stores you use are also mirrored so you do not mess up your test or production suite of services while you mirror traffic. If you want to test a new publisher you will have to be a little more ingenious and ensure you are not improperly altering your data and system.

New “app.hotel.save” mirroring traffic to test a new function when saving hotel data

Traffic control as we defined it above is not yet in NATS. According to their Slack channel conversations, they have a kind of canary deployment scheduled in their near future which may be similar to this. A weighted routing rule that you can apply to make certain clients answer requests for weighted balancing as well as a canary rollout. We will have to wait and see for that functionality.

You can watch for their timetable of updates in the NATS.io website. For now, if you need that capability you need to choose a service mesh implementation or roll your own way of doing this until NATS has this functionality.

To Sum It All Up…

When it comes to load balancing and routing traffic, service meshes and NATS share some similarities. You can do round robin type of load balancing across services with both. You can mirror traffic with both. Eventually NATS will allow canary type of deployments and weighted traffic management, but for now this works only with a service mesh. Both technologies have their strong functions and both can fit into your architecture based on your needs. Not just the latest marketecture!

One important difference is that NATS (at least to me) is more event driven or message driven (pub/sub, request/reply) where most service mesh implementations I have seen or studied use REST or gRPC types of calls. Your needs and quite frankly your knowledge of event driven architectures and messaging may heavily influence your architecture decisions for you and your team. I personally had to study up on publish/subscribe and request/reply messaging to even understand where I could use NATS as an option.

The other thing that stands out to me is that Istio and Linkerd and a lot of these service mesh systems require YAML files to configure and setup. If you have ever edited them, they can be quite tedious. Compared to NATS, some of the NATS features just work as you use them! They just happen the way you need them to, without any configuration or YAML definitions. Like the queued subscribers from clients load balancing automatically. And being able to mirror traffic by listening in to certain types of published messages and acting on them separately.

Where to go from Here

If your application architecture requires some of the common functions listed in these last four articles on NATS and service meshes, you may want to experiment with both to see which one solves your needs with the least complexity. If your application is asking for more than what NATS can give you, then you may lean more toward service mesh. These ideas are fairly new and this area will keep expanding. I hope you are like me in that you will keep challenging both architecture types to see how it can help you solve your application needs.

There are other ideas and functions in service mesh architectures such as circuit breaking (only let a call fail X times then don’t ask again), timeouts and retries that have some crossover to how NATS clients can connect. I invite you to take a journey into the concepts and see more ways you can do similar things with NATS that you can do with some of the more popular service mesh ideas.

And if you have ideas or requests jump into the NATS Slack channel and just ask. They are very good at discussing ideas, alternatives, what to do and what not to do. And you may give them an idea for NATS to help you in the process.