Using NATS to Implement Service Mesh Functionality, Part 3: Metrics, Tracing, Alerts, and Observability

9 min readDec 5, 2019

This is part 3 in my quest to discuss “what can you do with NATS that is similar to service mesh ideas”. There are a lot of articles and talks on service mesh benefits, usage and design. This write-up is to show how to have similar features when using the NATS messaging system as your communication backbone. As with anything dealing with software engineering, there are pros and cons to deciding on languages, infrastructure, and design models. It is always great to have alternatives and new ways of looking to solve problems.

To recap, in my first post on this subject I listed 7 items being the main functions evangelized with respect to a service mesh setup. Then we specifically dove into Service Discovery. The part 2 post on NATS and Service Mesh talked on security as it relates to NATS 2.0+ and Service Mesh ideas such as AuthN, AuthZ, and mutual TLS security.

This third article goes over the observability piece, or at least part of observability if you follow and agree with Charity Majors (a.k.a. Mipsy Tipsy) at honeycomb.io like I do. We will see how the metrics, tracing, and alerting are done with respect to NATS in comparison to the similar functions you get with the major service meshes such as Istio, Linkerd, Consul and similar service meshes in the IT landscape. This is not a deep dive on metrics or tracing or “this is better than that!”. It is to show alternatives so you have options when designing your applications.

Observability (Layer 7 metrics, tracing, alerting, Honeycomb.io)

Metrics in a Service Mesh

When you think metrics in your service mesh, you probably think of success rates, requests per second, latencies, error rates, and response times. Metrics are inherently tracked within service meshes for things such as API to API communication as well as the components of the service mesh itself. For instance, Istio can show Envoy resources used in the mesh (proxy-level metrics) as well as the service-oriented metrics for the actual services running in your service mesh.

Service Mesh software such as Consul, Istio, and Linkerd can export and store your metrics in Prometheus for querying or for displaying in tools such as Grafana as well. This can help you see the usage of your mesh, response codes, destinations, and security information of your APIs within your service mesh.

Service Mesh Dashboard in Grafana via https://istio.io/docs/tasks/observability/metrics/using-istio-dashboard/

Metrics in NATS

In comparison NATS has metrics as well, as can be seen with the NATS Prometheus Exporter information. I used this in my own personal open source software (OSS) project. And I use Prometheus to hit the configured port on the exporter per the documentation linked just above. The Prometheus Exporter for NATS uses the :8222 metrics port (run NATS with the ‘-m 8222’ option) on NATS to pull information and put into a format that Prometheus can use. I also used the Grafana NATS Dashboard to show the metrics in graphical form (pictured below).

NATS Metrics via https://grafana.com/grafana/dashboards/2279

Keep in mind that these metrics are on the overall NATS system, not individual message subjects per se that you can dive into with service mesh metrics (i.e. requests per API call, response per API call). To see more detailed metrics, you can go to the NATS endpoint configured for metrics (i.e. http://{ip-address}:8222/connz) using the NATS IP and see the information I am referring to per connection. (Hint: name the connection so it is easy to differentiate them.)

To get more detailed NATS messaging metrics information you would have to write your own exporter or do something similar to poll and pull the /connz NATS metrics endpoint information per client connection. The documentation on the metrics endpoint for NATS is here. From the outset, the infrastructure of NATS and its metrics at a high level can be tracked and shown as we can see below. They do not show per client or per subject by default at this level as the dynamic clients and subjects could make the graphs and metrics unmanageable.

NATS metrics endpoint showing information per client connection

Both of these models contain metrics at some level. There are more detailed metrics in the service mesh designs already integrated with visual tools. Out of the box NATS can show overall NATS metrics with the Prometheus exporter add-on. The data in the NATS metrics endpoint can show you more detailed information by client connection. If you want metrics per subject you would have to create your own as far as I can tell.

You also can setup alerts in Prometheus based on rules against these metrics as well. This can work regardless of NATS, service mesh, or your own application sending metrics to a Prometheus server.

Special Note: The folks at NATS are working on a new monitoring setup called Surveyor. It allows you to gather metrics on all your NATS servers through the regular NATS port inside the system. To read up on it, check out the linked GitHub repo. It should eventually look a little something like this below.

NATS Surveyor (currently in Alpha as of this writing)

Tracing Introduction

When it comes to tracing and service meshes (or anything really), you want to see how a call goes from point A to B to C to D and back to D to C to B and then A again before it finally finishes. With a microservices architecture there are plenty of hops in a request and you wish to know who did what, how long did it take, which took the longest, any errors along the way, and did it do what I tried to make it do.

OpenTracing’s website has great information on this information as a starting point so I will not get into the details of tracing. In my mind in very simple terms, you have a trace or “tracker” that starts with the first request that is called. It creates something that travels through all pieces of the subsequent calls and then gets back “home”. When it then calls a second service, it passes the initial trace information in some form in the header. That in turn does the same to subsequent calls. And you pass header information back around until the end and then mark it “complete”. You can see what started the call, other places it went, and then how it traversed back. Almost like a GPS for your API calls!

Your code has to be instrumented to do this of course regardless of NATS or a service mesh implementation. There are plenty of websites, blog posts and YT videos (outside of this article) to show you how to do this if you are interested. For example, Istio explains the trace information pretty well on their documentation site.

Tracing in a Service Mesh

Within service meshes you have to monitor and understand your API behaviors and relationships as your application flows and communicates. Istio does this through the Envoy proxies it uses inside the mesh and supports exporting that type of tracing data back to main backends such as Zipkin, Jaeger, LightStep, and Datadog. You can check out the Istio FAQ for more specific information on that specifically. Even Honeycomb can help in this realm.

Linkerd uses a collector that consumes these spans mentioned above. That is a similar model to Istio and other service meshes and this is documented here pretty well.

An example is pictured below of a trace and how it looks visually in tools such as Jaeger. You can even use open source tools like Kiali to visually show the calls made in a service mesh. It shows the initial call at the top of the listing, with subsequent calls below, along with the span of time the call actually took. It may even show the color coding of errors and responses for HTTP API based calls.

Example of Jaeger for Tracing in Istio via https://istio.io/docs/tasks/observability/distributed-tracing/jaeger/

Tracing in NATS

When I first was looking into tracing in NATS I thought “man, I wish they had this as well! Crap!”. Turns out: they do! I was chatting in the NATS Slack channel with some folks and they pointed me to them in GitHub thankfully. They have a reference Golang architecture using OpenTracing in GitHub where they walk through how this works. They also have one in Java. It was buried in their list of repositories. They even show how to use it with Jaeger in a docker container.

This can be done with Request/Reply communication as well as with Publish/Subscribe communications in your messaging clients. It follows the same exact model as above. You have a tracer, a span, a span context, and then you put all that into the payload and encode it to send. On the receiving end you decode it and use the message accordingly and the span (if there is one) is used. The examples in the 2 GitHub links above use Jaeger as the OpenTracing referenced system.

When doing a request/reply message in NATS the tracing information may look very familiar to the compounding API calls in a service mesh. The requestor sends out something, the replier sends something back, and then the requestor can use the response. That sound very much like the API-to-API scenario mentioned above.

The tracing information may not look exactly the same as an API does when it comes to using a publish/subscribe message model. You will see the publisher showing information with the timespan, and then you will see the subscriber later on after it receives the message. The gap will be the time after the publisher sent the message, when the NATS server receives it and routes it out to anyone that needs to listen for it.

In the new NATS 2.0+ server setup, they also now have a monitoring service showing latency for service imports with a usage export. This is their service observability piece in the 2.0 server for tracing and services. It will show the start time, service latency, NATS latency, and total latency.

Summing it all up

Both NATS and service mesh implementations have the functionality of metrics, tracing, and even logging inherently within them. The level of detail you get with service mesh implementations out of the box for detailed metrics on your API calls is more than out of the box metrics for NATS. With NATS you can get deeper metrics from the metrics endpoint (:8222) to the messaging client level. But you will need to poll for and use them yourself in your application.

Tracing for both types of application models seems on par to me at least. You instrument your code. You push to a server. And you display the spans and calls visually. You have to do some work on the NATS side to make sure you pass the span information around on your messages in the same way header information is sent from API to API in your service mesh application design.

Of course, to get the level of detail you need to setup the service mesh and more than likely setup Kubernetes to use it. I know several of these vendors say “this works within Kubernetes or without it”. However, I see most of the examples and talks of service meshes only within Kubernetes.

The NATS implementations discussed above are just NATS implementations. Whether you run it inside Kubernetes, within Docker, using NGS, or just on your server or workstation it works like this.

Hopefully this has given you as much information as it has for me. There are a lot of links inside here to the elements of metrics and tracing so please jump out and catch up on these technologies where you need to.

Now, I need to go work on Part IV of this since I am so far behind…