NATS Weekly #11

Week of January 24 - 30, 2022

🗞 Announcements, writings, and projects

A short list of announcements, blog posts, projects updates and other news.

⚡Releases

💬 Discussions

Github Discussions from various NATS repos.

Are the metrics for in/out bytes and message-count that are reported by the nats-server 100% accurate or are they approximations?
jetstream: slow consumer (or what can you do to prevent this)
NATS Terminology Clarification
Background Service??? (using nats.net)

💡 Recently asked questions

Questions sourced from Slack, Twitter, or individuals. Responses and examples are in my own words, unless otherwise noted.

Can I increase replication factor on already existing stream?

As of NATS Server v2.7.3, replicas on a stream can be dynamically scaled up or down through a stream configuration update.

Currently, this is not natively supported by editing the stream configuration. However this can be done with a fairly straightforward backup and restore sequence (with a few caveats).

Let’s assume we have a stream called events and a consumer called db-writer that consumes from the stream.

The first step is to backup the stream including the consumer state to a non-existent or empty directory and then immediately deleting stream (please read the first caveat below before doing this).

nats stream backup --consumers events ./events

nats stream rm events

The directory will contain the files backup.json and stream.tar.s2. The backup.json file contains the configuration of the stream including the num_replicas field which can be changed to 3.

Once the file is changed and saved you can restore:

nats stream restore events ./events

This will recreate the stream with the same name and restore all the consumer state. Existing subscriptions for consumers will remain connected and (should) pickup where they left off.

The caveats to keep in mind include:

A backup is a snapshot in time, so if new events come in prior to deleting the stream, those events are lost. If you can pause publishers or otherwise guarantee events have been received in the meantime, then you should be good. If you are really paranoid, you could create a mirror stream of the stream being deleted, delete the stream, and then do the backup off the mirror.
When the stream is deleted, publishers may have undesired behavior. The best case scenario is that a publisher is setup to expect an ack from the stream and will retry until it gets one. The worse case scenario is that it is publishing without requiring an ack and those messages will be sent into the abyss.
Doing this on a very large stream or highly active stream may be untenable. I don’t have any benchmark/max size in mind, but if you have a staging environment to test this approach on first (with near-production traffic), that would be wise to ensure it works. And again, setting up a mirror stream could be a great disaster recovery option if something goes awry.

In a NATS cluster, do all published messages propagate to all nodes?

No, each node keeps track of a list of subject interests based on the clients that are connected. This subject list is gossiped to the other nodes so an incoming message to a node is only forwarded to nodes that have interest in that message subject.

If this was not the case, messaging would get exponentially more costly for every new node that is added to the cluster.

A related question has been asked before.

What is the right balance of streams and subjects?

I answered a related question here (can I have thousands or millions of streams?) and went into further detail about the difference between a stream and subject here.

This recent question of balance is a bit more subtle. Ultimately, the only way to confirm you designed the streams well is to benchmark them.

However, there a few guiding heuristics that can be used to inform a proper design.

Design your subjects first. Think about the desired hierarchy and desired access patterns (filters) on those subjects on the consumption side. Concurrent publishers from different contexts should always have a token differentiating these contexts (e.g. geographical region). This provides more information for subject interest and routing, but acknowledges that they are concurrent messages rather than treating them as one serial stream.
Once the subject hierarchy is designed, determine what the desired throughput of the publishers should be. It my be perfectly possible to use one (replicated) stream to satisfy your entire subject space. If not, use those “context tokens” as an opportunity to split up subjects across N streams to achieve the desired throughput. Replication (consensus) is managed on a per-stream basis, so streams will perform differently.
If there are a set of subjects that require extreme performance (on some axis), recall that a NATS cluster can have many JetStream nodes that streams are provisioned on. There could be a set of nodes that have better hardware (more memory/CPU) that can serve those specific streams. Currently, there is no stream-to-node affinity support, but this could likely be achieved using a leaf node cluster.
Since there can N consumers associated with a single stream each overlapping or different subject filters, the number of consumers on the stream may be useful to consider as well. Consumer state is replicated on the same nodes that the stream is itself is replicated on. So splitting up the streams will mean the consumers are split up as well. The trade-off is that a consumer is bound to a single stream, so cross-stream use cases would need to be managed manually in the application (which is not hard!).

These heuristics get more and more into the weeds, but as a rule of thumb, assume one stream, design your subject space, think about how many publishers and consumers you will have, do some benchmarking if you are concerned about a specific bottleneck.