Building Real-Time Distributed Systems

With distributed systems, it's important to figure out how to propagate state across many devices.  Often, there will be many devices that produce data, and many consumers that wish to be notified of any state changes.

In a traditional client-server model, there is a direct correlation between a consumer of data, such as your phone or computer, and a server, which responds directly with the data you requested.  Any request must be addressed for a particular recipient, and you'd need a way to know where the other devices live somehow (likely using DNS or fixed addresses).  For all intents and purposes, the request and the response are tightly coupled.

This is effective for many use cases where only a single client is interacting with a given piece of data at a time, as with most websites using a traditional, synchronous client-server paradigm, but real-time applications need an asynchronous messaging paradigm for streaming updates to devices.  For instance, when you are on Google Docs, your computer keeps a connection open that allows it to get updates in real-time when other people edit a document.

Asynchronous-first

The main advantage of asynchronous messaging is the way it affects the impedance of data flow.  Rather than consumers synchronously making requests to producers any time they want new data, the producers can asynchronously publish new data whenever it is available.  This reduces the amount of data flowing over the network, and it means that the producers push data out in real time, rather than having the consumers ask for new data regularly, even when none is available.

The publisher/subscriber model is versatile, and we see this pattern in all sorts of systems.  Smart home systems use protocols like Z-Wave, where dimming lights or setting a thermostat will publish the new state to other devices on the network.  In server-side architecture, Redis is a popular in-memory key-value store that offers pub/sub functionality for observing changes to key-value pairs.

In a pub/sub model, applications respond to events asynchronously rather than with any form of polling.  Any state change is published as messages that reflect versioned state, meaning that consumers only need to update their own systems each time a device has a state change.  This minimizes down-stream effects that must cascade across multiple control planes, and it loosely couples producers and consumers of data so that they can dynamically communicate without fixed addresses or message synchronization.

Client-server architecture, in its best instantiation, requires some sort of router app anyway, something responsible for servicing requests, routing them to different systems, and persisting state in a database or in-memory cache.  Ideally, any consumers of state changes can receive an update to state asynchronously (via HTTP long-polling or WebSockets), but many systems may even use polling as an alternative to asynchronous interrupts.  Simplicity is the main advantage of this, but it makes the network significantly chattier with more devices.  This is why many devices these days use asynchronous messaging; with the amount of devices on a given network, they need to operate more efficiently.

Routed messaging

While WebSockets alone allows for sending asynchronous messages between two hosts over a persistent connection, WAMP extends WebSockets with a router for keeping track of which clients are interested in particular publications or remote procedure calls.  This means that devices don’t need to keep track of who to send updates to, and they only need to keep open a single WebSockets connection to the router.  It is especially powerful for sharing state across a distributed and dynamic group of observers, as in a smart home or control system, or even for chatrooms and games.

A producer can publish new data to a topic in real-time, and the router will pass the messages along to any clients in the realm that are subscribed to that topic.  It doesn’t matter when devices connect or disconnect, and we don’t need to worry about their IP addresses.  Everyone just connects to the router, subscribing to whichever topics are relevant.

In a distributed system, any changes you perform between a client and a device must be published such that any other consumers of that device's state will know about the state change.  You might perform an RPC to make a change on a device, but still only the caller would know that it happened.  RPC is a synchronous messaging paradigm, where a request and a response are tightly coupled.

This is where a publisher-subscriber model comes in.  Pub/sub allows us to make changes on a single device, and then have the state reflected on any other consumers of that state.  These systems are versatile, reducing the amount of configuration required on devices, and making it easy to handle on-the-fly connections with little synchronization.  

In this example, a touch panel subscribes to the Temperature topic.  Now any time a device publishes on that topic, they will receive the publication.

Pub/sub is very effective where there is an arbitrary number of devices consuming state from sensors or other systems.  A device can publish to a topic any time they have new data, and all subscribers will receive the new state.  

Here, a car’s A/C system registers an RPC, SetTemperature.  Next, all of the consumers of A/C temperature data in the car subscribe to a publication Temperature.  Any time the A/C system receives a SetTemperature RPC, it will publish on the Temperature topic, and all subscribed devices in the car will receive the new temperature data.

State changes are broadcasted, making it much easier to keep all devices in sync, and consumers can dynamically subscribe to topics without extensive network configuration.  The main downside, as with any fire-and-forget messaging paradigm, is you can’t necessarily verify who received the message; the router can acknowledge that it received a publication, but it doesn’t tell you who it broadcasted the message to.  You can architect around this by having consumers publish to a separate acknowledgement topic each time they receive a message, or by building synchronization primitives and sending out regular pings and keep-alives.

You also need a way to do feature and device discovery.  A pattern I like is probe/broadcast.  It's a bit like Marco Polo; when a new consumer joins the network, they subscribe to messages for a particular device type, and then publish a probe message.  Any devices of that type on the network would receive the probe message and then publish their state.

I recommend, at least, exploring how you can use WebSockets in your application for event-driven programming.  WAMP and other pub/sub servers are helpful when working with clusters of devices, but WebSockets in general gets you architecting around asynchronous programming and responding to events and data in real time, rather than performing synchronous operations.  Most applications that are already built around a REST API can be extended with WebSockets to support real-time updates that make applications feel more alive and more responsive to the user, especially when multiple people are working on the same page.