Achieving High Availability with Azure IoT Edge

Many partners ask me how to address High Availability to achieve resiliency with Azure IoT Edge.  Their concern is that they are running mission critical workloads on Edge and want to eliminate single points of failure to match business needs.

While there are many built in features that help with uptime in IoT Edge, such as automatically restarting the module, there are still risk associated with hardware, networking and software that can impact the resiliency or reduce uptime at the edge.

Message Loss:

Many partners are also concerned about individual message loss during a failure.  While this topic is out of scope, there are many factors to consider, perhaps larger than just the challenges of HA that make IoT Edge HA just a component of the full solution.

IoT Edge as a Gateway:

If leveraging Azure IoT Edge as gateway, in which client devices initiate communication to IoT Edge, high availability may be accomplished by deploying more than one active gateway, with a load balancer.  This works for many use cases except certificate-based child device authentication or IoT Edge offline usage, as both approaches require defining the parent-child relationship, which is currently a one to one mapping.  The load balancer pattern is illustrated below and is out of scope for this documentation.

IoT Edge as a Client:

In addition to IoT Edge being a gateway, it can also run modules that interact with downstream systems.  In this pattern, IoT Edge initiates the communication to downstream devices either by a custom or marketplace module as shown below:

It should be noted that an IoT Edge can act in gateway and client at the same time. For the rest of this document, we will focus only on IoT Edge as a Client.

Define your goals:

Before choosing the correct HA solution, you need to consider your goals and objectives, specifically RTO (Recover Time Objective), RPO (Recovery Point Objective), operational cost, productivity cost of downtime, and other single point of failures (network, cooling, building, power, etc.)  Lowering the recovery time often increases the cost or complexity of most any solution.

From a pattern perspective, there are at least 5 main approaches to HA with IoT Edge, as shown below:

Single IoT Edge with standby hardwareLowestLowestUnknownUnknown
Clustering at the Hypervisor using multiple hostHighestMedium~0Seconds
Clustering with Kubernetes using multiple hostHighHigh~0Minutes
Clustering with 3rd party solutions using multiple hostMediumMedium~0Minutes
Multiple IoT Edge host each sending duplicating messagesHighMedium00
Edge Module that is High Availability awareLowMedium~0Seconds

Single IoT Edge with standby hardware:

The simplest and lowest costing availability approach includes maintaining spare hardware to be deployed when needed.  Azure IoT Edge maintains the configuration in the cloud, so recovery could be a new deployment of the standby hardware restarting the edge workload.  Some ‘downstream’ devices that IoT Edge communicate with queue up messages, so if data is queued and standby hardware is recovered timely, all past data could be recovered.

Clustering at the Hypervisor:

Clustering at the Hypervisor, like Microsoft Hyper-V, VMWare or similar, is a common pattern for larger edge gateways.  With this approach, Azure IoT Edge is not aware that it is running in a High Availability mode, and simply executes regardless of the host machine it is running on.  This approach assumes duplicated host hardware and supports multiple operating systems in the virtual machine.

The RPO is ~0 with the RTO time for this approach is seconds.

Clustering with Kubernetes:

Azure IoT Edge includes High Availability awareness, based on Kubernetes, an Open Source solution from Google.  More information about IoT Edge on Kubernetes can be found here:

This approach assumes an existing multi node Kubernetes cluster with both Persistent Volumes and Network Management running on Linux.  In this configuration, Kubernetes manages the deployment of both IoT Edge runtime and IoT Edge modules, addressing availability and re deploying modules on other Worker nodes in the event of failure. 

The RPO time for this approach is ~0 and the RTO time is multiple minutes.

Clustering with 3rd party solutions:

For partners and customers concerned about support options for open source clustering solutions, 3rd party providers, such as HashiCorp’s Nomad, provide lower complexity and supported options.

With the goal of simplifying production-grade workload management, Nomad can simplify and lower the operating cost of IoT Edge.  With both a free Open Source and paid enterprise offering, Nomad can lower the barrier to adoption.

The architecture is like Kubernetes, including multiple Worker Nodes and a Scheduling Node.

The RPO time for this approach is ~0 and the RTO time is multiple minutes.

Multiple IoT Edge, each sending duplicate messages:

With the lowest RTO and RPO, running parallel processing on multiple IoT Edge devices is the lowest RTO/RPO with the lowest complexity.  This pattern leverages a cloud-based solution, such as Azure Stream Analytics, to perform deduplication of each message based on a defined unique key.  As with many parallel processing solutions, this increases availability but does not increase throughput, as each host must perform the exact same work.  When duplicating the processing, consideration should be given to where the data is derived from and any performance impact. 

The RTO and RPO time for this approach is 0.

Edge Module that is High Availability aware:

The final pattern is awareness at the Custom Module level.  In this architecture, 2 or more IoT Edge devices are communicating with each other at the container / module level, performing their own election process.  The elected “Active” module will communicate (send and receive message) with Azure IoT Hub.  If the “Passive” module has not received a heartbeat from the “Active” an election happens the newly elected “Active” module will start communicating with the cloud, eliminating duplicated messages.

This approach does not require matching host hardware and is operating system agnostic.  This approach is active / passive, meaning that processing is limited to the capacity of the “Active” host hardware.

While this approach can be accomplished with just 2 nodes, more common deployments include a 3rd node or witness to address the case of “split brain”, where both nodes claim to be active and a third node/system/witness is used to resolve this conflict.  Witness can include resources such as storage, or cloud connectivity.

The RPO time for this approach is near ~0 with the RTO time being seconds, configurable to milliseconds.

Examples of this pattern are here:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s