High Availability in Corda 5
October 06, 2022
By: James Higgs, Staff Software Engineer
One of the key features of Corda 5 is high availability (HA). A highly available system is one which guarantees some level of uptime by being resilient to faults that may occur during operation. In order to achieve this, the system needs to provide redundancy.
This article will explore what kind of high availability that Corda 5 provides in contrast to previous releases of Corda, and how this is implemented in the context of executing flows. To consider high availability however, let’s first discuss the Corda 5 architecture.
The worker architecture
Corda 5 uses a fundamentally different architecture to previous versions of Corda. The diagram below shows how a Corda 5 cluster is structured.
The worker architecture
Corda 5 consists of a number of worker processes responsible for different aspects of running Corda. The RPC worker handles requests to use the system on behalf of a particular virtual node, the flow worker runs business workflows, the DB worker handles interactions with the cluster database, and the crypto worker handles sensitive cryptographic operations such as signing. Additional workers are also available to handle cross-cluster communication and network membership.
These workers communicate via Kafka, acting as a central message bus between the worker processes. Kafka is chosen here as an industry standard solution for streaming messages between different processes, which is effectively what the Corda 5 workers are doing when they communicate. It also provides a way of load balancing those streams of data, which Corda relies on for providing HA.
What kind of availability does Corda 5 provide? In order to provide HA, Corda must provide some redundancy for each of its components. This however incurs some cost, as there is now more infrastructure to run in order to provide a full cluster. This cannot be avoided, but the system should do something to ensure that some use can be made of this infrastructure while the system is running normally.
To do this, Corda leverages load balancing provided by Kafka. When a worker subscribes to its event streams, the Kafka broker assigns some portion of the stream to that worker. Each worker instance then becomes responsible for processing these messages. (Sometimes this setup is referred to as hot/hot, active/active, meaning that all redundant parts of the system are participating in processing the work.)
Load balancing across workers
The advantage of this approach becomes apparent when a failure occurs in a worker instance. In this case, there is already another worker instance up and running that is able to pick up the slack. Kafka will rebalance the distribution of the stream in this case, assigning the partitions of the stream previously picked up by the failed instance to any instances still running. This allows the system to failover.
An aside: contrast to Corda 4
For those familiar with Corda 4, you might be wondering why such a scheme is not possible under the old version of Corda. The Corda 4 node, rather than being broken into multiple worker processes, is instead a single monolithic node. Why not simply run multiple instances of this?
The problem here is one of load balancing and failover. It is possible to have a second Corda 4 instance available to some extent, but there is no mechanism to balance the load across multiple instances, and so this second piece of infrastructure is incurring all the costs of running the node with no benefit in terms of system performance. Architecturally there is no way of introducing this, so at best Corda 4 might be able to achieve an active/standby setup, where a second instance is ready but not processing anything.
In practice, Corda 4 cannot be run hot/hot for technical reasons, and so the best possible availability scenario is hot/cold active/standby. This means that failover can be quite lengthy and during the failover period the system is not available. Corda 5 represents a considerable step up for this reason.
Let’s change focus now and look at Corda flows, particularly how these can be executed properly in a highly available system. Flows are the foundation of any CorDapp (Corda Distributed Application). A flow represents a piece of business logic. The Corda platform provides an API that makes writing these flows look like simple, sequential code. Behind the scenes, Corda does a lot of work to ensure that execution of these flows is resilient. Essentially, the purpose of the flow API is to abstract away the challenges of communicating between counterparties, so that business logic can be written more clearly.
The solution Corda uses for this is a checkpointing mechanism. When a flow reaches suitable points in the code, the platform generates a checkpoint object by pausing the execution of the flow and recording the current state. This object can then be used to resume the flow again when it is ready to do so; perhaps because a message from a counterparty has arrived, or a database request has completed. If some failure occurs in the platform midway through flow execution, the flow can be restored from a previous checkpoint.
Flows are present in Corda 4, where a similar mechanism was used. However, Corda 5 has an additional challenge. Rather than a single node processing requests at once, there are now potentially multiple flow workers. If one of these fails, we want the flow to be picked up by another one and continued. How does Corda 5 handle this? To understand the solution, a deeper dive into how Corda 5 uses Kafka is needed.
Corda 5 uses an abstraction layer over Kafka internally, called the message patterns library. This encapsulates common messaging patterns that Corda 5 uses to communicate between workers. For most patterns this is a straightforward application of Kafka configuration. Flows however, present a unique challenge that require some extra work.
Distribution of events across flow workers
A flow consists of the checkpoint state plus a stream of events that cause that checkpoint state to be modified. On receiving a flow event, the flow worker must also read in the checkpoint state for the flow and recreate the flow program state. It then runs the flow code until a suitable point is reached to generate a new checkpoint. When this happens, a new checkpoint is generated alongside any output events. These must be written back to Kafka atomically with marking the inputs as consumed.
For this to work, the two types of events (the input event stream and the checkpoints) must be kept in sync with each other. The patterns library provides a dedicated pattern for handling this type of use case. Appropriately, this is referred to as the state and event pattern. By using this, we can ensure that a worker is assigned exactly the right checkpoints (states) for the input events that it is going to process.
A future blog post will describe this in more detail.
Executing flows across failover
With the state and event pattern in hand, it is now possible to see how failover works for flows. If a flow worker processing a flow fails, the event partitions that were assigned to that worker are redistributed by Kafka to the surviving flow workers. These workers assign themselves the corresponding checkpoints in the state topic. The flow workers are now able to pick up where the failed worker left off.
Failover of a flow worker
In the above diagram, the flow workers from before are shown on the left. Each worker is processing events for three virtual nodes – Alice, Bob and Charlie. (Note that not all flows for the same virtual node may be processed by the same worker, as with Alice in this case.) Each flow is identified by the virtual node and an identifier (a number in this case) as in the previous diagram. If one worker fails, then the state and event partitions are assigned, in sync, to another worker. This is the scenario shown on the right. Importantly, the full state for the flow is contained within the flow checkpoint, and not within the worker itself. As a result, the newly assigned worker can pick up exactly where the previous one left off. By carefully configuring Kafka, we can ensure that marking an event as consumed is atomic with writing the outputs of processing that event back to Kafka, ensuring that we do not accidentally lose input events or overwrite old flow state incorrectly.
In this article we discussed what high availability (HA) means for Corda 5 and how that it is a significant step up from Corda 4. We also delved a little deeper into Corda flows, explaining how the new architecture continues to support resilient workflows, while also allowing for failover of the processes that execute them.