A (Virtual) Node by any other name would compute as sweetly

August 22, 2022

This blog post looks to dig into one of the foundations of the new Corda 5 architecture, Virtual Nodes, and highlights the real-world impact of this architectural change. Let us first use a graph to lay out exactly what we were seeking to address with this change and deliver in Corda 5.

Firstly, we wanted to address issues around the growth of an Application Network through onboarding an ever greater number of identities. With older versions of Corda, the overheads grow linearly with each new identity added to the Application Network. This is for a number of reasons we’ll talk about later, but for anyone operating Corda at scale and doing so as a service for others (operating Nodes on behalf of customers), this will be familiar. Corda 5 removes that direct link; this means identities can be essentially onboarded with overhead only tied to the load they bring to the system rather than paying a cost for simply existing.

Secondly, once we have identities onboarded within an Application Network, what about the overheads incurred through keeping them available to transact with other network members? With Corda 4 and below, this sits at a fixed level. There is no capacity to have identities be transient, only active when acting. Additionally, with the identity constrained by a single compute instance, at some point (if we set aside the ability to provision a new, larger, compute instance), we run out of ability to handle more throughput, becoming bottlenecked on the limits of a single physical or virtual resource.

Corda 5’s clustered, service-oriented, architecture breaks that link. Identities are ephemeral until transacting. A “node” with nothing currently active can consume very few resources until needed and resources previously committed to servicing higher loads released. Such processing loads include:

  • Firstly, where a single identity is executing a very large number of transactions
  • Secondly where a large number of identities are executing a small number of transactions individually.
  • Thirdly, a combination of the 2. I.e where some identities transact very frequently while others transact infrequently.
  • Finally, where there is a variable batch processing style load, i.e. everybody is mostly idle for most of the day then suddenly load comes in bursts.

Enter the Virtual Node

How does Corda 5 achieve this? The answer is the introduction of Virtual Nodes, a pithy name for a simple concept but a complex reality that ends up confusing everyone, in part due to the apparent relationship between the names Node and Virtual Node conjure in one’s imagination. Don’t worry, if you’re using the term, you’re not wrong!; it’s a great term that really captures a powerful difference between Corda 4 and 5, but there be details here! So let’s look at what’s really going on behind this complex topic.

First, let us start with “what is a node”

The Node!, a stalwart of Corda’s past, the thing we all think we know “what it is”, but before moving on let’s just check we know where we’ve been.

Thus, let us examine the definition from the Corda Whitepaper (revised version):

“The node acts as an application server which loads JARs containing CorDapps and provides them with access to the peer-to-peer network, signing keys, a relational database that may be used for any purpose, and a ‘vault’ that stores states.”

– Mike Hearn, Richard Gendal Brown – August 20, 2019

What’s in a name, a Node by any other name would smell as sweet.

A question posed recently by a colleague was “why are they called nodes”. Something I must admit I’ve not pondered since starting my own Corda journey, the name having been settled on long before. Looking at Wikipedia we can see a definition of the term

“A physical network node is an electronic device that is attached to a network, and is capable of creating, receiving, or transmitting information over a communication channel.”

Additionally, the blockchain concept of “a full node”, as present in BitCoin, for example, matched with the concept that Corda wanted to express that each entity on “the network” was a sovereign identity that understood the rules of being on the network and could thus transact within it.

Concept vs Creation

The node is thus a physical collection of implemented concepts constrained within a single entity, and it is that distinction between concept and how/where it is realized that is important because its easy to look at things as they are and assume that is how they must be, the node is a good example of this. So, what concepts does it encapsulate?

1Logic / ComputeWhat things can I do? Essentially, what applications are running on my behalf?
2StorageWhere is my stuff? What’s in my vault? Let me find it for transacting.
3IdentityWho am I, how do I prove it?
4LocationWhere am I, how do others contact me?

This is a very maximalist view of the world, where a single thing captures the totality of what we may consider an entity.

The practical problem with this node-centric network view of Corda’s past is that scaling is linear, and thus so are costs. What do we mean?, Well, if one identity is one node, then 100 identities mean 100 nodes, a thousand identities a thousand nodes, etc.

Clearly, this doesn’t scale in a way that allows for the general management of multiple identities. Indeed, the first versions of Corda, due to constraints in building the technology, chose to not focus on scalability; preferring to leave that to later iterations of the architecture, which is where Corda 5 steps in. Back then, we took a very maximalist view of decentralization, assuming all entities would run their own nodes for themselves, where the model of a node matching directly to a legal identity, and thus an entity, wasn’t a limitation.

Maximals (and Predicons)

This wasn’t done on a whim, Corda didn’t set out to be difficult to scale. To see the root of this design choice we need to look at the original threat model for the distributed database concept that, ultimately, all DLTs are viewable as. Ultimately, Corda was trying to answer the question

“how do we build a shared database between multiple parties where some of those parties cannot be assumed to be non-malicious”?

The thinking at the time being, that if there was a third party all participants could trust, then the solution would be to simply use them to run a standard database and avoid the complication and overheads incurred by introducing DLT constraints.

In hindsight, we can challenge this statement in a few ways.

Firstly, that just isn’t how the industry has evolved, there is a clear demand from solution providers to retain ownership of aspects of the system and meta information, such as participant lists. Equally, the users of those services are not sophisticated enough to operate their own infrastructure to consume a service. All are willing to compromise certain aspects of the fully decentralized model to meet their current needs better.

Secondly, it didn’t draw the distinction that a database can never be safely detangled and a sub-set of records provided to a member, its representation of each identity’s view of their holding are entwined within a single trust zone and thus can never be considered sovereign. We can view this as taking each private view held by a Corda entity of their subset of the Application Network and collapsing those contexts into a single set subdivided through keys or schema as shown below.

Just as a cryptographic key is only ever as secure as it was insecure, the same is true of data. Extracting records from something all had access to is a very different proposition from those that were always segregated. That centralized database would have made accounts of its participants, not leaving them as sovereign entities that could choose to relocate and run their own infrastructure when the time came. Progressive Decentralization is a clear demand from those investigating the technology.

Thirdly, it was considered that whilst there were technical solutions, such as sandboxing and encryption at rest, that allow untrusted third parties to host infrastructure, those were not mature enough at the time to embed within a solution. However, over the intervening years, engaging with techniques such as R3s Corda’s own Sandboxing technology means we’re making strides toward delivering this vision and can thus alter the threat model accordingly.

Finally, as shown above, those maximalists with strong use cases can still deploy and use Corda 5 in the same manner as Corda 4. It is extremely important to note that just because you can use Corda 5 to scale locally for a number of identities, one does not have to. Corda 5 introduces the optionality to present prospective network members with a choice, it does not enforce.

It’s all about that load (’bout that load)

Nodes in of themselves really aren’t all that interesting, it’s how they’re connected by applications and transacting between them that makes Corda actually useful to the wider world. Much like an operating system is somewhat pointless with no user-land applications to run, Corda’s value only emerges once those nodes are onboarded to a network (or networks) to transact.

Of course, following on from how networks have emerged, we’ve seen the need for Application Network operators to retain centralized management of infrastructure whilst delivering digital sovereignty to their users (Alice in our example below). Corda and DLT platforms in general are uniquely suited for allowing this model where future, further, decentralization can be easily achieved through simple identity migration.

The physical Corda 4 node is poorly suited to this as each identity requires management regardless of the computational load being put through it.

So what’s the Issue?

The issue with this is that we’ve all forgotten that this just is an implementation detail dating back to the first version of Corda. What matters isn’t how they’re packaged, but the concept themselves and what they allow people to do. Nodes, don’t matter that much, not beyond the convenient shorthand we all use it for. What matters is onboarding identities to networks, executing transactions, and proving the value of DLT solutions.

Virtual Node Aren’t Virtual!

As discussed above, we’ve spent a long time with Corda talking about nodes and centering them within how we ask people to conceptualize DLT, thus asking people to stop overnight would clearly be madness. The problem, of course, is that upon hearing the name virtual node it is logical to assume we’re somehow virtualizing nodes to allow for parallel operation within the same compute instance, allowing one “Corda Node” to operate on behalf of multiple identities.

We’re all familiar with virtualization technology and the abstractions it layers atop the physical instances being virtualized. It is important to take away that this isn’t what a virtual node is within Corda, the concept is far more radical and far-reaching.

What’s in a Flow

To really understand what a virtual node is, we must first dig into what a flow is.

At the highest level, it is a mechanism by which parties come to an agreement over some proposal, with such agreement often taking the form of a successful mutation of a distributed Ledger, a consensus being achieved by all parties that the mutation occurred and was valid. It is a run-to-completion process that can be easily realized through the creation of a CorDapp. Under the hood, Corda parcels this up and distributes the execution to ensure the correct identity is actioning the correct portion of the proposal.

In this example, we are using a Corda 4 implementation where each identity and thus step is hosted by a Corda Node. I’m doing this to highlight how execution in the global network is subdivided across parties.

The State Machine

Digging deeper again, we can see that each node schedules various operations for each step, all orchestrated through a state machine.

It is the state machine that allows Corda to run multiple flows at once, scheduling tasks for different flows as they “wake up” and need servicing, very much like a traditional scheduler in a normal operating system.

Indeed, we can think of the flow framework as a distributed state machine designed to schedule consensus protocol steps. Flows are checkpointed at any stage where they would suspend, that checkpoint is persisted for potential recovery of the flow, and the next flow is scheduled for execution from the point it was previously suspended.

Virtual Nodes

Having gone around the houses, the question remains, “just what is a virtual node”? Our foray into understanding flows answers this.

Corda 5 virtualizes the execution of flow process steps, allowing flows for multiple identities and from multiple applications to be executed within the same compute environment at the same time. An instance of the virtual node is created for long enough to handle the execution steps needed and then allowed to dematerialize.

Thus, a Virtual Node, is the combination of the context of a user and the ephemeral compute instances created to progress a transaction on that identity’s behalf.

At any point in time T, many instances of a virtual node may be executing, the limit governed by the availability of computing units (Flow Workers), with an additional number of flows inflight with other virtual node instances suspended in the checkpoint store.