Demystifying Corda Serialisation Format

January 19, 2021

I am a software engineer in the research team of Richard G. Brown, the CTO of R3. A few months ago I started looking into the Corda Serialisation Format in the course of a research project. For those familiar with Corda, my goal was to write a simple mobile application in Swift that can deserialise the transaction summary component group of a filtered transaction on an iPhone. The interested reader can find information about the transaction summaries in the Corda Technical White Paper.

This was an interesting and challenging project because such an application cannot leverage the Corda serialisation engine which is written in Kotlin. Please keep in mind that this was experimental work!

While trying to understand the details of the Corda Serialisation Format, I realised that there was little documentation available. Thus, I would like to share my learnings with others hoping that they would be useful to anyone who wishes to deepen their knowledge on the topic.

Corda AMQP Serialisation Format

Serialization is the process of converting an object to an array of bytes so that it can be stored in a database, sent to other peers over the network, etc. The inverse process is deserialisation and entails reconstructing the original object from the byte array.

In doing so, Corda uses an extended form of the Advanced Message Queuing Protocol (AMQP) 1.0 which allows blobs and messages to be self-described. Specifically, Corda relies on the qpid proton library which provides an implementation of AMQP for Java among other languages.

Corda takes this approach in order to provide a way to serialise/deserialise Java objects that is seamless — Corda does not rely on IDLs or predefined schemas — and safe — Corda attempts to eliminate all the tricks attackers can use to exploit serialisation frameworks that were not designed for adversarial environments.

Therefore, for security purposes, Corda requires all types that may appear in serialized streams to be marked as safe for deserialization. It is the developer’s responsibility to annotate their data types with the CordaSerializable annotation. Note that conveniently the developer does not need to provide the schemas for those types as the Corda framework takes care of it all internally.

In Corda a serialised byte array, referred to as blob hereafter, consists of the Payload — which includes the actual data, the Schema — which includes descriptions of any non-primitive data types included in the Payload, and the Transformation Schema — which describes the evolution of the types, if applicable. Therefore, the classes that are not already in the class path can be reconstructed directly from the blob. This approach heavily relies on reflection which is available in a JVM environment.

In this article, I describe the main features of the Corda Serialisation Format by showing (1) the serialisation of a String, which is a primitive Java type, and (2) the serialisation of a simple user-defined class object. Let’s get started!

Serialising a String Object

How does Corda convert the following String object to a byte array?

val str : String = "Approve NEW state with trade id 1234 from  party O=Alice Corp, L=Madrid, C=ES to counterparty O=Bob Plc, L=Rome, C=IT"

The serialised object is shown below. The initial String is 116 bytes long and the serialised byte array is 166 bytes long.

636F7264610100000080C562000000000001C09203A174417070726F7665204E4557207374617465207769746820747261646520696420313233342066726F6D207061727479204F3D416C69636520436F72702C204C3D4D61647269642C20433D455320746F20636F756E7465727061727479204F3D426F6220506C632C204C3D526F6D652C20433D49540080C562000000000002C00201450080C562000000000009C10100

The byte array starts with an eight-byte header (636F726461010000) followed by data encoded using the qpid proton library. The conversion of an object to an array of bytes happens in two phases. First, the object is mapped to a proton graph, which is then turned to the actual byte array. The entry point of the graph is the Envelope which consists of the three components introduced above; the Payload, the Schema, and the Transformation Schema.

The Node Types of a Proton Graph

The proton graph is acyclic. Each node can have any of the following pointers: a parent pointer (level above), a previous pointer and a next pointer (same level). In the majority of cases, the nodes of the graph belong to one of the following four types:

  • Primitive: String, Float, Unsigned Long etc
  • Container: List, Map
  • Described: explained shortly
  • Symbol: describes the Fingerprint (explained a bit later)

How does Corda use these node types to serialise an object? Well, there is a main construct that is heavily used for this purpose. This construct is shown in the Figure below.

Described-Descriptor-Description Construct

Looking at the left-hand side of the figure, the first node of the construct is a Described Type node which has two children: an Object Descriptor node and an Object Description node. The Object Descriptor node is an Unsigned Long which corresponds to one of the AMQP Descriptors in the AMQP Descriptor Registry shown below. The top 32 bits of the AMQP Descriptor are the AMQP enterprise number of R3 (0xC562). This is unique in the AMQP ecosystem and has been registered by Mike Hearn. The lower 32 bits correspond to one of the AMQP descriptor IDs for the Corda Serialisation custom types which are shown below.

const val DESCRIPTOR_TOP_32BITS: Long = 0xc562L shl (32 + 16)
enum class AMQPDescriptorRegistry(val id: Long) {
    ENVELOPE(1),
    SCHEMA(2),
    OBJECT_DESCRIPTOR(3),
    FIELD(4),
    COMPOSITE_TYPE(5),
    RESTRICTED_TYPE(6),
    CHOICE(7),
    REFERENCED_OBJECT(8),
    TRANSFORM_SCHEMA(9),
    TRANSFORM_ELEMENT(10),
    TRANSFORM_ELEMENT_KEY(11)
    ;
val amqpDescriptor = UnsignedLong(id or DESCRIPTOR_TOP_32BITS)
}

To make these concepts clearer the right-hand side of the figure shows how Corda uses this construct in the case of the Envelope. The Described Type node is one byte long (0x00) and is followed by an Unsigned Long Element node (0xC562000000000001) which corresponds to the AMQP descriptor ID of the Envelope. As we mentioned previously, the Envelope consists of three components. Therefore, the Envelope’s Description Node is a List with three elements.

Building the Proton Graph

The code that builds the proton graph can be found in the SerializationOutput.kt file which is available for inspection in Corda OS and Corda Enterprise alike.

val data = Data.Factory.create()
data.withDescribed(Envelope.DESCRIPTOR_OBJECT) {
    withList {
        writeObject(obj, this, context)
        val schema = Schema(schemaHistory.toList())
        writeSchema(schema, this)
        writeTransformSchema(TransformsSchema.build(schema, serializerFactory), this)
    }
}

As we discussed in the previous section, the graph starts with the Envelope which consists of three parts; the Payload (writeObject), the Schema (writeSchema) and the Transformation Schema (writeTransformSchema). The qpid proton Data object which stores the proton graph is shown below:

Data[current=77f05eed, parent=0]
{
  DescribedTypeElement[77f05eed]{parent=0, prev=0, next=0}
  UnsignedLongElement[6b6a7ccb]{parent=77f05eed, prev=0, next=37a1f9c8}
  ListElement[37a1f9c8]{parent=77f05eed, prev=6b6a7ccb, next=0}
  StringElement[1dedb4f8]{parent=37a1f9c8, prev=0, next=689f869e}
  DescribedTypeElement[689f869e]{parent=37a1f9c8, prev=1dedb4f8, next=502b2362}
  UnsignedLongElement[68fcff16]{parent=689f869e, prev=0, next=18c6dc85}
  ListElement[18c6dc85]{parent=689f869e, prev=68fcff16, next=0}
  ListElement[cf91275]{parent=18c6dc85, prev=0, next=0}
  DescribedTypeElement[502b2362]{parent=37a1f9c8, prev=689f869e, next=0}
  UnsignedLongElement[5c010bb]{parent=502b2362, prev=0, next=acb5b5a}
  MapElement[acb5b5a]{parent=502b2362, prev=5c010bb, next=0}
}

We also show an illustration of the proton graph below. The three boxes with the dotted lines highlight incarnations of the three-node construct (Described-Descriptor-Description) that we discussed in the previous section. The Schema and the Transform Schema are empty because the serialised object is a primitive Java type.

The Data Graph

Converting the Proton Graph to a Byte Array

Once the graph is in place, it is turned to a byte array during the second phase by the code shown below.

return SerializedBytes (byteArrayOutput {
    var stream: OutputStream = it
    try {
        // 1. write CORDA100 to the ByteArray
        amqpMagic.writeTo(stream)
        val encoding = context.encoding
        if (encoding != null) {
            SectionId.ENCODING.writeTo(stream)
            (encoding as CordaSerializationEncoding).writeTo(stream)
            stream = encoding.wrap(stream)
        }
        SectionId.DATA_AND_STOP.writeTo(stream)
        // 2. parse the acyclic graph and turn it into a ByteArray
        //    in doing that invoke data::encode
        stream.alsoAsByteBuffer(data.encodedSize().toInt(), data::encode)
    } finally {
        stream.close()
    }
})

Each ByteArray starts with a preamble of 8 bytes which is CORDA10N, where N in picked from the enum below. In our case that value is DATA_AND_STOP, and corresponds to 0.

enum class SectionId : OrdinalWriter {
    DATA_AND_STOP,
    ALT_DATA_AND_STOP,
    ENCODING;
}

The most anticipated part is the actual ByteArray shown below! The ByteArray is organised into an Envelope which includes the Payload (String), the Schema and Transform Schema. We show the individual bytes along with their semantics below. The number in the black circle denotes the associated node in the proton graph.

(left) Envelope (right) Payload

We have already discussed the Envelope, so let’s take a closer look at the Payload now. The length of the String is 0x74 which is encoded in byte[22] (also highlighted with turquoise in the figure below). According to the AMQP protocol this is a Small String as its length is less than 256 bytes — a number that can be encoded in one byte. The AMQP code for Small Strings is 0xA1 and is stored in byte[21], just before the actual length (highlighted with yellow in the figure below).

For longer strings, AMQP uses a slightly different format with a longer field for the length of the String that can accommodate an integer number larger than 255. As in our example byte[22] includes the value 116 (0x74), the subsequent 116 bytes, i.e., from byte[23] up to byte[138], include the serialised String. Due to the nature of the encoded object, which is a String, one can obtain these 116 bytes and just convert them to their ascii representation in order to get the original String (shown in purple below):

From bytes to ascii (human readable)

Before concluding this section let’s take a look at the byte arrays for the Schema and the Transform Schema which rely on the Described-Descriptor-Description construct. As the Schema and the Transformed Schema are both empty, the associated List Element and Map Element are empty as well.

(left) Schema (right) Transform Schema

Serialising a Class Object

What is the spatial layout of a class object? In this section, we attempt to answer this question by describing the serialised Example object shown below:

val name : String = "Approve NEW state with trade id 1234 from party O=Alice Corp, L=Madrid, C=ES to counterparty O=Bob Plc, L=Rome, C=IT"
val example = Example(name, 1234)

As Example is a custom data type, the payload includes additional information as compared to the String object, and the Schema includes the description of the Example object. The Transform Schema is empty as before. The resulting proton graph has 47 nodes. We split the graph into three sections and we describe them in the next sections.

The First Level of the Object Graph

The entry point of the graph is the Envelope which consists of three components: the Payload, the Schema and the Transform Schema similar to the graph that we presented in the previous section.

Let’s discuss now the additional bits. The Payload includes a Symbol node which holds the Fingerprint, a unique identifier of the user-defined class Example, which corresponds to a unique class in the class path. The Symbol node is followed by a List which holds the actual values for the class fields. We will get back to the Fingerprint in the next section where we discuss the Schema of the custom class in detail. The schema is described by the children of node 12.

the object graph

The Schema

Let’s see the AMQP representation of the user-defined class Example in xml below. The class is described as a Composite Type. The class definition of the CompositeType can be found in Schema.kt file in the codebase of Corda.

<type class="composite" name="net.corda.core.contracts.Example">
<descriptor name="net.corda:smpl3xz1BbvlO0YdaAJXjA=="/>
<field name="age" type="int" mandatory="true" multiple="false" default="0"/>
<field name="name" type="string" mandatory="true" multiple="false"/>
</type>

Composite Types have five properties shown in the figure below; The name which is “net.corda.core.contracts.Example” in this case; the label and the providers which are null and empty, respectively, for the Example class; the Object Descriptor which includes the Fingerprint, and a list of Fields, where each Field corresponds to a property of the Example class.

The proton graph has one node for each of the properties of the Composite Type (nodes 16, 17, 18, 19 and 24). The Composite type itself is described with the “Described-Descriptor-Description” construct (nodes 13, 14 and 15). The Object Descriptor includes the Symbol node with the Fingerprint of the class. We describe the Fields, which are children of node 24, in the next section.

the subgraph for the Composite Type

The Fields

The class definition of the Field can be found in the Schema.kt file in the codebase of Corda. Fields have seven properties; The name (“age”), the type (“int”), requires (“emptyList”), default (“0”), label (null), mandatory (true) and multiple (false). The values in the parentheses correspond to the “age” property of the Example class.

The proton graph has one node for each of the properties of the Field class (nodes 28, 29, 30, 31, 32, 33 and 34). The Field itself is described with the “Described-Descriptor-Description” construct. The other property of the Example class, which is “name”, has a similar layout (not shown here).

the subgraph for the Field Type

The Complete ByteArray

In this final section we show the ByteArray of the serialised Example Object and the associated nodes of the proton graph.

The serialised Example object:

[0. CORDA100] 636F726461010000
[1. Described Type Element] 00
[2. Unsigned Long Element] 80C562000000000001 (Envelope)
[3. List Element] D0 (string) 00000168 00000003

[4. Described Type Element] 00
[5. Symbol Element] A3 0x22 (length) 6E 65 74 2E 63 6F 72 64 61 3A 73 6D 70 6C 33 78 7A 31 42 62 76 6C 4F 30 59 64 61 41 4A 58 6A 41 3D 3D --> "net.corda:smpl3xz1BbvlO0YdaAJXjA=="
[6. List Element) C0 7C 02
[7. Integer Element] 71 000004D2 (0x4D2 equals 1234 in decimal)
[8. String Element] A1 74 (string length)
417070726F7665204E4557207374617465207769746820747261646520696420313233342066726F6D207061727479204F3D416C69636520436F72702C204C3D4D61647269642C20433D455320746F20636F756E7465727061727479204F3D426F6220506C632C204C3D526F6D652C20433D4954 --> “name=Approve NEW state with trade id 1234 from party O=Alice Corp, L=Madrid, C=ES to counterparty O=Bob Plc, L=Rome, C=IT”
[9. Described Type Element] 00
[10. Unsigned Long Element]80C562000000000002 (Schema)
[11. List Element] C0 A8 01
[12. List Element] C0 A5 01
[13. Described Type Element] 00
[14. Unsigned Long Element] 80C562000000000005 (Composite Type)
[15. List Element] C0 98 05
[16. String Element] A1 0x20 (string length)
6E65742E636F7264612E636F72652E636F6E7472616374732E4578616D706C65 --> "net.corda.core.contracts.Example"
[17. Null Element] 40
[18. List Element] 45 (empty list)

[19. Described Type Element] 00
[20. Unsigned Long Element]80C562000000000003 (Object Descriptor)
[21. List Element] C0 26 02
[22. Symbol Element] A3 0x22 (length)
6E65742E636F7264613A736D706C33787A314262766C4F30596461414A586A413D3D --> "net.corda:smpl3xz1BbvlO0YdaAJXjA=="
[23. Null Element] 40
[24. List Element] C0 3F 02 (two elements)
[25. Described Type Element] 00
[26. Unsigned Long Element]80C562000000000004 (Field)
[27. List Element] C0 12 07 (seven elements)

[28. String Element] A1 (small string) 03 (3 byte-long) 616765 (age in ascii)
[29. String Element] A1 03 696E74 (int in ascii)
[30. List Element] 45 (0x45 stands for empty list)
[31. String Element] A1 01 30
[32. Null Element] 40
[33. Boolean Element] 41 (true)
[34. Boolean Element] 42 (false)

[35. Described Type Element] 00
[36. Unsigned Long Element]80C562000000000004 (Field)
[37. List Element] C0 14 07 (seven elements)
[38. String Element] A1 04 6E616D65 (name in ascii)
[39. String Element] A1 06 737472696E67 (string in ascii)
[40. List Element] 45 (means empty list)
[41. Null Element] 40
[42. Null Element] 40
[43. Boolean Element] 41
[44. Boolean Element] 42

[45. Described Type Element] 00
[46. Unsigned Long Element]80C562000000000009 (Transform Schema)
[47. Map Element] C1 01 00 (number of elements is zero)

Want to learn more about building awesome blockchain applications on Corda? Be sure to visit https://corda.net, check out our community page to learn how to connect with other Corda developers, and sign up for one of our newsletters for the latest updates.

— Sotiria Fytraki is a Software engineer at R3, an enterprise blockchain software firm working with a global ecosystem of more than 350 participants across multiple industries from both the private and public sectors to develop on Corda, its open-source blockchain platform, and Corda Enterprise, a commercial version of Corda for enterprise usage.


Demystifying Corda Serialisation Format was originally published in Corda on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share: