Session Types for Hearty Codecs

Zack Pierce – P. Distributed Systems Engineer

blog-zack.png

In software systems where components talk to each other, data serialization is an essential challenge. A system’s encoding format has to hit the right balance of requirements around performance, expressiveness, tooling quality, and ease of correct usage.

At PolySync, we’ve found that balance for our use case in autonomous vehicle systems through building a safety-specialized Rust implementation of an established format: Simple Binary Encoding. But why? And how?

PICKING SIMPLE BINARY ENCODING

In general, high performance is almost always desirable for serialization. But in the world of autonomous vehicles, steady timing performance is even more important than peak throughput. This is because safe operation is sensitive to timing outliers. Nobody wants the system that decides when to slam on the brakes to occasionally take 100 times longer than usual to encode its commands. Enter : Simple Binary Encoding.

After a series of internal bake-offs, we found that Simple Binary Encoding (SBE) performed comparably or better than other well-known formats (like Cap’n Proto and FlatBuffers) in terms of encode and decode times. In particular, SBE stood out for its low timing variability.

In the context of automotive components, a serialization format’s expressiveness has to be able to support multiple subsystem-specific use cases. Different subsystems require different shapes and sizes of data, and there’s little point in a message that can’t properly transmit its domain needs. The implication being that the message format of choice has a sufficiently flexible schema language for defining messages, or goes the schemaless route. Most of the encoding formats surveyed hit this mark without much trouble, SBE included.

Code that will be used in production vehicles needs to have a higher bar of safety built into it than most other domains of software. Serialization is no exception. Writing code that uses the format correctly should be within the grasp of human developers, and the libraries that implement a format should work hard to minimize developer-introduced error. Unintentional data corruption, accidental nondeterminism, and incomplete error detection are all critical safety risks that engineers face when working with performance-oriented binary data formats.

Unfortunately, none of the formats that performed well met our safety-motivated expectations for ease of correct usage.

RUST AND SAFETY BY DEFAULT

PolySync is a Rust-first kind of company, because of Rust’s safe performance ethos. The language aligns closely with what we believe should be the biggest focus of the autonomous vehicle industry: safety.

But if Rust has low-effort interoperability with C/C++ code, and SBE can generate encoders and decoders in C/C++, what’s the problem? The answer is the unaddressed final aspect of our usability requirement: the risk of developer error when writing code.

In particular, I’m thinking about the kind of slip-ups that happen when using a format that requires the developer manually track byte offsets, iterator indices, and order-of-data-usage. The APIs for SBE, like other performance-optimized binary formats, require some degree of developer bookkeeping, with varying degrees of subtle (or loud) faults introduced when invariants are accidentally violated at runtime.

We need a way to mitigate the impact of this class of developer problems. The index and offset bookkeeping could be abstracted away at runtime, but wouldn’t it be best if we could catch and prevent these (and the order-of-operations issues) at compile time instead?

ENTER SESSION TYPES

The central concept of session types is that a series of specialized types (that represent a particular location in the encoding/decoding state machine) are linked by methods that cause state transitions.

/// Chain-of-transitions example
struct A;
impl A {
    fn go_to_b(self) -> B {
        // ... do some work on internal data, pass results along inside a new B
        B
    }
}
struct B;
impl B {
    fn go_to_c(self) -> C {
        C
    }
}
struct C;
impl C {
    fn all_done(self) {
        println!("made it to the end")
    }
}

fn main() {
    let a = A {};
    let b = a.go_to_b();
    let c = b.go_to_c();
    c.all_done();
}

So far, this is similar to the sort of state machine one might manually write using the “structures with transitions” option from Andrew Hobden’s excellent post on Rust state machines as a guide.

We deviate from Hobden’s method by providing additional barriers against developers accidently jumping into the wrong part of a chain of types. This is done by giving the structs private, non-defaultable fields (that are generated internally) at the start of a given invocation of the session type chain.

If we take the above methods and apply them in a sufficiently powerful type system, e.g. Rust, we can encode or decode data in a series of steps that has its order of operations enforced completely at compile time. This totally prevents out-of-order data handling at runtime, stopping that kind of data corruption bug several steps before it even gets close to production. The trick in Rust is that the type signature of the transition methods cause the dispatching struct instances to be moved and consumed each time, rendering each step irreversible and impossible to repeat out of order.

/// Double-transition prevention example.
/// Uses the structs from the chain-of-transitions above.
/// Should not compile
let a = A {};
let b = a.go_to_b();
let c = b.go_to_c();
let b_again = a.go_to_b(); // "value used here after move" error

SESSION TYPES APPLIED TO CODECS

The SBE codec format has a well-described order of desired encoding or decoding operations and a given message schema that clarifies the exact data types involved at each step. This gives us all of the information we need to fully generate session type encoding and decoding chains. Using code generation from a specification reduces the initial developer effort of creating long session type chains and increases the odds that complex chains are implemented correctly. But wait, it gets even better.

Because SBE is a streaming-oriented data format, if we place our type-chain’s access patterns in a particular order, we gain a performance benefit by encouraging cache-friendly data usage.

With these facts in mind, we’ve implemented preliminary Rust session type encoder/decoder generation for SBE and pushed it to SBE’s canonical upstream open source repository.

Before moving on, I’d like to point out a few key differences between a real session type codec and the toy chain shown in the previous section.

The most obvious is that SBE codecs work in a streaming fashion. This requires additional inputs or outputs at each state transition. When decoding, for example, each decoder state transition method outputs both the chunk of data that has just been decoded and the next decoder type in the chain.

let data: &[u8] = // get raw bytes data from somewhere
let header_decoder = start_decoding_example(data); // Start the session type chain by provided the first decoder

let (header, fixed_fields_decoder) = header_decoder.header()?;
// here, header is a MessageHeader struct with block length, ids, etc already decoded
// and fixed_fields_decoder is the decoder that will read off the next section

let (fixed_size_fields, _next_decoder) = fixed_fields_decoder.example_fields()?;
// now fixed_size_fields is a struct containing all of the fixed-sized data from the message,
// and _next_decoder continues the decoder type chain, and so on

Attentive readers may have noticed the trailing ? on each of the transition method calls. This is Rust syntax sugar for passing up errors that may occur over the course of encoding/decoding. After all, the transition methods are responsible for gleaning meaning out of a pile of bytes, and detectable runtime failures have to surface for developers to handle them. Since there is a performance cost for checking mid-codec failure scenarios (such as not having enough bytes left in a buffer to encode the rest of a message) and allocating more complex Result types to represent the possibility of failure, care must be given when designing the boundary of type-states to minimize duplicate or amortizable work.

For our implementation of SBE, transition methods are aligned to known-size blocks of data so we check buffer constraints as few times as possible. Here’s an example to drive the point home with explicit transition method type signatures.

/// Simplified SBE-like decoder session type chain highlighting error handling
extern crate core;

#[derive(Debug)]
enum CodecErr {
   NotEnoughBytes,
}
pub struct ExampleHeaderDecoder<'a> {
   data: &'a [u8],
   offset: usize,
}

/// Note the lack of a Result wrapper for this initial start method, since there
/// is no reasonable failure mode.
pub fn start_decoding_example<'a>(data: &'a [u8]) -> ExampleHeaderDecoder<'a> {
   ExampleHeaderDecoder {
       data: data,
       offset: 0,
   }
}

// Note the packed representation is essential to match SBE's format
#[repr(C, packed)]
#[derive(Debug)]
pub struct MessageHeader {
   id: u32,
}

impl<'a> ExampleHeaderDecoder<'a> {
    /// There is a chance for failure when reading data from bytes, thus a Result return type here
    fn header(mut self) -> Result<(&'a MessageHeader, ExampleDoneDecoder<'a>), CodecErr> {
       // implementation actually checking and reading data omitted
   }
}

ZERO-COPY WITH LIFETIME SAFETY

SBE is a format that enables zero-copy-by-default representation of a given message. This means that we should be able to access the heart of a message’s data without needlessly duplicating the message’s bytes. Zero-copy formats can be blazingly fast, but run the risk of misinterpretation or corruption from mishandling the timing of reading and writing shared byte-buffers backing data structures.

To make this risk manageable in Rust, we need to link the lifetime of the underlying byte-buffer used for encoding a SBE message with the lifetime of the developer-usable structs used to represent decoded or intended-to-be-encoded data. The implication for our session type implementation is that each type-chain carries the lifetime signature of the underlying byte buffer and expresses that lifetime for any related struct references that get coded.

Let’s work this out by expanding on the code from the previous example with a real implementation of decoding a struct overlay.

impl<'a> ExampleHeaderDecoder<'a> {
   fn header(mut self) -> Result<(&'a MessageHeader, ExampleDoneDecoder<'a>), CodecErr> {
       // Check there is enough room in the underlying to represent a header
       let num_bytes = ::core::mem::size_of::<MessageHeader>();
       let end = self.offset + num_bytes;
       if end <= self.data.len() {
           // Do some pointer-magic to produce a lifetime-aligned struct reference
           // to the decoded data
           let s = self.data[self.offset..end].as_ptr() as *mut MessageHeader;
           let v: &'a MessageHeader = unsafe { &*s };
           self.offset = end;
           Ok((v,
               ExampleDoneDecoder {
                   data: self.data,
                   offset: self.offset,
               }))
       } else {
           Err(CodecErr::NotEnoughBytes)
       }
   }
}

pub struct ExampleDoneDecoder<'a> {
   data: &'a [u8],
   offset: usize,
}

impl<'a> ExampleDoneDecoder<'a> {
   // Expose the internals of the decoding chain to confirm progress made
   fn all_done(self) -> (&'a [u8], usize) {
       (self.data, self.offset)
   }
}

fn main() {
   explicit_lifetimes(&[7, 0, 0, 0]).unwrap();
}

fn explicit_lifetimes<'a>(data: &'a [u8]) -> Result<(), CodecErr> {
   let header_decoder: ExampleHeaderDecoder<'a> = start_decoding_example(data);
   let (header, done_decoder): (&'a MessageHeader, ExampleDoneDecoder<'a>) = header_decoder.header()?;
   assert_eq!(7, header.id);
   let (underlying_data, final_offset): (&'a [u8], usize) = done_decoder.all_done();
   assert_eq!(4, final_offset);
   Ok(())
}

This example’s punchline function highlights how easy the decoded structs are to use—simply reference fields as usual at no runtime cost—as well as the mechanical fastidiousness necessary to pipe the correct lifetime constraints all the way through.

ROAD YET TO BE TRAVELED

By implementing a zero-copy streaming-oriented serialization format, we’ve been able to achieve the high and steady performance required for the timing-sensitive embedded system aspect of our domain. Layering session types into the format’s realization has greatly reduced the likelihood of several perennial sources of error sneaking into the autonomous vehicle runtime PolySync is creating. Since safety is a heavy part of our use case, and error reduction improves safety, we’re glad to have had Rust on our side in finding the right balance for a data serialization approach.


Share Zack's post: