The Builder Pattern for Complex Records

This is my first post in Part II of my series on Rust for data-intensive research applications. Part I dealt with individual correctness: getting a record out of input data by validating each field at the point where the raw bytes become typed data. In other words, validating our data at the boundary between the outside world and our system. In the first blog we concentrated on deserialisation as a key validation boundary where we check that the data is the type we expect: so a date is a date, a number is a number, and so on. In the second blog, we looked at creating newtypes to store our domain knowledge about our data and encoding rules about what makes a valid value and putting those rules in the newtype constructor.

Steps

Part II is going to be about compositional correctness. This is different from individual correctness. Individual correctness guarantees that each value is valid in isolation. A date must be in a valid format or a number must be in a particular range. Compositional correctness asks: can the data be assembled into something that makes sense? A field can be perfectly well-formed and still be wrong. For example, a date field present when it should be absent, or two columns that contradict each other. The record is wrong even though every part of it is right according to the individual correctness rules. What I found is that the builder pattern is a particularly good way of managing this type of compositional complexity, and Rust, with its strong type system and ownership model, is a great language to implement it in. In this post I will show you how to use the builder pattern to construct complex records while making contradictory ones impossible to build in the first place.

Builder Pattern

In this post I will use a single dataset from health data research: hospital episode data, where each row represents one patient spending some time in hospital. In epidemiology, we would call this an “episode of care”. In our scenario, some patients have been discharged and some are still admitted, and that distinction governs which fields a row should contain. A discharged episode has a discharge date and a discharge method; an ongoing one has neither. To keep the example concrete I will end by writing the constructed records out to a CSV, but that step is just a stand-in for whatever comes next; in a real pipeline you would more likely stream each record onward to the following stage rather than save it.

A note here. I am using health data as an example because that is what I am an expert in, but compositional correctness and the builder pattern apply to any domain with complex records and conditional or optional fields. The builder pattern is a general design pattern for constructing complex objects in a flexible and readable way, and this comes up constantly in data pipelines: the same approach applies just as well to customer data in sales, or in fintech, or anywhere a record only makes sense when its fields agree with one another. If you are coming from a different field that uses data, this post will still be useful to you. I find that working through a specific example is the best way to get my thoughts out on paper (so to speak!) and I hope you will be able to use the example of complexity here and apply it to your own domain.

The Builder Pattern

Builder Pattern

Let us first start with what the builder pattern is. The builder pattern is a creational design pattern that allows for step by step construction of a complex object. It is particularly helpful when you have an object that has many optional fields or when the order of construction matters.

Continuing with the example we described above, a hospital episode data has many potential fields, some of which are optional and some of which are conditional on the values, or at least presence of other fields. Data can be messy and we are trying to find only valid episodes of care. We actually have two different types of hospital episodes: ongoing and discharged.

Ongoing vs Discharged

An ongoing episode has a patient ID, an admission date, and an admission method, and it might have a primary diagnosis. A discharged episode has all of those fields, plus a discharge date and a discharge method. Either episode might have a list of secondary diagnoses. The aim here is to use the builder pattern to construct these two different types of episodes in a way that ensures we cannot create an invalid episode, such as a discharged episode without a discharge date, or an ongoing episode with a discharge method.

If we were to try to construct a hospital episode without the builder pattern, we might end up with a struct that looks like this:

struct HospitalEpisode {
    patient_id: String,
    admission_date: NaiveDate,
    admission_method: AdmissionMethod,
    discharge_date: Option<NaiveDate>,
    discharge_method: Option<DischargeMethod>,
    primary_diagnosis: Option<IcdCode>,
    secondary_diagnoses: Vec<IcdCode>,
}

As you can see, we have a lot of optional fields here, and it is completely possible to create an instance of this struct that does not make sense:

let episode = HospitalEpisode {
    patient_id: "12345".to_string(),
    admission_date: NaiveDate::from_ymd_opt(2023, 1, 1).unwrap(),
    admission_method: AdmissionMethod::new("Emergency"),
    discharge_date: Some(NaiveDate::from_ymd_opt(2023, 1, 10).unwrap()),
    discharge_method: None,
    primary_diagnosis: None,
    secondary_diagnoses: vec![IcdCode::new("A00"), IcdCode::new("B00")],
};

(I am using from_ymd_opt(...).unwrap() here. In real pipeline code the date would arrive already parsed and validated from the deserialisation step in Part I, so the unwrap is just to keep these illustrative examples short. Likewise IcdCode::new and the other newtype constructors are the validated, fallible ones from Part I, with their error handling already handled.)

Here we have a discharged episode with a discharge date but no discharge method, which is invalid. We could create an outcome enum to represent the two different types of episodes in order to try to prevent this:

enum EpisodeOutcome {
  Ongoing,
  Discharged { 
        discharge_date: NaiveDate, 
        discharge_method: DischargeMethod, 
        primary_diagnosis: IcdCode
    },
}

struct HospitalEpisode {
    patient_id: String,
    admission_date: NaiveDate,
    admission_method: AdmissionMethod,
    secondary_diagnoses: Vec<IcdCode>,
    outcome: EpisodeOutcome,
}

Here the outcome field carries the two mutually exclusive states, and we can only create a discharged episode if we provide both the discharge date, the discharge method and the primary diagnosis. This is better but has its own awkwardness. The construction is now less flexible: every caller has to assemble the EpisodeOutcome variant by hand before they can build the episode, and any conditional logic about whether the discharge fields are present has to happen out in the caller’s code, before the type even exists. Primary diagnosis is now only a field on the discharged variant, but it is could be present for ongoing episodes as well so we potentially lose information if we only allow it on EpisodeOutcome::Discharged. We could add it to the ongoing variant, but then we have to duplicate the field in both variants, and we still have no place to put the logic that checks that the discharge fields are present or absent as appropriate.

It is a step in the right direction - the enum models the finished shape better than the original struct - but it is still not enough. We end up having to put the logic for assembling the episode in the caller’s code, which is not ideal. We really want this to be integral to the construction of the episode, so that we can ensure that the episode is always valid when it is created. This is where the builder pattern comes in.

How to Implement the Builder Pattern in Rust

First, we will define our HospitalEpisode struct with all the fields that we need. This is a flat struct that has all the fields that we need to represent a hospital episode, both ongoing and discharged.

struct HospitalEpisode {
    patient_id: String,
    admission_date: NaiveDate,
    admission_method: AdmissionMethod,
    primary_diagnosis: Option<IcdCode>, 
    secondary_diagnoses: Vec<IcdCode>,
    discharge_date: Option<NaiveDate>,
    discharge_method: Option<DischargeMethod>,
}

Then, we will create a HospitalEpisodeBuilder struct that will have the same fields, but with all the fields as Option types, except for secondary_diagnoses, which will be a Vec. It can be a plain Vec because an episode with no secondary diagnoses is just an empty list.

struct HospitalEpisodeBuilder {
    patient_id: Option<String>,
    admission_date: Option<NaiveDate>,
    admission_method: Option<AdmissionMethod>,
    primary_diagnosis: Option<IcdCode>,
    secondary_diagnoses: Vec<IcdCode>,
    discharge_date: Option<NaiveDate>,
    discharge_method: Option<DischargeMethod>,
}

Then we will implement methods on the HospitalEpisodeBuilder struct to set each field. Each method will return self so that we can chain the method calls together.

impl HospitalEpisodeBuilder {
    fn new() -> Self {
        HospitalEpisodeBuilder {
            patient_id: None,
            admission_date: None,
            admission_method: None,
            primary_diagnosis: None,
            secondary_diagnoses: Vec::new(),
            discharge_date: None,
            discharge_method: None,
        }
    }

    fn patient_id(mut self, patient_id: String) -> Self {
        self.patient_id = Some(patient_id);
        self
    }

    fn admission_date(mut self, admission_date: NaiveDate) -> Self {
        self.admission_date = Some(admission_date);
        self
    }

    fn admission_method(mut self, admission_method: AdmissionMethod) -> Self {
        self.admission_method = Some(admission_method);
        self
    }

    fn primary_diagnosis(mut self, primary_diagnosis: IcdCode) -> Self {
        self.primary_diagnosis = Some(primary_diagnosis);
        self
    }

    fn secondary_diagnosis(mut self, code: IcdCode) -> Self {
        self.secondary_diagnoses.push(code);
        self
    }

    fn discharge_date(mut self, discharge_date: NaiveDate) -> Self {
        self.discharge_date = Some(discharge_date);
        self
    }

    fn discharge_method(mut self, discharge_method: DischargeMethod) -> Self {
        self.discharge_method = Some(discharge_method);
        self
    }
}

So far, we haven’t implemented any real logic yet. We are just creating a way to set fields on the builder. The next step is to implement a build() method. This is where our domain knowledge comes in and we set up our two different types of hospital episodes. The function signature for build will return a Result<HospitalEpisode, DataError>. We will return an error if the fields are missing or inconsistent, and we will return a HospitalEpisode if they are consistent. I am using a custom DataError type from the start rather than a bare String; I will show its definition at the end of the post, but the short version is that it is a thiserror enum with a variant for missing fields and a variant for inconsistent ones. I have a two part series on error handling in Rust that goes into more detail on the fundamentals here and using custom error types here. The second blog here in particular looks at using thiserror to create a custom error type, which is what I am doing here.

fn build(self) -> Result<HospitalEpisode, DataError> {
    let patient_id = self
        .patient_id
        .ok_or(DataError::MissingField("patient_id".into()))?;
    let admission_date = self
        .admission_date
        .ok_or(DataError::MissingField("admission_date".into()))?;
    let admission_method = self
        .admission_method
        .ok_or(DataError::MissingField("admission_method".into()))?;

    let (discharge_date, discharge_method, primary_diagnosis) = match (
        self.discharge_date,
        self.discharge_method,
        self.primary_diagnosis,
    ) {
        // Discharged: both discharge fields and a primary diagnosis.
        (Some(date), Some(method), Some(diagnosis)) => {
            (Some(date), Some(method), Some(diagnosis))
        }
        // Ongoing: no discharge fields. Primary diagnosis optional.
        (None, None, primary) => (None, None, primary),
        // Discharged but missing the primary diagnosis.
        (Some(_), Some(_), None) => {
            return Err(DataError::InconsistentFields(
                "discharged episode without primary_diagnosis".into(),
            ))
        }
        // Partial discharge information.
        (Some(_), None, _) => {
            return Err(DataError::InconsistentFields(
                "discharge_date set without discharge_method".into(),
            ))
        }
        (None, Some(_), _) => {
            return Err(DataError::InconsistentFields(
                "discharge_method set without discharge_date".into(),
            ))
        }
    };

    Ok(HospitalEpisode {
        patient_id,
        admission_date,
        admission_method,
        primary_diagnosis,
        secondary_diagnoses: self.secondary_diagnoses,
        discharge_date,
        discharge_method,
    })
}

Here I am taking advantage of Rust’s match statement to code up my logic for the two different types of hospital episodes. If both discharge_date and discharge_method are set, then we have a discharged episode, and we require a primary diagnosis to go with it: a finished episode of care that was never given a primary diagnosis is treated as inconsistent. If neither discharge field is set, then we have an ongoing episode, and here the primary diagnosis stays optional, because an admitted patient may not have one recorded yet, so it is not something the builder should force. If one discharge field is set and the other is not, then we return an error because that is an inconsistent state. Now we have a builder that can construct a HospitalEpisode in a way that ensures that the fields are consistent and valid. We can use it like this:

let ongoing_episode = HospitalEpisodeBuilder::new()
    .patient_id("12345".to_string())
    .admission_date(NaiveDate::from_ymd_opt(2026, 1, 1).unwrap())
    .admission_method(AdmissionMethod::new("Emergency"))
    .secondary_diagnosis(IcdCode::new("B00"))
    .build();

let discharged_episode = HospitalEpisodeBuilder::new()
    .patient_id("12345".to_string())
    .admission_date(NaiveDate::from_ymd_opt(2026, 1, 1).unwrap())
    .admission_method(AdmissionMethod::new("Emergency"))
    .primary_diagnosis(IcdCode::new("A00"))
    .secondary_diagnosis(IcdCode::new("B00"))
    .discharge_date(NaiveDate::from_ymd_opt(2026, 1, 10).unwrap())
    .discharge_method(DischargeMethod::new("Home"))
    .build();

We can also take lots of different paths to get to the same valid state. Maybe we have an admitted patient with a primary diagnosis, and two secondary diagnoses:

let ongoing_episode = HospitalEpisodeBuilder::new()
    .patient_id("12345".to_string())
    .admission_date(NaiveDate::from_ymd_opt(2026, 1, 1).unwrap())
    .admission_method(AdmissionMethod::new("Emergency"))
    .primary_diagnosis(IcdCode::new("A00"))
    .secondary_diagnosis(IcdCode::new("B00"))
    .secondary_diagnosis(IcdCode::new("C00"))
    .build();

Our builder allows this as a valid state. It also errors if we try to create an invalid state:

let invalid_episode = HospitalEpisodeBuilder::new()
    .patient_id("12345".to_string())
    .admission_date(NaiveDate::from_ymd_opt(2026, 1, 1).unwrap())
    .admission_method(AdmissionMethod::new("Emergency"))
    .discharge_date(NaiveDate::from_ymd_opt(2026, 1, 10).unwrap())
    .build(); // This will return an error because discharge_date is set but discharge_method is missing.

There are many different paths of “incorrectness” that the builder will catch. I won’t write them all out here, but the builder will catch any combination of missing or inconsistent fields and return an error. Note in the examples above, I am using a minimal example of getting the data into the builder from the newtype constructors from Part I. I have a real example of what this looks like in the github repo for this series, which you can find here. For clarity, in the example above, I have omitted the error handling for the newtype constructors, but in the real code they are fallible and return a Result that we handle appropriately.

Integrating the Builder Pattern into a Data Pipeline

Now that we have our builder, we can integrate it wherever we need to construct a HospitalEpisode in our data pipeline. Let’s imagine that our pipeline looks like this:

Hospital episode pipeline

We read in a CSV file with hospital episode data.
We deserialize each row into a RawEpisodeRow struct that has strongly typed fields where each field individually validates its own correctness. So a date is a date, a number is a number, and so on. This is the first part of the pipeline that we covered in Part I of this series.
We then validate the RawEpisodeRow to ensure that it meets our expectations for individual correctness. For example, we could check that the admission date is not in the future, or that the secondary diagnoses is a valid ICD code using newtypes and a regex. This is the second part of the pipeline that we covered in Part I of this series. The output of this step is a ValidatedEpisode struct that has the same fields as the raw data, but with the individual correctness rules applied.
We then use the HospitalEpisodeBuilder to construct a HospitalEpisode from the ValidatedEpisode. This is where we apply our compositional correctness rules. The builder will ensure that the fields are consistent and valid, and will return an error if they are not. The output of this step is a HospitalEpisode struct that has been validated for both individual and compositional correctness.

What does this look like in code? Let’s say we have a method on ValidatedEpisode that returns a Result<HospitalEpisode, DataError>. We could implement it like this:

impl ValidatedEpisode {
    fn into_episode(self) -> Result<HospitalEpisode, DataError> {
        let mut builder = HospitalEpisodeBuilder::new()
            .patient_id(self.patient_id)
            .admission_date(self.admission_date)
            .admission_method(self.admission_method);

        if let Some(primary) = self.primary_diagnosis {
            builder = builder.primary_diagnosis(primary);
        }
        for code in self.secondary_diagnoses {
            builder = builder.secondary_diagnosis(code);
        }
        if let Some(date) = self.discharge_date {
            builder = builder.discharge_date(date);
        }
        if let Some(method) = self.discharge_method {
            builder = builder.discharge_method(method);
        }

        builder.build()
    }
}

This one method is doing a lot of heavy lifting. It is taking a ValidatedEpisode struct that has been validated for individual correctness, and it is using the builder to construct a HospitalEpisode struct that has been validated for both individual and compositional correctness. Notice that the optional fields are only fed to the builder when they are actually present. If any of the fields are inconsistent, the builder returns an error and we can handle it appropriately.

Let’s see how we can use it when we read in a CSV file with hospital episode data. We can use the csv crate to read in the data, and we can use the serde crate to deserialize each row into a RawEpisodeRow. We then validate it into a ValidatedEpisode, and use into_episode to construct a HospitalEpisode. If any of the rows are invalid, we can log an error and skip that row.

let mut reader = csv::Reader::from_path("episodes.csv")?;
let mut episodes = Vec::new();
let mut rejected = Vec::new();

for result in reader.deserialize::<RawEpisodeRow>() {
    let outcome = result
        .map_err(DataError::from)
        .and_then(|row| row.validate())
        .and_then(|validated| validated.into_episode());

    match outcome {
        Ok(episode) => episodes.push(episode),
        Err(reason) => rejected.push(reason),
    }
}

Here I am skipping over the details of the RawEpisodeRow and ValidatedEpisode structs, as they are not the focus of this post. I have however written an example of them in the github repo for this series, which you can find here. You will see that this includes the full pipeline, including an example messy data file, and the code to deserialize it into a RawEpisodeRow, validate it into a ValidatedEpisode, and then use the builder to construct a HospitalEpisode. Each of these processes is a separate step in the pipeline, and they each output a useful variant of the DataError type that we can use to log and skip invalid rows. The DataError type is defined as follows:

use thiserror::Error;

#[derive(Debug, Error)]
pub enum DataError {
    #[error("CSV read error: {0}")]
    Csv(#[from] csv::Error),
    #[error("missing required field: {0}")]
    MissingField(String),
    #[error("inconsistent fields: {0}")]
    InconsistentFields(String),
    #[error("invalid value: {0}")]
    InvalidValue(String),
}

I have a post on error handling in Rust that goes into more detail on the fundamentals here and using custom error types here. The #[from] attribute generates the From<csv::Error> impl that map_err(DataError::from) relies on, the MissingField and InconsistentFields variants are the ones the builder returns, and InvalidValue is what the Part I newtype constructors return when a value fails its own validation. Using a custom error type like this, rather than a bare String, lets us provide more detailed error messages and makes it easier to handle different types of errors in the pipeline, for example counting how many rows were rejected for being inconsistent versus how many failed to parse at all.

Lets see how this works in practice. I have a small example CSV file with some valid and invalid rows, and I will run the pipeline on it. The valid rows will be constructed into HospitalEpisode structs, and the invalid rows will be logged and skipped. The output will be a vector of valid episodes and a vector of rejected rows with their reasons for rejection.

P0001,2023-01-04,emergency,J189,E119;I10,2023-01-11,home
P0002,2023-01-05,elective,M170,,2023-01-06,home
P0003,2023-02-12,emergency,I214,N179;E875,,

3 accepted, 5 rejected
  rejected: CSV read error: CSV deserialize error: record 8 (line: 9, byte: 443): unknown variant `teleport`, expected `emergency` or `elective`
  rejected: invalid value: admission date 2039-05-01 is in the future
  rejected: inconsistent fields: discharge_date set without discharge_method
  rejected: inconsistent fields: discharge_method set without discharge_date
  rejected: inconsistent fields: discharged episode without primary_diagnosis

Out of 8 rows in the input file, 3 were accepted and 5 were rejected. The rejected rows had a variety of reasons for rejection:

1 row had an unknown admission method, which was also caught by the individual correctness validation in Part I of this series.
1 row had an admission date in the future, which was also caught by the individual correctness validation through the newtype constructor
1 row had a discharge date but no discharge method, which was caught by the builder’s compositional correctness validation.
1 row had a discharge method but no discharge date, which was also caught by the builder’s compositional correctness validation.
1 row had a discharge date and discharge method but no primary diagnosis, which was also caught by the builder’s compositional correctness validation.

All of these rejections were logged with a reason, and the valid rows were constructed into HospitalEpisode structs. Make sure you do check out the codebase for this post because it does have a lot more about the whole pipeline than I could put here.

In my opinion this is already a real improvement over the all-Option struct. The mutually-dependent and mutually-exclusive rules are now enforced and it is readable in the code. The final chain of builder calls is easy to read and understand but hides a lot of the complexity of the validation logic.

let outcome = result
    .map_err(DataError::from)
    .and_then(|row| row.validate())
    .and_then(|validated| validated.into_episode());

Pushing the check into the compiler

So far we have a lot of good checks but there is still a gap: build() can fail at runtime if a required field like the patient id is missing. The compiler will happily let you write a builder chain that omits it, and you only find out when build() returns an error:

    // Nothing stops you from omitting a required field.
    let missing_patient_id = HospitalEpisodeBuilder::new()
        // .patient_id(...)  <-- omitted, but the compiler doesn't care
        .admission_date(NaiveDate::from_ymd_opt(2026, 1, 1).unwrap())
        .admission_method(AdmissionMethod::new("Emergency")?)
        .build();

    match missing_patient_id {
        Ok(episode) => println!("built: {}", episode.to_csv_row()),
        Err(reason) => eprintln!("\nruntime failure: {reason}"),
    }

This will compile but it will then fail at runtime with missing required field: patient_id. The way to think about this in my opinion is that the runtime build() check is fine when the requirement is actually conditional or when the missing-field error is something you want to handle gracefully. But for the fields that are always required, there is no reason to discover their absence at runtime. We can push that check into the compiler by using the typestate pattern. The typestate pattern is going to be the subject of the next post in this series, so I will only point to it here. The idea is to give the builder a type parameter that records which required fields have been supplied, and to make build() available only on the fully-populated state.

Conclusion

If there’s one thing to take from this post, it’s that valid fields don’t make a valid record. You can check every value on its own and still end up with a row that makes no sense: a discharge date on a patient who was never discharged, a discharge method with no date to go with it. Those aren’t typos you can catch field by field because they only show up when you look at the whole record. The builder is where I put that check so it happens in one place instead of being scattered around the pipeline in assert or if statements.

This is the fifth post in my series on using Rust for data-intensive work. In Part I the job was getting each value right on its own, so a date is a date and a newtype encodes something about what we know about our domain. Here it’s about whether those values hang together. I keep the two separate on purpose. Serde checks the parts, the builder checks the combination, and not asking either one to do the other’s job is what keeps both of them simple.

One catch: the builder only complains when build() runs, so a missing field is still a runtime error. In the next post I’ll show the typestate pattern, which moves that check into the compiler. Leave out a required field and the code won’t compile at all. It’s the same goal as this post, just enforced earlier and harder!

The Builder Pattern for Complex Records

The Builder Pattern

How to Implement the Builder Pattern in Rust

Integrating the Builder Pattern into a Data Pipeline

Pushing the check into the compiler

Conclusion

Related Posts

Logs and tracing in Rust: Structured Fields and Spans

Sharing study materials in health and medical research

Synthetic Data: The Complete Series