How Private is Synthetic Data? Understanding the Tradeoff with Utility

Recently, the UK Biobank found itself in the news after it came to light that real datasets had been uploaded by researchers to GitHub, including diagnoses and appointment dates. The data was pseudonymised, but it was easily demonstrable that records could be linked back to real individuals if you knew just a few details about someone in the dataset.

As someone working in the synthetic data space, I am often asked whether synthetic data is a safe alternative to real patient data. This incident and others like it explain why the question is important. No-one wants their health record to be available online for anyone to see! Pseudonymisation is not enough to protect privacy because health records are so rich in information, they are essentially unique to each individual. Synthetic data is often proposed as a solution, but it is not automatically private either. In this blog, I will explore the tradeoff between privacy and utility in synthetic data, and what it means for researchers.

Digital Twin

The promise of synthetic data

Synthetic data is straightforward in concept: it is artificially generated rather than collected from actual patients. Since it contains no real patient records, it is widely considered safer to share. But synthetic data is often generated by training a model on real data, and if that model is too good at capturing the patterns in the real data, it can inadvertently reveal information about the individuals in the original dataset. “Artificially generated” does not automatically mean private - just as “pseudonymised” did not for the Biobank researchers.

So basically, we shouldn’t assume any synthetic dataset is safe to share, just because it is synthetic. We need to understand the tradeoff between privacy and utility. I have discussed privacy metrics and utility metrics in previous blogs, and in this post I am going to focus on the tradeoff between the two, and what it means for researchers using synthetic data.

What do we mean by utility and privacy?

Before we go deeper discussing the tradeoffs, it’s worth being clear about what these two terms mean in the context of synthetic data.

Utility refers to how well synthetic data preserves the real properties needed for valid research. I explain the difference between broad and narrow utility in my previous blog, but in short, high utility means that researchers can draw the same conclusions from synthetic data that they would from real data.
Privacy refers to protection against re-identification, meaning preventing a specific individual’s data from being revealed.

	High	Low
Utility	Researchers can draw the same conclusions as from real data, as relationships between variables are reflected in the synthetic dataset	Statistical relationships between variables are not preserved, so the data does not reflect reality
Privacy	Synthetic data reveals as little as possible about real individuals	Re-identification or membership inference becomes possible

The challenge is that privacy goals and utility goals conflict with each other in a way that cannot be fully resolved, only managed.

The Tradeoff in Practice

To understand why these two goals pull against each other, it helps to think about what makes synthetic data realistic. To generate synthetic data, a model learns statistical patterns from real data, and then uses those patterns to produce new records. The better it captures those patterns (including rare events, combinations of variables, and relationships between clinical features, its representativeness), the more useful the output is for research.

The issue is that these same data patterns are also what make it possible, in principle, to trace synthetic records back to real individuals, because the more realistic the data, the more it resembles the people it came from. There are a number of different methods for generating synthetic data, and I’ve highlighted a few important examples below in the table, where you can see the tradeoff.

Generation Method	Utility	Privacy	Reference
Standard GANs	High	Medium	Venugopal et al. (2022)
SDV (Synthetic Data Vault)	Medium-High	Medium-High	Hernandez et al. (2023)
Bayesian Networks	Medium	Medium-High	Kharya et al. (2022)
Differentially Private GANs (e.g. DP-CTGAN)	Medium	High	Sun et al. (2023)
Independent Marginals	Low	High	Goncalves et al. (2020)

What’s important to understand is that this tradeoff is not theoretical, and many studies comparing multiple methods across healthcare datasets conclude that no generation method performs well on both dimensions simultaneously. Research by Goncalves, Appenzeller, and Hernandez all reach the same conclusion. Therefore, since you cannot optimise for both at once, the right method depends entirely on what the data needs to do. This is why implementation decisions benefit from specialist input. Choosing the right point on the privacy-utility spectrum for a specific research question requires technical knowledge of the available methods and a clear understanding of the research purpose, while getting it wrong in either direction has real consequences.

How should I choose between privacy and utility for my research?

Decisions about privacy and utility

The most effective approach is always to consider and define the research purpose first, and then select a generation method calibrated to that purpose. The NHS’s Simulacrum dataset is a good example of this done well. As a synthetic version of the National Cancer Registration and Analysis Service data, it was built specifically to let researchers develop and test code before applying it to real patient data. A 2025 study by Kafatos et al., found that 18 projects used it in this way, reducing development time significantly while keeping real patient data entirely out of the analysis.

Meanwhile the contrast with the NHS Federated Data Platform is striking. For context, the £330 million contract awarded to Palantir in 2023 to centralise and process identifiable NHS patient data has been met with strong and sustained public and professional opposition, with the British Medical Association voting to oppose the rollout in 2025. This appears to reflect a discomfort with what happens when real, identifiable patient data is concentrated in one place, and relies on the trustworthiness of a single private commercial organisation. Synthetic data does not resolve this debate, but it does reframe the question. If a research workflow can be carried out using synthetic data calibrated to the right level of fidelity, fewer organisations need access to the real data, and by extension, the associated privacy risks are reduced.

A choice between privacy and utility

In my Utility Metrics blog, I discussed the importance of evaluating synthetic data’s utility in the context of the task at hand. In general, you want to pick the generation method that gives you the highest utility while still meeting your privacy requirements. For example, if you are developing a new algorithm for predicting patient outcomes, you may want to prioritise utility over privacy, because the synthetic data will be used to train and validate your model. That might involve you having to go and get higher levels of consent, security or governance to counterbalance the privacy risk.

On the other hand, if you are developing a synthetic dataset simply to allow researchers to write their research code and test it whilst they wait for access to the real data (which as we know can take many months if not years), then you may want to prioritise privacy over utility, because the synthetic data will not be used for any research conclusions. It just generally has to have the shape of the real data.

How does representativeness fit into the tradeoff?

I have a whole other blog on representativeness in synthetic data, but it is worth mentioning here because it is another dimension to consider when evaluating synthetic data. It is definitely part of the utility side of the tradeoff but not the same. Utility asks whether the data can answer a question or perform a task; representativeness asks how well it reflects the underlying population, regardless of task.

The way I think about this is that making a judgement about how representative a dataset needs to be is a decision that should be made when thinking about utility. Can the dataset answer your particular research question or perform the task at hand? And how close to the underlying population does it need to be? Once you have made that decision, you can then evaluate what effect that has on the privacy-utility tradeoff. It might be possible to generate a dataset that has high utility for your research question, but is not particularly representative of the underlying population. For example, it might increase the number of records in a rare disease cohort or particular demographic group. This might give it more utility for answering your research question but it would not be representative of the underlying population. You would then need to make a judgement on what effect that has on the privacy of the dataset. It could make it more private, because the synthetic records are less likely to be linked back to real individuals, or it could make it less private, because the rare disease cohort or demographic group is now overrepresented in the synthetic dataset and therefore easier to identify individuals who contributed to the training model. The key point is that representativeness is another dimension to consider when evaluating synthetic data, and it should be considered in the context of the research question or task at hand.

Conclusion

Every synthetic data project involves a choice, whether explicit or not, about where the generated data should sit on the spectrum between maximum privacy and maximum utility. For researchers, the practical implication is that evaluating synthetic data on statistical similarity alone is not enough. A dataset that scores well on distribution matching but poorly on privacy is not a safe substitute for real data, and at the same time, a dataset that scores well on privacy but poorly on utility is not a useful one. As the evaluation by Hernandez has shown, these dimensions need to be assessed together, and the right balance will look different for a machine learning (ML) training dataset than it will for a clinical outcomes study. As synthetic data becomes more widely used, understanding this tradeoff is increasingly a baseline requirement for anyone working with health data.

How Private is Synthetic Data? Understanding the Tradeoff with Utility

The promise of synthetic data

What do we mean by utility and privacy?

The Tradeoff in Practice

How should I choose between privacy and utility for my research?

How does representativeness fit into the tradeoff?

Conclusion

Related Posts

The Single Responsibility Principle for Scientists Who Write Code

Your Errors Are Data Too

Logs and tracing in Rust: Fundamentals