How Do We Measure the Utility of Synthetic Data?

How long a patient stays in hospital is one of the most useful things a ward can forecast, and in 2023 a team set out to predict it using synthetic data. The synthetic records passed their statistical checks and looked, by almost every measure, like real patient stays. Yet the model trained on them was weakest for the patients with the longest and least typical stays, which are the unpredictable outliers that put pressure on beds, staff, and resources. In this article, we are going to talk about utility of synthetic data, and how to measure it.

Neural network

What is utility?

In my previous blog on representativeness, I described how representativeness defines whether a synthetic dataset reflects the population it is supposed to model. Utility is a different feature of synthetic data which assesses whether the data is good enough for the job you actually need it to do. The easiest way to keep the two apart is to remember that representativeness is measured against a population, while utility is always measured against a task. For example, a synthetic dataset might mirror the hospital’s population well but still fail to model long stays. Conversely, a dataset that deliberately oversamples long stays looks less representative as a whole but captures the edge cases better. Neither is the better dataset in the abstract; it depends entirely on what you are going to do with it. The task at hand is the lens through which we evaluate the utility, whereas representativeness is evaluated against the underlying population.

Broad and narrow utility

Utility metrics fall into two families: broad utility and narrow utility. Broad utility assesses general statistical properties of the synthetic dataset without any reference to any specific task. Narrow utility, on the other hand, tests whether the data actually performs a particular analysis.

A 2025 scoping review of 73 studies by Kaabachi and colleagues found broad utility evaluations are used more than twice as often as narrow ones (153 vs 63), which partly explains why so many synthetic datasets pass their quality checks yet still underperform in practice. The table below sets out what each family does.

	Broad utility	Narrow utility
The question it asks	Are the statistical properties of the real data preserved well enough that analyses run on the synthetic data would produce similar results? Not tied to any specific task.	Does the data actually perform the downstream task? For example, does a model trained on it predict outcomes as well as one trained on real data?
Use case	Agnostic to the research question	Specific to the workload
Cost	Cheap to compute	Needs the real task run end-to-end
How far it gets you	Necessary, but not sufficient	Decisive, but only for the specific task you test
Methods	Kolmogorov-Smirnov test, Hellinger distance, propensity score MSE	Train on synthetic, test on real (TSTR), replicating a study analysis on both datasets and comparing conclusions

Broad utility is usually necessary - i.e. you want to check that the synthetic data is not broken in some obvious way - but it is rarely sufficient.

So how do we actually measure utility?

Imagine you are the researcher designing the length-of-stay experiment. You have two synthetic copies of your ward’s records, each generated by a different method, and must choose between them. You run a Kolmogorov-Smirnov (KS) test on each dataset to check whether the two distributions match. The first KS test shows a match with the real data on all twelve columns, while the second misses on two. In the absence of further tests, you would naturally select the first generation method (see the table below).

However, when you then train your prediction models on each synthetic dataset and test both against real patient records, the model trained on the first synthetic dataset performs poorly, while the model trained on the second performs well. In this example, the broad check pointed you to the worse generation method.

	Dataset A	Dataset B
KS test columns passing (broad utility)	12 of 12	10 of 12
Obvious choice based on broad utility?	Yes	No
Model accuracy on real patients (narrow utility)	Poor	Good

The KS test is one of a family of broad measures. Other methods compare distributions using distance scores such as the Hellinger distance, to check whether relationships between variables survive generation, or they train a classifier to distinguish real records from synthetic ones. All are cheap and worth running, but none of them test the data against what it is designed for, however this does not mean broad metrics are useless. In a validation study across 30 health datasets, El Emam et al. found that broad metrics could predict which generation method would produce the most useful data for a real logistic regression task, without running the task first, indicating that a well-chosen broad metric can be a useful shortcut, but only once it has been validated against a real task.

The standard test of narrow utility, and the one I see misunderstood most often, is train on synthetic, test on real (TSTR). You train your model on synthetic data, test it against real records, and compare its performance with a model trained on the real data. However, the test data has to be real. One common mistake I’ve seen is that researchers test the model on held-back synthetic data instead, which produces a flattering score that means very little, because the model is being marked against data built to look like its own training set. Immediately you can see the issue with this!

The TSTR approach is not specific to healthcare. Indeed, the wider machine learning world applies it to everything from synthetic bank transactions to synthetic images, however for research the bar for quality is exceptionally high. An analysis run on synthetic data should reach the same conclusion it would have on real data. In a study aptly titled “Spot the difference”, Foraker and colleagues found the synthetic results statistically indistinguishable from the real data.

Spot the difference

More complications

There are a few complications worth considering, and each surfaces at a different point in the evaluation process.

Averages can hide subgroup failure. An overall score can look good even when the data fails for a small group of patients, and that small group is often the one the study cares about most. For example, outliers in hospital stay or rare diseases are often the most important to model.
Synthetic data can outperform its source Ghosheh and colleagues generated synthetic records from just 364 intensive care patients, and models trained on the synthetic data outperformed models trained on the real data, highlighting that synthetic data can exceed its source.
Different measures can disagree. One of the broad checks trains a classifier to tell real records from synthetic ones, and something as small as the order in which variables are generated can change how distinguishable the data is. A dataset can look obviously synthetic and still train a model that works, because looking real and working are different properties.

Conclusion

One thing I hope to have highlighted in this article, is that utility cannot be judged in the abstract, because the same dataset can be excellent for one task and useless for another. So the measurement of utility is always with respect to the task.

So, the best piece of advice I could give, when considering utility, is to decide what the data is for, screen with broad checks, confirm with task specific tests such as TSTR, then examine the subgroups the average hides. It is the same use-case-first logic that runs through my applications post. Representativeness, utility, and privacy should all be measured separately but they are chosen together based on the task at hand. A utility score that was never tested on a real patient is essentially just the dataset marking its own homework.

How Do We Measure the Utility of Synthetic Data?

What is utility?

Broad and narrow utility

So how do we actually measure utility?

More complications

Conclusion

Further reading

Related Posts

Your Errors Are Data Too

An Introduction to Electronic Health Records

Finding Similarity with Vector Search: A Beginner's Guide