Synthetic data, simply put, is data artificially generated by an AI algorithm that has been trained on a real data set. The goal is to reproduce the statistical properties and patterns of the existing dataset by modelling its probability distribution and sampling it out. The algorithm essentially creates new data that has all the same characteristics of the original data – leading to the same answer – but, crucially, it’s impossible for any of the original data to ever be reconstructed from either the algorithm or the synthetic data it has created. As a result, the synthetic data set has the same predictive power as the original data, but none of the privacy concerns that restrict the use of most original data sets.
Here’s an example: Imagine as a simple exercise that you are interested in creating synthetic data around athletes, specifically height and speed. We can represent the relationship between these two variables as simple linear function…if you take this function and want to create synthetic data it’s easy enough to have a machine randomly create a set of points that conform to the equation. This is our synthetic set. Same equation but different values.
Now imagine you are interested in height, speed, blood-pressure, oxygen in blood, etc... the data is much more complicated and representing it requires more complex non-linear equations and we need the power of AI to help us determine the "pattern." Using the same thinking as with our simple example, one can now use the trained AI to create data points that approximate to this new, more complex "pattern" we have learned and thus create our synthetic data set.
Synthetic data is a boon for researchers. One example is what the National Institutes of Health (NIH) in the U.S. is doing with Syntegra, an IT services start-up. Syntegra is using its synthetic data engine to generate and validate a non-identifiable replica of the NIH’s database of COVID-19 patient records comprising more than 2.7 million screened individuals and more than 413,000 COVID-19-positive patients. The synthetic data set, which precisely duplicates the original data set’s statistical properties but with no links to the original information, can be shared and used by researchers across the globe to learn more about the disease and accelerate progress in treatments and vaccines.
While the pandemic has illustrated potential health research-oriented use cases for synthetic data, we see potential for the technology across a range of other industries. For instance, in financial services, where restrictions around data usage and customer privacy are particularly limiting, companies are starting to use synthetic data to help them identify and eliminate bias in how they treat customers—without contravening data privacy regulations. Retailers are beginning to recognize how they could create new revenue streams by selling synthetic copies of their customers’ purchasing behavior that companies such as consumer goods manufacturers would find extremely valuable—all while keeping their customers’ personal details safely locked up.