We live in unpredictable times, whether this relates to the global economy and business, the climate or geopolitics. Anticipating the risks is important to humanity’s future resilience. This involves making sense of increasingly complex information aided with new tools. In this series we look at the work of researchers as they strive to make better predictions.
Across business, science and economics, including Actuarial Sciences, data has become an invaluable asset used to train algorithms and mathematical models. However, a bottleneck to innovation occurs where real data is scarce, sensitive, or biased. This is where synthetic data could make a difference.
This involves using artificially generated data, which mimics the characteristics of real-world information while preserving statistical integrity. If the data contains no details identifying the insured party, it usually complies with privacy regulations and can be more easily shared.
“Synthetic data can be used to trial new systems that insurance providers may want to use prior to purchase, without disclosing confidential information. Secondly, synthetic data can augment real datasets that may are small in size, say when a provider is entering a new market and lacks sufficient data to train a model that reasonably predicts the frequency or severity of insurance claims,” explains Assistant Professor Yevhen Havrylenko from the Department of Actuarial Science at HEC Lausanne.
He adds: “An augmented dataset may help insurers better capture how different variables interact and better quantify the impact of single variables on the frequency or severity of risks. As a result insurance companies may be able to price their products more accurately and fairly. However, augmentation does not automatically improve models, this depends on specific use cases.”
Generative AI models based on neural networks are increasingly used to create synthetic data. However, they are often ‘black box’ models, because it is difficult to understand how they generate results. Moreover, they usually require substantial preparatory work and fine-tuning for each new dataset.
Prof. Havrylenko and his co-authors found that the MICE-RF algorithm – Multiple Imputation by Chained Equations and Random Forests – is a competitive, more transparent and easier-to-use alternative to neural-network-based approaches.1
“In our opinion, MICE-RF methodology is less complicated, requires less preparatory work for new datasets, and is easier to use out-of-the box, which is especially relevant for practitioners. This is something the wider insurance community was not aware of,” details the Assistant Professor.
Havrylenko believes that the MICE-RF method may be adopted by other researchers and insurance companies over time.
“In the future, synthetic data could improve predictions in some scenarios. However, how data is generated matters. There is a debate in the insurance industry about the required explainability of models, including those that generate synthetic data. This depends on the level of model transparency. In general, regulators want more clarity to ensure insurance providers are doing the right thing and not discriminating against certain individuals,” he explains.
Havrylenko and his colleagues are studying ways to strengthen data generation, e.g., different data augmentation strategies, the impact of training data size, and how to encode business constraints. The aim is to help insurers better predict the claim frequency and severity and thus set fairer insurance rates for customers.
References:
- Amputation-imputation based generation of synthetic tabular data for ratemaking, Yevhen Havrylenko, Meelis Käärik, Artur Tuttar, Arxiv 2509.02171, 2 Sep 2025.