Synthetic Data
Learn about various applications of synthetic data, create and evaluate synthetic data with a focus on tabular data using GAN-like and copula techniques. You will learn best practices and identify situations leading to overfitting.
Learn about various applications of synthetic data, create and evaluate synthetic data with a focus on tabular data using GAN-like and copula techniques. You will learn best practices and identify situations leading to overfitting.
Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.
In this module, we will introduce the concept of generating tabular synthetic data and discuss its various use-cases and methods. We will present an industry use-case and demonstrate the use of copula as a method to generate synthetic data.
To begin with, we will discuss the importance of synthetic data and its applications in various industries such as healthcare, finance, and digital twins. We will also explore different methods to generate synthetic data, including copula.
Our industry use-case will be based on an insurance dataset. We will provide a step-by-step guide to generate a synthetic version of the dataset using copula. We will also perform an exploratory analysis of the dataset, which includes a mix of categorical, ordinal, and continuous data. We will normalize the data if necessary and identify the main groups within the dataset. Then, we will apply the copula method to each group.
To further solidify your understanding of the copula method, we will provide a mini-project where you will use a different healthcare dataset to generate synthetic data.
By the end of this module, you will have a good understanding of one of the most common techniques of generative AI and how it can be used to replicate the data structure of a typical dataset. You will also be able to apply the copula method to other datasets using Python libraries.
The goal of this module is to enable students to evaluate the quality of synthetic data, refine the model, and minimize bias. We will take a deep dive into the methodology, specifically as it applies to tabular datasets.
We will produce multiple versions of synthesized data and assess the quality of each version using metrics such as Hellinger distance and others. We will discuss how to choose the most appropriate metric for the task at hand. Additionally, we will add parametric noise, both correlated and uncorrelated, to generate data outside the observed range.
To fine-tune model parameters, we will explore how to assess the impact of adding or removing features to reduce algorithmic bias. We will also cover automated detection of groups using an ensemble approach.
For the mini-project, students will extend the project from the previous module and apply the techniques learned here to evaluate the quality of the synthetic data generated and refine the model accordingly.
By the end of this module, students will be equipped to overcome limitations inherent to the technique and perform a full synthesization from start to finish, including assessing the quality of the results and refining the model to minimize bias.
The goal of this module is to enable students to generate synthetic data using GANs. We will begin by explaining what GANs are and why they are important in the context of generating synthetic data.
We will provide a detailed explanation of GANs, including their architecture and training process, and discuss how they can be used to generate synthetic data that closely resembles real data. Additionally, we will present an industry use case and provide all the relevant information students need to solve the use case using GANs.
As with any technique, GANs have drawbacks that should be considered. We will discuss some of these limitations and explore how to mitigate them to better leverage GANs. We will also provide guidance on when to use GANs versus other techniques, such as copula.
For the mini-project, students will use the same dataset as Mini-project 1 to compare the results between different techniques.
By the end of this module, students will have a good understanding of GANs and how they can be used to generate synthetic data. They will be able to apply GANs to a real-world industry use case and understand the limitations and drawbacks of using this technique. They will also have the ability to compare the results between different techniques, including copula.
The goal of this module is to provide students with an understanding of the best practices for generating synthetic data and introduce them to other useful techniques. We will begin by summarizing the methods discussed in previous modules.
We will cover data transformation and normalization, which are essential steps for generating high-quality synthetic data. We will also explore how to synthesize data outside the range of observations using quantiles of a "best fit" Gaussian or geometric distribution, depending on the feature. Additionally, we will discuss using GMM for the bimodal "charges" feature and working with non-Gaussian copulas such as the Frank copula.
For the mini-project, students will extend their previous project to implement these best practices and techniques.
By the end of this module, students will be able to identify sources of bias and avoid situations leading to overfitting. They will have a good understanding of how to generate high-quality synthetic data by synthetizing outside the range of the real data and using the appropriate methods based on their data.
Bite-sized daily lessons that you can easily fit into your schedule. Each day, we release new lessons no longer than 15 minutes. Our lessons are carefully curated to ensure that they're both engaging and informative, allowing you to learn something new every day, and at your own pace.
Collaborate with other engineers from around the world, providing you with a unique opportunity to learn from others and build your professional network.
Our live learning sessions are designed to be interactive and engaging, giving you the opportunity to ask questions and interact with subject-matter experts.
Learn by solving real-world problems. Our courses are designed to get rid of the fluff and provide you with the most relevant information to help you apply your learning.
Fill in your details and we’ll reach out to you within 24h.