“`html
Synthetic data is artificially produced by algorithms to replicate the statistical characteristics of genuine data, while lacking any details from actual real-world sources. Although precise figures can be elusive, some projections indicate that over 60 percent of the data employed for AI applications in 2024 will be synthetic, and this number is anticipated to increase across various sectors.
Given that synthetic data does not contain real-world information, they promise to protect privacy while lowering costs and enhancing the speed at which new AI models are created. However, utilizing synthetic data necessitates meticulous assessment, planning, and safeguards to avert a decline in performance when AI models are implemented.
To delve into some advantages and disadvantages of utilizing synthetic data, MIT News interviewed Kalyan Veeramachaneni, a lead research scientist in the Laboratory for Information and Decision Systems and co-founder of DataCebo, whose open-core platform, the Synthetic Data Vault, assists users in generating and evaluating synthetic data.
Q: How is synthetic data produced?
A: Synthetic data is generated algorithmically and does not originate from real scenarios. Its value lies in its statistical resemblance to genuine data. For example, in terms of language, synthetic data appears as if a human had composed those phrases. While the practice of creating synthetic data has been around for years, recent advancements have enhanced our capabilities to construct generative models from data to produce realistic synthetic data. We can utilize a small sample of real data to develop a generative model, enabling us to generate as much synthetic data as needed. Furthermore, the model creates synthetic data in a manner that encapsulates all fundamental rules and limitless patterns inherent in real data.
There are essentially four distinct data types: text, video or imagery, audio, and tabular data. Each type has slightly different methodologies for constructing generative models to create synthetic data. A large language model (LLM), for instance, serves as a generative model from which synthetic data is sampled when a question is posed.
An abundance of textual and visual data is readily accessible on the internet. Conversely, tabular data, which is gathered when we engage with physical and social systems, is often secured behind enterprise firewalls. Much of it involves sensitive or confidential information, such as customer transactions stored by a financial institution. For this category of data, platforms like the Synthetic Data Vault offer software tools that can aid in building generative models. These models subsequently produce synthetic data that safeguard customer privacy and can be shared more broadly.
A significant advantage of this generative modeling approach for creating data is that organizations can develop a tailored, localized model specific to their own data. Generative AI automates what was previously a manual process.
Q: What are some advantages of utilizing synthetic data, and which applications are they especially well-suited for?
A: One primary application that has expanded significantly over the last decade is the use of synthetic data for testing software applications. Many software applications rely on data-driven logic, necessitating data to evaluate that software and its functionalities. In the past, individuals often relied on manually generating data, but now generative models allow us to produce as much data as we need.
Users can also generate specific datasets for application testing. For example, if I work for an online retail company, I can create synthetic data that mirrors actual customers located in Ohio who made transactions related to a particular product in February or March.
Since synthetic data is not sourced from actual events, they help preserve privacy. A major issue in software testing has been gaining access to sensitive real data for testing purposes in non-production settings, primarily due to privacy issues. Another immediate advantage is in performance assessment. You can generate billions of transactions using a generative model and examine how efficiently your system can handle them.
Another area where synthetic data shows considerable promise is in training machine-learning models. Occasionally, we want an AI model to assist us in predicting a less common event. For instance, a bank may wish to implement an AI model to forecast fraudulent transactions, but there may be insufficient real examples to train a model capable of accurately identifying fraud. Synthetic data provides data augmentation—additional examples that are akin to the real data—which can greatly enhance the precision of AI models.
Moreover, at times users lack the time or financial resources to gather all necessary data. Collecting data about customer intent, for example, would necessitate conducting numerous surveys. If you end up with insufficient data and then attempt to train a model, its performance will likely suffer. You can enhance training by incorporating synthetic data to improve model efficacy.
Q: What are some risks or potential challenges associated with synthetic data, and are there measures users can implement to mitigate those issues?
A: One of the primary concerns that people often have is: if the data is synthetically created, why should I rely on it? Assessing the reliability of synthetic data generally hinges on analyzing the overall system within which they are being utilized.
Several facets of synthetic data have been evaluated over time. For example, we have established methods to gauge how closely synthetic data resembles real data, as well as to assess their quality and privacy preservation. However, when using synthetic data to train a machine-learning model for a novel application, additional considerations are essential. How can you ascertain that the data will yield models that still draw valid conclusions?
New efficacy metrics are developing, with a focus on task-specific efficacy. It is crucial to meticulously analyze your workflow to guarantee that the synthetic data integrated into the system still enable you to derive valid conclusions. This should be approached cautiously on an application-by-application basis.
Bias is another concern. Since synthetic data is derived from a limited amount of real data, any existing bias in the original data may be replicated in the synthetic data. Just as with real data, it is necessary to intentionally eliminate any bias using varied sampling techniques that can result in balanced datasets. Careful planning is required, but you can fine-tune data generation to mitigate bias proliferation.
To assist in the evaluation process, our team developed the Synthetic Data Metrics Library. We were concerned that individuals might employ synthetic data in their contexts and obtain different conclusions in the real world. Therefore, we established a metrics and evaluation library to ensure necessary checks and balances. The machine learning community has confronted numerous challenges in ensuring models can generalize to new scenarios. The use of synthetic data introduces a completely new dimension to that complexity.
I anticipate that traditional methods of handling data—whether for software development, analytical inquiries, or model training—will undergo significant transformations as we become more adept at constructing these generative models. A multitude of tasks that have never before been achievable will now become possible.
“`