Agencies Must Take an Authentic Approach to Synthetic Data

The Accenture Federal Technology Vision 2022 highlights four technology trends that will have significant impact on how government operates in the near future. Today we look at Trend #3, The Unreal: Making Synthetic Authentic.

Artificial intelligence (AI) is one of the most strategic technologies impacting all parts of government. From protecting our nation to serving its citizens, AI has proven itself mission critical. However, at its core, there is a growing paradox.

Synthetic data is increasingly being used to fill some AI methods’ need for large amounts of data. Gartner predicts that 60 percent of the data used for AI development and analytics projects will be synthetically generated by 2024. Synthetic data is data that, while manufactured, mimics features of real-world data.

At the same time, the growing use of synthetic data presents challenges. Bad actors are using these same technologies to create deepfakes and disinformation that undermines trust. For example, social media was weaponized using a deepfake in the early days of the Russian-Ukrainian War in an unsuccessful effort to sow confusion.

In our latest research, we found that by judging data based on its authenticity – instead of its “realness” – we can begin to put in place safeguards needed to use synthetic data confidently.

Where Synthetic Data is Making a Difference Today

Government already is leveraging synthetic data to create meaningful outcomes.

During the height of the COVID crisis, for example, researchers needed extensive data about how the virus affected the human body and public health. Much of this data was being collected in patients’ electronic medical records, but researchers typically face barriers in obtaining such data due to privacy concerns.

Using synthetic data, a wide array of COVID research was artificially generated and informed by – though not directly derived from – actual patient data. For example, the National Institutes of Health (NIH) in 2021 partnered with the California-based startup Syntegra to generate and validate a nonidentifiable replica of the NIH’s extensive database of COVID-19 patient records, called the National COVID Cohort Collaborative (N3C) Data Enclave. Today, N3C consists of more than 5 million COVID-positive individuals. The synthetic data set precisely duplicates the original data set’s statistical properties but with no links to the original information so it can be shared and used by researchers around the world trying to develop insights, treatments, and vaccines.

The U.S. Census Bureau has leveraged synthetic data as well. Its Survey of Income and Program Participation (SIPP) gives insight into national income distributions, the impacts of government assistance programs, and the complex relationships between tax policy and economic activity. But that data is highly detailed and could be used to identify specific individuals.

To make the data safe for public use, while also retaining its research value, the Census Bureau created synthetic data from the SIPP data sets.

A Framework for Synthetic Data

To create a framework for when using synthetic data is appropriate, agencies can start by considering potential uses cases, to see which ones align with mission.

For example, a healthcare organization or financial institution might be particularly interested in leveraging synthetic data to protect Personally Identifiable Information.

Synthetic data could also be used to understand rare, or “edge,” events, like training a self-driving car to respond to infrequent occurrences like when debris falls on a highway at night. There won’t be much real-world data on something that happens so infrequently, but synthetic data could fill in the gaps.

Synthetic data likewise could be of interest to agencies looking to control for bias in their models. It can be used to improve fairness and remove bias in credit and loan decisions, for example, by generating training data that removes protected variables such as gender and race.

In addition, many agencies can benefit from the reduced cost of synthetic data. Rather than having to collect and/or mine vast troves of real-life information, they could turn to machine-generated data to build models quickly and more cost-effectively.

In the near future, artificial intelligence “factories” could even be used to generate synthetic data. Generative AI refers to the use of AI to create synthetic data rapidly, at great scale, and accurately. It can enable computers to learn patterns from a large amount of real-world data – including text, visual data, and multimedia – and to generate new content that mimics those underlying patterns.

One common approach to generative AI is using generative adversarial networks (GANS) – modeling architectures that pit two neural networks – a generator and a discriminator – against each other. This creates a feedback loop in which the generator constantly learns to produce more realistic data, while the discriminator gets better at differentiating fake data from the real data. However, this same technology is also being used to enable deepfakes.

Principles of Authenticity

As this synthetic realness progresses, conversations about AI that align good and bad with real and fake will shift to focus instead on authenticity. Instead of asking “Is this real?” we’ll begin to evaluate “Is this authentic?” based on four primary tenets:

  • Provenance (what is its history?)
  • Policy (what are its restrictions?)
  • People (who is responsible?)
  • Purpose (what is it trying to do?)

Many already understand the urgency here: 98% of U.S. federal government executives say their organizations are committed to authenticating the origin of their data as it pertains to AI.

With these principles, synthetic realness can push AI to new heights. By solving for issues of data bias and data privacy, it can bring next-level improvements to AI models in terms of both fairness and innovation. And synthetic content will enable customers and employees alike to have more seamless experiences with AI, not only saving valuable time and energy but also enabling novel interactions.

As AI progresses and models improve, enterprises are building the unreal world. But whether we use synthetic data in ways to improve the world or fall victim to malicious actors is yet to be determined. Most likely, we will land somewhere in the expansive in-between, and that’s why elevating authenticity within your organization is so important. Authenticity is the compass and the framework that will guide your agency to use AI in a genuine way – across mission sectors, use cases, and time – by considering provenance, policy, people, and purpose.

Learn more about synthetic data and how federal agencies can use it successfully and authentically in Trend 3 of the Accenture Federal Technology Vision 2022: The Unreal.

Authors:

  • Nilanjan Sengupta: Managing Director – Applied Intelligence Chief Technology Officer
  • Marc Bosch Ruiz, Ph.D.: Managing Director – Computer Vision Lead
  • Viveca Pavon-Harr, Ph.D.: Applied Intelligence Discovery Lab Director
  • David Lindenbaum: Director of Machine Learning
  • Shauna Revay, Ph.D.: Machine Learning Center of Excellence Lead
  • Jennifer Sample, Ph.D.: Applied Intelligence Growth and Strategy Lead