Agencies Must Take an Authentic Approach to Synthetic Data

By: MeriTalk Staff

Blogs

Oct 4, 2022 | 9:00 am

The Accenture Federal Technology Vision 2022 highlights four technology trends that will have significant impact on how government operates in the near future. Today we look at Trend #3, The Unreal: Making Synthetic Authentic.

Artificial intelligence (AI) is one of the most strategic technologies impacting all parts of government. From protecting our nation to serving its citizens, AI has proven itself mission critical. However, at its core, there is a growing paradox.

Synthetic data is increasingly being used to fill some AI methods’ need for large amounts of data. Gartner predicts that 60 percent of the data used for AI development and analytics projects will be synthetically generated by 2024. Synthetic data is data that, while manufactured, mimics features of real-world data.

At the same time, the growing use of synthetic data presents challenges. Bad actors are using these same technologies to create deepfakes and disinformation that undermines trust. For example, social media was weaponized using a deepfake in the early days of the Russian-Ukrainian War in an unsuccessful effort to sow confusion.

In our latest research, we found that by judging data based on its authenticity – instead of its “realness” – we can begin to put in place safeguards needed to use synthetic data confidently.

Where Synthetic Data is Making a Difference Today

Government already is leveraging synthetic data to create meaningful outcomes.

During the height of the COVID crisis, for example, researchers needed extensive data about how the virus affected the human body and public health. Much of this data was being collected in patients’ electronic medical records, but researchers typically face barriers in obtaining such data due to privacy concerns.

Using synthetic data, a wide array of COVID research was artificially generated and informed by – though not directly derived from – actual patient data. For example, the National Institutes of Health (NIH) in 2021 partnered with the California-based startup Syntegra to generate and validate a nonidentifiable replica of the NIH’s extensive database of COVID-19 patient records, called the National COVID Cohort Collaborative (N3C) Data Enclave. Today, N3C consists of more than 5 million COVID-positive individuals. The synthetic data set precisely duplicates the original data set’s statistical properties but with no links to the original information so it can be shared and used by researchers around the world trying to develop insights, treatments, and vaccines.

The U.S. Census Bureau has leveraged synthetic data as well. Its Survey of Income and Program Participation (SIPP) gives insight into national income distributions, the impacts of government assistance programs, and the complex relationships between tax policy and economic activity. But that data is highly detailed and could be used to identify specific individuals.

To make the data safe for public use, while also retaining its research value, the Census Bureau created synthetic data from the SIPP data sets.

A Framework for Synthetic Data

To create a framework for when using synthetic data is appropriate, agencies can start by considering potential uses cases, to see which ones align with mission.

For example, a healthcare organization or financial institution might be particularly interested in leveraging synthetic data to protect Personally Identifiable Information.

Synthetic data could also be used to understand rare, or “edge,” events, like training a self-driving car to respond to infrequent occurrences like when debris falls on a highway at night. There won’t be much real-world data on something that happens so infrequently, but synthetic data could fill in the gaps.

Synthetic data likewise could be of interest to agencies looking to control for bias in their models. It can be used to improve fairness and remove bias in credit and loan decisions, for example, by generating training data that removes protected variables such as gender and race.

In addition, many agencies can benefit from the reduced cost of synthetic data. Rather than having to collect and/or mine vast troves of real-life information, they could turn to machine-generated data to build models quickly and more cost-effectively.

In the near future, artificial intelligence “factories” could even be used to generate synthetic data. Generative AI refers to the use of AI to create synthetic data rapidly, at great scale, and accurately. It can enable computers to learn patterns from a large amount of real-world data – including text, visual data, and multimedia – and to generate new content that mimics those underlying patterns.

One common approach to generative AI is using generative adversarial networks (GANS) – modeling architectures that pit two neural networks – a generator and a discriminator – against each other. This creates a feedback loop in which the generator constantly learns to produce more realistic data, while the discriminator gets better at differentiating fake data from the real data. However, this same technology is also being used to enable deepfakes.

Principles of Authenticity

As this synthetic realness progresses, conversations about AI that align good and bad with real and fake will shift to focus instead on authenticity. Instead of asking “Is this real?” we’ll begin to evaluate “Is this authentic?” based on four primary tenets:

Provenance (what is its history?)
Policy (what are its restrictions?)
People (who is responsible?)
Purpose (what is it trying to do?)

Many already understand the urgency here: 98% of U.S. federal government executives say their organizations are committed to authenticating the origin of their data as it pertains to AI.

With these principles, synthetic realness can push AI to new heights. By solving for issues of data bias and data privacy, it can bring next-level improvements to AI models in terms of both fairness and innovation. And synthetic content will enable customers and employees alike to have more seamless experiences with AI, not only saving valuable time and energy but also enabling novel interactions.

As AI progresses and models improve, enterprises are building the unreal world. But whether we use synthetic data in ways to improve the world or fall victim to malicious actors is yet to be determined. Most likely, we will land somewhere in the expansive in-between, and that’s why elevating authenticity within your organization is so important. Authenticity is the compass and the framework that will guide your agency to use AI in a genuine way – across mission sectors, use cases, and time – by considering provenance, policy, people, and purpose.

Learn more about synthetic data and how federal agencies can use it successfully and authentically in Trend 3 of the Accenture Federal Technology Vision 2022: The Unreal.

Authors:

Nilanjan Sengupta: Managing Director – Applied Intelligence Chief Technology Officer
Marc Bosch Ruiz, Ph.D.: Managing Director – Computer Vision Lead
Viveca Pavon-Harr, Ph.D.: Applied Intelligence Discovery Lab Director
David Lindenbaum: Director of Machine Learning
Shauna Revay, Ph.D.: Machine Learning Center of Excellence Lead
Jennifer Sample, Ph.D.: Applied Intelligence Growth and Strategy Lead

Cookie	Duration	Description
AWSALBCORS	7 days	Amazon Web Services set this cookie for load balancing.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	New Relic uses this cookie to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie stores and identifies a user's unique session ID to manage user sessions on the website. The cookie is a session cookie and will be deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pxhd	1 year	PerimeterX sets this cookie for server-side bot detection, which helps identify malicious bots on the site.

Cookie	Duration	Description
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
li_gc	5 months 27 days	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
__cf_bm	30 minutes	Cloudflare set the cookie to support Cloudflare Bot Management.

Cookie	Duration	Description
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
_gat	1 minute	Google Universal Analytics sets this cookie to restrain request rate and thus limit data collection on high-traffic sites.

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
CONSENT	2 years	YouTube sets this cookie via embedded YouTube videos and registers anonymous statistical data.
ln_or	1 day	Linkedin sets this cookie to registers statistical data on users' behaviour on the website for internal analytics.
pardot	past	The pardot cookie is set while the visitor is logged in as a Pardot user. The cookie indicates an active session and is not used for tracking.
UID	1 year 1 month 4 days	Scorecard Research sets this cookie for browser behaviour research.
vuid	1 year 1 month 4 days	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos on the website.
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
_gcl_au	3 months	Google Tag Manager sets the cookie to experiment advertisement efficiency of websites using their services.
_gid	1 day	Google Analytics sets this cookie to store information on how visitors use a website while also creating an analytics report of the website's performance. Some of the collected data includes the number of visitors, their source, and the pages they visit anonymously.
__gads	1 year 24 days	Google sets this cookie under the DoubleClick domain, tracks the number of times users see an advert, measures the campaign's success, and calculates its revenue. This cookie can only be read from the domain they are currently on and will not track any data while they are browsing other sites.

Cookie	Duration	Description
anj	3 months	AppNexus sets the anj cookie that contains data stating whether a cookie ID is synced with partners.
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser IDs.
bscookie	1 year	LinkedIn sets this cookie to store performed actions on the website.
GoogleAdServingTest	session	Google sets this cookie to determine what ads have been shown to the website visitor.
IDE	1 year 24 days	Google DoubleClick IDE cookies store information about how the user uses the website to present them with relevant ads according to the user profile.
li_sugr	3 months	LinkedIn sets this cookie to collect user behaviour data to optimise the website and make advertisements on the website more relevant.
muc_ads	1 year 1 month 4 days	Twitter sets this cookie to collect user behaviour and interaction data to optimize the website.
personalization_id	1 year 1 month 4 days	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.
uuid2	3 months	The uuid2 cookie is set by AppNexus and records information that helps differentiate between devices and browsers. This information is used to pick out ads delivered by the platform and assess the ad performance and its attribute payment.
VISITOR_INFO1_LIVE	5 months 27 days	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
_mkto_trk	1 year 1 month 4 days	This cookie, provided by Marketo, has information (such as a unique user ID) that is used to track the user's site usage. The cookies set by Marketo are readable only by Marketo.
__gpi	1 year 24 days	Google Ads Service uses this cookie to collect information about from multiple websites for retargeting ads.

Archives