When the Monetary Authority of Singapore (MAS)-led Veritas consortium wanted to test a new fairness assessment framework, it faced a key challenge.
They knew that the most rigorous results would come from using real case studies with real data. But for obvious privacy and competitive reasons, the financial services companies involved could not share that data.
The solution? To use synthetic and/or anonymous data. But this itself created a new problem. They would need synthetic data close enough to the real data to be useful, but different enough to protect the privacy of the individuals involved. How could they find the right balance?
The need for anonymity
This speaks to a broader problem that organizations face in many different contexts. The opportunities to get value from data are exploding. But they come with more and more privacy concerns. Especially when personal microdata are at stake.
Consider just how easy it is to identify an individual from their data. Studies have shown that 87% of the US population has characteristics that make them identifiable from only three data points: Zip code, gender, and date of birth. The risks associated with disclosing personal data are therefore obvious – for both individuals and for companies who must protect the personal data of others.
One answer is to anonymize data. This is a good compliance strategy for companies, and it’s recommended by most data protection rules. By processing a collection of personal data for anonymization, it’s possible to irreversibly alter it to prevent straightforward identification of the individuals who contributed the data. Yet the anonymized data can still be used for larger statistical analysis. That keeps companies in compliance with data protection rules, and lets them drive value from the original data without directly “using” it.
100% anonymity = 0% value?
However, anonymization creates one very particular challenge. There’s always a trade-off between data privacy and data value. The “further away” from reality synthetic data becomes, the less useful it is for analytics or for developing AI algorithms. So what you gain in privacy, you lose in value.
You can’t eliminate this trade off entirely. But where you land on the spectrum between privacy and utility will vary depending on the approach you take. It’s therefore crucial to think proportionately about anonymization strategy, considering both the nature of the data and what it will be used for, as well as the privacy risks.
This need for proportionality has been nicely summarised by Mark James Elliot:
“We can have secure houses or usable houses but not both … An absolutely secure house would lack doors and windows and therefore be unusable. But that does not mean that all actions to make one’s house more secure are pointless, and nor does it mean that proportional efforts to secure my house are not a good idea. The deadbolt on my door may not help if a burglar comes armed with a battering ram or simply smashes my living room window but that does not mean that my lock is useless, merely that it does not (and cannot) provide absolute security.”
APAT: finding a balance between privacy and utility
For businesses, the objective is to find the most acceptable balance between privacy and value when dealing with personal data. To do that effectively, they need to fully understand the extent of the trade-offs they’re making. But until now, there haven’t been any off-the-shelf assessment methods or audit tools to do this.
This is where Labs’ new Automated Privacy & Value Assessment Tool (APAT) comes in. APAT takes a dataset and evaluates both the privacy and utility of various anonymization strategies, providing a recommendation on the best option for a particular use case.
It’s fully automated and generalizable to any anonymization process.
Having uploaded the original dataset and the different anonymized versions they want to compare, the user is given a percentage score for three key metrics.
- Privacy. This shows the risk that individuals in anonymized data could be re-identified. If the risk is high, it suggests that the records have not been properly anonymized.
- Utility. This shows how useful the anonymized data will be for predictive tasks. For example, an organization may want to predict the creditworthiness of a credit applicant based on historical financial data. APAT judges utility by comparing the predictive performance of the anonymized data against the real dataset.
- Similarity. This shows how much statistical information has been preserved in the anonymized data. A high score means its analytics capabilities will be close to that of the original dataset.
The combination of these three metrics allows the user to make a more informed decision about which anonymization approach to take in each case.
How did APAT help the the Veritas consortium solve its data privacy problem?
Accenture’s Labs and Applied Intelligence groups helped the MAS-led Veritas consortium select a suitable synthetic dataset for one of its key use cases: predictive underwriting for life insurance.
To do this, different anonymized dataset versions were synthesized from the original data. This synthetic data was then tested using APAT, producing scores for the level of utility, privacy and similarity to the original dataset.
Armed with this information, the Veritas consortium was able to select the most appropriate anonymized versions of the original dataset. They could produce a synthetic dataset with accuracy in line with marketplace benchmarks – and that could not be used to identify individuals in the seed portfolio it was generated from.
A strategic tool for all sensitive data use cases
Anonymisation and synthetic data generation are promising solutions to the data privacy challenge. But it’s critical to understand the trade-offs between privacy and utility.
APAT offers a new audit capability that helps organizations make more informed decisions about anonymization strategies. And it can be used in any industry use cases that deal with sensitive data and the need to balance privacy with data value. It can even be extended to questions of fairness and bias as well.
We’re excited about the potential of this tool to help companies manage the ethical and regulatory challenges of working with data. If you’d like to know more, or see the tool in action, contact Medb Corcoran, Managing Director, Accenture Labs. And for more on this topic, read our recent report on the business value of synthetic data, “Flipping the script on deepfake technologies.”
The authors would also like to acknowledge the efforts of Jer Hayes and Richard Vidal from Accenture Labs; and Dimitrios Vlitas, Henrietta Ridley, and Aurora Armiento from Accenture Applied Intelligence on this work.