Big tech is all in on differential privacy.
It’s a foundational concept within Google’s Privacy Sandbox; Apple applies it to the study of diagnostic device, health and web browsing data; and, just last week, Facebook used differential privacy to protect a trove of data it made available to researchers analyzing the effect of sharing misinformation on elections.
Uber employs differential privacy to detect statistical trends in its user base without exposing personal information. Amazon’s AI systems tap it to prevent data leakage. Snapchat has used differential privacy to train machine learning models. And Salesforce uses DP filters in its reporting logs.
But while differential privacy began as an academic notion that’s now being adopted by the biggies, ad tech companies need to know about it, too, and some even see it as the future of privacy protection.
What is DP?
Differential privacy was first invented by Microsoft researchers in 2006 as a privacy-safe model for data analysis.
Rather than an algorithm itself, differential privacy is a set of cryptographic properties that can be applied to machine learning algorithms in order to set a limit on how much information can be extracted from data before it’s possible to draw inferences about individuals.
In other words, it introduces “plausible deniability” into a data set, said Aaron Roth, a professor of computer and information science at University of Pennsylvania Engineering and co-author of “The Ethical Algorithm,” a treatise on the science of socially-aware algorithm design.
In practice, that means the data owner purposely adds noise or randomness into a data set so that it’s simultaneously possible to learn something about a population from the data without identifying any of the individuals included in the group.
Consider a pollster gathering statistical information about embarrassing behavior, like drug use or cheating. To protect their privacy, respondents flip a coin before answering without revealing the result to the pollster. If the coin lands on tails, they are asked to respond truthfully. If it’s heads, they flip a second coin and answer “yes” for heads and “no” for tails. This introduces randomness, or plausible deniability, into the eventual outcomes of the study. But because the researcher knows how the errors were introduced, he or she can later work backward to systematically remove them in the aggregate and still glean something useful from the data, Roth explained.
“There is no way for me to know whether an answer is random or not,” he said. “But because I know the process by which noise is added to the response, it’s possible to subtract the noise and learn the average.”
At scale, a machine learning algorithm could apply this principle to make estimates and gather information from a data set without compromising specific individuals. The caveat is that researchers require larger data sets to study in order to make up for the deliberate randomness.
The most common use for differential privacy today is as a way to randomize large data sets so that they can be made available to researchers, such as in the Facebook misinformation example.
“In a sense, it’s about data owners protecting themselves from their partners,” said Zach Edwards, founder of analytics firm Victory Medium. “Differential privacy allows you to give people some access to data in a way that doesn’t reduce the value of your own organization – or create another mini Cambridge Analytica.”
Enter ad tech?
But why should ad tech companies care about an arcane academic concept like differential privacy?
Because it’s the future, Edwards said, whether the ad tech ecosystem wants to admit it or not. Online data collection and sharing will increasingly be controlled by browser APIs whose purpose is to limit online data collection and sharing.
Many of the proposals within Google’s Privacy sandbox are based on a differential privacy framework.
“There’s clearly no more room for workarounds,” Edwards said. “It’s a reality that only big companies really seem to be acknowledging, though.”
By the same token, differential privacy isn’t a guarantee of privacy, and doesn’t create privacy where none previously existed, Roth said. It also can’t necessarily stop privacy violations against groups of people.
For example, the fitness app Strava inadvertently revealed the locations of secret military bases when it released a seemingly benign heat map of popular running routes in 2018. No single person’s privacy was compromised, but it was still pretty damned awkward. Differential privacy wouldn’t help in a situation like that.
The level of privacy protection in an algorithm that uses differential privacy is also dependent on how strict the deployment is.
“You can dial up to perfect privacy, but then you can do almost nothing useful with the data, or you can go in the other direction and have no real protections,” Roth said. “It’s a tradeoff, because privacy protections always come with a cost.”
Still, it’s heartening to see differential privacy finally starting to be applied by the large tech firms to real-world scenarios, he said.
“For the first 10 years, differential privacy was an academic curiosity, and people like me would write papers about it that maybe five other people like me would read,” Roth said. “It’s not a silver bullet, but it’s a very good thing to see companies really starting to think about it.”