In conventional machine learning we train an algorithm on data/classes pairs (supervised learning) or only on raw data( Unsupervised learning) and make predi...
Differential Privacy Part-I: Introduction
Personal data is a personal valuable asset , it could be used for economic , social or even malicious benifits. Most internet companies survive on personal data of users in exchange of services. Usually this leads to distributing data to third party services such as advertisers which on a personal level could be a privacy violation. Data is ensured to be distributed with anonymization where identifying attributes are removed. But , this still does not ensure data is anonymous. The increasing available data online gives surface to linkage attacks which involves reidentifying a user from the dataset with common attributes across datasets. One such popular attack is the Netflix attack. Usually third party services such as advertisers don't need personal identifying information , but just enough statistical representation of information. This motivates analysis which reflects the global dataset population , but does not capture any personally identifying information. Differential Privacy (DP) is based on Fundamental Law of Information Recovery which states that "overly accurate answers to too many questions will destroy privacy in a spectacular way".
DP is one take on privacy which ensures that no single individual is affected for being a part of the dataset or data collection process. It not only protects the individuals from linkage attacks but also provides a plausible way of quanitifying privacy loss.
To make it more intuitive as to why privacy is important,let us take an small example. Say a small town of 10,000 people has a total income of 2000 million. If the next time you calculate this number and it ends up reducing to 1900 million because there is one person lesser than before. Using auxillary data sources you could find out who left the town from which isn't hard to infer that their income was 100 million. This could be discouraging for participants to participate in such statistics as it are leaks such sensitive information. These are known as differencing attacks. Note that in practice differential privacy is used for large datasets and does not work for smaller datasets.
Now let us formalize the above example in terms of querying datasets. The above example is a single example of two queries A summation query with all rows of a dataset and a summation query without a single participant. The difference resulting in leakage of a indvidual's data.The dataset with the indvidual is D1 and without individual is D2. One way to solve the above problem is to ensure that the output of a given algorithm or query such as above does not change with inclusion/exclusion of any individual. So the output of query sum(Income|D1)should be rougly equal or close to sum(Income|D2). That is the statistics or analyticals is not affected by an single individual but should reflect the overall population. So how could we possibly mitigate this? we could add random noise to both D1 and D2 such that the difference does not leak in individuals data that is the outputs are as close as possible. Its not necessary that the adversery specifically phrases queries in order to marginalize the individual. It can so happen that the resultant of certain queries could result in violation of privacy or that the analyst is already aware of some entries in the dataset.
Given the above background as to why we need to protect privacy when analyzing datasets , let us define what DP is. DP defines that the probability of a given output of a mechanism on D1 is almost equal or close to the output for a given mechanism of D2 by a factor of exp(epsilon).
The above equation is the foundation of Differential privacy. We charecterize this by DP(,0) The delta term gives the tail bound of probability distribution which can ignore for a while , if you are curious you can check this out.
To summarize, a user's privacy is violated if more information is learnt about a given user when the analysis is performed. If a general inference is made about the population and the same applies to the specific user the users privacy is not violated. For instance if in the previous example the average income would be 0.2 million. It is likely that anybody belonging to the town would have a income around the range. This is a general inference and does not violate any single persons privacy.
Data collection and analysis must be limited to representing statistics of entire population but in a way that individuals confidentiality is maintained. You can get complete privacy by absolutely outputting junk or get complete utility for a certain task by outputting raw data. So there is a tradeoff between the two. Differential Privacy is not a direct solution to privacy but a mathematical theory to quantify and guide development of private algorithms. Like Asymptotic analysis by itself is not an algorithm by itself but a theory to guide development of algorithms. The reason why it is important to quantify utility is that the adversery interacts with the database in the same manner as the analyst who could benifit from the database.
While , DP can seem very pessimistic or unnecessary , the issues DP can quantify and guide to solve scale beyond just basic population statistics. Even training Neural Networks. What are the possible harms of neural networks ? Certain approaches could be used to reconstruct neural network training data. Think about it this way , you are using a NLP service of predicting next word by a MLAAS(Machine Learning As a Service) provider such as Amazon AWS or Microsoft Azure. You could possibly get it to predict a next word that was supposed to confidential or a facial recognition algorithm service where you could show it images of similar people to get an idea of target people to know if they belonged to the training data.
Further , differential privacy does not aim to guarantee protection of most individuals data but designed to guarantee every individual's data. The few people with interesting data do form the most interesting insights in analysis and needs to be protected.This introduces another term used to charecteristic a database called sensitivity. This can be given
What it intuitively means is that how likely the value of outcome would change if given person was not in the database. Similar to epsilon definition privacy but here we are taking in terms absolute values and not probabilities.Sensitivity is a term that is dependent on the database.
A major guarantee of DP is that it protects individuals data from auxillary information. That is if the data from a given database is combined with data from other database , the data of the individuals still remains private.Check out Part II of the tutorials where I will cover some basic DP mechanisms
I thank my friend Shravan Sridhar for reviewing and helping me improve this article
Conventional machine learning required the practitioner to manually look at images/text and handcraft appropriate features. Deep Learning models are powerful...
Having gone through the importance of differential privacy and its definition , this article motivates the theory with a practical example to make it more in...
Personal data is a personal valuable asset , it could be used for economic , social or even malicious benifits. Most internet companies survive on personal d...