Analyze Private datasets using Pandas

Conventionally pandas allows you to analyze datasets that are present locally on your PC, that is when you are given access to a given dataset. But, there are valuable and informative datasets that consists of Personally Identifiable Information (PII) that you wont be able access to directly query it. There are privacy compliances that prevent data from being moved in some industries such as Banking and Healthcare. With the use of framework GreyNSights you can leverage pandas to analyze and transform datasets that are sensitive. GreyNSights is stricly based on client-server implementation. The data owner hosts the dataset which begins a server that listens in to requests. The data analyst connects to the dataowner using client scripts of GreyNSights and can interactively query the dataset remotely. With GreyNSights the data owner doesn't have to move the dataset out of their premises and analyst can still analyze the dataset. So its a win-win situation for both. Analyst can now get access to datasets that are otherwise difficult to get and data owners can now maintain privacy of individuals in dataset.

Repository: Link

Design Principles

Some other approaches for analyzing and querying privacy sensitive datasets uses anonymization techniques to obfusticate or remove PII's and then share the dataset. While, this seems like a promising approach. It could lead to linkage attacks. Analysts could reidentify anonymized data rows by linking it with other data sources. One such popular attack is the Netflix attack[1]. The netflix dataset was anonymized and posted publicly, yet identity of the users could be recovered by linking with publicly available IMDB dataset.

Using anonymization techniques are not the best method of attaining privacy. This motivates the idea of being able to use PII's but such that the rows of PII's are not directly seen by the analyst. Further, when the rows are restricted for viewing analysts could also recover values of the individual rows using differencing attacks where they perform two queries one for the whole dataset and another excluding a given datapoint. Say one query where we calculate a sum with a given datapoint and another query without a given datapoint. Subtracting the values of both gives the value of a sensitive row. This attack is known as differencing attack. It could be mitigated by instead returning differentially private answers to queries rather than exact answers to queries. Differentially private results are query results to which well calibrated noise is added such that the result of a query is almost the same even if a given datapoint is not present. I have explained Differential Privacy in detail in my blogposts DP Intro and DP Mechanisms.

The fundamental design principles of the framework are:

  • No raw data is exposed only aggregates

    The analyst can query and transform the dataset however they would want to, but can only get the aggregate results back.

  • The aggregates or analysis does not leak any information about individual rows

    The aggregate results are differentially private securing data rows from differencing attacks.

  • Pandas capabilities to transform and process datasets is still preserved

    The analyst might have to add a few lines of code for initializing the setup with dataowner, but they would essentially use the same pandas syntax ensuring anybody who already knows pandas could use without having to learn anything more. </p.

Pointers

Every query executed remotely gives a pointer object to the analyst. This pointer object is a reference to the actual object that lives on the dataowner's server.These pointers could be transformed and used the same way, the underlying object could be used. When a operation is performed on the pointer, the transformation occurs remotely on the object the pointer points to. The analyst could fetch the actual result of the pointer by calling get() on the pointer object. But, the actual results are retrieved only if the result is an aggregate result. The analyst is restricted from looking at rows or columns of dataset.

Simple Example

Below is an simple example where an data owner hosts a dataset on animals and carrots and the analyst queries it remotely. The example reproduces the official example of base Differential Privacy package PyDP. You can view the example in the official repository.

Data Owner

import pandas
from GreyNsights.analyst import Pointer
from GreyNsights.host import Dataset, DataOwner
from GreyNsights.config import Config

dataset = pandas.read_csv(
    "animals_and_carrots.csv", sep=",", names=["animal", "carrots_eaten"]
)


owner = DataOwner("Bob", port=6544, host="127.0.0.1")

config = Config(owner)
config.load("test_config.yaml")

dataset = Dataset(owner, "Sample Data", dataset, config, whitelist={"Alice": None})
dataset.listen() 

Analyst

The below script allows the analyst to connect to the remote datasource.

import GreyNsights
from GreyNsights.analyst import DataWorker, DataSource, Pointer, Command, Analyst
from GreyNsights.frameworks import framework

"""The analyst identity. Currently just a placeholder and non-functional. But , in future could allow analyst to identify 
   with a x.509 certificate"""
   
identity = Analyst("Alice", port=65441, host="127.0.0.1")

#The details of remote datasource
worker = DataWorker(port=6544, host="127.0.0.1")

#The dataset hosted by remote datasource. Currently supports only 1 datasource. 
dataset = DataSource(identity,worker, "Sample Data")

#Initialization pointer
dataset_pt = dataset.init_pointer()

frameworks = framework()

#Initialize GreyNSights version of Pandas
pandas = frameworks.pandas

Once the dataowner has hosted the dataset server for listening to requests , data analyst can request for the config from data owner to understand the level of privacy the dataowner has enforced.

config = dataset.get_config()
print(config)

In order to get a initialization pointer the dataowner has to approve the config

dataset_pt = config.approve().init_pointer()

Just to depict that the GreyNSights version of Pandas could be used like ordinary Pandas. The dataset is already a dataframe.

df = pandas.DataFrame(dataset_pt)

The command displays the columns of dataset

df.columns 

Executes describe remotely, which returns a pointer and gets underlying value of it when get() is called. Describe is an aggregate query so the analyst is permitted to see the result.

df.describe().get() 

Calculates mean, standard deviation and sum.

df['carrots_eaten'].mean().get() 
df['carrots_eaten'].sum().get()
(df['carrots_eaten']>70).sum().get()



Analyze Multiple Parties Datasets (Federated Analytics)

We can query multiple datasets as a single dataset, ensuring that the analyst is exposed only to aggregate values. But, the individual parties values are not exposed to the analyst. The method known as Federated Analyticals. Currently the framework supports only counts , approximate standard deviation, mean and sum.

Data Owner

The code for launching data owner server. Similarly, host several dataowners with different ports and dataset names.

import pandas
from GreyNsights.host import Dataset, DataOwner
from GreyNsights.config import Config

dataset = pandas.read_csv("week_data.csv")

owner = DataOwner("Bob", port=65444, host="127.0.0.1")

config = Config(owner)
config.load("test_config.yaml")

dataset = Dataset(owner, "Sample Data1", dataset, config, whitelist={"Alice": None})
dataset.listen()

Analysis

identity = Analyst("Alice", port=65441, host="127.0.0.1")

# Initialize DataOwner1
worker1 = DataWorker(port=65444, host="127.0.0.1")
dataset1 = DataSource(identity, worker1, "Sample Data1")
config1 = dataset1.get_config()
dataset1 = config1.approve().init_pointer()

# Initialize DataOwner2
worker2 = DataWorker(port=65446, host="127.0.0.1")
dataset2 = DataSource(identity, worker2, "Sample Data2")
config2 = dataset2.get_config()
dataset2 = config2.approve().init_pointer()

# Initialize DataOwner3
worker3 = DataWorker(port=65442, host="127.0.0.1")
dataset3 = DataSource(identity, worker3, "Sample Data3")
config3 = dataset3.get_config()
dataset3 = config3.approve().init_pointer()

# Create a workergroup to which commands to all dataowners are executed together
group = WorkerGroup(identity)
group.add(dataset1, worker1, config1)
group.add(dataset2, worker2, config2)
group.add(dataset3, worker3, config3)

pt = group.init_pointer()

# Perform queries on all three workers together
er = pt["Money spent (euros)"].sum() + pt["Money spent (euros)"].sum()

er = pt["Money spent (euros)"].sum().get()

print(er)

When get() is called on the workergroup pointer, dataowner communicate with each other and create shares. These exchanged shares are added by the analyst to obtain the summation of shares. This ensures that the analyst does not look at the individual values of dataowners but gets only the aggregate values of all dataowners. The underlying protocol used is called Secure Aggregation [2]. Secure aggregation works only for linear queries such as mean, sum & standard deviation.

References

  1. A. Narayanan and V. Shmatikov Robust de-anonymization of large sparse datasets (how to break anonymity of the netflix prize dataset). In Proceedings of IEEE Symposium on Security and Privacy. 2008
  2. K. A. Bonawitz Vladimir Ivanov Ben Kreuter Antonio Marcedone H. Brendan McMahan Sarvar Patel Daniel Ramage Aaron Segal Karn Seth Practical Secure Aggregation for Federated Learning on User-Held Data NIPS Workshop on Private Multi-Party Machine Learning, 2016

2021

Introduction to Weakly Supervised Learning

3 minute read

Supervised Machine Learning relies on labelled data that consists of data and pairs of expected outputs. For example an image of dog that is labelled a dog. ...

Meta Learning with MAML

3 minute read

Training neural networks for a single task requires several thousands of examples for a each class when training a model from scratch. This is typically not ...

Analyze Private datasets using Pandas

6 minute read

Conventionally pandas allows you to analyze datasets that are present locally on your PC, that is when you are given access to a given dataset. But, there a...

Back to top ↑

2020

Deep Learning in Practice-Be The algorithm

6 minute read

Conventional machine learning required the practitioner to manually look at images/text and handcraft appropriate features. Deep Learning models are powerful...

Back to top ↑

2019

Differential Privacy Part-II: DP Mechanisms

6 minute read

Having gone through the importance of differential privacy and its definition, this article motivates the theory with a practical example to make it more int...

Differential Privacy Part-I: Introduction

6 minute read

Personal data is a personal valuable asset, it could be used for economic, social or even malicious benifits. Most internet companies survive on personal dat...

Back to top ↑