Scalable Machine Learning with Fully Anonymized Data
Note: This article will likely be revised and expanded before being submitted for review and publication. At the moment it is missing critical sections, that will be added later. If we have suggestions for improvement, please send them to me directly.
Abstract
In this article I will discuss the well-known technique of feature hashing, but with the modification of performing the hashing step on the client-side before sending data to a server or daemon performing model training and prediction. By using this approach, we can ensure that the system performing the training cannot have any knowledge of the underlying data being received, since the learning takes place only using the hashed representation of the data. Since the hash values are not reversible, this approach allows for data to be shared across business units and regulatory jurisdictions while maintaining privacy and security. Additionally, this approach has well-documented scalability properties, and in our testing can perform model fitting and classification tasks at a rate of over 350,000 records per second on a commodity laptop.
Introduction
Sharing data between entities for any reason, including for the purposes of doing machine learning, presents a variety of ethical, organizational, and legal issues. Depending on the informed consent of the user, producer, or owner of the data, some of the data may not be shareable outside of the unit that has permission to collect it. Inside an organization there are often a variety of reasons why data cannot be shared between different business units, subsidiaries, or teams. Lastly, there are often legal frameworks such as the General Data Protection Regulation (GDPR) in the EU that prevent certain entities from sharing data with others or from sending that data to other countries. Similarly, in Indonesia, regulations limit and make difficult the transfer of data out of the country. For this reason, it is necessary to consider methods of performing machine learning in situations where the underlying data cannot be shared or accessed by the party actually training the models.
Consider a company headquarters in the EU, with a subsidiary or affiliate in Indonesia. The company is quite large and has a centralized Data Science department located in its EU office. The Indonesia office produces valuable data that must be analyzed and can be used for predictions, but they do not have the staff with the background or requisite skills in machine learning or computing to perform such an analysis. Due to regulatory constraints, the data on the systems in Indonesia cannot be sent to the existing EU systems for data ingestion and model training.
We start with the realization that entities are often hashing parts of data sets in order to provide some level of protection and compliance, but often this partial hashing is ineffective. This is due to problems such as data leakage and being able to brute-force original data from partially anonymized data sets. A memorable example of the latter is the Netflix Prize competition, wherein researchers were able to cross reference non-anonymized portions of data from Netflix with publicly-available data in order to uniquely identify users. They covered their approach in the paper Robust De-anonymization of Large Sparse Datasets.
The so-called hashing trick is an established way of processing data as part of training a machine learning model. The typical motivation for using the technique is a reduction in memory requirements or the ability to perform stateless feature extraction. The commonly cited disadvantages of this approach, namely, that the hashing is not reversible and therefore the model cannot be interpreted by feature values alone, can be turned to an advantage in the case where we do not want any such interpretation to take place at all. While feature hashing is ideally suited to categorical features, it also empirically works well on continuous features.
In order to accomplish our goal of machine learning on fully anonymized data, we apply the hashing portion of the hashing trick on a separate system from the system ingesting the training examples. We then build the model and calculate predictions. The effect of this is that the system processing the data actually knows nothing about it. This includes zero knowledge of the feature names or label names and what they represent. The processing system only receives a list of integers and a label of 0 or 1. By decoupling the hashing step from the training system, we obtain the following benefits:
- We can produce a model with the same accuracy as if it were trained on non-anonymized data.
- We reduce or eliminate the possibility of recovering the underlying data from the hashed training examples.
Sharing is caring
The problem of acquiring data for training machine learning models in a way that fits both ethical and legal standards is not new. However, recent changes in public perception after scandals and data breaches involving companies like Cambridge Analytica and Facebook, Equifax, and others, have caused both individuals and regulatory agencies to examine the problem more closely.
In Europe, this close examination has been going on for an even longer period of time, and has started to have tangible, and many would argue hugely beneficial, effects in the form of the General Data Protection Regulation (GDPR).
With the growing awareness that collecting and sharing data is perhaps at least as much of a liability as a benefit, more work needs to be done to determine ways to safely share data. Data must pass between organizational units or regulatory environments in a way that is compatible with data protection legislation and also takes care to protect the privacy of those who supply the data.
This problem of data transfer and sharing can even take place between two different units of the same company. This was the case for one of my clients who had global offices, including some in Indonesia. They wanted to have their fraud detection work done in the EU, but could not move data out of their Indonesian subsidiary due to the data regulations in that country.
For reasons such as these, there is critical need for a method of training machine learning models without access to any of the underlying data.
Rehashing old techniques
Hash functions and feature hashing
For our purposes, a hash function is a function that receives input data of varying size and produces output of a fixed size, with the additional properties that the output values are relatively resistant to collisions and computationally difficult to invert. In other words, if we hash someone’s name, we should receive some value that a) isn’t likely to be a value we would get from hashing a different name and b) cannot be reversed in order to obtain the name that was supplied as input to the hash function. Common hash functions include cryptographic hash functions such as MD5, SHA-1, and SHA-256. Cryptographic hash functions are designed in such a way that it is especially difficult to recover the original data by reversing the hash. For our purposes, cryptographic hash functions are not required. While they do have security benefits, they also come with additional computational costs.
As an example of turning a feature name and feature value into an integer representation of the underlying data, we might have a feature in our data like firstName that has a value of Adam. We can concatenate these two values and hash the result in order to obtain an integer representation of this unique feature name and feature value combination. This would map something like firstName-Adam to the resulting integer 620516.
Now we have a reasonably unique integer representing our particular feature name and value combination. If we were also using a machine learning model that has a weight vector on a per-feature basis, then we could use this integer as the index to refer to that particular weight in the weight vector. We could then modify weights as appropriate as part of, for example, building a logistic regression model.
Taken together, this technique of applying a hash function to incoming features in order to go directly to their corresponding location in an array of weights is referred to as feature hashing or the hashing trick.
Additionally, since this process can be performed on individual records, and does not depend on any records that come before or after in the data set, it is ideally suited to a process that benefits from stateless feature extraction like Online Stochastic Gradient Descent (OSGD). Because the feature extraction process is stateless, it takes trivial effort to parallelize, resulting in extremely high-performing systems. For more detail on performance, see HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent.
Feature hashing is usually done on the same machine where the model is being trained, and the hashing is often part of the step wherein the weights of the model are updated. However, feature hashing is a completely discrete step and could be performed on a separate machine before the data is supplied to the machine where the model training is taking place!
Furthermore, as conclusively demonstrated in the paper Feature Hashing for Large Scale Multitask Learning, even if features are hashed and collisions are not expected, their presence is in most cases negligible on the resulting accuracy of the model, especially in multi-class classification scenarios. For this reason, using the hashed representation of the data does not result in an appreciable decrease in the effectiveness of our machine learning model.
Old dogs and new tricks
Consider the example in the previous section about two business units, Company A and Company B. Both entities belong to the same company, but exist in different regulatory jurisdictions. Company A does data collection, but B has all of the analytics expertise and actually builds the models that detect fraud in the data collected by Company A.
By using the approach we outlined above, Company A can simply provide to Company B the hashed representation of the data along with a label like 0 or 1 (in the case of a binary classification problem). This allows Company B to do all the same machine learning work, model training, and so on, but without ever receiving any of the non-anonymized data.
Another way to accomplish the same goal would be for Company B to provide an API that accepts hashed data plus a label indicating its class. The API request would then be fed to the model for learning. Additionally, the API can expose an endpoint where Company A can provide only the hashed features and obtain a prediction of the resulting class membership.
Related work
Fully hashing data on the client side is not the only approach to the problem of securely sharing data for use and processing by others. There are other techniques that can provide better security protections, and still other approaches that do not anonymize the data per se, but rather add specific and small amount of variation to the data so as to make it difficult to attribute to one individual.
Homomorphic Encryption for Statistical Machine Learning is one such promising approach. In a homomorphic encryption scenario, operations are performed on the data after it has been encrypted and without needing to do any decryption of any kind. The benefit here is that the data processor never sees any unencrypted data, and the encryption approaches used could be more secure than anonymizing the data with a hash function. However, the drawback is that performing operations on homomorphically-encrypted data can be very slow. The linked paper describes one example in which multiplication operations are limited to approximately 50 per second. Unfortunately, this performance limitation makes the approach unusable at this time.
Differential Privacy is another interesting approach to the question of obtaining data in a privacy-friendly way. In this approach, data is queried from some kind of repository. The response from the repository has some amount of random noise produced by a chosen distribution. In this way, the data returned are, in some sense, lower resolution than their original representation. While differential privacy approaches are an interesting method for making progress on the problem of preserving privacy while still sharing data, they do not fully anonymize the data as does our decoupled hashing approach.
Mohassel and Zhang present an approach in their paper SecureML: A System for Scalable Privacy-Preserving Machine Learning that uses a specially-designed communication protocol to share data between two actors. However, this approach requires coordination between the parties in terms of data processing and availability that our approach does not.
Conclusion
The problems of sharing data between regulatory frameworks and the ethical use and privacy protection of data provided by users are becoming increasingly common problems in today’s technology companies. By making use of feature hashing techniques and decoupling the hashing and training components of a machine learning system, we can provide hashed, and therefore totally anonymized, data to a third party for the training of machine learning models. The third party need have no knowledge of what the underlying data is or what it represents, to the great benefit of the privacy and security of our end users.