class: center, middle # Machine Learning with Fully Anonymized Data ## Toronto Machine Intelligence - 20171214 Adam Drake @aadrake http://aadrake.com --- class: center, middle # @aadrake #hacker # @aadrake #thoughtleader --- class: middle # Overview 1. Background 1. Problem setup 1. Learning process 1. Outcomes 1. Bonus round --- class: middle # Background * 20 years in tech * Advisor to growth-stage startups * Management consulting and high-performance tech/ML --- class: middle # Data are everywhere, often useful, not always shareable --- class: middle # Issues * Ethics * Organizational boundaries * Regulatory requirements --- class: middle, center # Data...finds a way. # What to do? --- class: middle # Anonymize! * Hash only PII * Remove all Personally-Identifiable Information (PII) * Netflix prize? * Make **all** data non-human readable, irreversible, with zero context --- class: middle # Logistic regression + Online Stochastic Gradient Descent (OSGD) # Train * Receive list of features along with actual label * Multiply list of features element-wise by list of weights and sum to get label prediction * Use error to adjust weights so that prediction would have been closer to actual * Repeat # Predict * Receive list of features * Do multiplication and addition as above to get prediction * Return prediction --- class: middle # Features ``` surname, address, sex drake, 101 college st, male ``` One-hot encode: ``` surname_is_albert, surname_is_carlton, ..., surname_is_drake, ... 0, 0, ..., 1, ... ``` Drawbacks: have to see the whole dataset so we can know all possible values Result: a huge, sparse, matrix and lots of memory usage --- class: middle # Feature Hashing ``` surname, address, sex drake, 101 college st, male ``` ```python for key, value in record.items(): index = abs(hash(key+'_'+value)) x.append(index) return x '''[1895500209063583407, 4529498959038451508, 89271952858425798]''' ``` --- class: middle # Learning ## For each example: (assume hashed feature vector `x`, label `y`, weights `w`, learning rate `alpha` are provided) ```python # Calculate the prediction p wTx = sum(w[i] for i in x) p = 1.0 / (1.0 + exp(-max(min(wTx, 20.0), -20.0))) # Update the weights for i in x: w[i] += (p - y) * alpha ``` Note: `p-y` is the gradient under logloss --- class: middle # Anonymization ``` surname, address, sex drake, 101 college st, male ``` ```python for key, value in record.items(): index = abs(hash(key+'_'+value)) x.append(index) return x '''[1895500209063583407, 4529498959038451508, 89271952858425798]''' ``` This can be done client-side, and data need not be saved on the server. --- class: middle # Feature interactions ``` surname, address, sex drake, 101 college st, male ``` ```python for key, value in record.items(): index = abs(hash(key+'_'+value)) x.append(index) x = sorted(x) interactions = [] for i in range(len(x)): for j in range(i+1, len(x)): interactions.append(abs(hash(str(x[i]+'_'+str(x[j]))))) return x+interactions ``` ``` [1895500209063583407, 4529498959038451508, 89271952858425798] [89271952858425798, 1895500209063583407, 4529498959038451508, 4681361264041862069, 3347029972801325145, 237115545698493179] ``` --- # Drawbacks * No feature to hash value mapping (anonymizes)? * Hash collisions (regularizes)? * Other? --- class: middle # Performance * This method (correct 67.6%) vs. Ensemble of 20 Field-aware Factorization Machine (FFM) models, with tuned parameters (correct 68.5%, 1st place) * Deterministic memory usage * 45.8 million records: 3m55s (~200k RPS) --- class: middle # Conclusion * Unshareable data: bad * Feature hashing: good * Decoupling hashing from learning --- class: middle # Bonus round: constrained memory (think IoT, edge analytics, etc.) (`rec` is a dictionary for the record) ```python D = 2**18 # Set length of weights array w = [0.0] * D def h(rec): for k, v in rec.items(): idx = abs(hash(k+'_'+v)) % D x.append(idx) return x '''[111113, 31362, 21336]''' ``` --- class: middle # Bonus round: Ocean Protocol * The future of data interchange * Decentralized exchange protocol * Blockchain, proof-of-stake, and tokens. Oh my! * Creators and custodians can share, consumers can use. * oceanprotocol.com --- class: middle, center # @aadrake # http://aadrake.com # Questions?