class: center, middle # Novel Results Considered Harmful ## University of Toronto ## 2018-07-10 Adam Drake @aadrake
--- class: center, middle # @aadrake #hacker # @aadrake #thoughtleader --- class: middle # Overview 1. Background 2. AI vs. IA 3. Novel vs. Useful 4. Applications --- class: middle # Background - Applied Mathematics - 20+ years in tech - Growth-stage startup advisor - Management consulting and high-performance tech/ML --- class: middle, center # AI vs. IA ## Artificial Intelligence ## Intelligence Amplification --- class: middle # Novel: New or unusual # Useful: Helpful toward advancing a purpose --- class: middle # Douglas Engelbart Augmenting Human Intellect: A Conceptual Framework (1962) ## MOAD: 1968 Windows, hypertext, graphics, video conferencing, computer mouse, word process, dynamic file linking, version control, collaborative real-time editing --- class: middle # Kaggle Often novel. Rarely useful. Solution A: 67.6% accuracy Ensemble of 20 Field-aware Factorization Machine (FFM) models with tuned parameters: 68.5% accuracy, 1st place --- class: middle # Kaggle Often novel. Rarely useful. Logistic regression: 67.6% accuracy, **453rd place**! Ensemble of 20 Field-aware Factorization Machine (FFM) models with tuned parameters: 68.5% accuracy, 1st place --- class: middle # Sorting Which sort algorithm is used for Python, Android, Java SE 7, Octave, ... ? --- class: middle # Timsort (Tim Peters, 2002) #### Hybrid mergesort and insertion sort #### Accounts for sequential runs in array and merges them as a block .credit[References "Optimistic Sorting and Information Theoretic Complexity" (McIlroy, 1993)] --- class: middle, center # Biggest problem in machine learning? --- class: middle, center # Count everything, but don't count anything twice. .credit[--Yaser Abu-Mustafa] --- class: middle # Typical Problems - counting - binary classification (fraud, churn, clicks, etc.) - anomaly detection - time series forecasting - segmentation .credit[For even further generalization, see Machine Learning Reductions] --- class: middle # Command-line Tools can be 235x Faster than your Hadoop Cluster (2014) --- class: middle # Scalability! But at what COST? ## McSherry et al. 2015 Configuration that Outperforms a Single Thread --- class: middle # COST = How many cores do you need before you outperform a single thread? .credit[Note: 25Gbps ~ 3125MB/s is commodity bandwidth] --- class: middle, center ![]() .credit[Plot by A. Colyer] --- class: middle # Practical Example: Binary classification - Logistic regression with per-coordinate learning rates and OSGD --- class: middle, center ![]() .credit[Wikipedia] --- class: middle # Aside: The Hashing Trick Assume data like fname,lname,location Adam,Drake,Toronto ```python features = ['fnameAdam', 'lnameDrake', 'locationToronto'] maxWeights = 2**20 def hashedFeatures(features): hashes = [hash(x) for x in features] return [x % maxWeights for x in hashes] print(hashedFeatures(features)) # [18445008, 8643786, 20445187] ``` These are the indices in the weights array .credit[Weinberger et al., Feature Hashing for Large Scale Multitask Learning, 2009] --- class: middle # Feature hashing benefits - security/privacy - bounded memory consumption - stateless feature extraction --- class: middle # Logistic regression, feature hashing, per-coordinate learning rates, OSGD #### Complicated framework? --- class: middle Example ```python # Turn the record into a list of hash values x = [0] # 0 is the index of the bias term for key, value in record.items(): index = int(value + key[1:], 16) % D x.append(index) # Get the prediction for the given record (now transformed to hash values) wTx = 0. for i in x: # do wTx wTx += w[i] # w[i] * x[i], but if i in x we got x[i] = 1. p = 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # Update the loss p = max(min(p, 1. - 10e-12), 10e-12) loss += -log(p) if y == 1. else -log(1. - p) # Update the weights for i in x: # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic # (p - y) * x[i] is the current gradient # note that in our case, if i in x then x[i] = 1 w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) n[i] += 1. ``` --- class: middle, center # Result: 74,000 RPS ##### (19MB/s) --- class: middle #### Ad Click Prediction: a View from the Trenches (McMahan et al., 2013) --- class: middle # Multicore ```go // ... housekeeping ... // Hash the record for i, v := range record { hashResult := hash([]byte(fields[i]+v)) % int(D) x[i+1] = int(math.Abs(float64(hashResult))) } // Make prediction wTx := 0.0 for _, v := range x { wTx += (*w)[v] } p := 1.0 / (1.0 + math.Exp(-math.Max(math.Min(wTx, 20.0), -20.0))) p = math.Max(math.Min(p, 1.-math.Pow(10, -12)), math.Pow(10, -12)) // Update the loss if y == 1 { *loss += -math.Log(p) } else { *loss += -math.Log(1.0 - p) } // Update the weights for _, v := range x { (*w)[v] = (*w)[v] - (p-float64(y))*alpha/(math.Sqrt((*n)[v])+1.0) (*n)[v]++ } ``` --- class: middle, center ### HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent i.e., Nevermind about locks .credit[Niu, Recht, et al. 2011] --- class: middle, center # Multicore version (Go): ## 366,000 RPS ###### (~95MB/s) .credit[Note: 25Gbps ~ 3125MB/s] --- class: middle # Application: Edge Analytics - privacy - resources (bandwidth, bettery, storage, RAM) - accuracy .credit[Resource-efficient Machine Learning in 2KB RAM for the Internet of Things, Kunmar et al. 2017] --- class: middle # Conclusion - Novel results often harmful - Useful methods often seen as boring - Old methods are often great for new problems - Troubling Trends in Machine Learning Scholarship (Lipton and Steinhardt, 9 JUL 2018) --- class: middle, center # @aadrake #
# Questions?