+ - 0:00:00
Notes for current slide
Notes for next slide

Novel Results Considered Harmful

University of Toronto

2018-07-10

Adam Drake

@aadrake

https://adamdrake.com

1 / 32

@aadrake #hacker

@aadrake #thoughtleader

2 / 32

Overview

  1. Background

  2. AI vs. IA

  3. Novel vs. Useful

  4. Applications

3 / 32

Background

  • Applied Mathematics

  • 20+ years in tech

  • Growth-stage startup advisor

  • Management consulting and high-performance tech/ML

4 / 32

AI vs. IA

Artificial Intelligence

Intelligence Amplification

5 / 32

Novel: New or unusual

Useful: Helpful toward advancing a purpose

6 / 32

Douglas Engelbart

Augmenting Human Intellect: A Conceptual Framework (1962)

MOAD: 1968

Windows, hypertext, graphics, video conferencing, computer mouse, word process, dynamic file linking, version control, collaborative real-time editing

7 / 32

Kaggle

Often novel. Rarely useful.

Solution A: 67.6% accuracy

Ensemble of 20 Field-aware Factorization Machine (FFM) models with tuned parameters: 68.5% accuracy, 1st place

8 / 32

Kaggle

Often novel. Rarely useful.

Logistic regression: 67.6% accuracy, 453rd place!

Ensemble of 20 Field-aware Factorization Machine (FFM) models with tuned parameters: 68.5% accuracy, 1st place

9 / 32

Sorting

Which sort algorithm is used for Python, Android, Java SE 7, Octave, ... ?

10 / 32

Timsort (Tim Peters, 2002)

Hybrid mergesort and insertion sort

Accounts for sequential runs in array and merges them as a block

References "Optimistic Sorting and Information Theoretic Complexity" (McIlroy, 1993)

11 / 32

Biggest problem in machine learning?

12 / 32

Count everything, but don't count anything twice.

--Yaser Abu-Mustafa

13 / 32

Typical Problems

  • counting
  • binary classification (fraud, churn, clicks, etc.)
  • anomaly detection
  • time series forecasting
  • segmentation

For even further generalization, see Machine Learning Reductions

14 / 32

Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

15 / 32

Scalability! But at what COST?

McSherry et al. 2015

Configuration that

Outperforms a

Single

Thread

16 / 32

COST = How many cores do you need before you outperform a single thread?

Note: 25Gbps ~ 3125MB/s is commodity bandwidth

17 / 32

Plot by A. Colyer

18 / 32

Practical Example: Binary classification

  • Logistic regression with per-coordinate learning rates and OSGD
19 / 32

Wikipedia

20 / 32

Aside: The Hashing Trick

Assume data like

fname,lname,location
Adam,Drake,Toronto
features = ['fnameAdam', 'lnameDrake', 'locationToronto']
maxWeights = 2**20
def hashedFeatures(features):
hashes = [hash(x) for x in features]
return [x % maxWeights for x in hashes]
print(hashedFeatures(features))
# [18445008, 8643786, 20445187]

These are the indices in the weights array

Weinberger et al., Feature Hashing for Large Scale Multitask Learning, 2009

21 / 32

Feature hashing benefits

  • security/privacy
  • bounded memory consumption
  • stateless feature extraction
22 / 32

Logistic regression, feature hashing, per-coordinate learning rates, OSGD

Complicated framework?

23 / 32

Example

# Turn the record into a list of hash values
x = [0] # 0 is the index of the bias term
for key, value in record.items():
index = int(value + key[1:], 16) % D
x.append(index)
# Get the prediction for the given record (now transformed to hash values)
wTx = 0.
for i in x: # do wTx
wTx += w[i] # w[i] * x[i], but if i in x we got x[i] = 1.
p = 1. / (1. + exp(-max(min(wTx, 20.), -20.)))
# Update the loss
p = max(min(p, 1. - 10e-12), 10e-12)
loss += -log(p) if y == 1. else -log(1. - p)
# Update the weights
for i in x:
# alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
# (p - y) * x[i] is the current gradient
# note that in our case, if i in x then x[i] = 1
w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
n[i] += 1.
24 / 32

Result: 74,000 RPS

(19MB/s)
25 / 32

Ad Click Prediction: a View from the Trenches (McMahan et al., 2013)

26 / 32

Multicore

// ... housekeeping ...
// Hash the record
for i, v := range record {
hashResult := hash([]byte(fields[i]+v)) % int(D)
x[i+1] = int(math.Abs(float64(hashResult)))
}
// Make prediction
wTx := 0.0
for _, v := range x {
wTx += (*w)[v]
}
p := 1.0 / (1.0 + math.Exp(-math.Max(math.Min(wTx, 20.0), -20.0)))
p = math.Max(math.Min(p, 1.-math.Pow(10, -12)), math.Pow(10, -12))
// Update the loss
if y == 1 {
*loss += -math.Log(p)
} else {
*loss += -math.Log(1.0 - p)
}
// Update the weights
for _, v := range x {
(*w)[v] = (*w)[v] - (p-float64(y))*alpha/(math.Sqrt((*n)[v])+1.0)
(*n)[v]++
}
27 / 32

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

i.e., Nevermind about locks

Niu, Recht, et al. 2011

28 / 32

Multicore version (Go):

366,000 RPS

(~95MB/s)

Note: 25Gbps ~ 3125MB/s

29 / 32

Application: Edge Analytics

  • privacy
  • resources (bandwidth, bettery, storage, RAM)
  • accuracy

Resource-efficient Machine Learning in 2KB RAM for the Internet of Things, Kunmar et al. 2017

30 / 32

Conclusion

  • Novel results often harmful
  • Useful methods often seen as boring
  • Old methods are often great for new problems
  • Troubling Trends in Machine Learning Scholarship (Lipton and Steinhardt, 9 JUL 2018)
31 / 32

@aadrake

https://adamdrake.com

Questions?

32 / 32

@aadrake #hacker

@aadrake #thoughtleader

2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow