Novel Results Considered Harmful
Ravi Adve from University of Toronto graciously invited me to give a public lecture at the university earlier this summer. I was very grateful for the opportunity.
The university’s facilities were fantastic, and Ravi did a wonderful job organizing everything. The audience was engaging and asked thoughtful questions, and the attendance was much higher than I had anticipated for a Tuesday morning lecture in the middle of summer! There was great representation from multiple departments, including Electrical and Computer Engineering, Computer Science, and Mathematics.
Many thanks again to Ravi and University of Toronto for hosting me. Below are some main points from my slides along with brief thoughts on each.
The slides are also available for download.
My primary goal for this lecture was to emphasize the distinction between techniques or developments that are useful, and those that are simply novel. Additionally, I wanted to pose a question to the audience: what should we do in academic-like environments when results that are novel are valued much more highly than results that are useful?
In order to illustrate this distinction, I started with some definitions, then gave some examples.
Novel results are a requirement for attaining journal publishing, but often useless in cases where business problems need to be solved. Novelty is also something that can be abused in the form of a novel technology solution that is useful in some specific areas, but is often just a solution searching for a problem (looking at you, blockchain…). How to handle this fact is still an open question in my mind. I would also say that practitioners misunderstand the academic publishing system, leading to the implementation and large-scale deployment of some of these novel results. This implementation on production systems of such results usually does more harm than good, hence Novel Results Considered Harmful.
There is a huge amount of hype right now surrounding Artificial Intelligence (AI), however, I would advocate placing focus instead on Intelligence Amplification (IA). The basis of IA is that humans are the focus of systems, and that we should explore ways to make humans more productive by amplifying human intelligence. Instead of some nebulous definition (colloquially, AI) of a machine engaging in some type of cognition to solve problems, IA takes the form of automation and/or providing access to more information in concrete and measurable steps.
Viewing problem solving through the lens of IA is a great way to add immediate and incremental value to endeavors that humans are engaging in right now. The best-known example of IA in common use is probably Google. How much more are you able to accomplish with Google than without it? Do you consider Google’s search product to be AI? If not, why not? By developing technology that aids humans in this way, we’re able to add business value at a much faster rate, with lower risk.
There’s a big difference between no automation and mediocre automation, and comparatively zero difference between mediocre automation and better-than-mediocre automation. Go for the big gains in human output!
Academic journals are the vehicles by which researchers advance their careers. A general requirement for publication in an academic journal is that the work must contain a novel result, where novel is taken to mean new and interesting.
The problem is that something being new and interesting does not, by itself, mean that the thing is useful. In terms of solving real-world business problems, the usefulness of an approach or technique matters much more than how new or interesting it is. In fact, most business problems requiring technical solutions are best solved using techniques that most technologists would consider old and boring.
This misalignment of incentives causes problems for technologists who read academic journals. It misleads them to think that the novel, state-of-the-art result is the thing they should build. Instead, technologists should be building the most useful thing possible that solves the problem at hand, under the current time and materials constraints.
I like to ask audiences if they know of Douglas Engelbart. Most people don’t, which is unfortunate since his lab was a great example of what can be accomplished with the goal of Intelligence Amplification. Probably the most notable result of their efforts was the Mother of All Demos (MOAD) which, in a single demonstration, showcased:
- Video conferencing
- The computer mouse
- Word Processing
- Dynamic file linking
- Revision control
- Real-time collaborative editing (like Google Docs)
Engelbart’s lab and the MOAD are a great example of what happens when we focus on amplifying the intelligence of humans, rather than creating some kind of general machine intelligence.
Kaggle is a great example of a place where novel results abound, but useful results or approaches are not rewarded. Some years ago, I worked on a Kaggle competition. My solution had an accuracy within one or two percentage points of the winning solution. My standing? I came in at 453rd place.
For reference, my solution was modified logistic regression with improvements as described in Google’s paper “Ad Click Prediction: a View from the Trenches”. It had extremely high performance in the range of ~380,000 transactions per second on my laptop. It’s a model that is well understood, and easily deployable on even a modest virtual server. It was also an online algorithm, so there was very little needed in the way of training time/steps.
By contrast, the winning solution used an ensemble of 20 separate Field-Aware Factorization Machine (FFM) models. I won’t go into the details in this article, but feel free to explore FFMs and ensemble techniques if you like.
Such a difference in solution complexity, in order to achieve an improvement of one or two percentage points, is endemic on platforms like Kaggle. It’s a great place to explore solutions and different problem areas, but it can train people to do nearly the opposite of what they should do if they want to add maximal value to a business.
If Google calls you for an interview, they’ll probably ask you about a sorting algorithm (probably quicksort) and the time complexity thereof. Why do they ask about quicksort? It’s certainly useful, but it’s not always the best option. The Linux kernel uses merge sort in the
list_sort function for sorting linked lists, for example.
Timsort is one of, if not the, most widely-used sorting algorithms and is essentially a combination of merge sort and insertion sort. It’s extremely useful, but not necessarily novel or academic. In fact, I don’t know that a paper has ever been written on it, though it did incorporate some results from other papers. While Timsort works well and is thus extremely useful, it might be considered too derivative, and therefore not novel enough, to be a big paper in a prestigious journal.
Even if there has been a publication focusing on Timsort, the fact that it’s not commonly known despite its obvious usefulness is interesting.
Beyond the common grouping of machine learning tasks into areas like classification or regression, there are a few areas where businesses typically apply machine learning techniques:
- counting (how many sales happened yesterday)
- binary classification (e.g., will a user click on this)
- time series forecasting (e.g., what is our expected sales, demand, etc.)
- customer segmentation (e.g., churn prediction, personas, etc.)
In each of these areas, there are well-studied, high-performance, interpretable algorithms that can be used. These approaches, like logistic regression and online stochastic gradient descent, have extremely high success relative to their difficulty of implementation and support. Such algorithms are extremely useful.
However, most of the focus these days is on things like deep learning, and AI (whatever people define that to be). These approaches are more novel.
While deep learning at least has a more specific definition, it suffers from very high training complexity and low interpretability relative to the usefulness of the models generated. In some cases, deep learning approaches can be very effective and the tradeoffs make sense, but this is not the norm.
Often in academic papers and industry alike, horizontal scalability is touted as a benefit of a given algorithm or system architecture. This capability to scale horizontally is often used to justify the development of a novel approach to solving a problem. This is especially true among papers touting a new approach to distributed computation. However, advocates for distributed approaches to data processing rarely compare their designs and systems with what could easily be achieved by a standard single-threaded solution on one machine. I’ve done similar comparisons in the past, showing that Command-line Tools can be 235x Faster than your Hadoop Cluster, or expanding on these thoughts in my lecture Big Data, Small Machine.
Frank McSherry does a great job of examining the tradeoffs between distributed processing and performance costs in his paper “Scalability! But at what COST?” where COST is an acronym for “Configuration that Outperforms a Single Thread.” In many cases, there is NO distributed computational platform (Hadoop, Spark, etc.) that will outperform a simple, single-threaded implementation running on a laptop. With more voices like Frank’s in the data community, we can prevent people from reaching for heavy tooling simply because it’s novel. Heavy tooling costs more, not only for maintenance, but for processing as well, and in many cases is completely unnecessary. For example, at the time of this article you can spin up an instance on AWS EC2 with 12 TB (yes, Terabytes) of RAM. Is your working set larger than 12TB? Is your total data processing volume larger than 12TB? Do you have more than 12TB of data to deal with in your organization? Probably not. If not, why do you need anything distributed?
You could argue that this is simply another way of saying use the right tool for the job, and you’d be right. Yet, it’s amazing how many people want to use a different tool just for the novelty of it, instead of deriving similar excitement from coming up with the best solution for the problem at hand.
As an extension of the point above, consider the paper from Niu, Recht, et al.: “HOGWILD! A Lock-Free Approach to Parallelizing Stochastic Gradient Descent”.
I mentioned above that many real-world problems are very amenable to logistic regression and SGD. Sometimes, people think they need more performance than their LR/SGD approach can provide. Sometimes they think they need to use a cluster of machines for their ML task. Interestingly, in some cases, you can achieve an order or magnitude higher performance on a single multicore machine just from a simple change to your code!
I demonstrate this in my lecture Big Data, Small Machine, wherein a processing rate of 366,000 records per second is achieved on my modest laptop. It’s almost never the case that people need to handle that kind of request volume, even in high-growth startups.
Given that we have great tools and techniques for high-performance data processing that do not require complicated distributed systems, why do people insist on using such data processing frameworks? My conjecture is that, generally, they simply enjoy the novelty of the tool more than solving the actual problem in the best possible way. Additionally, blaming the current tooling and claiming that new tooling will help solve the problem may be a way to avoid taking responsibility for the current state of affairs.
In the future, I think edge analytics will become more common due to factors such as data privacy concerns, bandwidth limitations, and processing limitations. In the latter case, we can benefit a lot by imposing constraints on ourselves for model size, training time, and power usage.
Consider “Resource-efficient Machine Learning in 2KB RAM for the Internet of Things”, Kunmar et al. 2017. This might be a paper that nicely combines novelty and usefulness.
I expect a great deal of additional developments in the area of efficient tools and techniques that can allow us to do on-device machine learning. This will be critical for future developments in edge analytics, and also allow for much better privacy and control when it comes to things like training machine learning models directly on your phone instead of sending data into some cloud service. This is a very exciting area.
From a technical perspective, perhaps the most common failure mode I see in the startups I advise is a completely avoidable one. It’s the mistake of over-engineering by implementing novel algorithms and systems encountered in journals, or on the technical blogs of large software companies. The propagation of these solutions, most of which are wholly inappropriate for the needs of most companies, seems to be largely related to their novelty rather than their usefulness.
This is plain to see in the proliferation of companies that have labeled themselves with AI over the last few years. These companies are using the ambiguity, novelty, and hopefulness associated with the term in order to try to advance their business interests. The AI terminology (usually) has nothing to do with the products and services they offer to customers.
Our industry, and especially growth-stage startups therein, need to focus on ways technology can add practical value to the problems that most concern or absorb the time of humans. Eliminating time sinks is critically important, but under-appreciated, and sits in the shadow of AI-as-marketing-term.
As a better, more practical way to make humans smarter and more productive, let us instead focus on IA: Intelligence Amplification.