On Big Data and Spurious Correlations
I didn’t have time to mention it last week, but even though I am happy that the New York Times wrote an article on big data, I think the most interesting part was at the end:
Big Data has its perils, to be sure. With huge data sets and fine-grained measurement, statisticians and computer scientists note, there is increased risk of false discoveries. The trouble with seeking a meaningful needle in massive haystacks of data, says Trevor Hastie, a statistics professor at Stanford, is that many bits of straw look like needles. Big Data also supplies more raw material for statistical shenanigans and biased fact-finding excursions. It offers a high-tech twist on an old trick: I know the facts, now lets find em. That is, says Rebecca Goldin, a mathematician at George Mason University, one of the most pernicious uses of data.
The warning was embedded on the last page of a 3-page article, a mere 3 short paragraphs from the end. I understand that the piece was designed to be rather lighthearted and to focus more on job opportunities that are present in such a growing field, but more needs to be said about this peril of analyzing very large sets of data.
Humans already have a long list of cognitive biases, which I call brain failures, that come up in our daily lives. These brain failures have become increasingly problematic along with the increase in access to information. Humans love to find similarities between things, and those pattern recognition skills are one thing that have allowed us to survive this long. If Ug the caveman ate berries and then Ug got sick, he would assume that the berries were the cause and therefore avoid them, thus potentially saving him from illness and death. In this way, by natural selection, we have evolved to become very sensitive to correlations that do not exist or exist but do not have an effect on the situation under consideration. This is especially apparent in the financial sector, where spurious relationships abound. On the surface it can be pretty obvious that the correlations are spurious, but that doesn’t stop people from demonstrating that they are supported by data.
For example, consider the Superbowl Indicator which says that if an original NFL team wins the Superbowl then stocks will rise in the coming year, and if not then they will fall. This already sounds pretty ridiculous, but consider that it also has a 79% accuracy rate. A perfect example of a spurious correlation. There are other crazy correlative indicators of stock market prices, like the Sports Illustrated Swimsuit Issue Indicator, which says that the stock market will have above-average returns in years that an American model is on the cover of the Sports Illustrated Swimsuit issue. The point is that as humans have the ability to amass and analyze larger and larger sets of data, they will increasingly discover correlations that are spurious, and data scientists or those who work with “Big Data” should be very aware of this problem.
At this point the discipline is still occupied by those with a strong scientific and mathematical background and therefore already have some critical-thinking-based immunity to spurious correlations from training or past exposure, but as tools and techniques for data analysis become more accessible to the average person the problem of succumbing to spurious correlations will be more pronounced. I’m not scolding the New York Times for not putting that at the beginning of their article, but I think it would be good to put more emphasis on the careful analytical skills required in “Big Data” work.