A few days ago Nassim N. Taleb wrote an opinion piece for Wired claiming that we should “Be aware of the big errors of big data.” If you haven’t heard about it, “big data” is becoming a buzz term in the media and sciences, particularly social sciences, for the scientific strategy of gathering massive amounts of data and then processing it with statistical tools. Taleb paints a picture of big data as being extremely manipulable, so much so that scientists can not resist the urge to employ it uncritically in support of their favorite theories:
Big-data researchers have the option to stop doing their research once they have the right result. In options language: The researcher gets the “upside” and truth gets the “downside.” It makes him antifragile, that is, capable of benefiting from complexity and uncertainty — and at the expense of others.
But beyond that, big data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal). It’s a property of sampling: In real life there is no cherry-picking, but on the researcher’s computer, there is. Large deviations are likely to be bogus.
We used to have protections in place for this kind of thing, but big data makes spurious claims even more tempting. And fewer and fewer papers today have results that replicate: Not only is it hard to get funding for repeat studies, but this kind of research doesn’t make anyone a hero. Despite claims to advance knowledge, you can hardly trust statistically oriented sciences or empirical studies these days.
That’s right, he said you can hardly trust statistically oriented sciences or empirical studies these days. In making this claim however, Taleb falls into his own trap and doesn’t provide much of a justification to support the link between big data and spurious correlations. As the comments to the article have pointed out, Taleb seems to be implying that “big data” is simply looking at lots of different variables (columns), but collecting very little information about those variables (rows). Collecting more information about each variable reduces the worries that Taleb seems to raise.
“Big Data” can also be taken in a different way. Consider this TED talk by Deb Roy who paints a much friendlier picture of big data (video below, news article here). He uses big data to analyze his own child’s language acquisition abilities. To analyze his child’s language acquisition, he installed cameras throughout his house to track the baby’s movements and record his voice. The trick here was to record lots and lots of data on lots and lots of variables. The data generated was statistically analyzed and mapped to locations inside the house. Part of the advantage of big data gathered in this way is that it reveals natural interactions free from the artificiality of laboratory setups. I doubt that Roy would say that just recording his child shows how language acquisition happens in general, but I’m sure with enough data from enough children, gathered in the same way, something general might be said. What he does think it can do is come up with interesting relationships that call out for explanation.
Taleb’s real target seems to be lazy and irresponsible scientists. But saying that scientists can cook the books in favor of their pet theory is not new (see Alan Franklin’s section on Weber and gravity waves in “How to Prevent the Experimenters’ Regress“). Furthermore, Taleb does very little to talk about the standards placed upon scientific investigators to prevent the spurious correlations he discusses. For example, physics standards require a five-sigma result in order to claim discovery (1 in 3.5 million chance that the result was due to statistical fluctuations). This standard has grown steadily from one-sigma to five-sigma as the data sets used in physics studies have increased in size (Alan Franklin has a forthcoming book on this subject). Such standards are in place to prevent the very thing that Taleb discusses, and are the reasons why some statistically oriented sciences and empirical investigations are trustworthy.
A consequence of big data (in the sense of lots of data on many variables) might actually be added constraint on scientific investigations, not, as Taleb suggests, the subjective freedom to confirm any theory. For example, in atmospheric science, the massive amounts of atmospheric data gathered everyday are ingested by forecasting models to constrain model output and provide accurate initial conditions for future projections. This kind of data ingesting system keeps model errors from building up, and its inclusion in meteorological forecast models improved accuracy significantly. It is likely that such accuracy could have only been enabled by large data, statistical analysis, and computing power.
If Taleb’s real target is lazy scientists though, big data might be a virtue and not a vice. After all, Taleb seems to be claiming that extremely large data sets entice scientists to support their favorite theories. Big data, in the sense that Roy uses it, could enable the data to “speak for itself.” It seems possible that, with a few good algorithms, we no longer need humans to posit theories for the data to support. The algorithms can find correlations that serve as hypotheses, and then specify the additional amounts of data necessary to reach some predefined statistical benchmark. While something like this may be a implausible, or at least quite far off, the point is that big data has the potential to begin to displace human scientists from the center of investigation, and in turn, remove the subjectivity that leads to the worries Taleb posits.
Big data isn’t going away. We can’t go back to the old days and “rely on the protections we used to have”, instead, big data requires us to rethink the more traditional methods and standards of science and embrace the realities of big computation.