立即打开
大数据有大问题

大数据有大问题

Joshua Klein 2013-11-07
超级计算的基础是形形色色的各类模型。但很多模型都存在天然的缺陷,一旦出错,就很可能在大数据时代给人们造成始料未及的大麻烦。

    大数据和云计算现在让每个人都拥有了超级计算能力。不过,大家都忽略了一个问题:我们用来截获、应用这些海量信息的工具往往存在着致命的缺陷。我们所做的绝大多数的数据分析都是以错误的模型为基础的,这必然会带来多种错误。而一旦我们眼高手低,想法太多却能力有限,就会造成不堪设想的后果。

    如果大数据本身不是那么规模庞大,这还不算是什么大问题。但是现在我们手里的数据量已经足够庞大,就算大家使用存在内在缺陷的模型,往往也能获得还算有用的结果。问题在于,我们还往往误认为这些结果无所不能。我们沉迷于自己的技术,可一旦模型失效,它就会变得非常糟糕,尤其是因为,海量数据产生的错误也同样巨大。

    这个问题的部分成因在于,人们对作为计算机程序基础的模型做了过度简化,而不是它们的编程本身出了什么问题。比如,2011年4月初,亚马逊网站(Amazon.com)上就出了这么一件怪事。作为一本很多生物学家时常参考的发展生物学经典著作,彼得•劳伦斯的《苍蝇的成长》(The Making of a Fly)在这个网站上共有17个版本在售:15本二手书的售价为35.54美元,但两本新书居然卖到了23,698,655.93美元(还要另外再加上3.99美元的运费!)。

    这本书最后一次印刷是1992年,现在已绝版,但这还是无法解释它凭什么能卖到这样的天价。真实情况是,有两个自动程序当时掀起了一场你追我赶、不断抬价的竞价活动。它们一个由卖家“bordeebook”运行,另一个由卖家“profnath”运行。Profnath每天会有一次将自己的出价抬到bordeebook出价的0.9983倍。几个小时后,bordeebook的出价就是profnath最新出价的1.270589倍了。

    意外因素就能搞砸最完备的计算机模型,这就是一个经典的案例,而且它还不是孤立事件。

    打个比方,难道这起事件听起来难道不像是次贷危机的翻版吗?2008年前,拥有最好技术、运作最先进的假设情境的顶尖人才完全没有预料到迫在眉睫的危机,随后还对危机的严重性一无所知。一个模型所涵盖的范围越宽,就能包括越多可能出现的错误。这一点听起来显而易见,但我们往往忽略了一个事实,即这些模型无法,也永远不会,和现实情况本身毫厘不差。

    Big Data and the cloud are putting supercomputer capabilities into everyone's hands. But what's getting lost in the mix is that the tools we use to interpret and apply this tidal wave of information often have a fatal flaw. Much of the data analysis we do rests on erroneous models, meaning mistakes are inevitable. And when our outsized expectations exceed our capacity, the consequences can be dire.

    This wouldn't be such a problem if Big Data wasn't so very, very big. But the amount of data that we have access to is enabling us to use even flawed models to produce what are often useful results. The trouble is that we're frequently confusing those results for omniscience. We're falling in love with our own technology, and when the models fail it can be pretty ugly, especially when the mistakes all that data produces are concomitantly large.

    Part of the issue is oversimplification of the models computer programs are based on, rather than actual errors in their programming. For example, in early April 2011, Peter Lawrence's "The Making of a Fly," a classic work in developmental biology that many biologists consult regularly, was listed on Amazon.com (AMZN) as having 17 copies for sale: 15 used from $35.54, and two new from $23,698,655.93 (plus $3.99 shipping).

    The book, last published in 1992, is now out of print, but that doesn't quite explain the multimillion-dollar price tag. What had happened was that two automated programs, one run by seller "bordeebook" and one by seller "profnath," were engaged in an iterative and incremental bidding war. Once a day profnath would raise their price to 0.9983 times bordeebook's listed price. Several hours later, bordeebook would increase their price to 1.270589 times profnath's latest amount.

    It's a classic example of how unanticipated factors can foil even the best-prepared computer models, and it's not an isolated incident.

    For example, does this sound anything like the subprime mortgage crisis? Before 2008, the best minds with the best technology running the most advanced hypothetical scenarios completely missed the looming crisis and then failed to understand its severity. The more broadly a model is scoped the more possibilities for error it includes. It sounds obvious, but we often miss the fact that those models are not, and will never be, as accurate as reality itself.

热读文章
热门视频
扫描二维码下载财富APP