立即打开
大数据的预测盲区

大数据的预测盲区

Kurt Wagner 2013-04-28
美国统计学家内特•希尔是个数学天才,长于利用大数据进行预测。去年美国总统大选期间,他非常准确的预测了美国50个州的投票胜负。但他认为,大数据也不是万能的,有些领域的预测成功率就很低,比如地震,比如股市。

    预测大选的时候,把你的个人政治理念从工作中抛开会不会很困难?

    无论我们干哪一个行业,都很难保持客观。没有人能左右现实,我们多多少少有些厌世的观点。不过我认为在体育上的训练对我是有帮助的,比如我虽然可以像小时候一样做底特律猛虎队(Detroit Tigers)的粉丝,但是我仍然认为洛杉矶天使队(Los Angeles Angels)的麦克•特劳特才应该当选为去年的最有价值球员。不过我认为政治有一点不同,这个行业里的很多人不光有自己的观点,且而还习惯于左右大众的观点。他们习惯性地认为,他们可以创造他们自己的现实。这就是为什么我认为有时候正确理解政治语言有困难。

    有些人会想,如果我编出一个事实,或是编造一个民调数据,问题就解决了。而政治媒体圈里虽然有好人,但是也有人非常听话,而且乐于把政客在拉票活动上说的鬼话传播出来。我认为这就是问题所在。跟体育相比,人们在政治问题上不习惯检查一下现实。

    那么你是怎样筛选信息,挑出那些“鬼话”的?

    重点是忽略政治人物说的话,坚持使用能公开获得的数据。记录显示,大多数政治观察家一般爱把政治人物的一次失态或一场辩论看得太重了——当然总有例外,不过大体上民意调查数据还是提供了一个较为可靠的标准。至于老百姓,他们有自己的生活,也不总是消费政治新闻。他们衡量事物的方式非常复杂,比如他们会考虑经济问题,或者政府是不是让我们卷入了一场愚蠢的战争,又或者政府是不是出了什么大丑闻。这些因素才能帮助我们解释最终是谁赢得了大选,而不是政治评论家们关注的那些劲爆花边。

    现在的数据比以前多了。你在选择数据的时候,怎样确定哪些数据才能正确回答你的问题?

    其中一点是,你需要一个系统,而不是一次性的做法。我们在2008年设计了一个模型,在2012年进行了升级,我们用它来对每次民意调查进行分析。如果有些民调机构以往的信用很好,它在系统中就会占有更大的权重。并不是说其它民调就会被忽视。不是说我们只盯着一份民调,然后伸出手指说:“这份民调很重要,那份不重要。”基本上所有的难题和所有的决策过程都来自设计模型的过程。根据理论、实际和以往的经验,怎样设计一系列好的规则来处理这些信息?这个问题最重要,然后坚持这些标准。我们在每年6月推出这个模型后,就不会再更改了,除非模型里有bug,幸运的是到现在还没有发现。我们的基本原则始终不变,然后你再在这个规矩方圆里分析数据。

    Is it hard to keep your own political beliefs separate from your work predicting elections?

    It's always hard for us to be objective in any walk of life. None of us has a monopoly on reality, we all have rather jaded points of view. I do think the sports training helps though, where I can be a Detroit Tigers fan as I am [and was] growing up, I still thought Mike Trout [Los Angeles Angels] should have won the MVP award last year. What I think differentiates politics a bit is that you have an industry full of people who not only have views but are [also] used to manipulating public opinion. They're used to thinking they can create their own reality. That's why I think you have such trouble on the uptake there.

    People think that, well, if I can spin a fact a certain way or spin polls a certain way, [the problem] goes away. When you have a political press where some people are very good, but some other people are very compliant and happy to pass along spin from the campaigns, I think that's the issue. People aren't used to getting a reality check in politics as much as in sports.

    So how are you able to sift through that information then to pick out the BS?

    The idea is to ignore what the politicians say and stick with publically available data. The record shows that in general, most political observers tend to overrate the importance of a gaffe or a debate -- there are always exceptions -- but in general the polls provide a pretty reliable benchmark. And the public, who have real lives and are not constantly consuming political news, are [sometimes] weighing things in a very sophisticated way where they're looking at things like the economy or are we involved in any stupid wars or major scandals from the administration. Those are the things that explain a lot about who wins the elections and not so much the petty stuff that the political pundits can focus on.

    There is more data now than ever before. How are you able to determine which information to pull in order to properly answer your question?

    Part of it is that you do need -- as Vegas might say -- you do need a system instead of an ad hoc way of doing it. So we have a model that we designed in 2008 that was updated for 2012 that was designed to account for every single poll. Some polls, if they're from a pollster that has a better track record, get more weight in the system. It doesn't mean that others are ignored. So it's not like we're just looking at a poll and sticking our fingers up in the air and saying, "Oh that poll is important, and that poll's not." Basically all the hard work and all the decision-making process comes from designing this model before the fact. Based on theory and practice and past experience, what are a good set of rules for processing this information? And then sticking to that. We don't make any alterations to the model once we launch it in June every year, unless there's a bug, which fortunately there hasn't been. But the principles are always the same, and then you have a disciplined way to analyze data in that context.

热读文章
热门视频
扫描二维码下载财富APP