雅虎危难时刻公布海量数据宝藏
在大数据时代,研究人员需要从大量来源获取海量数据,而且是越多越好。在测试新的学术理论,或者对已有理论的结果进行重复试验时,大数据集尤其必不可少。 因此,雅虎实验室最近宣布将公布13.5万亿字节的数据,这是一个对学术界和大数据公司尤为重要的大事件。这些数据来自长达4个月内访问雅虎新闻、金融、体育和其他网站的2000万用户。现在,研究机构可以对这些数据进行交叉分析。 而对于不关心大数据或机器学习的普通人来说,这些数据也有好处。机器学习技术使计算机能够识别模式,通过算法向它们审查的数据“学习”。 例如,一项使用这些数据进行的研究,可以带来完美符合用户自身兴趣的新网页——比如展示球队比分和伤情报告;最喜欢的作者的新书书评;他们感兴趣区域的房地产类文章等。 正如雅虎实验室个性化科学研究总监苏吉•拉简所说,让内容变得对个人更有吸引力是件好事。 拉简说道:“我住在奥斯丁,是长角牛队的球迷,我丈夫喜欢休斯顿火箭队。他登录雅虎的时候,希望看到对他最有用的内容,我也一样。” 虽然雅虎新闻已经实现了一定程度的定制化,但它是基于用户提供的偏好和根据用户阅读行为推断出的偏好。 雅虎有许多大型网站(雅虎新闻、体育、金融等),所以它有大量的内容,还有许多用户在阅读这些内容——这是一个非常宝贵的组合。拉简非常谨慎地指出,用户必须主动选择加入才能参加数据收集过程,所有可验证的个人信息均被剔除。 拉简在文章中将雅虎公布的数据宝库,称为“史上规模最大的机器学习数据集”。这些数据包含1100亿个“事件”或记录,采集自2015年1月至5月用户与雅虎网站的互动过程。 她表示,雅虎实验室希望这个数据集能够成为衡量机器学习算法性能的基准。 这批数据的公布,属于雅虎实验室Webscope项目的一部分,该项目旨在公布匿名用户数据用于非商业用途。 雅虎、Facebook和谷歌等公司均收集了海量的用户数据。通过提供规模最大、最好的公共数据来宣示自己的领导地位,至少让雅虎拥有了自夸的权利。 例如,两年前,谷歌贡献的GDELT数据集包括2.5亿条记录,供那些希望通过谷歌BigQuery工具进行查询的用户使用。当时,谷歌声称GDELT是全世界最大的数据集。 对于雅虎首席执行官玛丽莎•梅耶正在努力解决的所有商业问题,雅虎一直都有雄心勃勃的炫酷技术。例如,雅虎在Hadoop进行了大量投入。Hadoop是著名的开源框架,可用于分布式数据的分类和处理。 通过公布这个数据集,雅虎或许打算证明,它依旧具备做大事的实力。(财富中文网) 译者:刘进龙/汪皓 审校:任文科 |
In the era of big data, where researchers truly need massive amounts of information from many sources, more really is more. Extremely big data sets are needed to test out new academic theories and to replicate the results of already-proposed theories. So Thursday’s announcement that Yahoo YHOO 2.99% Labs’ is releasing 13.5 terabytes of data culled from 20 million readers of Yahoo News, Finance, Sports, and other sites over four months, was a big deal for academics and big data heads, who will now be able to slice and dice it. But this data can also bring advantages to mere mortals who don’t care about big data or machine learning, a technology, that enables computers to recognize patterns and use algorithms to “learn” from the data they examine. For example, research using this data could lead to a news page perfectly tailored to users’ own interests—one that shows their team’s scores and injury reports; reviews of their favorite author’s new book; real estate postings of areas they’re interested in, for example. As Suju Rajan, Yahoo Labs’ director of research for personalization science, puts it, making content more personally appealing is a good thing. “I’m in Austin and a Longhorn fan, my husband likes the Houston Rockets. When he goes to Yahoo he wants to see what is most useful to him, I want to see what’s most useful to me,” Rajan said. While Yahoo News is already somewhat customized, that is based on a combination of user-provided preferences and inferred preferences gleaned from the user’s reading behavior. Because Yahoo hosts so many big sites (Yahoo News, Sports, Finance, and more) it has lots of content and many users viewing that content—and that’s a valuable combo. Rajan is careful to note that users had to opt-in to participate in the data gathering process and that all personally identifiable information (PII) was stripped out. In her post, Rajan called the data trove the “largest-ever machine learning data set” ever offered to researchers. It comprises 110 billion “events” or records culled from reader interactions with Yahoo sites from February to May 2015. Yahoo Labs would like for this data set to become the benchmark for gauging the performance of machine learning algorithms going forward, she said. The data is offered as part of Yahoo Labs’ existingWebscope program, which releases anonymized user data for non-commercial use, according to the post. Companies like Yahoo, Facebook FB 3.07% , Google GOOG 2.02% all collect massive amounts of user data. Being able to claim leadership by providing the biggest-and-best public data at the very least gives Yahoo bragging rights. Two years ago, for example, Google offered up theGDELT data set comprising a quarter-of-a billion records, to anyone wanting to run queries of it using Google’s BigQuery tool. When that happened Google billed GDELT as the world’s largest data set. For all its business problems which chief executive Marissa Mayer is trying to solve, Yahoo has always had ambitious and cool technology. For instance, the company contributed mightily to Hadoop, the popular open-source framework for storing and processing distributed data. Projects and contributions like this data set may be one way for Yahoo prove it still has the wherewithal to do great work. |