

Katherine Noyes 2014年07月03日








    之后不久,谷歌就谷歌文件系统(Google File System)和MapReduce发表了一系列学术论文,卡法雷拉说:“于是我们很快就清楚了,Nutch需要拥有一些类似的架构。”


    卡廷和卡法雷拉【如今分别是Cloudera首席架构师和密歇根大学(University of Michigan)计算机科学和工程专业的助理教授】知道,他们得做出自己的架构——不仅是为了Nutch,也是为了造福其他业内人士——他们明白自己想把它做成开源。




    There are countless open source projects with crazy names in the software world today, but the vast majority of them never make it onto enterprises’ collective radar. Hadoop is an exception of pachydermic proportions.

    Named after a child’s toy elephant, Hadoop is now powering big data applications at companies such as Yahoo YHOO 2.57% and Facebook FB -0.46% ; more than half of the Fortune 50 use it, providers say.

    The software’s “refreshingly unique approach to data management is transforming how companies store, process, analyze and share big data,” according toForrester analyst Mike Gualtieri. “Forrester believes that Hadoop will become must-have infrastructure for large enterprises.”

    Globally, the Hadoop market was valued at $1.5 billion in 2012; by 2020, it is expected to reach $50.2 billion.

    It’s not often a grassroots open source project becomes a de facto standard in industry. So how did it happen?

    ‘A market that was in desperate need’

    “Hadoop was a happy coincidence of a fundamentally differentiated technology, a permissively licensed open source codebase and a market that was in desperate need of a solution for exploding volumes of data,” said RedMonk cofounder and principal analyst Stephen O’Grady. “Its success in that respect is no surprise.”

    Created by Doug Cutting and Mike Cafarella, the software—like so many other inventions—was born of necessity. In 2002, the pair were working on an open source search engine called Nutch. “We were making progress and running it on a small cluster, but it was hard to imagine how we’d scale it up to running on thousands of machines the way we suspected Google was,” Cutting said.

    Shortly thereafter Google GOOG -0.34% published a series of academic papers on its own Google File System and MapReduce infrastructure systems, and “it was immediately clear that we needed some similar infrastructure for Nutch,” Cafarella said.

    “The way Google was approaching things was different and powerful,” Cutting explained. Whereas so far at that point “you had to build a special-purpose system for each distributed thing you wanted to do,” Google’s approach offered instead a general-purpose automated framework for distributed computing. “It took care of the hard part of distributed computing so you could focus just on your application,” Cutting said.

    Both Cutting and Cafarella (who are now chief architect at Cloudera and University of Michigan assistant professor of computer science and engineering, respectively) knew they wanted to make a version of their own—not just for Nutch, but for the benefit of others as well—and they knew they wanted to make it open source.

    “I don’t enjoy the business aspects,” Cutting said. “I’m a technical guy. I enjoy working on the code, tackling the problems with peers and trying to improve it, not trying to sell it. I’d much rather tell people, ‘It’s kind of OK at this; it’s terrible at that; maybe we can make it better.’ To be able to be brutally honest is really nice—it’s much harder to be that way in a commercial setting.”

    But the pair knew that the potential upside of success could be staggering. “If I was right and it was useful technology that lots of people wanted to use, I’d be able to pay my rent—and without having to risk my shirt on a startup,” Cutting said.

    For Cafarella, “Making Nutch open source was part of a desire to see search engine technology outside the control of a few companies, but also a tactical decision that would maximize the likelihood of getting contributions from engineers at big companies. We specifically chose an open source license that made it easy for a company to contribute.”

  • 热读文章
  • 热门视频
扫码打开财富Plus App