Meta上月推出新型网络爬虫，从互联网上抓取数据训练AI

KALI HAYS

2024-08-24

Meta首席执行官马克·扎克伯格（Mark Zuckerberg）在人工智能领域押下重注。

文本设置

小号

默认

大号

Plus(0条)

图片来源：JASON HENRY/BLOOMBERG VIA GETTY IMAGES

Meta悄然推出了一款新型网络爬虫，用于搜索互联网并收集大量数据，为其人工智能模型提供数据支持。

据三家追踪全网网络爬虫和机器人的公司称，这款名为Meta External Agent的爬虫已于上月推出。这种自动机器人基本上是复制或“抓取”网站上公开显示的所有数据，例如新闻文章中的文字或在线讨论组中的对话。

Dark Visitors为网站所有者提供了一种自动阻止所有已知抓取机器人的工具，该公司的一名代表表示，Meta External Agent类似于OpenAI的GPTBot，后者可以抓取网络上的数据以为人工智能训练提供数据。另外两个参与追踪网页爬虫的实体也证实了该机器人的存在及其用于收集人工智能训练数据的用途。

根据使用互联网档案馆（Internet Archive）发现的版本历史记录，脸书（Facebook）、Instagram和Whatsapp的母公司Meta在7月下旬更新了一个面向开发者的公司网站，其中一个标签显示了新抓取工具的存在。除了更新页面，Meta还没有公开宣布新爬虫。

Meta的一位发言人表示，该公司“多年来”一直在使用一款名为Facebook External Hit的爬虫程序，“随着时间的推移，它被用于不同的目的，比如分享链接预览”。

这位发言人说：“像其他公司一样，我们也会根据网上公开的内容训练生成式人工智能模型。我们最近更新了关于出版商如何以最佳方式将其域名排除在Meta的人工智能相关爬虫抓取范围之外的指南。”

通过抓取网络数据来训练人工智能模型是一种备受争议的做法，这种做法已导致艺术家、作家和其他人提起了多起诉讼，他们称人工智能公司在未经同意的情况下使用了他们的内容和知识产权。最近几个月，OpenAI和Perplexity等一些人工智能公司达成了协议，向内容提供商支付数据访问费用（《财富》杂志是7月份宣布与Perplexity达成收入分成协议的几家新闻提供商之一）。

悄然进行

Dark Visitors的数据显示，目前全球最受欢迎的网站中有近25%屏蔽了GPTBot，但只有2%屏蔽了Meta的新型机器人。

网站要想阻止网络爬虫，就必须部署robots.txt，即在代码库中添加一行代码，以便向爬虫发出信号，让它忽略该网站的信息。不过，为了遵守robots.txt相关代码，通常还需要添加抓取机器人的具体名称。如果名称没有公开，就很难做到这一点。抓取机器人的操作人员也可以直接选择忽略robots.txt，它不具有任何强制力或法律约束力。

这种抓取机器人用于从网络中提取大量数据和书面文本，作为生成式人工智能模型（也称为大型语言模型或LLM）和相关工具的训练数据。Meta的Llama是目前最大的大型语言模型之一，它为Meta AI（人工智能聊天机器人，目前已出现在各种Meta平台上）等工具提供支持。虽然该公司没有透露最新版本的模型Llama 3使用的训练数据，但其初始版本的模型使用了由Common Crawl等其他来源收集的大型数据集。

今年早些时候，Meta的联合创始人、长期担任首席执行官的马克·扎克伯格在一次财报电话会议上吹嘘说，他公司旗下的社交平台已经积累了一套用于人工智能训练的数据集，甚至“超过了Common Crawl”，后者自2011年以来每月抓取大约30亿个网页。

由于该公司继续致力于更新Llama和扩展Meta AI，新爬虫的存在表明Meta庞大的数据宝库可能已经不够用了。大型语言模型通常需要全新的、高质量的训练数据来不断改进功能。Meta今年的支出将高达400亿美元，主要用于人工智能基础设施和相关成本。（财富中文网）

译者：中慧言-王芳

Meta悄然推出了一款新型网络爬虫，用于搜索互联网并收集大量数据，为其人工智能模型提供数据支持。

悄然进行

Dark Visitors的数据显示，目前全球最受欢迎的网站中有近25%屏蔽了GPTBot，但只有2%屏蔽了Meta的新型机器人。

译者：中慧言-王芳

Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

Meta, the parent company of Facebook, Instagram, and Whatsapp, updated a corporate website for developers with a tab disclosing the existence of the new scraper in late July, according to a version history found using the Internet Archive. Besides updating the page, Meta has not publicly announced the new crawler.

A Meta spokesman said the company has had a crawler under a different name “for years,” although this crawler—dubbed Facebook External Hit— “has been used for different purposes over time, like sharing link previews.”

“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesman said. “We recently updated our guidance regarding the best way for publishers to exclude their domains from being crawled by Meta’s AI-related crawlers.”

Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others, who say AI companies used their content and intellectual property without their consent. Some AI companies like OpenAI and Perplexity have struck deals in recent months that pay content providers for access to their data (Fortune was among several news providers that announced a revenue-sharing deal with Perplexity in July).

Flying under the radar

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

In order for a website to attempt to block a web scraper, it must deploy robots.txt, a line of code added to a codebase, in order to signal to a scraper bot that it should ignore that site’s information. However, typically the specific name of a scraper bot needs to be added as well in order for robots.txt to be respected. That’s difficult to accomplish if the name has not been openly disclosed. An operator of a scraper bot can also simply choose to ignore robots.txt – it is not enforceable or legally binding in any way.

Such scrapers are used to pull mass amounts of data and written text from the web, to be used as training data for generative AI models, also referred to as large language models or LLMs, and related tools. Meta’s Llama is one of the largest LLMs available, and it powers things like Meta AI, an AI chat bot that now appears on various Meta platforms. While the company did not disclose the training data used for the latest version of the model, Llama 3, its initial version of the model used large data sets put together by other sources, like Common Crawl.

Earlier this year, Mark Zuckerberg, Meta’s co-founder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

The existence of the new crawler suggests Meta’s vast trove of data may no longer be enough however, as the company continues to work on updating Llama and expanding Meta AI. LLMs typically need new and quality training data to keep improving in functionality. Meta is on track to spend up to $40 billion this year, mostly on AI infrastructure and related costs.

财富中文网所刊载内容之知识产权为财富媒体知识产权有限公司及/或相关权利人专属所有或持有。未经许可，禁止进行转载、摘编、复制及建立镜像等任何使用。

0条Plus

精彩评论

撰写或查看更多评论

请打开财富Plus APP

前往打开

热读文章

关注我们

Meta上月推出新型网络爬虫，从互联网上抓取数据训练AI

撰写或查看更多评论