可耻：OpenAI涉嫌在AI数学测试中作弊

David Meyer

2025-01-25

OpenAI尚未回应他们在训练o3大模型时是否利用了它对FrontierMath的访问权，不过批评人士对此是不留情面的。

文本设置

小号

默认

大号

Plus(0条)

OpenAI公司CEO山姆·奥特曼。图片来源：Stefano Guidi—Getty Images

OpenAI或将推出一款颠覆我们认识的超级智能AI大模型。

科技网站Axios上周日发表了一篇耸人听闻的文章，称有一家公司正在准备推出一款“博士级别的超级智能体”，它将有望“真正取代人类工作者”。文章虽未提及具体的公司名称，但却特别指出，OpenAI公司的CEO山姆・奥特曼将于本月底向特朗普政府的官员进行闭门汇报。

这篇文章还指出：“消息人士指出，此项进展具有重大意义。几名OpenAI的员工曾向他们的朋友表示，他们对近期的研发进展既感到兴奋，又感到担忧。” 而这些消息人士显然来自“美国政府和一些领先的AI公司。”

当然，上面说的这些充斥着浓浓的炒作意味。不过奥特曼表示，他并不喜欢炒作。昨天，他在推特上在谈到OpenAI为实现“通用型AI”所做的努力时表示：“推特上的炒作又离谱了，我们不至于下个月就能推出通用型AI，目前我们也尚未开发出通用型AI。”（目前业界对“通用型AI”有着不同的定义，但基本上都是指具有了相当于人类或者超过人类智能的AI。）

山姆·奥特曼真的不喜欢炒作吗？如果是的话，他3周前就不会发那条故弄玄虚的推文了。“我一直想写一个故事，它只有两句话：‘奇点将至，祸福难料。’”他最好真的是在讲故事，不过这个故事也确实给人一种强烈的暗示感。（“奇点”是一个物理学名词，用在此处显然是在暗示人工智能超越人类智能的那个转折点。）

昨天，山姆·奥特曼又发推称：“我们准备了一些非常酷的东西给大家”。我已经询问了 OpenAI是否就是那家即将推出“博士级超级智能体”的公司，但尚未收到他们的回复。不过据科技媒体《The Information》报道，OpenAI有可能最早于本月推出一个名为Operator 的智能体系统，它将可以代表用户自主执行任务。

不过，无论OpenAI发布了什么东西，我们都应该仔细审视、认真监督，因为该公司最近爆出了一场基准测试丑闻，让人们不得不对它声称的性能产生一些质疑。

首先我们要介绍一下FrontierMath，它是由Epoch AI编制的一套数学基准测试，旨在检验AI大模型推理数学问题的能力。为了避免测试问题已经在大模型的训练库中，FrontierMath只包含“全新且尚未发布过”的数学问题。结果令人有些失望，Epoch AI称，当前市面上的主流大模型（如OpenAI的GPT-4和谷歌的Gemini）的解题正确率还不到2%。在公开演示中，只有OpenAI最新推出的o3大模型的得分略高于 25%。

问题是，OpenAI还资助了FrontierMath的开发，而且还要求Epoch AI在o3大模型发布前对此保密。因此，Epoch AI的一名外包在LessWrong论坛上发帖抱怨称，参与出题的数学家们一直被蒙在鼓里，根本想不到OpenAI与FrontierMath还有这样一层关系。这条帖子火了之后，Epoch AI的副主任塔梅・贝西罗格鲁才公开道歉，表示是因为OpenAI的合同里有相关条款规定。才导致Epoch AI无法更早披露二者之间的关系。

“我们承认，OpenAI确实能够接触到FrontierMath中大部分的问题及答案，但是题库里也有一部分OpenAI没有看到的保留题目，使我们依然能够独立验证大模型的数学能力。“贝西罗格鲁表示：“而且我们也有口头协议，这些材料不会用于模型训练。”

OpenAI尚未回应他们在训练o3大模型时是否利用了它对FrontierMath的访问权，不过批评人士对此是不留情面的。比如著名的通用型AI反对者加里・马库斯昨天表示：“从科学角度看，o3大模型的这次公开演示是有误导性的，是不光彩的。”他还指出，这次演示“经过了刻意设计，使其看起来比实际上更接近通用型AI”。

马库斯表示：“OpenAI 应该更透明地说清它与 Epoch AI的商业安排，以及他们在多大程度上获得了竞争优势，在多大程度上直接或间接地利用获得的材料进行了训练，还有在多大程度上对这些信息使用了数据增强技术。如果他们对这些问题不透明，我们就不必把他们当回事。”

在接下来的几周，我们有必要记住马库斯的话，密切关注事情进展。接下来，再让我们了解一下最近几天，繁忙的AI领域还发生了哪些事。

AI相关新闻

特朗普废除拜登人工智能行政令。重返白宫首日，特朗普便废除了拜登制定的数十项政策，其中之一就是拜登2023年签署的《关于安全可靠开发和使用人工智能的行政令》。该行政令的很多内容已经得到了实施，比如在美国国家标准与技术研究所（NIST）下面设立了人工智能安全研究所。特朗普此举标志着AI公司在发布新模型之前，无需再向美国政府提交安全测试结果。这也意味着美国在联邦层面没有了AI相关的法律法规，这也与欧盟形成了鲜明对比。这或许还为未来美欧双方在AI安全问题上的冲突埋下了隐患。

亚马逊对Covariant的“收购式招聘”遭举报。Covariant AI是一家专门为物流机器人研发AI程序的公司。近日，该公司的一名匿名股东兼前员工向美国有关部门举报了亚马逊对该公司的收购存在问题。亚马逊于去年8月份宣布，它聘用了Covariant 的3位创始人以及该公司四分之一的员工，同时获得了该公司研发的AI模型的非独家许可。据《华盛顿邮报》报道，举报人称，这笔“收购式招聘”的交易价值达到3.8亿美元，超过了向反垄断监管机构备案门槛的3倍，但亚马逊却并未就此进行备案。而且亚马逊的交易条款还限制了Covariant向其他公司出售许可。对此，亚马逊的一位发言人回应称：“Covariant将继续为其数十家客户提供服务，而且由于亚马逊获得的是Covariant技术的非独家许可，因此Covariant公司仍可以自由地向其他公司进行技术授权。”

Metropolis 收购 Oosto。Oosto是一家以色列人工智能面部识别公司，其前身为AnyVision公司，该公司目前已经找到了买家。据科技媒体TechCrunch 报道，一家名叫Metropolis的公司将以价值1.25亿美元的股份收购Oosto。Metropolis是一家帮助停车场运营者实现无感支付的AI公司。此前，Oosto已从投资者手中拉到了3.8亿美元的融资。Oosto是一家颇具争议的公司，一方面，很多人都对面部识别技术感到不安，另一方面，以色列政府还利用了该公司的软件监视约旦河西岸的巴勒斯坦人。

英国政府宣布其AI计划。英国工党政府上周宣布，将把人工智能“融入英国经济的血脉”，接着又公布了将英国公共服务与AI对接的详细计划。为了促进政府服务数字化，更好地加强不同部门的信息共享，英国政府还发布了一套供政府公务员使用的AI工具包。这套工具包被命名为“汉弗莱”，看过英语《是，大臣》的肯定会明白这个梗。简单地说，这个AI助手是个“政策通”，能基于几十年的议会辩论，预测民众对立法的接受度，并对法律和政策进行总结，从而快速对公众咨询进行解答。

AI研究速览

谷歌Titans架构是否将取代Transformer架构。谷歌最新发布的Titans神经网络架构引发了诸多热议。Titans架构为长期的持续性的神经记忆与更多的短期记忆协同工作提供了可能性。而目前的主流大模型的Transformer架构更多依赖短期记忆。而这种长期记忆与短期记忆协同工作的能力对于构建真正类似于人脑的智能体非常有用。谷歌的研究人员表示，在“常识推理”和其他任务方面，Titans架构比Transformer架构“更有效”。不过对于这种新架构，我们还不知道它有怎样的算力需求。

Meta宣称取得“巴别鱼”级别的突破。“巴别鱼”是《银河系漫游指南》里的一种奇特生物，只要把它塞进耳朵，就能听懂其他物种的话。近日，Meta公司的研究人员发布了一个“大规模多语言多模态机器翻译系统”，简称“SEAMLESSM4T”。该系统无需将语音先换为文本，再转换回语音，就能将口语对话翻译成其他语言。研究人员称，SEAMLESSM4T 在排除背景噪音方面比同类系统出色得多。

近期AI大事记

2月10-11日：人工智能行动峰会，法国巴黎

3月3-6日：世界移动通信大会，巴塞罗那

3月7-15日：西南偏南艺术节（SXSW），奥斯汀

3月10-13日：Human [X] 大会，拉斯维加斯

3月17-20日：英伟达 GTC 大会，圣何塞

4月9-11日：谷歌云 Next 大会，拉斯维加斯

精神食粮

AI推理模型在中国蓬勃发展。AI“推理”模型也是AI研究的前沿领域之一。由于近期几项引人瞩目的成果发布，全世界的眼光再次聚焦在了中国身上。

首先，杭州的深度求索公司（DeepSeek）在圣诞节前夕发布了DeepSeek V3 模型，有人认为它是目前市面上最好用的开源AI工具。V3在训练中用到了DeepSeek R1模型。深度求索公司表示，R1在数学、编程和推理任务方面，已经几乎可以与 OpenAI 的o1模型相媲美。基准测试也表明深度求索公司并没有说大话，该模型已经成为o1的一个强大对手，而且运行成本还要低得多。

深度求索公司现在已经开源了R1 的一个版本——R1-Zero。虽然R1-Zero遇到了一些问题，比如“无休止的重复、可读性差、语言混乱等等”，但是R1显然已经没有这些问题了。或许是因为这两个模型体量太大，深度求索还把它们的知识迁移到了Meta的Llama和阿里巴巴的 Qwen模型版本上，而且也将这些模型开源了。

此外，中国的月之暗面公司（Moonshot AI）刚刚发布了Kimi k1.5模型，它能够对文本和视觉模态进行推理，月之暗面也表示该模型可与o1媲美。据说，该模型的新版本很快将应用于在它的Kimi 聊天机器人中。（财富中文网）

译者：朴成奎

OpenAI或将推出一款颠覆我们认识的超级智能AI大模型。

在接下来的几周，我们有必要记住马库斯的话，密切关注事情进展。接下来，再让我们了解一下最近几天，繁忙的AI领域还发生了哪些事。

AI相关新闻

AI研究速览

近期AI大事记

2月10-11日：人工智能行动峰会，法国巴黎

3月3-6日：世界移动通信大会，巴塞罗那

3月7-15日：西南偏南艺术节（SXSW），奥斯汀

3月10-13日：Human [X] 大会，拉斯维加斯

3月17-20日：英伟达 GTC 大会，圣何塞

4月9-11日：谷歌云 Next 大会，拉斯维加斯

精神食粮

AI推理模型在中国蓬勃发展。AI“推理”模型也是AI研究的前沿领域之一。由于近期几项引人瞩目的成果发布，全世界的眼光再次聚焦在了中国身上。

译者：朴成奎

OpenAI may or may not be about to release something big and agentic.

According to a rather breathless Axios article on Sunday, an unidentified company is preparing “Ph.D.-level super-agents” that would be “a true replacement for human workers.” No names are named, but the article prominently notes that OpenAI CEO Sam Altman will give Trump administration officials a closed-door briefing at the end of the month.

It goes on to add: “Sources say this coming advancement is significant. Several OpenAI staff have been telling friends they are both jazzed and spooked by recent progress.” Those sources apparently come from “the U.S. government and leading AI companies.”

There’s more than a whiff of hype about all this. But Altman is no fan of such things, he claims. Addressing the separate but perhaps connected issue of OpenAI’s efforts to achieve “artificial general intelligence” (definitions differ, but this usually means AI with human- or superhuman-level capabilities), the CEO tweeted yesterday that “Twitter hype is out of control again” and “we are not gonna deploy AGI next month, nor have we built it.”

If he’s so anti-hype, Altman might want to take himself aside for tweeting, less than three weeks ago: “I have always wanted to write a six-word story. Here it is: Near the singularity; unclear which side.” A story, sure, but it also came across as a strong hint. (“The singularity” is a term referring to the inflection point where AI surpasses human intelligence.)

In yesterday’s tweet, Altman promised “We have some very cool stuff for you.” I’ve asked OpenAI whether it is the company that’s about to reveal “Ph.D.-level super-agents” and have received no response. But The Information reports that OpenAI will launch an agentic system called Operator, which can autonomously execute tasks on the user’s behalf, as soon as this month.

Whatever OpenAI does release, people should scrutinize it very closely, because the company has in recent days been caught up in a bit of a benchmarking scandal that raises questions about its performance claims.

The benchmark in question is FrontierMath, which was used in the demonstration of OpenAI’s flagship o3 model a month back. Curated by Epoch AI, FrontierMath contains only “new and unpublished” math problems, which is supposed to avoid the issue of a model being asked to solve problems that were included in its training dataset. Epoch AI says models such as OpenAI’s GPT-4 and Google’s Gemini only manage scores of less than 2%. In its demo, o3 scored a shade over 25%.

Problem is, it turns out that OpenAI funded the development of FrontierMath and apparently instructed Epoch AI not to tell anyone about this, until the day of o3’s unveiling. After an Epoch AI contractor used a LessWrong post to complain that mathematicians contributing to the dataset had been kept in the dark about the link, Epoch associate director Tamay Besiroglu apologized, saying OpenAI’s contract had left the company unable to disclose the funding earlier.

“We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities,” Besiroglu wrote. “However, we have a verbal agreement that these materials will not be used in model training.”

OpenAI has not yet responded to a question about whether it nonetheless used its FrontierMath access when training o3—but its critics aren’t holding back. “The public presentation of o3 from a scientific perspective was manipulative and disgraceful,” the notable AGI skeptic Gary Marcus told my colleague Jeremy Kahn in Davos yesterday, adding that the presentation was “deliberately structured to make it look like they were closer to AGI than they actually are.

“OpenAI should be more transparent about what the business arrangements were [with Epoch AI] and the extent to which they were given a competitive advantage and the extent to which they trained directly or indirectly on materials they had access to and the extent to which they used data augmentation techniques on information they had access to,” Marcus said. “If they are not transparent, we should not take them seriously.”

That’s something to bear in mind over the coming weeks. And with that, here’s more on what has been a very busy few days on the AI news front.

AI IN THE NEWS

Trump scraps Biden’s AI order. On his first day back in office, President Donald Trump scrapped dozens of his predecessor’s policies, among them Biden’s 2023 Executive Order on Safe, Secure, and Trustworthy Development and Use of AI. Much of that particular order has already been carried out, such as the creation of an AI Safety Institute within the National Institute of Standards and Technology (NIST). But Trump’s move does mean that AI companies will no longer have to give the U.S. government safety-test results before releasing new models. It also means that the U.S. now has no significant federal AI rules, creating an enormous disparity with the EU in particular, and perhaps setting the stage for future EU-U.S. clashes over the issue of AI safety.

Whistleblower targets Amazon’s Covariant acquihire. An unnamed shareholder and former employee of Covariant AI, a company that makes AI for logistics robots, has complained to the U.S. authorities about Amazon’s recent deal with the company. As it announced last August, Amazon hired three Covariant founders and a quarter of its staff, while taking a nonexclusive license for its models. Per the Washington Post, the whistleblower claims the acquihire deal was worth $380 million—over three times the threshold for giving antitrust regulators a heads-up, which never happened—and also that its terms limited the licenses that Covariant could sell to others. An Amazon spokesperson responded: “Covariant continues to serve its dozens of customers, and because Amazon is licensing Covariant technology on a non-exclusive basis, Covariant is free to license its technology to other companies."

Metropolis buys Oosto. Oosto, the Israeli AI facial recognition firm formerly known as AnyVision, has found a buyer. Metropolisan, an AI company that helps parking operators provide checkout-free payment experiences, will pay $125 million of its stock in exchange for Oosto, according to TechCrunch. Oosto had raised some $380 million from investors. Oosto/AnyVision was a controversial outfit, partly because many people are generally uneasy about facial recognition, but also because the Israeli government used its software to surveil West Bank Palestinians.

British government details extensive AI plans. The U.K.’s Labour government said last week that it would “mainline AI into the veins” of the country’s economy, and now it’s detailed how the country’s public services will embrace the new technology. As part of an announcement around the digitization of services and better sharing of data between agencies, the government announced an AI toolkit for civil servants. The package is dubbed “Humphrey," a witty reference to the classic TV show Yes Minister. The kit includes tools for rapidly parsing responses to public consultations, draws on decades of parliamentary debate to “better manage bills” (reportedly by predicting how legislation will be received by lawmakers), and summarizing policies and laws.

EYE ON AI RESEARCH

Google pits Titans against transformers. There’s a lot of buzz around a new neural-network architecture that Google researchers have just announced. The Titans architecture provides the possibility of long-term, persistent neural memory that can act in concert with more short-term memory, of the sort that is associated with the transformer architecture that underpins today’s LLMs. This would be useful for building agents. According to Google’s researchers, the new architecture is “more effective” than transformers when it comes to “common-sense reasoning” and other tasks, specifically when it comes to handling large amounts of information. However, the big question now is what the compute requirements look like.

Meta claims Babel Fish breakthrough. Meta’s researchers have announced a system called Massively Multilingual and Multimodal Machine Translation, or SEAMLESSM4T, that can translate spoken words into other languages without the need to convert the recording to text and back again (though it can do that too.) They suggest this is a big step towards the creation of something like the Babel Fish, a universal translator (and fish) that makes it possible for characters in Douglas Adams’s Hitchhiker’s Guide to the Galaxy to communicate with other species. According to the researchers, SEAMLESSM4T is far better at rejecting background noise than comparable systems.

AI CALENDAR

Feb. 10-11: AI Action Summit, Paris, France

March 3-6: MWC, Barcelona

March 7-15: SXSW, Austin

March 10-13: Human [X] conference, Las Vegas

March 17-20: Nvidia GTC, San Jose

April 9-11: Google Cloud Next, Las Vegas

BRAIN FOOD

Reasoning models flourish in China. In the push for better AI “reasoning” models, all eyes are currently on China thanks to a couple of notable announcements.

First up: DeepSeek-R1. Hangzhou-based DeepSeek released its V3 model, currently considered by some to be the best open-source AI model out there (sorry, Meta,) just before Christmas. R1 was used to train V3, and DeepSeek claims it can just about match OpenAI’s o1 “across math, code, and reasoning tasks.” Benchmarking suggests this is true, providing a serious competitor to o1 that is much cheaper to run.

DeepSeek has now open-sourced a version of R1 called R1-Zero, which it says “encounters challenges such as endless repetition, poor readability, and language mixing,” as well as R1 itself, which apparently doesn’t. Perhaps because both are enormous, it has also transferred (or “distilled”) knowledge from them to versions of Meta’s Llama and Alibaba’s Qwen models, and open-sourced those too.

Meanwhile, China’s Moonshot AI just announced Kimi k1.5, a model that can reason over both text and vision modalities, and that Moonshot also claims is comparable to o1. It says the new version of the model will soon power its popular Kimi chatbot.

财富中文网所刊载内容之知识产权为财富媒体知识产权有限公司及/或相关权利人专属所有或持有。未经许可，禁止进行转载、摘编、复制及建立镜像等任何使用。

0条Plus

精彩评论

撰写或查看更多评论

请打开财富Plus APP

前往打开

热读文章

领导力
2025年，哪些人才最吃香？

Sara Braun
4天前
商业
这6种食物可能让你加速衰老

Jodi Helmer
5天前
科技
Canva联合创始人：对2025年人工智能的7个预测

CAMERON ADAMS
3天前
商业
《财富》水晶球：企业家和科技公司高管对2025年的预测

ALLIE GARFINKLE
5天前
商业
耐克新任CEO希望该零售业巨头回归体育本源

RUTH UMOH
3天前

关注我们

可耻：OpenAI涉嫌在AI数学测试中作弊

撰写或查看更多评论

2025年，哪些人才最吃香？

这6种食物可能让你加速衰老

Canva联合创始人：对2025年人工智能的7个预测

《财富》水晶球：企业家和科技公司高管对2025年的预测

耐克新任CEO希望该零售业巨头回归体育本源

《财富》对话上野千鹤子