首页 500强 活动 榜单 商业 科技 领导力 专题 品牌中心
杂志订阅

ChatGPT错误太多,正在制造伤害

Jeremy Kahn
2023-03-03

大型语言模型训练出来的引擎都远非完美,因为这样的引擎倾向于编造事物。

文本设置
小号
默认
大号
Plus(0条)

微软首席执行官萨蒂亚·纳德拉。该公司不得不限制其基于OpenAI的新版必应聊天功能的对话次数,以防止聊天机器人偏离正轨,化身为一个自称希德尼的令人不安的角色。图片来源:SEONGJOON CHO—BLOOMBERG VIA GETTY IMAGES

芝加哥城市新闻署(City News Bureau of Chicago)是一家目前已经倒闭的新闻机构,曾经被誉为培训意志坚定的实地报道记者的传奇基地,该机构有一句著名的非官方格言:“如果你的母亲说她爱你,那也得去核实一下。”多亏了ChatGPT、新版必应搜索(Bing Search)、Bard和大量基于大型语言模型的山寨搜索聊天机器人的出现,我们不得不奉行该机构的古老信条。

研究人员已经知道,对于搜索查询或任何基于事实的请求来说,大型语言模型训练出来的引擎都远非完美,因为这样的引擎倾向于编造事物(人工智能研究人员称之为“幻觉”现象)。但科技公司巨头认为,可以进行对话的用户界面带来的“利”大于“弊”(提供的信息不准确或是提供了错误信息),这些大型语言模型能够执行大量从翻译到做总结的自然语言相关任务,还可以将这些模型与其他软件工具结合起来执行任务(无论是进行搜索还是预订剧院门票)。

当然,当这些系统产生幻觉时,可能会造成真正的损害——甚至当它们没有产生幻觉时,只是从训练数据中学习了一些与事实有出入的东西,也会造成真正的损害。Stack Overflow不得不禁止用户提交使用ChatGPT生成的编码,因为该网站上充斥着看似合理但实则错误的代码。科幻杂志《克拉克世界》(Clarkesworld)不得不停止接受投稿,因为很多人提交的故事并不是他们自己创作的,而是ChatGPT创作的。一家名为OpenCage的德国公司提供能够进行地理编码的应用程序接口,该接口可以将物理地址转换为能够标记在地图上的经纬度坐标。该公司表示,由于ChatGPT的推荐出错(将其应用程序接口作为一种仅根据号码就可以查找手机位置的方法做了推荐),他们不得不应对越来越多大失所望的注册用户。ChatGPT甚至还帮助用户编写了python代码,允许他们为此目的调用OpenCage的应用程序接口。

但是,正如OpenCage被迫在一篇博文中解释的那样,这不是它提供的服务,也不是使用该公司的技术能够实现的。OpenCage表示,ChatGPT之所以有这样错误的想法,是因为它从YouTube的视频教程中学习了相关内容,有人声称OpenCage的应用程序接口可以用于反向推断手机地理定位,其实这种说法是错误的。但是,那些教程只说服了少数人注册OpenCage的应用程序接口,而ChatGPT却促使人们成群结队地注册OpenCage。OpenCage写道:“关键的区别在于,人们在接受他人的建议时持怀疑态度,例如在视频编码教程学习时,人们也会持怀疑态度。但在人工智能或ChatGPT方面,我们似乎还没有把这一点内化于心。我想我们最好把这一点内化于心,保持适当的怀疑态度。”

与此同时,在一系列关于其基于OpenAI的新版必应聊天功能的阴暗面的报道引发人们担忧后——聊天机器人自称希德尼,变得很暴躁,有时甚至充满敌意,极具威胁性——微软(Microsoft)决定限制用户与必应聊天机器人的对话长度。但正如我和其他许多人所发现的那样,显而易见的是,虽然这种对对话长度的随意限制让新版必应的聊天功能更安全,但也让它的功能大打折扣。

比如,我向必应聊天询问了计划去希腊旅行的问题。我正试图让它为建议的行程提供详细的时间安排和航班选择时,这时突然弹出“哎呀,我们的对话到此结束喽。如果你还想继续和我聊天的话,就请点击‘新话题’!”

长度限制显然是微软被迫给出的“克鲁格”(不够精巧,但还能够应付要求的解决方案),因为它一开始就没有对其新产品进行足够严格的测试。关于Prometheus(微软对新版必应模型的命名)究竟是什么,以及它究竟有什么功能,还有很多亟待解决的问题(没有人声称新版必应有感知能力或自我意识,但新版必应出现了一些非常奇怪的突现行为,甚至超出了希德尼人格的范畴,微软应该就此事做出解释,而不是假装它不存在)。微软在公开场合对它和OpenAI如何创建了这个模型讳莫如深。除了微软之外,没有人确切地知道为什么新版必应聊天机器人倾向于扮演暴躁的希德尼的角色,而当ChatGPT基于一个更小、功能更弱的大型语言模型时,它似乎表现得好得多——而且,微软对它已知的事情也是三缄其口。

[OpenAI的早期研究发现,通常情况下,用更高质量的数据训练出来的较小模型会给出人类用户更喜欢的答案,尽管在一些基准测试中,它们的表现不如大模型。这导致一些人猜测Prometheus是OpenAI的GPT-4,该模型被认为比之前推出的任何模型都要大很多倍。但如果是这样的话,微软为什么选择使用GPT-4,而不是一个更小但性能更好的系统来支持新版必应,这是真正的问题所在。坦率地说,另外一个问题是,如果OpenAI实际上意识到新版必应聊天机器人很有可能让用户感到不安,那么为什么它会建议微软使用更强大的模型呢?微软的研究人员可能和许多人工智能研究人员前辈一样,被领先的基准性能蒙蔽了双眼(他们可以向其他人工智能开发人员炫耀这些性能),但这些性能本身却是非常差的指标,并不能代表人类用户的需求。]

可以肯定的是,如果微软不尽快解决这个问题,如果其他公司,例如谷歌(正在努力完善其即将推出的搜索聊天机器人),或者包括Perplexity和You.com等创业公司在内的任何一家(已经推出了自己的聊天机器人)表明他们的聊天机器人能够进行长时间对话,而且也不会变身达米安这样的人格,那么微软就有可能在新的搜索引擎之争中失去其先发优势。

同时,让我们花点时间来感受一下这样的反讽,微软,一家曾经以自己是最负责任的大型科技公司而自豪的公司(不无道理),现在却让我们重回早期社交媒体时代“快速行动,打破陈例”的艰难往昔——可能后果更糟。(但我猜,当你的首席执行官痴迷于让他的主要竞争对手“跳舞”时,乐队里的乐手们很难反驳说,也许他们不应该现在就开始演奏这首曲子。)除了OpenCage、《克拉克世界》和Stack Overflow之外,人们还可能因为错误的用药建议而导致严重后果,因为类似希德尼的虐待行为导致某人自残或自杀,或者因为强化可憎的刻板印象和措辞而受到伤害。

我以前说过这一点,但我要再强调一遍:鉴于这些潜在的威胁,现在是时候让政府介入,就如何构建和部署系统制定明确的规定。基于风险的方法是起点,比如欧盟(European Union)的人工智能法案提案(A.I. Act)的最初草案中提出的想法。但风险的定义和评估不应该完全由公司自己来决定。如果没有特定的标准,就需要有明确的外部标准和相应的问责制度。(财富中文网)

译者:中慧言-王芳

芝加哥城市新闻署(City News Bureau of Chicago)是一家目前已经倒闭的新闻机构,曾经被誉为培训意志坚定的实地报道记者的传奇基地,该机构有一句著名的非官方格言:“如果你的母亲说她爱你,那也得去核实一下。”多亏了ChatGPT、新版必应搜索(Bing Search)、Bard和大量基于大型语言模型的山寨搜索聊天机器人的出现,我们不得不奉行该机构的古老信条。

研究人员已经知道,对于搜索查询或任何基于事实的请求来说,大型语言模型训练出来的引擎都远非完美,因为这样的引擎倾向于编造事物(人工智能研究人员称之为“幻觉”现象)。但科技公司巨头认为,可以进行对话的用户界面带来的“利”大于“弊”(提供的信息不准确或是提供了错误信息),这些大型语言模型能够执行大量从翻译到做总结的自然语言相关任务,还可以将这些模型与其他软件工具结合起来执行任务(无论是进行搜索还是预订剧院门票)。

当然,当这些系统产生幻觉时,可能会造成真正的损害——甚至当它们没有产生幻觉时,只是从训练数据中学习了一些与事实有出入的东西,也会造成真正的损害。Stack Overflow不得不禁止用户提交使用ChatGPT生成的编码,因为该网站上充斥着看似合理但实则错误的代码。科幻杂志《克拉克世界》(Clarkesworld)不得不停止接受投稿,因为很多人提交的故事并不是他们自己创作的,而是ChatGPT创作的。一家名为OpenCage的德国公司提供能够进行地理编码的应用程序接口,该接口可以将物理地址转换为能够标记在地图上的经纬度坐标。该公司表示,由于ChatGPT的推荐出错(将其应用程序接口作为一种仅根据号码就可以查找手机位置的方法做了推荐),他们不得不应对越来越多大失所望的注册用户。ChatGPT甚至还帮助用户编写了python代码,允许他们为此目的调用OpenCage的应用程序接口。

但是,正如OpenCage被迫在一篇博文中解释的那样,这不是它提供的服务,也不是使用该公司的技术能够实现的。OpenCage表示,ChatGPT之所以有这样错误的想法,是因为它从YouTube的视频教程中学习了相关内容,有人声称OpenCage的应用程序接口可以用于反向推断手机地理定位,其实这种说法是错误的。但是,那些教程只说服了少数人注册OpenCage的应用程序接口,而ChatGPT却促使人们成群结队地注册OpenCage。OpenCage写道:“关键的区别在于,人们在接受他人的建议时持怀疑态度,例如在视频编码教程学习时,人们也会持怀疑态度。但在人工智能或ChatGPT方面,我们似乎还没有把这一点内化于心。我想我们最好把这一点内化于心,保持适当的怀疑态度。”

与此同时,在一系列关于其基于OpenAI的新版必应聊天功能的阴暗面的报道引发人们担忧后——聊天机器人自称希德尼,变得很暴躁,有时甚至充满敌意,极具威胁性——微软(Microsoft)决定限制用户与必应聊天机器人的对话长度。但正如我和其他许多人所发现的那样,显而易见的是,虽然这种对对话长度的随意限制让新版必应的聊天功能更安全,但也让它的功能大打折扣。

比如,我向必应聊天询问了计划去希腊旅行的问题。我正试图让它为建议的行程提供详细的时间安排和航班选择时,这时突然弹出“哎呀,我们的对话到此结束喽。如果你还想继续和我聊天的话,就请点击‘新话题’!”

长度限制显然是微软被迫给出的“克鲁格”(不够精巧,但还能够应付要求的解决方案),因为它一开始就没有对其新产品进行足够严格的测试。关于Prometheus(微软对新版必应模型的命名)究竟是什么,以及它究竟有什么功能,还有很多亟待解决的问题(没有人声称新版必应有感知能力或自我意识,但新版必应出现了一些非常奇怪的突现行为,甚至超出了希德尼人格的范畴,微软应该就此事做出解释,而不是假装它不存在)。微软在公开场合对它和OpenAI如何创建了这个模型讳莫如深。除了微软之外,没有人确切地知道为什么新版必应聊天机器人倾向于扮演暴躁的希德尼的角色,而当ChatGPT基于一个更小、功能更弱的大型语言模型时,它似乎表现得好得多——而且,微软对它已知的事情也是三缄其口。

[OpenAI的早期研究发现,通常情况下,用更高质量的数据训练出来的较小模型会给出人类用户更喜欢的答案,尽管在一些基准测试中,它们的表现不如大模型。这导致一些人猜测Prometheus是OpenAI的GPT-4,该模型被认为比之前推出的任何模型都要大很多倍。但如果是这样的话,微软为什么选择使用GPT-4,而不是一个更小但性能更好的系统来支持新版必应,这是真正的问题所在。坦率地说,另外一个问题是,如果OpenAI实际上意识到新版必应聊天机器人很有可能让用户感到不安,那么为什么它会建议微软使用更强大的模型呢?微软的研究人员可能和许多人工智能研究人员前辈一样,被领先的基准性能蒙蔽了双眼(他们可以向其他人工智能开发人员炫耀这些性能),但这些性能本身却是非常差的指标,并不能代表人类用户的需求。]

可以肯定的是,如果微软不尽快解决这个问题,如果其他公司,例如谷歌(正在努力完善其即将推出的搜索聊天机器人),或者包括Perplexity和You.com等创业公司在内的任何一家(已经推出了自己的聊天机器人)表明他们的聊天机器人能够进行长时间对话,而且也不会变身达米安这样的人格,那么微软就有可能在新的搜索引擎之争中失去其先发优势。

同时,让我们花点时间来感受一下这样的反讽,微软,一家曾经以自己是最负责任的大型科技公司而自豪的公司(不无道理),现在却让我们重回早期社交媒体时代“快速行动,打破陈例”的艰难往昔——可能后果更糟。(但我猜,当你的首席执行官痴迷于让他的主要竞争对手“跳舞”时,乐队里的乐手们很难反驳说,也许他们不应该现在就开始演奏这首曲子。)除了OpenCage、《克拉克世界》和Stack Overflow之外,人们还可能因为错误的用药建议而导致严重后果,因为类似希德尼的虐待行为导致某人自残或自杀,或者因为强化可憎的刻板印象和措辞而受到伤害。

我以前说过这一点,但我要再强调一遍:鉴于这些潜在的威胁,现在是时候让政府介入,就如何构建和部署系统制定明确的规定。基于风险的方法是起点,比如欧盟(European Union)的人工智能法案提案(A.I. Act)的最初草案中提出的想法。但风险的定义和评估不应该完全由公司自己来决定。如果没有特定的标准,就需要有明确的外部标准和相应的问责制度。(财富中文网)

译者:中慧言-王芳

City News Bureau of Chicago, a now-defunct news outfit once legendary as a training ground for tough-as-nails, shoe-leather reporters, famously had as its unofficial motto: “If your mother says she loves you, check it out.” Thanks to the advent of ChatGPT, the new Bing Search, Bard, and a host of copycat search chatbots based on large language models, we are all going to have to start living by City News’ old shibboleth.

Researchers already knew that large language models were imperfect engines for search queries, or any fact-based request really, because of their tendency to make stuff up (a phenomenon A.I. researchers call “hallucination”). But the world’s largest technology companies have decided that the appeal of dialogue as a user interface—and the ability of these large language models to perform a vast array of natural language-based tasks, from translation to summarization, along with the potential to couple these models with access to other software tools that will enable them to perform tasks (whether it is running a search or booking you theater tickets)—trumps the potential downsides of inaccuracy and misinformation.

Except, of course, there can be real victims when these systems hallucinate—or even when they don’t, but merely pick up something that is factually wrong from their training data. Stack Overflow had to ban users from submitting answers to coding questions that were produced using ChatGPT after the site was flooded with code that looked plausible but was incorrect. The science fiction magazine Clarkesworld had to stop taking submissions because so many people were submitting stories crafted not by their own creative genius, but by ChatGPT. Now a German company called OpenCage—which offers an application programming interface that does geocoding, converting physical addresses into latitude and longitude coordinates that can be placed on a map—has said it has been dealing with a growing number of disappointed users who have signed up for its service because ChatGPT erroneously recommended its API as a way to look up the location of a mobile phone based solely on the number. ChatGPT even helpfully wrote python code for users allowing them to call on OpenCage’s API for this purpose.

But, as OpenCage was forced to explain in a blog post, this is not a service it offers, nor one that is even feasible using the company’s technology. OpenCage says that ChatGPT seems to have developed this erroneous belief because it picked up on YouTube tutorials in which people also wrongly claimed OpenCage’s API could be used for reverse mobile phone geolocation. But whereas those erroneous YouTube tutorials only convinced a few people to sign up for OpenCage’s API, ChatGPT has driven people to OpenCage in droves. “The key difference is that humans have learned to be skeptical when getting advice from other humans, for example via a video coding tutorial,” OpenCage wrote. “It seems though that we haven’t yet fully internalized this when it comes to AI in general or ChatGPT specifically.” I guess we better start internalizing.

Meanwhile, after a slew of alarming publicity about the dark side of its new, OpenAI-powered Bing chat feature—where the chatbot calls itself Sydney, becomes petulant, and at times even downright hostile and menacing—Microsoft has decided to restrict the length of conversations users can have with Bing chat. But as I, and many others have found, while this arbitrary restriction on the length of a dialogue apparently makes the new Bing chat safer to use, it also makes it a heck of a lot less useful.

For instance, I asked Bing chat about planning a trip to Greece. I was in the process of trying to get it to detail timings and flight options for an itinerary it had suggested when I suddenly hit the “Oops, I think we’ve reached the end of this conversation. Click ‘New topic,’ if you would!”

The length restriction is clearly a kluge that Microsoft has been forced to implement because it didn’t do rigorous enough testing of its new product in the first place. And there are huge outstanding questions about exactly what Prometheus, the name Microsoft has given to the model that powers the new Bing, really is, and what it is really capable of (no one is claiming the new Bing is sentient or self-aware, but there’s been some very bizarre emergent behavior documented with the new Bing, even beyond the Sydney personality, and Microsoft ought to be transparent about what it understands and doesn’t understand about this behavior, rather than simply pretending it doesn’t exist). Microsoft has been cagey in public about how it and OpenAI created this model. No one outside of Microsoft is exactly sure why it is so prone to taking on the petulant Sydney persona, especially when ChatGPT, based on a smaller, less capable large language model, seems so much better behaved—and again, Microsoft is saying very little about what it does know.

(Earlier research from OpenAI had found that it was often the case that smaller models, trained with better quality data, produced results that human users much preferred even though they were less capable when measured on a number of benchmark tests than larger models. That has led some to speculate that Prometheus is OpenAI’s GPT-4, a model believed to be many times more massive than any it has previously debuted. But if that is the case, there is still a real question about why Microsoft opted to use GPT-4 rather than a smaller, but better-behaved system to power the new Bing. And frankly, there is also a real question about why OpenAI might have encouraged Microsoft to use the more powerful model if it in fact realized it had more potential to behave in ways that users might find disturbing. The Microsoft folks may have, like many A.I. researchers before them, become blinded by stellar benchmark performance that can convey bragging rights among other A.I. developers, but which are a poor proxy for what real human users want.)

What is certain is that if Microsoft doesn’t fix this soon—and if someone else, such as Google, which is hard at work trying to hone its search chatbot for imminent release, or any of the others, including startups such as Perplexity and You.com, that have debuted their own chatbots, shows that their chatbot can hold long dialogues without it turning into Damien—then Microsoft risks losing its first mover advantage in the new search wars.

Also, let’s just take a moment to appreciate the irony that it’s Microsoft, a company that once prided itself, not without reason, on being among the most responsible of the big technology companies, which has now tossed us all back to the bad old “move fast and break things” days of the early social media era—with perhaps even worse consequences. (But I guess when your CEO is obsessed with making his arch-rival “dance” it is hard for the musicians in the band to argue that maybe they shouldn’t be striking up the tune just yet.) Beyond OpenCage, Clarkesworld, and Stack Overflow, people could get hurt from incorrect advice on medicines, from abusive Sydney-like behavior that drives someone to self-harm or suicide, or from reinforcement of hateful stereotypes and tropes.

I’ve said this before, but I’ll say it again: Given these potential harms, now is the time for governments to step in and lay down some clear regulation about how these systems need to be built and deployed. The idea of a risk-based approach, such as that broached in the original draft of the European Union’s proposed A.I. Act, is a potential starting point. But the definitions of risk and those risk assessments should not be left entirely up to the companies themselves. There need to be clear external standards and clear accountability if those standards aren’t meant.

财富中文网所刊载内容之知识产权为财富媒体知识产权有限公司及/或相关权利人专属所有或持有。未经许可,禁止进行转载、摘编、复制及建立镜像等任何使用。
0条Plus
精彩评论
评论

撰写或查看更多评论

请打开财富Plus APP

前往打开