首页 500强 活动 榜单 商业 科技 商潮 专题 品牌中心
杂志订阅

人工智能侧重英语,使许多国家处于不利地位

David Meyer
2025-02-13

欧盟的一项新项目旨在为32种语言解决这一问题

文本设置
小号
默认
大号
Plus(0条)

图片来源:Jakub Porzycki—NurPhoto/Getty Images

欧洲一项雄心勃勃的新人工智能项目已初具规模,该项目旨在开发支持该地区24种官方语言及更多语言的开源人工智能模型,并力求遵守其繁杂的数字立法。

OpenEuroLLM项目于本月初启动,预算仅为3740万欧元(约合3860万美元):与其他人工智能相关项目[如美国星际之门人工智能基础设施项目(Stargate AI infrastructure project)首期投入1000亿美元]相比,这一预算显得微不足道。尽管参与该项目的公司,如德国的Aleph Alpha和芬兰的Silo AI等,也投入了等值的研究人员时间,但项目资金的主要来源仍是欧盟委员会。

欧盟资助的项目通常进展缓慢,而该项目制定了为期三年的路线图,但该行业目前每月都在经历重大变革。不过,组织者和参与者向《财富》杂志表示,有望在一年内交付一个中间成果模型,而且为此付出的努力是值得的。

说方言

Aleph Alpha首席研究官亚瑟·贾迪迪(Yasser Jadidi)指出:“大多数享有全球知名度的模型开发工作都侧重于英语。这是由于绝大多数可获取且可访问的互联网文本数据都是英文的,这使得其他语言处于不利地位。”

对于瑞典或土耳其(OpenEuroLLM项目还针对已申请加入欧盟的八个国家的语言,因此该项目总共涵盖32种语言)等地的民众而言,缺乏能够理解其语言复杂性的人工智能模型无疑构成了一个严峻的挑战。首要问题在于,这加大了当地企业和公共机构采纳该技术并开始提供新服务的难度。

欧洲最大的私人人工智能实验室Silo AI(该实验室去年被AMD收购,目前正在参与OpenEuroLLM项目)的首席执行官彼得·萨林(Peter Sarlin)表示:"这首先是一个商业问题。无论是阿尔巴尼亚语、芬兰语、瑞典语还是其他语言,是否存在能够在特定的低资源语言中表现出色的模型,从而使该地区的公司能够最终以此为基础构建服务?”

贾迪迪表示,这一问题还对本地语境中人工智能模型的准确性和安全性的评估工作产生了影响。事实上,Aleph Alpha在该项目中的主要作用是提供人工智能模型评估基准(而这套基准并非简单地从英语版本进行机器翻译得来,因为大多数现有的人工智能模型评估基准都沿用了这一做法。)

OpenEuroLLM项目的资金可能相对较少,但它并非从零开始。

该项目的大多数参与者此前已参与过一个名为高性能语言技术(HPLT)的独立项目,该项目于两年前启动,预算仅为600万欧元。起初,高性能语言技术项目的目标是交付人工智能模型,但随后OpenAI的ChatGPT改变了人工智能领域的格局,于是组织者转向创建一个可用于训练多语言模型的高质量数据集。目前,高性能语言技术数据集正处于“清理”错误信息阶段,将成为OpenEuroLLM工作的基础。

OpenEuroLLM将创建一个基于所有欧洲语言数据集训练的基础模型。一旦该基础模型完成开发,另一个由欧盟资助的名为LLMs4EU的项目将对其进行微调以用于各种应用程序。除了提供资金支持外,欧盟还为所有这些项目提供了算力资源。

遵守规则

对于人工智能公司而言,在欧洲开展业务并非易事。除了逐步生效的《人工智能法案》(AI Act)对模型提供商及其客户施加的一系列报告责任之外,还要考虑版权法和竞争法,以及《通用数据保护条例》(GDPR,该条例对人工智能公司可使用的个人数据设定了严格限制)。

这些法律对欧洲人工智能的发展产生了实质性影响,Meta因《通用数据保护条例》的限制而推迟了Meta AI的推出,苹果(Apple)也因未指明的反垄断问题而推迟了Apple Intelligence的部署。(Apple Intelligence将于4月以有限的形式在欧盟地区的iPhone上推出,而Meta已开始向欧洲智能眼镜佩戴者提供部分Meta AI功能。)

就OpenEuroLLM的组织者而言,这些法律挑战是可以克服的。与萨林共同领导该项目的捷克查理大学的扬·哈吉奇(Jan Hajič)说:"我们相信,我们能够遵守所有这些法律规定。”

哈吉奇表示,参与者在开发高性能语言技术数据集时已经解决了版权问题和大部分隐私问题。“《通用数据保护条例》可能构成一定的挑战,但我们正试图通过数据假名化来解决这一问题,也就是说,如果遇到人名,会将其进行删除处理。”他说,同时承认这一过程中必要的自动化可能无法保证达到百分之百的成功率。

哈吉奇表示:“我们的宗旨是确保所有行动都不会与欧洲法规产生任何冲突。”他还补充说,这可能会吸引那些意图开拓欧盟市场的公司。对于那些在《人工智能法案》框架下需要向欧盟当局提交大量报告的高风险用例而言,开源方法将因其所提供的透明度而变得至关重要。

OpenEuroLLM项目有20个参与者,包括企业、研究机构和芬兰Lumi等高性能计算集群。这样的组合可能被视为一种负担,甚至可能引发优先级上的分歧,但Aleph Alpha的贾迪迪认为,开源项目通常涉及众多的参与者,但这并不意味着项目会因此受到拖累。

他说:“我们完全有机会确保众多的贡献者不是阻碍,反而会带来机遇。”(财富中文网)

译者:中慧言-王芳

欧洲一项雄心勃勃的新人工智能项目已初具规模,该项目旨在开发支持该地区24种官方语言及更多语言的开源人工智能模型,并力求遵守其繁杂的数字立法。

OpenEuroLLM项目于本月初启动,预算仅为3740万欧元(约合3860万美元):与其他人工智能相关项目[如美国星际之门人工智能基础设施项目(Stargate AI infrastructure project)首期投入1000亿美元]相比,这一预算显得微不足道。尽管参与该项目的公司,如德国的Aleph Alpha和芬兰的Silo AI等,也投入了等值的研究人员时间,但项目资金的主要来源仍是欧盟委员会。

欧盟资助的项目通常进展缓慢,而该项目制定了为期三年的路线图,但该行业目前每月都在经历重大变革。不过,组织者和参与者向《财富》杂志表示,有望在一年内交付一个中间成果模型,而且为此付出的努力是值得的。

说方言

Aleph Alpha首席研究官亚瑟·贾迪迪(Yasser Jadidi)指出:“大多数享有全球知名度的模型开发工作都侧重于英语。这是由于绝大多数可获取且可访问的互联网文本数据都是英文的,这使得其他语言处于不利地位。”

对于瑞典或土耳其(OpenEuroLLM项目还针对已申请加入欧盟的八个国家的语言,因此该项目总共涵盖32种语言)等地的民众而言,缺乏能够理解其语言复杂性的人工智能模型无疑构成了一个严峻的挑战。首要问题在于,这加大了当地企业和公共机构采纳该技术并开始提供新服务的难度。

欧洲最大的私人人工智能实验室Silo AI(该实验室去年被AMD收购,目前正在参与OpenEuroLLM项目)的首席执行官彼得·萨林(Peter Sarlin)表示:"这首先是一个商业问题。无论是阿尔巴尼亚语、芬兰语、瑞典语还是其他语言,是否存在能够在特定的低资源语言中表现出色的模型,从而使该地区的公司能够最终以此为基础构建服务?”

贾迪迪表示,这一问题还对本地语境中人工智能模型的准确性和安全性的评估工作产生了影响。事实上,Aleph Alpha在该项目中的主要作用是提供人工智能模型评估基准(而这套基准并非简单地从英语版本进行机器翻译得来,因为大多数现有的人工智能模型评估基准都沿用了这一做法。)

OpenEuroLLM项目的资金可能相对较少,但它并非从零开始。

该项目的大多数参与者此前已参与过一个名为高性能语言技术(HPLT)的独立项目,该项目于两年前启动,预算仅为600万欧元。起初,高性能语言技术项目的目标是交付人工智能模型,但随后OpenAI的ChatGPT改变了人工智能领域的格局,于是组织者转向创建一个可用于训练多语言模型的高质量数据集。目前,高性能语言技术数据集正处于“清理”错误信息阶段,将成为OpenEuroLLM工作的基础。

OpenEuroLLM将创建一个基于所有欧洲语言数据集训练的基础模型。一旦该基础模型完成开发,另一个由欧盟资助的名为LLMs4EU的项目将对其进行微调以用于各种应用程序。除了提供资金支持外,欧盟还为所有这些项目提供了算力资源。

遵守规则

对于人工智能公司而言,在欧洲开展业务并非易事。除了逐步生效的《人工智能法案》(AI Act)对模型提供商及其客户施加的一系列报告责任之外,还要考虑版权法和竞争法,以及《通用数据保护条例》(GDPR,该条例对人工智能公司可使用的个人数据设定了严格限制)。

这些法律对欧洲人工智能的发展产生了实质性影响,Meta因《通用数据保护条例》的限制而推迟了Meta AI的推出,苹果(Apple)也因未指明的反垄断问题而推迟了Apple Intelligence的部署。(Apple Intelligence将于4月以有限的形式在欧盟地区的iPhone上推出,而Meta已开始向欧洲智能眼镜佩戴者提供部分Meta AI功能。)

就OpenEuroLLM的组织者而言,这些法律挑战是可以克服的。与萨林共同领导该项目的捷克查理大学的扬·哈吉奇(Jan Hajič)说:"我们相信,我们能够遵守所有这些法律规定。”

哈吉奇表示,参与者在开发高性能语言技术数据集时已经解决了版权问题和大部分隐私问题。“《通用数据保护条例》可能构成一定的挑战,但我们正试图通过数据假名化来解决这一问题,也就是说,如果遇到人名,会将其进行删除处理。”他说,同时承认这一过程中必要的自动化可能无法保证达到百分之百的成功率。

哈吉奇表示:“我们的宗旨是确保所有行动都不会与欧洲法规产生任何冲突。”他还补充说,这可能会吸引那些意图开拓欧盟市场的公司。对于那些在《人工智能法案》框架下需要向欧盟当局提交大量报告的高风险用例而言,开源方法将因其所提供的透明度而变得至关重要。

OpenEuroLLM项目有20个参与者,包括企业、研究机构和芬兰Lumi等高性能计算集群。这样的组合可能被视为一种负担,甚至可能引发优先级上的分歧,但Aleph Alpha的贾迪迪认为,开源项目通常涉及众多的参与者,但这并不意味着项目会因此受到拖累。

他说:“我们完全有机会确保众多的贡献者不是阻碍,反而会带来机遇。”(财富中文网)

译者:中慧言-王芳

An ambitious new AI project has begun to take shape in Europe, with the aim of developing open-source AI models that support the region’s 24 official languages and more—while also complying as much as possible with its thicket of digital legislation.

The OpenEuroLLM project, which commenced work at the start of the month, has a budget of just €37.4 million ($38.6 million): a pittance compared with the sums being invested in other AI-related projects like the $100 billion first tranche of the U.S.’s Stargate AI infrastructure project. Although participating companies such as Germany’s Aleph Alpha and Finland’s Silo AI are also contributing their researchers’ time to an equivalent value, the bulk of the funding comes from the European Commission.

EU-funded projects don’t tend to move fast, and this one has a three-year road map in a sector that’s currently undergoing significant evolution each month. But organizers and participants tell Fortune that it could be possible to deliver an intermediate model within a year—and the effort will be worth it.

Speaking in tongues

“Most model development efforts that have worldwide visibility focus on the English language,” said Yasser Jadidi, chief research officer at Aleph Alpha. “It’s a consequence of most of the internet text data that is available and accessible being in English, and it puts other languages at a disadvantage.”

For people in places like Sweden or Turkey (the OpenEuroLLM project is also targeting the tongues of eight countries that have applied for EU membership, so that the project encompasses a total of 32 languages) the lack of AI models that understand the intricacies of their languages can be a serious problem. For a start, it makes it harder for local companies and public authorities to adopt the technology and start providing new services.

“It’s first and foremost a commercial question,” said Peter Sarlin, the CEO of Silo AI, Europe’s largest private AI lab, which was acquired by AMD last year and is participating in OpenEuroLLM. “Are there models that are performant in that specific low-resource language, be it Albanian or Finnish or Swedish or some other, that allows companies within that region to eventually build services on top?”

The issue also has consequences for evaluating the accuracy and safety of AI models in the local context, Jadidi said. Indeed, Aleph Alpha’s role in the project is chiefly to provide AI-model evaluation benchmarks that aren’t simply machine-translated from English, as most are.

The OpenEuroLLM project may have relatively meager funding, but it isn’t starting from scratch.

Most of its participants have already been involved in a separate scheme called High Performance Language Technologies (HPLT), which started two years ago with a budget of just €6 million. The original proposal was for HPLT to deliver AI models, but then OpenAI’s ChatGPT changed the AI landscape and the organizers pivoted to creating a high-quality dataset that can be used to train multilingual models. The HPLT dataset is currently being “cleaned” of errors, and it will form the basis of OpenEuroLLM’s work.

OpenEuroLLM will create a base model trained on a dataset of all the European languages. Once that’s done, yet another EU-funded project, called LLMs4EU, will fine-tune it for various applications. Apart from cash, the EU is also providing computational resources to all these schemes.

Sticking to the rules

Europe is not the easiest place for AI companies to do business. Quite apart from the AI Act that is gradually coming into force, placing all sorts of reporting responsibilities on model providers and their customers, there’s also copyright and competition law to consider—and the General Data Protection Regulation (GDPR), which places strict limits on the personal data that AI companies can use.

These laws have had real effects on AI’s European progress, with Meta delaying the rollout of Meta AI because of GDPR limits, and Apple also delaying the deployment of Apple Intelligence because of unspecified antitrust issues. (Apple Intelligence will come to EU iPhones in limited form in April, while Meta has started offering some Meta AI features to European wearers of its smart glasses.)

As far as OpenEuroLLM’s organizers are concerned, these laws are manageable. “We believe we can live with all of them,” said Jan Hajič of Charles University in Czechia, who is co-leading the project with Sarlin.

Hajič said the participants had already dealt with the copyright and most privacy issues when developing the HPLT dataset. “The GDPR could be a problem, but that’s something we are trying to get around with pseudonymizing the data, meaning that if we encounter people’s names it gets deleted,” he said, while acknowledging that the necessary automation in this process may not have a 100% success rate.

“Our goal is to do things in such a way that they will not clash with the European regulation in any way,” Hajič said, adding that this could be a draw for companies wanting to target EU markets. For high-risk use cases that will require a lot of reporting to the EU authorities under the AI Act, the open-source approach will be essential for the transparency it allows, he argued.

The OpenEuroLLM project has 20 participants including companies, research institutions, and high-performance computing clusters like Finland’s Lumi. This setup could be seen as a liability with the potential for diverging priorities, but Aleph Alpha’s Jadidi argued that open-source projects often include a wide array of participants without being dragged down.

“We have all the opportunity to ensure that a high amount of contributors is not a hindrance but an opportunity,” he said.

财富中文网所刊载内容之知识产权为财富媒体知识产权有限公司及/或相关权利人专属所有或持有。未经许可,禁止进行转载、摘编、复制及建立镜像等任何使用。
0条Plus
精彩评论
评论

撰写或查看更多评论

请打开财富Plus APP

前往打开