谷歌旗下公司发布新款视频生成器，分辨率超Sora

DAVID MEYER

2024-12-19

DeepMind的新Veo 2人工智能视频生成器以4K分辨率超越了OpenAI的Sora模型。

文本设置

小号

默认

大号

Plus(0条)

在Alphabet旗下的谷歌DeepMind推出Veo人工智能视频生成器仅7个月后，该部门再次宣布推出Veo 2。

新工具能够生成分辨率高达4K的视频，而第一代Veo仅支持最高1080p的视频处理。谷歌声称，升级后的Veo生成的场景的物理效果有所改进，而且“相机控制”功能也更为出色（尽管并不涉及实体相机，但用户可以通过指令提示模型选用特定的相机镜头和拍摄角度，从特写镜头到摇摄再到“定场镜头”）。

DeepMind还宣布了Imagen 3文本到图像模型的更新版本，尽管这些改进——比如“构图更为均衡”的图像以及艺术风格更为贴合——显然不足以让其升级到全新的版本号。Imagen 3于今年8月首次推出。

Veo 2升级至4K分辨率表明DeepMind在视频生成领域相较于竞争对手的人工智能实验室取得了领先优势。

一周前，OpenAI终于推出了Sora视频生成器（早在2月份就已公布），但Sora（尤其是目前仅向ChatGPT Plus和Pro用户开放的Sora Turbo版本）的导出分辨率上限仅为1080p。目前最受欢迎的人工智能视频生成器Runway的导出分辨率更是局限在较为模糊的720p。

谷歌在Veo 2的演示中表示：“低分辨率视频在移动设备上播放效果很好，但创作者希望看到自己的作品在大屏幕上熠熠生辉。”

谷歌发言人表示，在默认情况下，Veo 2生成的4K视频时长被限制在8秒以内，但可以延长至2分钟或以上。Sora生成的1080p视频时长则被限制在20秒以内。

DeepMind声称，在对Veo 2和Sora Turbo进行比较时，59%的人类评分者更青睐谷歌的服务，27%的人则选择Sora Turbo。它还声称，在与Minimax及Meta的Movie Gen的较量中，DeepMind也取得了类似的胜利。当竞争对手是来自中国的快手科技（Kuaishou Technology）的Kling v1.5时，Veo 2的受青睐程度仅略低于50%。

据DeepMind称，在“遵循提示”（即按照要求完成任务）方面，Veo 2的受青睐程度也相似。

该谷歌部门还声称在消除多余手指等“幻觉”细节方面取得了重大进展，并且在展示“对现实世界物理学以及人类动作和表情细微差别有更好理解”方面也取得了重大进展。

物理学问题一直是视频生成器面临的一大难题。例如，Sora就难以生成逼真的体操运动员及其复杂动作视频。Veo 2在这方面会有多大改进还有待观察。

斯坦福大学（Stanford）教授、World Labs联合创始人李飞飞（Fei-Fei Li）等人认为，只有所谓的世界模型才能真正解决物理和物体永恒性等难题，这些模型具有“空间智能”，能够理解和生成三维环境。谷歌于本月早些时候推出了Genie 2世界模型，但其重点是生成环境，用于训练和评估在虚拟环境中运行的人工智能“代理”。

图像和视频生成器的输出越逼真，其被用于非法目的的风险就越大。DeepMind在Veo 2视频片段中添加了隐形的SynthID水印。如果人们在查看视频时发现了人工智能来源的蛛丝马迹，那么利用这些视频片段进行政治造谣的难度就会加大。对于更普通的欺诈性应用程序，这一措施可能并不奏效，因为受害者不太可能检查文件中是否有隐形水印。

相比之下，OpenAI的Sora在其生成视频的右下角添加了明显的动画标识。Sora还使用开源的C2PA水印协议，这是SynthID的替代系统（尽管谷歌也在2月份加入了C2PA计划）。

Veo 2现已被整合进谷歌实验室的VideoFX生成工具（分辨率上限为720p），而修改后的Imagen 3模型如今也已应用于ImageFX工具。VideoFX目前只在美国推出，但ImageFX可在100多个国家使用。

谷歌DeepMind并未透露Veo 2和新版Imagen 3所使用的训练数据来源，不过该公司此前曾暗示，油管（YouTube）上的视频（这两家公司都隶属于Alphabet）是原始Veo版本部分训练数据的来源。

许多艺术家、摄影师、创作者和电影制作人担心，他们受版权保护的作品会在未经授权的情况下被用于训练此类系统。OpenAI拒绝透露Sora的训练数据来源，但《纽约时报》援引熟悉Sora训练情况的消息人士报道称，该公司使用了谷歌油管服务上的视频来训练人工智能模型。404 Media此前曾报道，Runway似乎也使用了油管上的视频来训练Gen 3 Alpha。

ImageFX在笔者所在的德国无法使用。然而，谷歌DeepMind的一位发言人否认这与欧盟新的《人工智能法案》有任何关联，该法案要求大型科技公司提供详细的摘要，说明他们在训练人工智能模型时使用了哪些受版权保护的数据。他们表示：“我们通常会先在某一特定市场或有限的市场范围内逐步推进试验，然后再拓展到更广阔的市场。”（财富中文网）

译者：中慧言-王芳

在Alphabet旗下的谷歌DeepMind推出Veo人工智能视频生成器仅7个月后，该部门再次宣布推出Veo 2。

Veo 2升级至4K分辨率表明DeepMind在视频生成领域相较于竞争对手的人工智能实验室取得了领先优势。

谷歌在Veo 2的演示中表示：“低分辨率视频在移动设备上播放效果很好，但创作者希望看到自己的作品在大屏幕上熠熠生辉。”

谷歌发言人表示，在默认情况下，Veo 2生成的4K视频时长被限制在8秒以内，但可以延长至2分钟或以上。Sora生成的1080p视频时长则被限制在20秒以内。

据DeepMind称，在“遵循提示”（即按照要求完成任务）方面，Veo 2的受青睐程度也相似。

物理学问题一直是视频生成器面临的一大难题。例如，Sora就难以生成逼真的体操运动员及其复杂动作视频。Veo 2在这方面会有多大改进还有待观察。

译者：中慧言-王芳

Just seven months after it unveiled its Veo AI video generator, Alphabet division Google DeepMind has announced Veo 2.

The new tool can generate videos of up to 4K resolution, whereas the first Veo could only handle up to 1080p. Google is claiming improvements in the physics of the scenes that the upgraded Veo generates, as well as better “camera control” (there is no real camera involved, but the user can prompt the model for specific camera shots and angles, from close ups to pans to “establishing shots.”)

DeepMind also announced an updated version of its Imagen 3 text-to-image model, though the changes—like “more compositionally balanced” images and improved adherence to artistic styles—clearly aren’t big enough to warrant a full new version number. Imagen 3 first rolled out in August.

Veo 2’s step up to 4K suggests DeepMind is pulling ahead of rival AI labs in video generation.

OpenAI finally released its Sora video generator a week ago, after having unveiled it all the way back in February, but the output of Sora (specifically, the Sora Turbo version that is now available to ChatGPT Plus and Pro users) remains limited to a maximum resolution of 1080p. Runway, which is perhaps the most popular of the current AI video generators, can only export at an even fuzzier 720p.

“Low resolution video is great for mobile, but creators want to see their work shine on the big screen,” Google said in a presentation on Veo 2.

Veo 2’s 4K clips are limited to eight seconds by default, but they can be extended to two minutes or more, said a Google spokesperson. Sora’s 1080p clips are capped at 20 seconds.

DeepMind claims that, when comparing Veo 2 to Sora Turbo, 59% of human raters preferred Google’s service, with 27% opting for Sora Turbo. It also claims similar victories against Minimax and Meta’s Movie Gen, with Veo 2 preference only slipping slightly below 50% when the rival was Kling v1.5, a service from China’s Kuaishou Technology.

When it comes to “prompt adherence”—i.e. doing what it was asked to do—Veo 2 was preferred at similar rates, according to DeepMind.

The Google unit also claims to have made significant strides in combating “hallucinated” details, like bonus fingers, and in demonstrating “a better understanding of real-world physics and the nuances of human movement and expression.”

The physics issue is one that continues to bedevil video generators. Sora, for example, struggles to generate plausible footage of gymnasts and their complex movements. It remains to be seen how much better Veo 2 will prove in this regard.

Some, like Stanford professor and World Labs co-founder Fei-Fei Li, argue that issues like physics and object permanence can only really be solved with so-called world models that have the “spatial intelligence” to understand and generate 3D environments. Google unveiled its own Genie 2 world model earlier this month, but with a focus on generating environments that can be used to train and evaluate AI “agents” that operate in virtual environments.

The more plausible the output of image and video generators, the greater the risk of them being used for nefarious purposes. DeepMind applies invisible SynthID watermarks to Veo 2 clips, which should make it more difficult to use them for political disinformation, if people are checking videos for such telltale signs of AI origins. The same may not hold true for more mundane fraudulent applications, where victims would be less likely to check the file for invisible watermarks.

By way of contrast, OpenAI’s Sora embeds a visible animation in the bottom right corner of its videos. Sora also uses the open-source C2PA watermarking protocol, an alternative system to SynthID (though Google also joined the C2PA initiative in February.)

Veo 2 is now powering Google Labs’s VideoFX generation tool (which has a resolution cap of 720p,) while the revised Imagen 3 model can now be used in the ImageFX tool. VideoFX is currently only rolling out in the U.S., but ImageFX is available in over 100 countries.

Google DeepMind has not said what data was used to train Veo 2 or the new version of Imagen 3, though it previously hinted that YouTube videos (both companies fall under the Alphabet umbrella) comprised some of the training data for the original Veo.

Many artists, photographers, creators and filmmakers are concerned their copyrighted works have been used to train such systems without their consent. OpenAI has refused to say what data was used to train Sora but the New York Times, citing sources familiar with Sora’s training, has reported that the company used videos from Google’s YouTube service to train the AI model. 404 Media has previously reported that Runway also seems to have used YouTube videos to train Gen 3 Alpha.

ImageFX is not available in Germany, where this writer is based. However, a Google DeepMind spokesperson denied that this had anything to do with the EU’s new AI Act, which demands that Big Tech firms provide a detailed summary of what copyright-protected data they use to train their AI models. “We often ramp up experiments in one or limited markets before expanding more broadly,” they said.

财富中文网所刊载内容之知识产权为财富媒体知识产权有限公司及/或相关权利人专属所有或持有。未经许可，禁止进行转载、摘编、复制及建立镜像等任何使用。

0条Plus

精彩评论

撰写或查看更多评论

请打开财富Plus APP

前往打开

热读文章

关注我们

谷歌旗下公司发布新款视频生成器，分辨率超Sora

撰写或查看更多评论