Sora什么时候开放？OpenAI的Sora学习笔记

OpenAI的Sora学习笔记

sora 什么时候开放

Sora 是Openai 刚刚官宣不久的模型，目前并没有向所有的用户开放，只是邀请行业内一些人员，艺术家等体验，但是按照以往Open ai Chatgpt 新功能的开放速度，应该会很快就全员开放！

万老师学习总结：

（1）Sora是一个扩散型transformer。Sora 是一种扩散模型，它从看起来像静态噪声的视频开始生成视频，然后通过多个步骤消除噪声来逐渐对其进行转换。Sora 能够一次生成整个视频或扩展生成的视频以使其更长。通过一次为多个帧提供模型预测，OpenAI 解决了一个具有挑战性的问题，即确保对象即使暂时离开视野也保持不变。与 GPT 模型类似，Sora 使用 Transformer 架构，释放出卓越的缩放性能。

（2）从报告来看没用太多花里胡哨的方法，主要由VAE encoder、ViT、DDPM、VAE decoder混合。Sora 建立在过去对 DALL·E 和 GPT 模型的研究之上。它使用 DALL·E 3 的重述技术，该技术涉及为视觉训练数据生成高度描述性的标题。因此，该模型能够更忠实地遵循生成视频中用户的文本指令。除了能够仅根据文本指令生成视频之外，该模型还能够获取现有的静态图像并从中生成视频，准确地动画图像的内容并关注小细节。该模型还可以获取现有视频并对其进行扩展或填充缺失的帧。Sora利用视觉补丁（visual patches）作为标记，而不是大语言模型使用的文本标记（Token）。先前在视觉模型中被证明有效的补丁被发现对于在不同视频和图像上训练生成模型具有高度可扩展性和高效性。

（3）“涌现”，即由量变引起质变。在大规模训练的情况下，Sora能够模拟物理世界中人、动物和环境的某些方面。这些能力使得这些特性不需要任何显式的归纳偏置来处理3D、对象等问题，它们纯粹是规模的现象。

启示

（1）学习理工科，学习编程的“学习”不能只是刷题、背公式、背定理、记代码，不能只是学习表面的理论和知识。学习任务其实要更高级，是要更深一层了解他们背后的方法和原理，了解AI背后的规则和模型，这样才能知道它的特点和弱点，才能成为AI之上的人，所以未来我们必须更要加强学习理工科、学习编程，而且要更高级地学。

（2）老记得一个孩子对俺说过：“您知道为什么我们这么尊重您吗？因为您重来不区别对待每个学生。”，其实这个世界那有什么差生优生，只有会不会学习。“假舆马者，非利足也，而致千里；假舟楫者，非能水也，而绝江河。君子非生异也，善假于物也”！2月15日发布的Sora让人挺震撼的！它的文生视频的提示语，遵循着提示工程的原则。这个时代只有善用AI的人才不会被别人取代，大家可以看下我们的书《AI提示工程基础·应用·实例》。另，微信读书上也可以阅读我们的书。

附：技术报告全文中英对照（ChatGPT翻译）

https://openai.com/research/video-generation-models-as-world-simulators

Video generation models as world simulators
视频生成模型作为世界模拟器

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
我们探索在视频数据上进行大规模生成模型的训练。具体而言，我们联合训练可变长度、分辨率和宽高比的视频和图像的文本条件扩散模型。我们利用一个在视频和图像潜码的时空块上操作的Transformer架构。我们最大的模型Sora能够生成一分钟高保真度的视频。我们的结果表明，扩展视频生成模型是构建通用物理世界模拟器的有希望的路径。

This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.
这份技术报告重点介绍了两个方面：（1）我们将各种类型的视觉数据转化为统一表示的方法，从而实现大规模生成模型的训练；（2）对Sora的能力和局限性进行定性评估。报告中不包含模型和实现的详细信息。

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,generative adversarial networks, autoregressive transformers, and diffusion models. These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.
许多先前的研究都探索了使用各种方法对视频数据进行生成建模，包括循环网络、生成对抗网络、自回归Transformer和扩散模型等。这些研究通常侧重于特定类型的视觉数据、较短的视频或固定尺寸的视频。Sora是一种对视觉数据进行综合建模的通用模型，它可以生成跨越不同时长、宽高比和分辨率的视频和图像，最高可生成一分钟的高清视频。

Turning visual data into patches

将视觉数据转化为补丁/片段

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data. The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data. We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.
我们受到大型语言模型的启发，这些模型通过在互联网规模的数据上进行训练而获得了通用能力。语言模型的成功在一定程度上得益于使用优雅地统一了文本、代码、数学和各种自然语言的标记。在这项工作中，我们考虑了生成视觉数据模型如何继承这些好处。与语言模型使用文本标记不同，Sora使用视觉补丁（visual patches）。之前已经证明补丁是对视觉数据模型有效的表示形式。我们发现，补丁是在各种类型的视频和图像上训练生成模型的一种高度可扩展和有效的表示形式。

At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space, and subsequently decomposing the representation into spacetime patches.
在高层次上，我们通过首先将视频压缩成低维潜空间，然后将表示分解为时空补丁，将视频转化为补丁形式。

Video compression network

视频压缩网络

We train a network that reduces the dimensionality of visual data. This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.
我们训练了一个网络，用于降低视觉数据的维度。该网络以原始视频作为输入，并输出在时间和空间上都进行了压缩的潜在表示。Sora在这种压缩的潜在空间上进行训练，随后生成视频。我们还训练了一个相应的解码器模型，将生成的潜在表示映射回像素空间。

Spacetime Latent Patches

隐空间时空编码块

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.
对于给定的压缩输入视频，我们提取一系列时空补丁，它们充当Transformer的标记（token）。这个方案同样适用于图像，因为图像只是单帧的视频。我们基于补丁的表示使得Sora能够在分辨率、时长和宽高比可变的视频和图像上进行训练。在推理过程中，我们可以通过在适当尺寸的网格中排列随机初始化的补丁来控制生成的视频大小。

Scaling transformers for video generation

扩展Transformer用于视频生成

Sora is a diffusion model; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling, computer vision and image generation.
Sora是一个扩散模型；给定带噪声的补丁输入（以及文本提示等条件信息），它被训练用于预测原始的“清晰”补丁。重要的是，Sora是一个扩散Transformer。Transformer在多个领域展现了卓越的可扩展性，包括语言建模、计算机视觉和图像生成。

In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.
在这项工作中，我们发现扩散Transformer同样在作为视频模型时能够有效扩展。下面，我们展示了在训练进行时，使用固定种子和输入的视频样本的比较。随着训练计算量的增加，样本质量显著提高。

Variable durations, resolutions, aspect ratios

可变持续时间、分辨率、宽高比

Past approaches to image and video generation typically resize, crop or trim videos to a standard size – e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.
过去的图像和视频生成方法通常会将视频调整为标准尺寸，例如256x256分辨率的4秒视频，通过调整大小、裁剪或修剪。然而，我们发现使用原始尺寸的数据进行训练会带来几个优点。

Sampling flexibility

采样灵活性

Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.
Sora可以生成宽屏1920x1080p的视频、垂直1080x1920的视频以及介于两者之间的任何尺寸。这使得Sora能够直接根据设备的原始宽高比创建内容。它还使我们能够在生成全分辨率之前，通过较低的尺寸迅速原型化内容，而所有这些都可以使用同一个模型完成。

Improved framing and composition

改进的构图和画面组成

We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.
我们在实践中发现，以原始宽高比训练视频可以改善构图和画面框架。我们将Sora与将所有训练视频裁剪为正方形的模型进行了比较，这是在训练生成模型时常见的做法。在使用正方形裁剪训练的模型（左侧）有时会生成仅部分可见主体的视频。相比之下，Sora生成的视频（右侧）在构图方面有所改善。

Language understanding

语言理解

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 3 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.
训练文本到视频生成系统需要大量具有相应文本标题的视频。我们应用了DALL·E 330中引入的重新标注技术来处理视频。首先，我们训练了一个高度描述性的标题生成模型，然后使用该模型为训练集中的所有视频生成文本标题。我们发现，在高度描述性的视频标题上进行训练不仅提高了文本的准确性，还提升了视频的整体质量。

Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.
和DALL·E 3类似，我们还利用GPT将用户提供的简短提示转化为更详细的长篇描述，并将其发送给视频模型。这使得Sora能够生成高质量的视频，准确地符合用户的提示。

Prompting with images and videos

使用图片和视频进行提示

All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.
以上所有的结果以及我们的首页上展示的都是文本到视频的样本。但是Sora也可以使用其他输入进行提示，例如预先存在的图像或视频。这种能力使得Sora能够执行各种图像和视频编辑任务，如创建完美循环的视频、为静态图像添加动画效果、将视频向前或向后延长等等。

Animating DALL·E images

制作DALL·E图像动画

Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 2 and DALL·E 3 images.
Sora能够根据输入的图像和提示生成视频。下面我们展示了基于DALL·E 2和DALL·E 3图像生成的示例视频。

Extending generated videos

延长生成的视频

Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.
Sora还具备将视频向前或向后延展时间的能力。下面是四个视频，它们都是从生成视频的一部分开始向时间轴后退延展的。因此，这四个视频的起始各不相同，但最终都达到了相同的结尾。

We can use this method to extend a video both forward and backward to produce a seamless infinite loop.
我们可以利用这种方法将视频向前和向后延展，以生成一个无缝的无限循环。

Video-to-video editing

视频到视频编辑

Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit to Sora. This technique enables Sora to transform the styles and environments of input videos zero-shot.
扩散模型为基于文本提示的图像和视频编辑提供了大量的方法。下面我们将其中一种方法，SDEdit，应用于Sora。这种技术使得Sora能够在零样本情况下转换输入视频的风格和环境。

Connecting videos

连接视频

We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.
我们还可以使用Sora在两个输入视频之间逐渐插值，创建在完全不同主题和场景构成的视频之间的无缝过渡。在下面的例子中，中间的视频在左右两边对应视频之间进行插值。

Image generation capabilities

图像生成能力

Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.
Sora还能够生成图像。我们通过在空间网格中排列高斯噪声的补丁，并具有一个帧的时间范围来实现这一点。该模型可以生成具有可变尺寸的图像，最高分辨率可达到2048x2048。

Close-up portrait shot of a woman in autumn, extreme detail, shallow depth of field
秋天里一位女性的特写肖像，极致细节，浅景深

Vibrant coral reef teeming with colorful fish and sea creatures
充满活力的珊瑚礁，挤满了五彩缤纷的鱼类和海洋生物

Digital art of a young tiger under an apple tree in a matte painting style with gorgeous details
数字艺术：一只幼年老虎在苹果树下，采用哑光绘画风格，细节华丽

A snowy mountain village with cozy cabins and a northern lights display, high detail and photorealistic dslr, 50mm f/1.2
一个雪山村庄，有着舒适的小木屋和北极光展示，高清晰度和逼真的数码单反相机，50mm f/1.2镜头拍摄。

Emerging simulation capabilities

涌现的模拟能力

We find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.
我们发现，在大规模训练的情况下，视频模型展现出许多有趣的新兴能力。这些能力使得Sora能够模拟物理世界中人、动物和环境的某些方面。这些特性不需要任何显式的归纳偏置来处理3D、对象等问题，它们纯粹是规模的现象。

3D consistency

3D一致性

Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.
Sora能够生成具有动态相机运动的视频。随着相机的移动和旋转，人物和场景元素在三维空间中保持一致的运动。

Long-range coherence and object permanence

长距离一致性和物体恒存性

A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies. For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.
视频生成系统面临的一个重要挑战是在采样长视频时保持时间一致性。我们发现，Sora通常能够有效地建模短期和长期的依赖关系，尽管并非总是如此。例如，我们的模型可以在人物、动物和物体被遮挡或离开画面时保持它们的存在。同样地，它可以在单个样本中生成同一角色的多个镜头，并在整个视频中保持它们的外观。

Interacting with the world

与世界互动

Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.
Sora有时可以模拟以简单方式影响世界状态的动作。例如，一个画家可以在画布上留下持久的新笔触，或者一个人可以吃掉一个汉堡并留下咬痕。

Simulating digital worlds.

模拟数字世界

Sora is also able to simulate artificial processes–one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”
Sora还能够模拟人工过程，其中一个例子就是视频游戏。Sora可以在保持高保真度的同时，通过基本策略同时控制Minecraft中的玩家和渲染世界及其动态。通过提示Sora使用包含"Minecraft"的文本描述，可以激发这些能力，而无需进行任何训练。

These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.
这些能力表明，持续扩展视频模型是开发高度功能强大的物理世界和数字世界模拟器，以及其中的物体、动物和人类的有希望的路径。

Discussion 讨论

Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model—such as incoherencies that develop in long duration samples or spontaneous appearances of objects—in our landing page.
目前，作为一个模拟器，Sora展示了许多限制。例如，它无法准确地模拟许多基本交互的物理现象，比如玻璃破碎。其他交互，比如吃东西，不总是产生正确的物体状态变化。我们在首页上列举了模型的其他常见失败模式，比如在长时间样本中产生的不一致性或物体的突然出现。

We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them.
我们相信，Sora目前展示的能力表明，持续扩展视频模型是开发功能强大的物理世界和数字世界模拟器，以及其中的物体、动物和人类的有希望的路径。

{{userData.name}}已认证

Claude、ChatGPT、Grok、Midjourney国内镜像版

sora 什么时候开放

Turning visual data into patches

Video compression network

Spacetime Latent Patches

隐空间时空编码块

Scaling transformers for video generation

Variable durations, resolutions, aspect ratios

Sampling flexibility

采样灵活性

Improved framing and composition

Language understanding

语言理解

Prompting with images and videos

Animating DALL·E images

制作DALL·E图像动画

Extending generated videos

Video-to-video editing

视频到视频编辑

Connecting videos

连接视频

Image generation capabilities

Discussion 讨论

sora是什么?为什么Pika和Runway做不出这样的效果？

Sora什么时候开放?如何访问Sora?

ChatGPT Pro代充值&成品账号、无限次数提问 | 独享一个月

ChatGPT Team账号购买 | 可自己邀请一个人

ChatGPT Team团队版 | 5个人共享合租、自动发货 | 保证能用30天

gpt4o账号共享 | chatgpt4共享账号合租，3个人合租一个月 | 自动发货

chatgpt4共享号 | 支持最新GPT4o、5个人合租使用一个月 | 自动发货

gpt4o账号购买 | 一人独享、独家质保30天，支持GPT4o

本站所有商品

ChatGPT论文专栏

微信人工客服

Claude、ChatGPT、Grok、Midjourney国内镜像版