[翻訳] Road to Sora: OpenAIのSoraを理解するための先行研究紹介（feat. Oxen.AI）

(discuss.pytorch.kr)

6 ポイント投稿者 ninebow 2024-03-26 | 1件のコメント | WhatsAppで共有

高品質なAIデータセットツールを制作する Oxen.AIでは、毎週金曜日にAI論文を読み、インサイトを共有する ArXiv Dives を運営しています。
この記事は、3月初旬の ArXiv Dives で取り上げられた Road to Sora という記事を許可を得て翻訳・共有するものです。
今回翻訳した Road to Sora は、OpenAI が公開した画像生成モデル Sora の技術文書をもとに、Sora モデルを理解するために必要な知識を見ていくことを目的としています。

Road to Sora: OpenAIのSoraを理解するための研究 / "Road to Sora" Paper Reading List

by Greg Schoeninger, Mar 5, 2024

この記事は、金曜日の論文クラブ ArXiv Dives のリーディングリストをまとめる試みの一環です。Sora に関する公式論文はまだ公開されていないため、OpenAI の Sora 技術報告書の手がかりをたどることを目標にしています。今後数週間にわたり、金曜日の論文クラブでいくつかの基本的な論文を取り上げ、Sora の舞台裏で何が起きているのかをよりよく理解できるようにする予定です。

This post is an effort to put together a reading list for our Friday paper club called ArXiv Dives. Since there has not been an official paper released yet for Sora, the goal is follow the bread crumbs from OpenAI's technical report on Sora. We plan on going over a few of the fundamental papers in the coming weeks during our Friday paper club, to help paint a better picture of what is going on behind the curtain of Sora.

Soraとは何ですか？ / What is Sora?

Sora は、自然言語プロンプトから高精細な動画を生成できることで、生成AI分野に大きな反響を巻き起こしたモデルです。まだ Sora の作例を見たことがなければ、以下のサンゴ礁を泳ぐカメの動画を見てみてください。

Sora has taken the Generative AI space by storm with it's ability to generate high fidelity videos from natural language prompts. If you haven't seen an example yet, here's a generated video of a turtle swimming in a coral reef for your enjoyment.

OpenAI は、モデル自体の技術的な詳細に関する公式研究論文こそ公開していませんが、使用した技術の高レベルな詳細といくつかの定性的な結果を扱う技術文書は公開しています。

While the team at OpenAI has not released an official research paper on the technical details of the model itself, they did release a technical report that covers some high level details of the techniques they used and some qualitative results.

https://openai.com/research/video-generation-models-as-world-simulators

Soraアーキテクチャの概要 / Sora Architecture Overview

以下の論文を読めば、ここにある Sora のアーキテクチャが理解しやすくなるはずです。技術報告書は1万フィート上空から見たような大まかな視点なので、各論文がそれぞれの側面を拡大し、全体像を描き出してくれることを期待しています。まず、"Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models" という優れたレビュー論文が、リバースエンジニアリングされたアーキテクチャの高レベルな図を示しています。

After reading the papers below, the architecture here should start to make sense. The technical report is a 10,000 foot view and my hope is that each paper will zoom into different aspects and paint the full picture. There is a nice literature review called "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models" that gives a high level diagram of a reverse engineered architecture.

OpenAI は、Sora を、上記の論文に挙げられた多くの概念を組み合わせた「Diffusion Transformer」であり、動画から生成された latent spacetime patch に適用されるものだと説明しています。

The team at OpenAI states that Sora is a "Diffusion Transformer" which combines many of the concepts listed in the papers above, but applied applied to latent spacetime patches generated from video.

これは、Vision Transformer（ViT）論文で使われたパッチのスタイルと、Latent Diffusion 論文に近い latent space を組み合わせ、それを Diffusion Transformer の形で統合したものです。画像の width と height に沿ったパッチだけでなく、動画の時間次元にも拡張されています。

This is a combination of the style of patches used in the Vision Transformer (ViT) paper, with latent spaces similar to the Latent Diffusion Paper, but combined in the style of the Diffusion Transformer. They not only have patches in width and height of the image but extend it to the time dimension of video.

これらすべてのための学習データを正確にどのように収集したのかを断定するのは難しいですが、DALL-E 3の論文にある技術を組み合わせただけでなく、GPT-4を使って各画像の詳細なテキスト説明を作成し、それを動画へ変換したように見えます。学習データこそがここで最も重要な秘伝のタレである可能性が高いため、技術レポートでも詳細の説明が最も少なくなっています。

It's hard to say how exactly they collected the training data for all of this, but it seems like a combination of the techniques in the Dalle-3 paper as well as using GPT-4 to elaborate on textual descriptions of images, that they then turn into videos. Training data is likely the main secret sauce here, hence has the least level of detail in the technical report.

活用事例 / Use Cases

Soraのような動画生成技術には、興味深いユースケースや応用が数多くあります。映画、教育、ゲーム、医療、ロボティクスなど、自然言語プロンプトから現実的な動画を生成することは、複数の業界に大きな変化をもたらすでしょう。

There are many interesting use cases and applications for video generation technologies like Sora. Whether it be movies, education, gaming, healthcare or robotics, there is no doubt generating realistic videos from natural language prompts is going to shake up multiple industries.

この図の下部にある注記は、Oxen.aiにも当てはまる内容です。Oxen.aiをご存じない方のために説明すると、私たちは機械学習モデルに入出力されるデータの共同作業と評価を支援するオープンソースツールを構築しています。私たちは、多くの人がこのデータに対する可視性を必要としており、そのためには協調的な取り組みが必要だと考えています。AIはさまざまな分野や産業に影響を及ぼしており、これらのモデルを学習・評価するデータにより多くの目が向けられるほど、結果はより良くなります。

The note at the bottom of this diagram rings true for us at Oxen.ai. If you are not familiar with Oxen.ai we are building open source tools to help you collaborate on and evaluate data the comes in and out of machine learning models. We believe that many people need visibility into this data, and that it should be a collaborative effort. AI is touching many different fields and industries and the more eyes on the data that trains and evaluates these models, the better.

こちらもご覧ください: https://oxen.ai

Check us out here: https://oxen.ai

論文一覧 / Paper Reading List

OpenAIが公開した技術レポートの参考文献セクションには多くの論文がリンクされていますが、どれを先に読むべきか、あるいはどれが重要な背景知識なのかを判断するのは少し難しいです。私たちはその中から、特に影響力が大きく興味深い論文を選び、種類別に整理しました。

There are many papers linked in the references section of the OpenAI technical report but it is a bit hard to know which ones to read first or are important background knowledge. We've sifted through them and selected what we think are the most impactful and interesting ones to read, and organized them by type.

背景知識に関する論文 / Background Papers

生成された画像や動画の品質は、2015年以降着実に向上しています。一般大衆の目を引いた最も大きな進展は、2022年のMidjourney、Stable Diffusion、DALL-Eから始まりました。このセクションには、文献の中で繰り返し参照されるいくつかの基礎論文とモデルアーキテクチャが含まれています。すべての論文がSoraアーキテクチャに直接関係しているわけではありませんが、最先端技術が時間とともにどのように進歩してきたかを理解するうえで、いずれも重要な背景資料です。

The quality of generated images and video have been steadily increasing since 2015. The biggest gains that caught the general public's eyes began in 2022 with Midjourney, Stable Diffusion and Dalle. This section contains some foundational papers and model architectures that are referenced over and over again in the literature. While not all papers are directly involved in the Sora architecture, they are all important context for how the state of the art has improved over time.

以下の論文の多くは、以前のArXiv Divesで取り上げています。追いつきたい方は、Oxen.aiブログの全記事を参照してください。

https://www.oxen.ai/community/arxiv-dives

U-Net

「U-Net: 生物医学画像セグメンテーションのための畳み込みネットワーク（U-Net: Convolutional Networks for Biomedical Image Segmentation）」論文は、特定分野（ここでは生物医学画像）のタスクで使われていた研究が、さまざまなユースケースに応用された好例です。とりわけ注目すべきなのは、Stable Diffusionのような多くの拡散モデルの基盤となっており、各ステップでノイズを予測して軽減する学習を容易にしている点です。Soraアーキテクチャで直接使われているわけではありませんが、従来の最先端技術を理解するうえで重要な背景知識です。

"U-Net: Convolutional Networks for Biomedical Image Segmentation" is a great example of a paper that was used for a task in one domain (Biomedical imaging) that got applied across many different use cases. Most notably is the backbone many diffusion models such as Stable Diffusion to facilitate learning to predict and mitigate noise at each step. While not directly used in the Sora architecture, important background knowledge for previous state of the art.

https://arxiv.org/abs/1505.04597

言語トランスフォーマー / Language Transformers

「アテンションさえあれば十分です（Attention Is All You Need）」論文は、機械翻訳タスクでその有効性を示したもう1本の論文ですが、最終的には自然言語処理研究全体にとって極めて重要な論文となりました。トランスフォーマーは現在、ChatGPTのような多くのLLMアプリケーションの基盤になっています。トランスフォーマーは最終的にさまざまなモダリティへ拡張可能であり、Soraアーキテクチャの構成要素としても使われています。

"Attention Is All You Need" is another paper that proved itself on a Machine Translation task, but ended up being a seminal paper for all of natural language processing research. Transformers are now the backbone of many LLM applications such as ChatGPT. Transformers end up being extensible to many modalities and are used as a component of the Sora architecture.

https://arxiv.org/abs/1706.03762

ビジョントランスフォーマー / Vision Transformer (ViT)

「画像は16x16個の単語に値する：大規模画像認識のためのトランスフォーマー（An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale）」論文は、トランスフォーマーを画像認識に適用した初期の研究の1つであり、十分に大規模なデータセットで学習すれば、ResNetやその他の畳み込みニューラルネットワークを上回れることを示しました。この論文は、「Attention Is All You Need」で示されたアーキテクチャをコンピュータビジョンタスク向けに適用したものです。ViTは、テキストトークンを入力として使う代わりに、16x16の画像パッチを入力として使います。

"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" was one of the first papers to apply Transformers to image recognition, proving that they can outperform ResNets and other Convolutional Neural Networks if you train them on large enough datasets. This takes the architecture from the "Attention Is All You Need" paper and makes it work for computer vision tasks. Instead of the inputs being text tokens, ViT uses 16x16 image patches as input.

https://arxiv.org/abs/2010.11929

潜在拡散モデル / Latent Diffusion Models

「潜在拡散モデルによる高解像度画像合成（High-Resolution Image Synthesis with Latent Diffusion Models）」は、Stable Diffusionのような多くの画像生成モデルの基盤となる技術です。潜在表現（latent representation）からのノイズ除去オートエンコーダの連続として、画像生成をどのように再定式化できるかを示しています。これらのモデルは、前述のU-Netアーキテクチャを生成プロセスの中核として使用します。これらのモデルは、テキスト入力が与えられると写実的な画像を生成できます。

"High-Resolution Image Synthesis with Latent Diffusion Models" is the technique behind many image generation models such as Stable Diffusion. They show how you can reformulate the image generation as a sequence of denoising auto-encoders from a latent representation. They use the U-Net architecture referenced above as the backbone of the generative process. These models can generate photo-realistic images given any text input.

https://arxiv.org/abs/2112.10752

CLIP

「自然言語による教師あり学習から転移可能な視覚モデルの学習（Learning Transferable Visual Models From Natural Language Supervision）」は、対照的言語-画像事前学習（CLIP; Contrastive Language-Image Pre-training）とも呼ばれ、テキストデータと画像データを同じ潜在空間に埋め込む手法です。この技術は、テキストと画像のペアにおける表現間のコサイン類似度が高くなるようにすることで、生成モデルの言語理解と視覚理解を結び付けるのに役立ちます。

"Learning Transferable Visual Models From Natural Language Supervision" often referred to as Contrastive Language-Image Pre-training (CLIP) is a technique for embedding text data and image data into the same latent space as each other. This technique helps connect the language understanding half of generative models to the visual understanding half by making sure that the cosine similarity between the text and image representations are high between text and image pairs.

https://arxiv.org/abs/2103.00020

VQ-VAE

Soraの技術レポートによると、ベクトル量子化変分オートエンコーダ（VQ-VAE, Vector Quantized Variational Auto Encoder）によって生の動画の次元を削減します。VAEモデルは、潜在表現を学習するための強力な教師なし事前学習手法として知られています。

According to the technical report, they reduce the dimensionality of the raw video with a Vector Quantised Variational Auto Encoder (VQ-VAE). VAEs have been shown to be a powerful unsupervised pre-training method to learn latent representations.

https://arxiv.org/abs/1711.00937

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Soraの技術レポートでは、あらゆるアスペクト比の動画を取り込む方法と、それによってはるかに大規模なデータセットで学習できる方法について説明しています。データをクロップせずにモデルへより多くのデータを供給できるほど、より良い結果を得られます。この論文では画像に同じ手法を使っていますが、Soraはこれを動画へ拡張しています。

The Sora technical report talks about how they take in videos of any aspect ratio, and how this allows them to train on a much larger set of data. The more data they can feed the model without having to crop it, the better results they get. This paper uses the same technique but for images, and Sora extends it for video.

https://arxiv.org/abs/2307.06304

動画生成分野の論文 / Video Generation Papers

Soraに着想を与えた複数の動画生成論文が参照されており、上記の生成モデルを動画へ適用することで次の段階へ押し上げています。

ViViT: A Video Vision Transformer

この論文では、動画タスクに必要な「時空間トークン（Spatio-Temporal Token）」へ動画を分割する方法について詳しく説明しています。この論文は動画分類に焦点を当てていますが、同じトークナイゼーション方式を動画生成タスクにも適用できます。

This paper goes into details about how you can chop the video into "spatio-temporal tokens" needed for video tasks. The paper focuses on video classification, but the same tokenization can be applied to generating video.

https://arxiv.org/abs/2103.15691

Imagen Video: High Definition Video Generation with Diffusion Models

Imagenは、一連の動画拡散モデルを基盤とするテキスト条件付き動画生成システム（Text-conditional Video Generation System）です。時間方向の畳み込みとSuper Resolution技術を用いて、テキストから高精細な動画を生成します。

Imagen is a text-conditional video generation system based on a cascade of video diffusion models. They use convolutions in the temporal direction and super resolution to generate high quality videos from text.

https://arxiv.org/abs/2210.02303

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

この論文は、上記の画像生成論文で使われた潜在拡散モデルを取り入れ、潜在空間に時間次元（temporal dimension）を導入しています。ここでは潜在空間を整列させることで時間次元におけるいくつかの興味深い手法を適用していますが、まだSoraの時間的一貫性には及びません。

This paper takes the latent diffusion models from the image generation papers above and introduces a temporal dimension to the latent space. They apply some interesting techniques in the temporal dimension by aligning the latent spaces, but does not quite have the temporal consistency of Sora yet.

https://arxiv.org/abs/2304.08818

Photorealistic video generation with diffusion models

この論文では、拡散モデリングによる写実的な動画生成のためのトランスフォーマーベースのアプローチであるW.A.L.Tを紹介しています。私の知る限り、参考文献リストの中ではSoraに最も近い技術のようで、Google、スタンフォード、ジョージア工科大学のチームが2023年12月に発表しました。

They introduce W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. This feels like the closest technique to Sora in the reference list as far as I can tell, and was released in December of 2023 by the teams at Google, Stanford and Georgia Tech.

https://arxiv.org/abs/2312.06662

視覚・言語理解分野の論文群 / Vision-Language Understanding

テキストプロンプトから動画を生成するには、大規模なデータセットを収集する必要があります。人手でそれほど多くの動画にラベルを付けるのは不可能なため、DALL-E 3論文で説明されているものに類似した合成データ手法を使っているようです。

In order to Generate Videos from text prompts, they need to collect a large dataset. It is not feasible to have humans label that many videos, so it seems they use some synthetic data techniques similar to those described in the DALL·E 3 paper.

DALL·E 3

テキストから動画を生成するシステムを学習させるには、対応するテキストキャプション付きの大量の動画が必要です。DALL-E 3で紹介されたリキャプショニング（re-captioning）手法を、Soraの動画学習データに適用します。DALL-E 3と同様に、短いユーザープロンプトをより長く詳細なキャプションに変換し、それを動画モデルに渡すためにもGPTモデルを活用します。

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. They apply the re-captioning technique introduced in DALL·E 3 to videos. Similar to DALL·E 3, they also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model.

https://openai.com/dall-e-3

Llava

モデルがユーザーの指示に従えるようにするため、Llava論文に類似した指示ベースのファインチューニング（instruction finetuning）を行った可能性が高いです。この論文はまた、上記のDALL-Eの手法と組み合わせることで興味深い大規模な指示データセットを生成できる、いくつかの合成データ手法も示しています。

In order for the model to be able to follow user instructions, they likely did some instruction fine-tuning similar to the Llava paper. This paper also shows some synthetic data techniques to create a large instruction dataset that could be interesting in combination with the Dalle methods above.

https://arxiv.org/abs/2304.08485

Make-A-Video & Tune-A-Video

Make-A-Video や Tune-A-Video のような論文では、プロンプトエンジニアリングがモデルの自然言語理解能力を活用し、複雑な指示を読み解いて、それを一貫性があり生き生きとした高品質な動画ナラティブとしてレンダリングする方法を示しています。たとえば、シンプルなユーザープロンプトを形容詞や動詞で拡張し、シーンをより豊かに描写できます。

Papers like Make-A-Video and Tune-A-Video have shown how prompt engineering leverages model’s natural language understanding ability to decode complex instructions and render them into cohesive, lively, and high-quality video narratives. For example: taking a simple user prompt and extending it with adjectives and verbs to more fully flush out the scene.

https://arxiv.org/abs/2209.14792

https://arxiv.org/abs/2212.11565

結論 / Conclusion

この記事が、Soraのようなシステムを構成しうる重要な要素を知るための出発点になれば幸いです。見落としている点があると思われたら、いつでもメール（hello@oxen.ai）でお知らせください。

We hope this gives you a jumping off point for all the important components that could make up a system like Sora! If you think we missed anything, feel free to email us at hello@oxen.ai.

ここで紹介した論文は、決して気軽に読めるものではありません。だからこそ金曜日には、1回に1本の論文をじっくり読み進め、誰にでも理解できるよう平易な言葉でトピックを解説しています。私たちは、誰もがAIシステム構築に貢献できると信じていますし、基礎をより深く理解するほど、より多くのパターンを見つけ、より良い製品を作れるようになると考えています。

It is by no means a light set of reading. This is why on Fridays we take one paper at a time, slow down, and break down the topics in plain speak so anyone can understand. We believe anyone can contribute to building AI systems, and the more you understand the fundamentals, the more patterns you will spot, and better products you will build.

https://www.oxen.ai/community

ArXiv Divesに登録するか、Oxen.aiのDiscordコミュニティに参加して、この学びの旅に加わってみてください。

Join us on a learning journey either by signing up for ArXiv Dives or simply joining the Oxen.ai Discord community.

https://discord.com/invite/s3tBEn7Ptg

原文

https://www.oxen.ai/blog/road-to-sora-reading-list

⚠️広告⚠️::pytorch:PyTorch韓国ユーザーコミュニティ:kr:がまとめたこの記事は役に立ちましたか？会員登録すると主要な記事をメール:love_letter:でお届けします！（デフォルトはWeeklyですが、Dailyへの変更も可能です。）

1件のコメント

ninebow 2024-03-26

OpenAI's Sora: