18] 今週の主要ML論文（Top ML Papers of the Week）

(discuss.pytorch.kr)

2 ポイント投稿者 ninebow 2024-02-19 | まだコメントはありません。 | WhatsAppで共有

概要

DAIR.AI が毎週公開している ML 論文記事を自動翻訳してみました。
今週選ばれた論文は、自然言語処理、深層ニューラルネットワーク、強化学習分野の最新研究動向を反映しています。また、自然言語処理（NLP）関連技術に関する論文が今週注目を集めました。さらに、「World Model」「neural network trainability」という用語は、強化学習や深層ニューラルネットワークの理論的側面と関連しているようです。
最近の人工知能分野は、大規模言語モデルの発展に大きな注目を集めています。これは GPT-3 のようなモデルがさまざまな言語ベース作業で驚異的な性能を示したため、自然言語処理技術が理論研究と実用応用の双方で主要トピックとして浮上したことが主な理由です。大規模言語モデルは翻訳、要約、質問応答、生成的な文章作成など多様な NLP 作業に利用でき、これらのモデルの理解と改善に関する研究が活発です。
また、「neural network trainability」や「World Model」といった概念は、深層ニューラルネットワークをより効果的に学習させ、より複雑な環境をモデリングできる新しい技術研究を示唆しています。強化学習分野では、より洗練された環境モデルを通じてエージェントがより複雑な問題を解決できる能力を開発することに焦点を当てており、これも現代 AI 研究における重要な潮流です。
この文章は GPT モデルでまとめたものであり、誤りがある可能性があります。下の原文もあわせてご確認ください。読んでいる中で不自然または誤った内容を見つけた場合は、コメントでお知らせください。

OpenAIのSora

論文紹介

テキスト指示が与えられると、最大1分間のリアルで想像力豊かなシーンを動画として作成できるテキスト対動画 AI モデルで、複数のキャラクター、さまざまな動作タイプ、背景を含む複雑なシーンを生成し、互いの関係性を理解できます。キャラクターとビジュアルスタイルを一貫して維持しながら、単一動画内で複数のショットを生成するなどの機能も備えています。

A text-to-video ai model that can create videos of up to a minute of realistic and imaginative scenes given text instructions; it can generate complex scenes with multiple characters, different motion types, and backgrounds, and understand how they relate to each other; other capabilities include creating multiple shots within a single video with persistence across characters and visual style.

論文リンク

https://openai.com/research/…

さらに読む

https://discuss.pytorch.kr/t/gn-openai-sora-ai/3519

https://x.com/OpenAI/status/1758192957386342435

ジェミニ 1.5 / Gemini 1.5

論文紹介

長形式コンテンツの想起と推論などの能力に重点を置いた、計算効率の高いマルチモーダル混合エキスパートモデルです。数時間分の動画と音声を含む数百万トークンの長文書を推論でき、長文書 QA、長動画 QA、長コンテキスト ASR で最先端性能を向上させます。Gemini 1.5 Pro は、標準ベンチマークで Gemini 1.0 Ultra と同等またはそれ以上の性能を示し、他の長コンテキスト LLM と比較して、少なくとも 1000 万トークンまでほぼ完全な検索（>99%）を達成しました。

A compute-efficient multimodal mixture-of-experts model that focuses on capabilities such as recalling and reasoning over long-form content; it can reason over long documents potentially containing millions of tokens, including hours of video and audio; improves the state-of-the-art performance in long-document qa, long-video qa, and long-context asr. gemini 1.5 pro matches or outperforms gemini 1.0 ultra across standard benchmarks and achieves near-perfect retrieval (>99%) up to at least 10 million tokens, a significant advancement compared to other long-context llms.

論文リンク

https://storage.googleapis.com/deepmind-media/gemini/…

さらに読む

https://discuss.pytorch.kr/t/gn-gemini-1-5/3518

https://x.com/omarsar0/status/1758151923612483839

V-JEPA

論文紹介

200万本の動画を用いて特徴予測目的で学習された一連のビジョンモデルであり、自己教師あり学習に依存し、事前学習済み画像エンコーダ、テキスト、ネガティブサンプル、再構成、または他の教師信号を使用しません。モデルのパラメータを適応させることなく、動作ベースのタスクと外観ベースのタスクの両方で高い性能を示す汎用的な視覚表現を実現すると主張しています。

A collection of vision models trained on a feature prediction objective using 2 million videos; relies on self-supervised learning and doesn’t use pretrained image encoders, text, negative examples, reconstruction, or other supervision sources; claims to achieve versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters.

論文抄録（Abstract）

この論文では、ビデオに対する教師なし学習の独立した目的として特徴予測を検討し、事前学習済み画像エンコーダ、テキスト、負例、再構成、または他の教師ありソースを使用せず、特徴予測目的のみで学習された視覚モデル群である V-JEPA を紹介します。モデルは公開データセットから収集した2,000,000本のビデオで学習され、ダウンストリームの画像およびビデオタスクで評価されました。結果として、ビデオ特徴を予測することで学習すると、モデルのパラメータを適応させることなく、モーションと外観ベースのタスクの両方で優れた性能を発揮する汎用的な視覚表現を得られることが示されています。例えば、ビデオのみで訓練された最大モデルの ViT-H/16 は、フリーズしたバックボーンを用いて、Kinetics-400で81.9%、Something-Something-v2で72.2%、ImageNet1Kで77.9%の精度を記録しました。

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone, our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

論文リンク

https://ai.meta.com/research/publications/…

もっと読む

https://ai.meta.com/blog/…

https://github.com/facebookresearch/jepa

https://x.com/AIatMeta/status/1758176023588577326

LWM（Large World Model）：リングアテンションを使用して100万長の動画と言語を扱うモデル / World Model on Million-Length Video And Language With RingAttention

論文紹介

リングアテンションを使用し、長尺動画と書籍で学習された汎用の1Mコンテキスト・マルチモーダルモデル。難易度の高い検索タスクと長尺動画理解で新たなベンチマークを確立し、長いシーケンスチャットで異なるシーケンス長、損失重み、モデル生成QAデータセットを混在させるためにマスク付きシーケンスパッキングを利用。1Mトークン以上の長文テキストと動画を処理できる7Bパラメータのモデル群をオープンソース化。

A general-purpose 1m context multimodal model trained on long videos and books using ringattention; sets new benchmarks in difficult retrieval tasks and long video understanding; uses masked sequence packing for mixing different sequence lengths, loss weighting, and model-generated qa dataset for long sequence chat; open-sources a family of 7b parameter models that can process long text and videos of over 1m tokens.

論文要約（Abstract）

現在の言語モデルは、言葉で簡単に説明できない世界の側面を理解する点が不足しており、複雑で長い形式のタスクには苦労しています。ビデオシーケンスは、言語や静止画像にはない貴重な時間情報を提供するため、言語との共同モデリングに適しており、魅力的です。この種のモデルは、人間のテキスト知識と物理的世界の理解の両方を発展させることで、人間を支援するより広範なAI機能を実現できる可能性があります。しかし、数百万件のビデオおよび言語シーケンスから学習することは、メモリ制約、計算の複雑さ、データセットの制約のため困難です。これらの課題に対処するため、様々なビデオと書籍で構成された大規模データセットをキュレーションし、長いシーケンスをスケーラブルに学習するためにリングアテンション技術を活用し、コンテキスト長を4Kから100万トークンへ段階的に拡大します。論文の主な貢献は次のとおりです。(a) 最も大きなコンテキストサイズのニューラルネット: 長いビデオと言語シーケンス向けに、最長コンテキストサイズを持つトランスフォーマーの1つを学習し、難解な検索タスクと長尺ビデオ理解で新たなベンチマークを設定。(b) 異なるシーケンス長を混合するためのマスク付きシーケンスパッキングの利用、言語と視覚のバランスを取るための損失重み、長いシーケンスチャット向けのモデル生成QAデータセットなど、視覚言語学習の課題を克服するための解決策。(c) 数百万の長さを持つマルチモーダルシーケンスの学習に向けて、リングアテンション、マスク付きシーケンスパッキング、およびその他の主要機能を用いた高度に最適化された実装。(d) 100万トークンを超える長文テキストドキュメント（LWM-Text, LWM-Text-Chat）と動画（LWM, LWM-Chat）を処理できる7Bパラメータモデル群を完全にオープンソース化しました。本研究は、長い動画と言語の大量データセットを用いた学習により、人間の知識とマルチモーダル世界への理解を発展させ、より広い能力を育成する道を拓きます。

Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop an understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.

論文リンク

https://arxiv.org/abs/2402.08268

さらに読む

https://largeworldmodel.github.io/

https://huggingface.co/LargeWorldModel

https://x.com/haoliuhl/status/1757828392362389999

ニューラルネットワークの学習可能性の境界はフラクタルです / The boundary of neural network trainability is fractal

論文紹介

学習可能なニューラルネットワークのハイパーパラメータ構成と学習不可能なハイパーパラメータ構成の間の境界がフラクタルであることを見出し、すべてのニューラルネットワーク構成と深層線形ネットワークのフラクタルなハイパーパラメータランドスケープを観察し、最も性能が高いハイパーパラメータが安定性の終点にあることを観察します。

Finds that the boundary between trainable and untrainable neural network hyperparameter configurations is fractal; observes fractal hyperparameter landscapes for every neural network configuration and deep linear networks; also observes that the best-performing hyperparameters are at the end of stability.

論文要約(Abstract)

たとえば、マンデルブロ集合や二次ジュリア集合に関連する一部のフラクタルは、関数を反復して、結果の数列が発散するか有界なままであるかを識別し、その境界となるハイパーパラメータを特定することで計算されます。ニューラルネットワークの学習も同様に、更新関数を反復的に適用（例: 勾配降下の反復ステップ）し、収束あるいは発散挙動を示すことがあり、ハイパーパラメータの小さな変化に対して非常に敏感に反応します。これらの類似性に着目し、安定した学習と発散的な学習に至るニューラルネットワークのハイパーパラメータ間の境界を実験的に調べました。テストしたすべての構成で、この境界が十以上のオーダーにわたりフラクタル形状で存在することを発見しました。

Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations.

論文リンク

https://arxiv.org/abs/2402.06184

さらに読む

https://x.com/jaschasd/status/1756930242965606582

OS-Copilot: 自己改善を通じた汎用コンピュータエージェントへ / OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

論文紹介

LinuxやmacOSのようなオペレーティングシステムの主要要素とインターフェースする汎用コンピュータエージェントを構築するフレームワークとして、一般コンピュータタスクを自動化するために自己改善するエンボディードエージェントを提案します。さらに、このエージェントは、一般 AI アシスタント（GAIA）ベンチマークで従来手法より35%高い性能を示します。

a framework to build generalist computer agents that interface with key elements of an operating system like linux or macos; it also proposes a self-improving embodied agent for automating general computer tasks; this agent outperforms the previous methods by 35% on the general ai assistants (gaia) benchmark.

論文要約(Abstract)

コンピュータとの自律的な相互作用は長年の課題であり、近年の大規模言語モデル（LLM）の普及により、デジタルエージェント構築の進展は著しく加速しました。しかしこれらのエージェントの大半は、特定のソフトウェアやウェブサイトなど狭いドメインと相互作用するように設計されています。この狭い焦点は、一般的なコンピュータタスクへの適用性を制限します。これに対して、我々はオペレーティングシステム（OS）の包括的な要素、すなわちウェブ、コード端末、ファイル、マルチメディア、さまざまなサードパーティ製アプリケーションに対応可能な汎用エージェントを構築するフレームワークOS-Copilotを紹介します。OS-Copilotを用いて、一般的なコンピュータ作業を自動化する自己改善型エンボディードエージェントFRIDAYを作成しました。一般的なAIアシスタントベンチマークであるGAIAにおいて、FRIDAYは過去の方法より35%高い性能を示し、以前のタスクで蓄積されたスキルにより未観測のアプリケーションへも強い汎化能力を示しています。さらに、最小限の監督でExcelとPowerPointを制御し自己改善する方法をFRIDAYが学習したことについて、数値的・定量的な証拠も示します。OS-Copilotフレームワークと経験的結果は、より有能で汎用的なコンピュータエージェントに向けた今後の研究のためのインフラと示唆を提供します。

Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents. However, most of these agents are designed to interact with a narrow domain, such as a specific software or website. This narrow focus constrains their applicability for general computer tasks. To this end, we introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications. We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks. We also present numerical and quantitative evidence that FRIDAY learns to control and self-improve on Excel and Powerpoint with minimal supervision. Our OS-Copilot framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents.

論文リンク

https://arxiv.org/abs/2402.07456

さらに読む

https://x.com/omarsar0/status/1757443594976206885

TestGen-LLM: Metaで大規模言語モデルを用いた自動化された単体テスト改善 / Automated Unit Test Improvement using Large Language Models at Meta

論文紹介

InstagramのReelsとStoriesの評価後、TestGen-LLMのテストケースの75%が正しく構築され、57%が安定して通過し、カバレッジが25%向上したと報告されています。

Uses llms to automatically improve existing human-written tests; reports that after an evaluation on reels and stories products for instagram, 75% of testgen-llm's test cases were built correctly, 57% passed reliably, and 25% increased coverage.

論文要約(Abstract)

この論文では、LLM を用いて既存の人手で作成されたテストを自動的に改善する Meta の TestGen-LLM ツールについて説明します。TestGen-LLM は、生成されたテストクラスが元のテストスイートより測定可能な改善を保証する一連のフィルターを問題なく通過することを確認することで、LLM の幻覚に起因する問題を除去します。Instagram および Facebook プラットフォーム向けの Meta テストにおける TestGen-LLM の導入方法を説明します。Instagram の Reels および Stories 製品に対する評価では、TestGen-LLM のテストケースのうち 75% が正しくビルドされ、57% が安定して合格し、25% がカバレッジを増加させました。Meta の Instagram および Facebook の test-a-thon において、このソリューションは適用されたすべてのクラスのうち 11.5% を改善し、Meta のソフトウェアエンジニアが推奨した内容の 73% が本番デプロイのために採用されました。この報告は、このようなコード改善の担保がある LLM 生成コードの産業規模でのデプロイについての初の報告だと考えています。

This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers. We believe this is the first report on industrial scale deployment of LLM-generated code backed by such assurances of code improvement.

論文リンク

https://arxiv.org/abs/2402.09171

さらに読む

https://x.com/nathanbenaich/status/1758036247115608317

ChemLLM: 化学分野の大規模言語モデル / ChemLLM: A Chemical Large Language Model

論文紹介

名前変換、分子キャプション、反応予測などの主要タスクで GPT-3.5 より高い性能を示し、そのうち 2 つのタスクでは GPT-4 を上回ると主張する、化学関連タスク向けに学習された専用 LLM です。

A dedicated llm trained for chemistry-related tasks; claims to outperform gpt-3.5 on principal tasks such as name conversion, molecular caption, and reaction prediction; it also surpasses gpt-4 on two of these tasks.

論文要約(Abstract)

大規模言語モデル（LLM）は、分子特性予測、分子生成、実験プロトコル設計など、化学分野で目覚ましい進歩を遂げています。しかし、コミュニティには化学向けに特別に設計された対話型モデルが不足しています。この課題は、ほとんどの化学データと科学知識が主に構造化データベースに保存されているため、これらの構造化データを直接使用すると、モデルの一貫した対話を維持する能力が低下するという事実に起因します。これを解決するため、構造化された知識を一般対話に変換して言語モデル学習に適した新しいテンプレートベースの命令構成手法を開発しました。このアプローチを活用し、化学分野全般のさまざまなタスクをスムーズな対話インタラクションで実行できる、初の化学専用大規模言語モデルであるChemLLMを開発しました。ChemLLMは、名前変換、分子キャプション、反応予測という化学の主要3タスクすべてでGPT-3.5を上回り、うち2つのタスクでGPT-4をも上回ります。驚くべきことに、ChemLLMは化学中心のコーパスで主に学習されているにもかかわらず、関連する数学的および物理的タスクへの優れた適応能力を示します。さらに、ChemLLMは文献翻訳やケモインフォーマティック・プログラミングなど、化学分野の専門的なNLPタスクにも熟達しています。ChemLLMは化学研究に新しい探求の道を開き、構造化された化学知識を対話システムに統合するこの手法は、様々な科学分野でLLMを開発する新たな地平を切り拓きます。コード、データセット、およびモデル重みはhf.co/AI4Chem/ChemLLM-7B-Chatで公開アクセスできます。

Large language models (LLMs) have made impressive progress in chemistry applications, including molecular property prediction, molecular generation, experimental protocol design, etc. However, the community lacks a dialogue-based model specifically designed for chemistry. The challenge arises from the fact that most chemical data and scientific knowledge are primarily stored in structured databases, and the direct use of these structured data compromises the model's ability to maintain coherent dialogue. To tackle this issue, we develop a novel template-based instruction construction method that transforms structured knowledge into plain dialogue, making it suitable for language model training. By leveraging this approach, we develop ChemLLM, the first large language model dedicated to chemistry, capable of performing various tasks across chemical disciplines with smooth dialogue interaction. ChemLLM beats GPT-3.5 on all three principal tasks in chemistry, i.e., name conversion, molecular caption, and reaction prediction, and surpasses GPT-4 on two of them. Remarkably, ChemLLM also shows exceptional adaptability to related mathematical and physical tasks despite being trained mainly on chemical-centric corpora. Furthermore, ChemLLM demonstrates proficiency in specialized NLP tasks within chemistry, such as literature translation and cheminformatic programming. ChemLLM opens up a new avenue for exploration within chemical studies, while our method of integrating structured chemical knowledge into dialogue systems sets a new frontier for developing LLMs across various scientific fields. Codes, Datasets, and Model weights are publicly accessible at hf.co/AI4Chem/ChemLLM-7B-Chat.

論文リンク

https://arxiv.org/abs/2402.06852

参考リンク

https://hf.co/AI4Chem/ChemLLM-7B-Chat

https://x.com/omarsar0/status/1757246740539773165

大規模言語モデル: サーベイ論文 / Large Language Models: A Survey

論文紹介

GPT、Llama、PaLMの3つの人気LLMファミリーとその特徴、貢献、制約を検討し、LLMを構築・拡張するために開発された機能および技術を要約するとともに、LLMの学習、ファインチューニング、評価で広く使用されるデータセットとLLM評価指標についても論じ、未解決の課題と今後の研究方向を提示して締めくくります。

Reviews three popular families of llms (gpt, llama, palm), their characteristics, contributions, and limitations; includes a summary of capabilities and techniques developed to build and augment llm; it also discusses popular datasets for llm training, fine-tuning, and evaluation, and llm evaluation metrics; concludes with open challenges and future research directions.

論文要旨（Abstract）

大規模言語モデル（LLM）は、2022年11月にChatGPTが公開されて以来、幅広い自然言語タスクでの高い性能により多大な注目を集めています。LLM の汎用的な言語理解および生成能力は、膨大なテキストデータで数十億のモデルパラメータを学習することで獲得され、これはスケーリング法則 \cite{kaplan2020scaling,hoffmann2022training} によって予測されます。LLM の研究分野は誕生から日が浅いものの、様々な方向で急速に発展しています。本論文では、広く使われている3つのLLMファミリー（GPT、LLaMA、PaLM）を含む、最も注目されるLLMをレビューし、その特徴、貢献、および制約について論じます。また、LLMの構築と拡張のために開発された技術の概要も示します。次に、LLMの学習、ファインチューニング、評価のために準備された人気のあるデータセットを調査し、広く使われているLLM評価指標をレビューし、代表的なベンチマークセットでいくつかの主要LLMの性能を比較します。最後に、未解決の課題と将来の研究方向について議論し、論文を締めくくります。

Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.

論文リンク

https://arxiv.org/abs/2402.06196

さらに読む

https://x.com/omarsar0/status/1757049645119799804

LLMエージェントがウェブサイトを自律的にハッキングできます / LLM Agents can Autonomously Hack Websites

論文紹介

人間のフィードバックや脆弱性に関する明示的な事前知識なしに、ウェブサイトを自動的にハッキングし、SQLインジェクションのようなタスクを実行できることを示しています。これはLLMのツール使用と長いコンテキスト機能によって可能となり、GPT-4が実環境でウェブサイトの脆弱性を見つけるなどのハッキングが可能であることが示されていますが、オープンソースモデルでは同等の機能が示されませんでした。

Shows that llm agents can automatically hack websites and perform tasks like sql injections without human feedback or explicit knowledge about the vulnerability beforehand; this is enabled by an llm’s tool usage and long context capabilities; shows that gpt-4 is capable of such hacks, including finding vulnerabilities in websites in the wild; open-source models did not show the same capabilities.

論文要旨（Abstract）

近年、LLMの能力は向上を続けており、ツールを使って相互作用できる（すなわち、関数呼び出し）、ドキュメントを読み、自らを再帰的に呼び出すことが可能になっています。その結果、これらのLLMはエージェントとして自律的に機能できるようになりました。これらのエージェントの能力向上に伴い、最近の研究ではLLMエージェントがサイバーセキュリティへ与える影響が推測されています。しかし、LLMエージェントの攻撃能力についてはよく知られていません。この研究では、LLMエージェントが人間のフィードバックなしでウェブサイトを自律的にハッキングし、盲目的なデータベーススキーマ抽出やSQLインジェクションなどの複雑な作業を実行できることを示します。重要な点は、エージェントが脆弱性を事前に知る必要がないことです。この能力は、拡張されたコンテキストを活用し、ツールの使用に高度に長けたフロンティアモデルでのみ有効化されます。つまり、GPT-4はこのようなハッキングが可能である一方、既存のオープンソースモデルはそうではありません。最後に、GPT-4が実環境でウェブサイトの脆弱性を自律的に発見できることを示します。我々の発見は、LLMの広範な展開について疑問を提起します。

In recent years, large language models (LLMs) have become increasingly capable and can now interact with tools (i.e., call functions), read documents, and recursively call themselves. As a result, these LLMs can now function autonomously as agents. With the rise in capabilities of these agents, recent work has speculated on how LLM agents would affect cybersecurity. However, not much is known about the offensive capabilities of LLM agents. In this work, we show that LLM agents can autonomously hack websites, performing tasks as complex as blind database schema extraction and SQL injections without human feedback. Importantly, the agent does not need to know the vulnerability beforehand. This capability is uniquely enabled by frontier models that are highly capable of tool use and leveraging extended context. Namely, we show that GPT-4 is capable of such hacks, but existing open-source models are not. Finally, we show that GPT-4 is capable of autonomously finding vulnerabilities in websites in the wild. Our findings raise questions about the widespread deployment of LLMs.

⚠️広告⚠️: PyTorch韓国ユーザーコミュニティがまとめたこの記事は役に立ちましたか？会員登録すると、主要な記事をメールでお届けします！ (デフォルトはWeeklyですが、Dailyへ変更も可能です。)

[2024/02/12 ~ 02/18] 今週の主要ML論文（Top ML Papers of the Week）

概要

OpenAIのSora

論文紹介

論文リンク

さらに読む

ジェミニ 1.5 / Gemini 1.5

論文紹介

論文リンク

さらに読む

V-JEPA

論文紹介

論文抄録（Abstract）

論文リンク

もっと読む

LWM（Large World Model）：リングアテンションを使用して100万長の動画と言語を扱うモデル / World Model on Million-Length Video And Language With RingAttention

論文紹介

論文要約（Abstract）

論文リンク

さらに読む

ニューラルネットワークの学習可能性の境界はフラクタルです / The boundary of neural network trainability is fractal

論文紹介

論文要約(Abstract)

論文リンク

さらに読む

OS-Copilot: 自己改善を通じた汎用コンピュータエージェントへ / OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

論文紹介

論文要約(Abstract)

論文リンク

さらに読む

TestGen-LLM: Metaで大規模言語モデルを用いた自動化された単体テスト改善 / Automated Unit Test Improvement using Large Language Models at Meta

論文紹介

論文要約(Abstract)

論文リンク

さらに読む

ChemLLM: 化学分野の大規模言語モデル / ChemLLM: A Chemical Large Language Model

論文紹介

論文要約(Abstract)

論文リンク

参考リンク

大規模言語モデル: サーベイ論文 / Large Language Models: A Survey

論文紹介

論文要旨（Abstract）

論文リンク

さらに読む

LLMエージェントがウェブサイトを自律的にハッキングできます / LLM Agents can Autonomously Hack Websites

論文紹介

論文要旨（Abstract）

論文リンク

さらに読む

原文

関連記事

まだコメントはありません。