22] 今週の主要ML論文（Top ML Papers of the Week）

(discuss.pytorch.kr)

5 ポイント投稿者 ninebow 2024-09-23 | 3件のコメント | WhatsAppで共有

DAIR.AIが毎週公開しているML論文の紹介記事を自動翻訳しました。
今週選ばれた論文を見ると、いくつかの際立った傾向が見て取れます。第一に、大規模言語モデル（LLM）に関する研究が大きな比重を占めています。『Training LLMs to Self-Correct via RL』『Qwen2.5 Coder』『A Comprehensive Evaluation of Quantized Instruction-Tuned LLMs』など、さまざまな論文がLLMの性能向上や応用を扱っています。これは、LLMが現在のAI研究における中核的なテーマの一つであることを示しています。
第二に、人工知能の思考過程に関する研究も多く見られます。『Diagram of Thought (DoT)』『Iteration of Thought』『To CoT or not to CoT?』のような論文は、AIの思考様式や推論プロセスを深く探究しています。これを通じて、AIシステムの精度と効率を高めようとする試みがうかがえます。
このような傾向が現れる理由はいくつか考えられます。まず、大規模言語モデルは多様な応用可能性と高い性能により、産業界と学術界の双方で大きな関心を集めているためです。特に、モデルの自己修正能力や性能改善のためのさまざまな手法が活発に研究されています。また、AIの思考過程に関する研究は、人間に近い思考能力を持つAIを開発しようとする究極的な目標とも関係しています。これは、より複雑で知的な作業を自動化するうえで不可欠な要素と見なされています。
要約すると、今週の論文における主要トレンドは、大規模言語モデルの性能向上とAIの思考過程に関する研究だと言えます。これは、現在のAI研究がどの方向へ発展しているのかをよく示す事例です。

Moshi

論文紹介

音声・テキスト基盤モデルと全二重音声対話フレームワーク、システムの複数の構成要素、7BパラメータのテキストLLMであるHelium、音声品質において最先端の性能を持つセマンティック・アコースティックなニューラル音声コードであるMimi、そして音声対音声の形で任意の会話を生成できる階層的マルチストリームアーキテクチャを紹介します。

Introduces a speech-text foundation model and full-duplex spoken dialogue framework; they present several components of the systems; Helium is a 7B parameter text LLM; Mimi is a semantic-acoustic neural audio code with state-of-the-art performance on audio quality; a hierarchical multi-stream architecture that can generate arbitrary conversation in a speech-to-speech manner.

論文要旨(Abstract)

音声・テキスト基盤モデルであり、全二重音声対話フレームワークでもあるMoshiを紹介します。現在の音声対話システムは、音声活動検出、音声認識、テキスト対話、テキスト音声変換といった独立した構成要素のパイプラインに依存しています。このようなフレームワークでは、実際の会話体験を再現することはできません。第一に、その複雑さのために、やり取りの間に数秒の遅延が生じます。第二に、対話における中間モダリティがテキストであるため、感情や非言語音のように意味を修飾する非言語情報がやり取りの中で失われます。最後に、発話の重なり、中断、間投詞を考慮しない話者ターンへの分割に依存しています。Moshiは、音声対話を音声対音声生成として捉えることで、これらの個別の問題をまとめて解決します。テキスト言語モデルのバックボーンから出発し、Moshiはニューラル音声コーデックの残差量子化器から音声をトークンとして生成すると同時に、自身の音声とユーザーの音声を並列ストリームとして分離してモデリングします。これにより、明示的な話者交代を不要にし、任意の会話ダイナミクスをモデル化できます。さらに、先行研究の階層的なセマンティックからアコースティックへのトークン生成を拡張し、まず時間的に整列したテキストトークンを音声トークンの接頭辞として予測します。この「Inner Monologue」方式は、生成される音声の言語品質を大幅に向上させるだけでなく、ストリーミング音声認識とテキスト音声変換を提供する方法も示しています。最終的に得られたモデルは、理論上160ms、実運用で200msの遅延を持つ、初のリアルタイム全二重音声大規模言語モデルであり、github.com/kyutai-labs/moshi で公開されています。

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.

論文リンク

https://kyutai.org/Moshi.pdf

さらに読む

https://github.com/kyutai-labs/moshi

https://x.com/kyutai_labs/status/1836427396959932492

強化学習を通じて言語モデルが自己修正するよう訓練する / Training Language Models to Self-Correct via Reinforcement Learning

論文紹介

LLMの自己修正能力を向上させるためのマルチターンのオンライン強化学習を開発。完全に自己生成データのみに基づいており、SFTは自己修正の学習には非効率で、学習データとモデル応答の分布不一致に悩まされることが示された。まず修正行動を最適化し、その後、報酬ボーナスを用いて学習中の自己修正機能を増幅する2段階アプローチを提案。Gemini 1.0 Proおよび1.5 Flashモデルに適用すると、MATHおよびHumanEvalベンチマークでベースモデルの自己修正性能をそれぞれ15.6%と9.1%向上させ、最先端の自己修正性能を達成できる。

Develops a multi-turn online reinforcement learning to improve the capabilities of an LLM to self-correct; it’s based entirely on self-generated data; SFT is shown to be ineffective at learning self-correction and suffers from distribution mismatch between training data and model responses; proposes a two-stage approach that first optimizes correction behavior and then uses a reward bonus to amplify self-correction during training; when applied to Gemini 1.0 Pro and 1.5 Flash models, it achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

論文要旨(Abstract)

自己修正は大規模言語モデル（LLM）にとって非常に望ましい能力ですが、最新のLLMではその効果がほとんどないことが一貫して示されてきました。自己修正を学習させる既存の手法は、複数のモデルを必要とするか、より高性能なモデル、あるいは他の形の教師あり情報に依存しています。これに対し、Unityは、完全に自己生成データのみを用いてLLMの自己修正能力を大幅に向上させるマルチターンのオンライン強化学習（RL）手法であるSCoReを開発しました。SCoReを構築するにあたり、まず、オフラインでモデル生成された修正トレースに対する教師ありファインチューニング（SFT）の変種では、自己修正行動を十分に注入できないことを示します。特に、SFTによる学習では、学習データとモデル自身の応答の間で分布不一致が生じるか、あるいはテスト時には有効でないことが多い特定の修正行動モードだけを暗黙的に好むことを観察しました。SCoReは、モデル自身が生成した修正トレースの分布のもとで学習し、適切な正則化を用いることで、与えられたプロンプトに対して単に高報酬の応答へ当てはめるのではなく、テスト時に有効な自己修正戦略を学習するよう学習過程を導くことで、これらの課題に対処します。この正則化では、まずベースモデルに対してRLの第1段階を実行し、崩壊の影響を受けにくい方策の初期化を生成し、その後、報酬ボーナスを使って学習中の自己修正を増幅することを定めています。Gemini 1.0 Proおよび1.5 Flashモデルに適用した結果、SCoReはMATHおよびHumanEvalベンチマークにおいて、ベースモデルの自己修正性能をそれぞれ15.6%と9.1%向上させ、最先端の自己修正性能を達成することがわかりました。

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

論文リンク

https://arxiv.org/abs/2409.12917

さらに読む

https://x.com/omarsar0/status/1837228446839361984

Qwen2.5-Coder 技術文書 / Qwen2.5-Coder Technical Report

論文紹介

15億および70億パラメータを含む一連のモデルで、5.5兆トークンで継続的に事前学習されたQwen2.5アーキテクチャに基づいて構築されており、10以上のベンチマークで最先端の性能を達成し、コード生成、補完、推論、修復において強力な機能を備えています。

A series of models including 1.5B and 7B parameters; it’s built upon the Qwen2.5 architecture which is continuously pretrained on 5.5 trillion tokens; achieves state-of-the-art performance across more than 10 benchmarks; includes strong capabilities in code generation, completion, reasoning, and repairing.

論文要旨(Abstract)

本レポートでは、前身であるCodeQwen1.5から大幅にアップグレードされたQwen2.5-Coderシリーズを紹介します。このシリーズには、Qwen2.5-Coder-1.5BとQwen2.5-Coder-7Bの2つのモデルがあります。コード特化型モデルであるQwen2.5-CoderはQwen2.5アーキテクチャを基盤として構築され、5.5兆超のトークンから成る大規模コーパスに対する事前学習を継続しています。入念なデータクリーニング、スケーラブルな合成データ生成、バランスの取れたデータミキシングを通じて、Qwen2.5-Coderは高い汎用性を維持しながら、印象的なコード生成能力を示します。このモデルは、コード生成、補完、推論、修復など10を超えるベンチマークで最先端レベルの性能を達成しており、同規模クラスのより大きなモデルを一貫して上回る結果を、幅広いコード関連タスクで示しました。ユニティは、Qwen2.5-Coderシリーズの公開がコードインテリジェンス研究の地平を広げるだけでなく、寛容なライセンスを通じて、開発者による実アプリケーションでのより広範な採用を後押しすると考えています。

In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general versatility. The model has been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will not only push the boundaries of research in code intelligence but also, through its permissive licensing, encourage broader adoption by developers in real-world applications.

論文リンク

https://arxiv.org/abs/2409.12186

さらに読む

https://x.com/huybery/status/1837170643563073960

思考のダイアグラム（DoT）について / On the Diagram of Thought

論文紹介

数学的厳密性によってLLMの推論能力を高めるDATは、LLMにおける反復的推論を有向非巡回グラフの構築としてモデル化し、命題、批判、改善、検証を統合されたDAG構造にまとめることで、線形または木構造ベースのアプローチを超えて複雑な論理的推論を捉えられるようにします。

Enhances the reasoning capabilities of LLMs through mathematical rigor; DAT models iterative reasoning in LLM as the construction of a directed acyclic graph; it integrates propositions, critiques, refinement, and verification into a unified DAG structure; this allows DoT to capture complex logical deduction beyond linear or tree-based approaches.

論文要旨(Abstract)

大規模言語モデル（LLM）における反復推論を、単一モデル内で有向非巡回グラフ（DAG）の構築としてモデル化するフレームワーク、Diagram of Thought（DoT）を紹介します。推論を線形チェーンや木として表現する従来手法とは異なり、DoTは命題、批判、改善、検証を一貫したDAG構造として整理し、モデルが論理的一貫性を保ちながら複雑な推論経路を探索できるようにします。ダイアグラム内の各ノードは、提案・批評・改善・検証された命題に対応しており、LLMが自然言語フィードバックを通じて反復的に推論を改善できるようにします。役割別トークンによる自己回帰的な次トークン予測を活用することで、DoTはアイデアの提案と批判的評価のあいだの滑らかな遷移を促進し、二値信号よりも豊かなフィードバックを提供します。さらに、トポス理論を用いてDoTフレームワークを定式化し、推論過程における論理的一貫性と健全性を保証する数学的基盤を与えます。このアプローチは、単一のLLM内で学習プロセスと推論プロセスの両方を強化し、複数モデルや外部制御メカニズムを不要にします。DoTは、学習効率、堅牢な推論能力、理論的基盤を重視した次世代の推論特化モデルを設計するための概念的フレームワークを提供します。コードは https://github.com/diagram-of-thought/diagram-of-thought で公開されています。

We introduce Diagram of Thought (DoT), a framework that models iterative reasoning in large language models (LLMs) as the construction of a directed acyclic graph (DAG) within a single model. Unlike traditional approaches that represent reasoning as linear chains or trees, DoT organizes propositions, critiques, refinements, and verifications into a cohesive DAG structure, allowing the model to explore complex reasoning pathways while maintaining logical consistency. Each node in the diagram corresponds to a proposition that has been proposed, critiqued, refined, or verified, enabling the LLM to iteratively improve its reasoning through natural language feedback. By leveraging auto-regressive next-token prediction with role-specific tokens, DoT facilitates seamless transitions between proposing ideas and critically evaluating them, providing richer feedback than binary signals. Furthermore, we formalize the DoT framework using Topos Theory, providing a mathematical foundation that ensures logical consistency and soundness in the reasoning process. This approach enhances both the training and inference processes within a single LLM, eliminating the need for multiple models or external control mechanisms. DoT offers a conceptual framework for designing next-generation reasoning-specialized models, emphasizing training efficiency, robust reasoning capabilities, and theoretical grounding. The code is available at https://github.com/diagram-of-thought/diagram-of-thought.

論文リンク

https://arxiv.org/abs/2409.10038

さらに読む

https://github.com/diagram-of-thought/diagram-of-thought

https://x.com/omarsar0/status/1835882277563179512

ソフトウェアエンジニアリングエージェント：調査、現状、そして展望 / Agents in Software Engineering: Survey, Landscape, and Vision

論文紹介

ソフトウェアエンジニアリングにおけるLLMベースのエージェントのフレームワークについて、包括的な概要を提供します。

Provides a comprehensive overview of frameworks of LLM-based agents in software engineering.

論文要旨(Abstract)

近年、大規模言語モデル（LLM）は目覚ましい成功を収め、さまざまなダウンストリームタスク、特にソフトウェアエンジニアリング（SE）分野のタスクで広く活用されています。LLMとSEを組み合わせた多くの研究では、エージェントという概念が明示的または暗黙的に用いられています。しかし、既存研究の発展の文脈を整理し、既存研究がLLMベースのエージェント技術をどのように組み合わせて多様なタスクを最適化しているのかを分析し、SEにおけるLLMベースのエージェントのフレームワークを明確化するための踏み込んだ調査は不足しています。本論文では、LLMベースのエージェントとSEの結合に関する研究を初めて体系的に調査し、知覚、記憶、行動という3つの主要モジュールを含む、SEにおけるLLMベースのエージェントのフレームワークを提示します。さらに、両分野を結び付けるうえで現在直面している課題を要約し、既存の課題に対応する将来の機会を提案します。関連論文のGitHubリポジトリは https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE で公開されています。

In recent years, Large Language Models (LLMs) have achieved remarkable success and have been widely used in various downstream tasks, especially in the tasks of the software engineering (SE) field. We find that many studies combining LLMs with SE have employed the concept of agents either explicitly or implicitly. However, there is a lack of an in-depth survey to sort out the development context of existing works, analyze how existing works combine the LLM-based agent technologies to optimize various tasks, and clarify the framework of LLM-based agents in SE. In this paper, we conduct the first survey of the studies on combining LLM-based agents with SE and present a framework of LLM-based agents in SE which includes three key modules: perception, memory, and action. We also summarize the current challenges in combining the two fields and propose future opportunities in response to existing challenges. We maintain a GitHub repository of the related papers at: https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.

論文リンク

https://arxiv.org/abs/2409.09030

さらに読む

https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE

https://x.com/omarsar0/status/1835705359723319702

CoTをすべきか、すべきでないか？連鎖的思考は主に数学と記号的推論に有効 / To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

論文紹介

100本以上の論文と複数の評価に対するメタ分析を通じて、どの種類のタスクが思考の連鎖（CoT）プロンプティングの恩恵を最も受けるのかを調査した結果、CoTは主に数学や論理に関わるタスクで強い性能向上をもたらすことが分かりました。また、CoTの利得の大部分は記号的実行の改善に由来する一方で、記号的ソルバーのほうがそれを上回る性能を示すことも明らかになりました。

Investigates what kinds of tasks benefit the most from chain-of-thought (CoT) prompting; after a meta-analysis on 100+ papers and several evaluations, it finds that CoT produces strong performance benefits primarily on tasks involving math and logic; they find that most of the CoT gain comes from improving symbolic execution, but a symbolic solver outperforms it.

論文要旨(Abstract)

プロンプトによる思考の連鎖（CoT）は、大規模言語モデル（LLM）から推論能力を引き出すための事実上唯一の方法です。しかし、この追加の「思考」は実際にはどのような種類のタスクで役立つのでしょうか。これを分析するために、CoTを用いた100本以上の論文を対象に定量的メタ分析を行い、14のモデルにまたがる20のデータセットについて独自評価を実施しました。その結果、CoTは主に数学や論理に関わるタスクで強力な性能上の利点をもたらし、その他の種類のタスクでは利点がはるかに小さいことが分かりました。MMLUでは、質問やモデルの回答に記号操作や推論を示す等号が含まれていない限り、CoTなしで直接答えを生成しても、CoTとほぼ同等の精度につながります。この結果を踏まえ、計画と実行を分離し、ツール拡張LLMと比較することで、こうした問題におけるCoTの挙動を分析します。CoTの利点の大部分は記号的実行の改善に由来しますが、記号ソルバーを用いる場合と比べると性能は劣ります。研究結果は、CoTを選択的に適用することで、性能を維持しながら推論コストを削減できることを示しています。また、プロンプトベースのCoTを超え、LLMアプリケーション全体にわたって中間計算をより有効に活用する新しいパラダイムへ移行する必要性も示唆しています。

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

論文リンク

https://arxiv.org/abs/2409.12183

さらに読む

https://x.com/omarsar0/status/1836599280477299013

量子化された命令調整大規模言語モデルに関する包括的評価：最大405Bまでの実験的分析 / A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

論文紹介

7Bから405Bに至るモデルにおいて、さまざまな量子化手法にまたがってインストラクションチューニング済みLLMの性能を評価した結果、1) より大きなLLMを、より小さなFP16 LLMと同程度のサイズに量子化する方が、一般にほとんどのベンチマークでより良い性能を発揮し、2) 性能は量子化手法、モデルサイズ、ビット幅によって大きく異なり、重みのみの手法はより大きなモデルでしばしば良好な結果を示し、3) タスクの難易度は量子化による精度低下に大きな影響を与えないことが分かります。

Evaluates the performance of instruction-tuned LLMs across various quantization methods on models ranging from 7B to 405B; the key findings are 1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, 2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models, and 3) task difficulty does not significantly impact accuracy degradation due to quantization.

論文要旨(Abstract)

従来の研究では、量子化されたLLMの評価に、パープレキシティやいくつかの基本的な知識タスク、古いデータセットといった限られた指標が用いられていました。また、最大405Bに達するLlama 3.1のような近年の大規模モデルは、十分に検証されていませんでした。本ホワイトペーパーでは、7Bから405Bまでのモデルを対象に、さまざまな量子化手法（GPTQ、AWQ、SmoothQuant、FP8）にわたって、命令チューニング済みLLMの性能を評価します。13のベンチマークを用いて、常識的な質疑応答、知識および言語理解、指示追従、ハルシネーション検出、数学、対話という6種類のタスクに対する性能を評価しました。主な研究結果として、(1) より大きなLLMを、より小さなFP16 LLMと同程度のサイズに量子化した場合、一般にハルシネーション検出と指示追従を除くほとんどのベンチマークでより高い性能を示すこと、(2) 性能は量子化手法、モデルサイズ、ビット幅によって大きく変動し、重みのみを量子化する手法が大規模モデルでより良い結果を示すことが多いこと、(3) タスクの難易度は量子化による精度低下に大きな影響を与えないこと、そしてMT-Bench評価手法は近年の高性能LLM同士を識別する力に限界があること、が明らかになりました。

Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.

論文リンク

https://arxiv.org/abs/2409.11055

さらに読む

https://x.com/omarsar0/status/1836479309390995790

思考の反復: 自律的な大規模言語モデル推論のための内部対話の活用 / Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning

論文紹介

適応的な推論経路によってLLMの応答と推論能力を向上させるため、思考の反復（IoT）フレームワークを提案しています。ガイド役を果たす内部対話エージェントを活用し、推論経路を動的に調整することで、適応的なクロスパス探索を可能にし、応答精度を高めます。また、プロンプト生成が適応可能な動的プロセスである点で、CoTおよびToT（いずれも固定的なプロセス）とは異なります。

Proposes the Iteration of Thought (IoT) framework to enhance the LLM responses and reasoning capabilities with adaptive reasoning paths; it leverages an inner dialogue agent, acting as a guide, to dynamically adjust reasoning paths which allows adaptive cross-path exploration and enhance response accuracy; it's different from CoT and ToT (both rigid processes) in that its prompt generation is a dynamic process that allows it to adapt.

論文要旨(Abstract)

反復的な人間の関与は、大規模言語モデル（LLM）の高度な言語処理能力を活用するための一般的かつ効果的な手段です。適切に構造化された対話型プロンプトを用いることで、人間のユーザーはLLMに対し、より思慮深く正確な応答を生み出すよう効果的に働きかけることができます。この洞察に着想を得て、入力クエリと現在のLLM応答の反復に対して「思考」を促すプロンプトを生成することでLLMの応答を改善する、Iteration of Thought（IoT）フレームワークを提案します。静的または半静的なアプローチ（例：Chain of Thought（CoT）やTree of Thoughts（ToT））とは異なり、IoTは進化する文脈に応じて推論経路を動的に調整し、最終的に破棄される代替的な探索的思考を生成しません。IoTフレームワークは、（1）有益な文脈依存プロンプトを生成するIDA（Inner Dialogue Agent）、（2）それらのプロンプトを処理して応答を洗練するLLMA（LLM Agent）、（3）前者2要素間の対話を実装する反復的プロンプトループ、の3つの構成要素から成ります。このフレームワークには2つの変種があります。LLMが反復をいつ停止するかを決定するAutonomous Iteration of Thought（AIoT）と、常に固定回数の反復を強制するGuided Iteration of Thought（GIoT）です。GPQAデータセットの複雑な推論タスク、Game of 24における探索的問題解決、Mini Crosswordsのパズル解法、HotpotQAデータセットのマルチホップ質問応答など、さまざまなデータセットにまたがってIoTの性能を調査します。研究結果によれば、IoTはLLMの自律的な応答改善のための実行可能なパラダイムであり、CoTと比べて大幅な改善を示すことで、人間の介入を最小限に抑える、より適応的で効率的な推論システムを実現できることが示されています。

Iterative human engagement is a common and effective means of leveraging the advanced language processing power of large language models (LLMs). Using well-structured prompts in a conversational manner, human users can effectively influence an LLM to develop more thoughtful and accurate responses. Motivated by this insight, we propose the Iteration of Thought (IoT) framework for enhancing LLM responses by generating "thought"-provoking prompts vis a vis an input query and the current iteration of an LLM's response. Unlike static or semi-static approaches, e.g. Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT adapts its reasoning path dynamically, based on evolving context, and without generating alternate explorative thoughts which are ultimately discarded. The three components of the IoT framework are (1) an Inner Dialogue Agent (IDA) responsible for generating instructive, context-specific prompts; (2) an LLM Agent (LLMA) that processes these prompts to refine its responses; and (3) an iterative prompting loop that implements a conversation between the former two components. We introduce two variants of our framework: Autonomous Iteration of Thought (AIoT), where an LLM decides when to stop iterating, and Guided Iteration of Thought (GIoT), which always forces a fixed number iterations. We investigate the performance of IoT across various datasets, spanning complex reasoning tasks from the GPQA dataset, explorative problem-solving in Game of 24, puzzle solving in Mini Crosswords, and multi-hop question answering from the HotpotQA dataset. Our results show that IoT represents a viable paradigm for autonomous response refinement in LLMs, showcasing significant improvements over CoT and thereby enabling more adaptive and efficient reasoning systems that minimize human intervention.

論文リンク

https://arxiv.org/abs/2409.12618

さらに読む

https://x.com/omarsar0/status/1836977595847692671

シュレーディンガーの記憶：大規模言語モデル / Schrodinger's Memory: Large Language Models

論文紹介

普遍近似定理を用いてLLMのメモリ機構を説明します。また、さまざまなモデルのメモリ容量を比較することでLLMの性能を評価する新たなアプローチを提案しており、Transformerアーキテクチャは入力を適応的にフィットさせる強力な能力を持つ動的フィッティングUATモデルとして機能し、最小限の入力情報から全体の内容を想起できます。

Uses the Universal Approximation Theorem to explain the memory mechanism of LLMs. It also proposes a new approach to evaluate LLM performance by comparing the memory capacities of different models; the Transformer architecture functions as a dynamic fitting UAT model, with a strong ability to adaptively fit inputs; this enables LLMs to recall entire content based on minimal input information.

論文要旨(Abstract)

記憶はあらゆる人間活動の基盤であり、記憶がなければ日常生活で何かを遂行することはほぼ不可能でしょう。大規模言語モデル（LLM）の発展により、その言語能力はますます人間に近づいています。では、LLMにもメモリはあるのでしょうか。現在の性能を見る限り、LLMは記憶力を備えているように見えます。では、その記憶の根本的なメカニズムは何なのでしょうか。従来の研究では、LLMの記憶能力とその基盤となる理論について深く掘り下げた検討が不足していました。本論文では、万能近似定理（UAT）を用いてLLMのメモリ機構を説明します。また、さまざまなLLMのメモリ能力を検証する実験を行い、こうしたメモリ能力に基づいて能力を評価する新たな方法を提案します。私たちは、LLMのメモリはシュレーディンガーのメモリのように機能し、特定の記憶が問い合わせられたときにのみ観測できると主張します。問い合わせに対する応答を通じてのみ、そのモデルが記憶を保持しているかどうかを確認でき、それ以外では不確定な状態のままです。最後に、人間の脳とLLMの記憶能力を比較することでこの概念をさらに広げ、動作メカニズムの類似点と相違点を強調します。

Memory is the foundation of all human activities; without memory, it would be nearly impossible for people to perform any task in daily life. With the development of Large Language Models (LLMs), their language capabilities are becoming increasingly comparable to those of humans. But do LLMs have memory? Based on current performance, LLMs do appear to exhibit memory. So, what is the underlying mechanism of this memory? Previous research has lacked a deep exploration of LLMs' memory capabilities and the underlying theory. In this paper, we use Universal Approximation Theorem (UAT) to explain the memory mechanism in LLMs. We also conduct experiments to verify the memory capabilities of various LLMs, proposing a new method to assess their abilities based on these memory ability. We argue that LLM memory operates like Schr"odinger's memory, meaning that it only becomes observable when a specific memory is queried. We can only determine if the model retains a memory based on its output in response to the query; otherwise, it remains indeterminate. Finally, we expand on this concept by comparing the memory capabilities of the human brain and LLMs, highlighting the similarities and differences in their operational mechanisms.

論文リンク

https://arxiv.org/abs/2409.10482

さらに読む

https://x.com/omarsar0/status/1835882330323554321

記号数学で大規模言語モデルを脱獄する / Jailbreaking Large Language Models with Symbolic Mathematics

論文紹介

効果的な脱獄手法として機能する数学的にエンコードされたプロンプトを生成するためにGPT-4oを使用し、13の最先端モデルで平均73.6%の攻撃成功率を示すことで、既存の安全性訓練メカニズムが数学的にエンコードされた入力へ一般化できないことを浮き彫りにしています。

Uses GPT-4o to generate mathematically encoded prompts that serve as an effective jailbreaking technique; shows an average attack success rate of 73.6% across 13 state-of-the-art; this highlights the inability of existing safety training mechanisms to generalize to mathematically encoded inputs.

論文要旨(Abstract)

近年のAI安全性の進展により、安全でないコンテンツ生成を緩和するために大規模言語モデル（LLM）を訓練し、レッドチーミングする取り組みが増加しています。しかし、こうした安全メカニズムは包括的ではない可能性があり、潜在的な脆弱性が見過ごされているおそれがあります。本稿では、LLMの高度な記号数学能力を悪用して安全メカニズムを回避する新しい脱獄手法であるMathPromptを紹介します。有害な自然言語プロンプトを数学の問題へとエンコードすることで、現在のAI安全対策における重大な脆弱性を示します。13の最先端LLMに対する実験の結果、平均攻撃成功率は73.6%に達し、既存の安全性訓練メカニズムが数学的にエンコードされた入力へ一般化できないことが明らかになりました。埋め込みベクトルの分析では、元のプロンプトとエンコード後のプロンプトの間に大きな意味的シフトがあることが示され、攻撃の成功を説明する手がかりとなっています。この研究は、AI安全に対する包括的なアプローチの重要性を強調し、あらゆる潜在的な入力タイプとそれに伴うリスクに対応する堅牢な安全策を開発するため、レッドチーミングの取り組みを拡大すべきだと訴えています。

Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack's success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.

この記事はGPTモデルで整理したものであり、誤りが含まれている可能性があるため、記事下部の原文もあわせてご参照ください。お読みいただく中で不自然な点や誤った内容を見つけた場合は、コメントでお知らせいただけますと幸いです。* 🤗

⚠️広告⚠️: 🔥PyTorch 韓国ユーザーの集い🇰🇷がまとめたこの記事は役に立ちましたか？会員登録すると、主要な記事をメール💌でお届けします！（基本はWeeklyですが、Dailyへの変更も可能です。）

3件のコメント

savvykang 2024-09-23

タイトルは6月で、リンク先の投稿は9月です。自動補完のせいでこうなったのでしょうか。

ninebow 2024-09-23

えっ、本当ですね;;; お知らせいただきありがとうございます。T_T
タイトルは「[2024/09/16 ~ 09/22] 今週の主要なML論文 (Top ML Papers of the Week)」にすべきでしたが、テンプレートを使っていてミスしてしまいました。xguruさん、もしご覧になりましたら変更をお願いいたします。🙇‍♂️

ninebow 2024-09-23

ありがとうございます！！

[2024/09/16 ~ 09/22] 今週の主要ML論文（Top ML Papers of the Week）

Moshi

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

強化学習を通じて言語モデルが自己修正するよう訓練する / Training Language Models to Self-Correct via Reinforcement Learning

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

Qwen2.5-Coder 技術文書 / Qwen2.5-Coder Technical Report

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

思考のダイアグラム（DoT）について / On the Diagram of Thought

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

ソフトウェアエンジニアリングエージェント：調査、現状、そして展望 / Agents in Software Engineering: Survey, Landscape, and Vision

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

CoTをすべきか、すべきでないか？ 連鎖的思考は主に数学と記号的推論に有効 / To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

量子化された命令調整大規模言語モデルに関する包括的評価：最大405Bまでの実験的分析 / A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

思考の反復: 自律的な大規模言語モデル推論のための内部対話の活用 / Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

シュレーディンガーの記憶：大規模言語モデル / Schrodinger's Memory: Large Language Models

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

記号数学で大規模言語モデルを脱獄する / Jailbreaking Large Language Models with Symbolic Mathematics

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

原文

関連記事

3件のコメント

CoTをすべきか、すべきでないか？連鎖的思考は主に数学と記号的推論に有効 / To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning