17] 今週の注目ML論文（Top ML Papers of the Week）

(discuss.pytorch.kr)

4 ポイント投稿者 ninebow 2024-03-19 | 6件のコメント | WhatsAppで共有

[2024/03/11 ~ 03/17] 今週の注目ML論文（Top ML Papers of the Week）

DAIR.AIが毎週公開しているML論文の記事を自動翻訳してみました。
今週は大規模言語モデル（Large Language Models, LLMs）に関する論文が主要なトレンドとして現れました。複数の論文がLLMsに焦点を当て、さまざまな問題の解決や理解を試みている点から、その傾向が見て取れます。たとえば、"SIMA"、"Retrieval Augmented Thoughts"、"LMs Can Teach Themselves to Think Before Speaking"、"Knowledge Conflicts for LLMs"、および"LLMs Predict Neuroscience Results"といった論文は、大規模言語モデルを活用したり、その性能に関する問題を扱ったりしています。また、"Stealing Part of a Production Language Model"のような論文は、セキュリティの観点から言語モデルを研究していることを示しています。
このような傾向は、ここ数年にわたって人工知能研究コミュニティにもたらされた大規模言語モデルの革新的な変化と、その影響力を反映しているように見えます。大規模言語モデルは自然言語処理（Natural Language Processing, NLP）だけでなく、さまざまなドメインで有効なファウンデーションモデルとしての地位を確立しています。このようにLLMsは、多様な言語理解および生成タスクにおいて高い性能を示しており、さらに応用研究においても広く探究されています。加えて、"Multimodal LLM Pre-training"のような論文は、LLMsが画像、音声など他の形式のデータと結びつくことで、マルチモーダル学習能力を強化する最新の研究動向を示しています。
この分析を踏まえると、今後もLLMsに関する研究は自然言語理解をさらに改善し、さまざまな新しい応用分野へと拡張され、人工知能技術の発展に重要な役割を果たすと予想されます。LLMsの性能向上だけでなく、応用研究、セキュリティ、そして倫理的問題に至るまで、幅広い課題が探究されていくと見られます。

SIMA / SIMA

論文紹介

幅広い3D仮想環境およびビデオゲームで自然言語の指示に従う、3D仮想環境向けのジェネラリストAIエージェントであり、探索、物体操作、メニュー使用など600種類の基本スキルを評価しています。言語が性能に大きく影響しているようです。

A generalist ai agent for 3d virtual environments that follows natural-language instructions in a broad range of 3d virtual environments and video games; sima is evaluated across 600 basic skills, spanning navigation, object interaction, and menu use. language seems to be a huge factor in performance.

論文要旨(Abstract)

あらゆる3D環境で任意の言語指示に従える embodied AI システムを構築することは、汎用AIを実現するうえでの中核的な課題です。この目標を達成するには、複雑なタスクを遂行するために、知覚と身体化された行動に言語を接地させて学習する必要があります。Scalable, Instructable, Multiworld Agent（SIMA）プロジェクトは、多様な仮想3D環境において自由形式の指示に従うようエージェントを訓練することで、この課題に取り組みます。対象には、厳選された研究用環境に加え、オープンエンドな商用ビデオゲームも含まれます。私たちの目標は、あらゆるシミュレートされた3D環境で、人間ができることを何でも達成できる、指示可能なエージェントを開発することです。私たちのアプローチは、最小限の仮定のもとで、言語駆動の汎用性に重点を置いています。エージェントは、人間に似た汎用的なインターフェースを使って環境とリアルタイムに相互作用します。入力は画像観測と言語指示、出力はキーボードとマウスの操作です。この一般的なアプローチは困難ではありますが、視覚的に複雑で意味的に豊かな多数の環境にまたがって言語を接地できるだけでなく、新しい環境でも容易にエージェントを実行できるようにします。本論文では、私たちの動機と目標、これまでの初期的な進展、そして複数の多様な研究環境およびさまざまな商用ビデオゲームにおける有望な予備的結果について説明します。

Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as openended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.

論文リンク

https://storage.googleapis.com/deepmind-media/DeepMind.com/…

さらに読む

https://discuss.pytorch.kr/t/gn-google-sima-3d-ai/3764

https://x.com/GoogleDeepMind/status/1767918515585994818

RAT: 検索拡張思考によって文脈認識推論を引き出す長期ホライズン生成 / RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

論文紹介

情報検索を通じて思考の連鎖を反復的に修正することで、長期的な生成タスクにおけるLLMの推論と生成を大幅に改善できることを示します。中核となるアイデアは、各思考ステップを、タスククエリと現在および過去の思考ステップに関連する検索情報で修正することです。検索拡張思考（RAT）は、gpt-4 や codellama-7b のような他のモデルにも適用でき、長期的な生成タスク（例：創作文章や具体化されたタスク計画）を改善します。RATはゼロショットのプロンプティング手法であり、ゼロショットCoTプロンプティング、バニラRAG、その他のベースラインを含む手法に対して大幅な改善を示します。

Shows that iteratively revising a chain of thoughts with information retrieval can significantly improve llm reasoning and generation in long-horizon generation tasks; the key idea is that each thought step is revised with relevant retrieved information to the task query, the current and past thought steps; retrieval augmented thoughts (rat) can be applied to different models like gpt-4 and codellama-7b to improve long-horizon generation tasks (e.g., creative writing and embodied task planning); rat is a zero-shot prompting approach and provides significant improvements to baselines that include zero-shot cot prompting, vanilla rag, and other baselines.

論文要旨（Abstract）

私たちは、情報検索の助けを借りて思考の連鎖を反復的に修正することが、長時間にわたる生成タスクにおいて大規模言語モデルの推論および生成能力を大幅に向上させると同時に、ハルシネーションを大きく軽減する仕組みを探究します。特に提案手法である 検索拡張思考（RAT）は、初期のゼロショットCoTが生成された後、タスククエリ、現在および過去の思考ステップに関連する検索情報を用いて、各思考ステップを順番に修正します。GPT-3.5、GPT-4、CodeLLaMA-7b にRATを適用すると、多様な長期生成タスクで性能が大きく向上し、評価スコアの相対的な平均上昇率は、コード生成で13.63%、数学的推論で16.96%、創作文章で19.2%、具体化されたタスク計画で42.78%でした。デモページは https://craftjarvis.github.io/RAT で確認できます。

We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination. In particular, the proposed method -- retrieval-augmented thoughts (RAT) -- revises each thought step one by one with retrieved information relevant to the task query, the current and the past thought steps, after the initial zero-shot CoT is generated. Applying RAT to GPT-3.5, GPT-4, and CodeLLaMA-7b substantially improves their performances on various long-horizon generation tasks; on average of relatively increasing rating scores by 13.63% on code generation, 16.96% on mathematical reasoning, 19.2% on creative writing, and 42.78% on embodied task planning. The demo page can be found at https://craftjarvis.github.io/RAT

論文リンク

https://arxiv.org/abs/2403.05313

さらに読む

https://x.com/omarsar0/status/1767251740443746435

Quiet-STaR: 言語モデルは話す前に自ら考える方法を学べる / Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

論文紹介

言語モデル（LM）が、より一般的でスケーラブルな方法で推論を学習できるように、STaRを一般化したQuiet-STaRを提示します。Quiet-STaRを用いると、LMは各トークンごとに将来のテキストを説明するための根拠を生成できます。また、内部思考を効率的に生成してLMの予測を改善するのに役立つ、トークン単位の並列サンプリングアルゴリズムを提案します。根拠生成は REINFORCE を用いて改善されます。

Presents a generalization of star, called quiet-star, to enable language models (lms) to learn to reason in more general and scalable ways; quiet-star enables lms to generate rationales at each token to explain future text; it proposes a token-wise parallel sampling algorithm that helps improve lm predictions by efficiently generating internal thoughts; the rationale generation is improved using reinforce.

論文要旨（Abstract）

文章を書いたり話したりするとき、人はときどき立ち止まって考え込みます。推論に焦点を当てた研究では、推論はしばしば質問に答えたりエージェント的なタスクを完了したりする方法として描かれてきましたが、推論はほぼすべての書かれたテキストに内在しています。たとえば、証明の行間に明示されていないステップや、会話の土台となる心の理論などに当てはまります。Self-Taught Reasoner（STaR、Zelikman ら、2022）では、質問応答における少数例から根拠を推論し、正答につながる例から学習することで、有用な思考を学習します。理想的には、言語モデルは任意のテキストにおいて明示されていない根拠を推論する方法を学べるべきですが、これは非常に制約の強い設定です。私たちは、言語モデルが将来のテキストを説明するために各トークンで根拠を生成する方法を学び、予測性能を向上させる、STaRの一般化版であるQuiet-STaRを紹介します。私たちは、1) 継続生成にかかる計算コスト、2) LMが当初は内部思考を生成または利用する方法を知らないという事実、3) 個々の次トークンを超えて予測する必要性、という主要な課題に対処します。これらの問題を解決するため、私たちは思考の開始と終了を示す学習可能なトークンと、拡張されたteacher forcing手法を用いる、トークン単位の並列サンプリングアルゴリズムを提案します。心強いことに、生成された根拠は予測が難しいトークンのモデリングに特に有効であり、難しい質問に直接答えるLMの能力向上にも大きく寄与します。特に、インターネットテキストのコーパスに対してQuiet-STaRでLMを継続事前学習した結果、GSM8K（5.9% $\rightarrow$ 10.9%）およびCommonsenseQA（36.3% $\rightarrow$ 47.2%）でゼロショット性能の改善が見られ、自然言語テキストにおける難しいトークンのパープレキシティ改善も観測されました。重要なのは、これらの改善にはこれらのタスクに対するファインチューニングが不要である点です。Quiet-STaRは、より一般的かつスケーラブルな方法で推論を学習できるLMに向けた一歩です。

When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text. For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversation. In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting -- ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions. We address key challenges, including 1) the computational cost of generating continuations, 2) the fact that the LM does not initially know how to generate or use internal thoughts, and 3) the need to predict beyond individual next tokens. To resolve these, we propose a tokenwise parallel sampling algorithm, using learnable tokens indicating a thought's start and end, and an extended teacher-forcing technique. Encouragingly, generated rationales disproportionately help model difficult-to-predict tokens and improve the LM's ability to directly answer difficult questions. In particular, after continued pretraining of an LM on a corpus of internet text with Quiet-STaR, we find zero-shot improvements on GSM8K (5.9%$\rightarrow$10.9%) and CommonsenseQA (36.3%$\rightarrow$47.2%) and observe a perplexity improvement of difficult tokens in natural text. Crucially, these improvements require no fine-tuning on these tasks. Quiet-STaR marks a step towards LMs that can learn to reason in a more general and scalable way.

論文リンク

https://arxiv.org/abs/2403.09629

さらに読む

https://x.com/omarsar0/status/1768681638009975088

LLMのための知識衝突：サーベイ / Knowledge Conflicts for LLMs: A Survey

論文紹介

このサーベイ論文では、LLMを扱う際によく発生する知識衝突の問題を、コンテキスト-メモリ間、コンテキスト間、メモリ内の衝突に分類し、こうした知識衝突の問題を緩和できる原因と潜在的な方法についての洞察を提供します。

An overview of the common issue of knowledge conflict when working with llms; the survey paper categorizes these conflicts into context-memory, inter-context, and intra-memory conflict; it also provides insights into causes and potential ways to mitigate these knowledge conflict issues.

論文要旨(Abstract)

このサーベイは、大規模言語モデル（LLM）における知識衝突の詳細な分析を提供し、文脈知識とパラメトリック知識を組み合わせる際に直面する複雑な課題を浮き彫りにしています。ここでは、context-memory、inter-context、intra-memory conflict という3種類の知識衝突に焦点を当てています。これらの衝突は、特にノイズや誤情報が一般的な実世界のアプリケーションにおいて、LLMの信頼性と性能に大きな影響を与える可能性があります。このサーベイは、これらの衝突を分類し、その原因を探り、衝突下でのLLMの振る舞いを調べ、利用可能な解決策をレビューすることで、LLMのロバスト性を向上させるための戦略を明らかにし、この進化する分野の研究を前進させるための貴重な資料となることを目指しています。

This survey provides an in-depth analysis of knowledge conflicts for large language models (LLMs), highlighting the complex challenges they encounter when blending contextual and parametric knowledge. Our focus is on three categories of knowledge conflicts: context-memory, inter-context, and intra-memory conflict. These conflicts can significantly impact the trustworthiness and performance of LLMs, especially in real-world applications where noise and misinformation are common. By categorizing these conflicts, exploring the causes, examining the behaviors of LLMs under such conflicts, and reviewing available solutions, this survey aims to shed light on strategies for improving the robustness of LLMs, thereby serving as a valuable resource for advancing research in this evolving area.

論文リンク

https://arxiv.org/abs/2403.08319

さらに読む

https://x.com/omarsar0/status/1768288774532858003

プロダクション言語モデルの一部を盗む / Stealing Part of a Production Language Model

論文紹介

ChatGPT や PaLM-2 のようなプロダクション言語モデルから情報を抽出する初のモデルスティーリング攻撃を紹介し、一般的なAPIアクセスを通じてトランスフォーマーベースのモデルの埋め込み射影層を復元できることを示しています。例として、20ドル未満のコストで OpenAI の Ada および Babbage モデルから射影行列全体を抽出したことを示しています。

Presents the first model-stealing attack that extracts information from production language models like chatgpt or palm-2; shows that it's possible to recover the embedding projection layer of a transformer-based model through typical api access; as an example, the entire projection matrix was extracted from the openai ada and babbage models for under $20.

論文要旨(Abstract)

OpenAI の ChatGPT や Google の PaLM-2 のようなブラックボックス型プロダクション言語モデルから、正確で重要な情報を抽出する初のモデルスティーリング攻撃を紹介します。具体的には、この攻撃は一般的なAPIアクセスがあれば、トランスフォーマーモデルの埋め込み射影層（対称性を除いて）を復元します。20ドル未満のコストで、OpenAI の Ada および Babbage 言語モデルの射影行列全体を抽出できます。これにより、これらのブラックボックスモデルの隠れ次元がそれぞれ 1024 と 2048 であることを初めて確認しました。また、gpt-3.5-turbo モデルの正確な隠れ次元サイズも復元し、射影行列全体の復元にはクエリコストが 2,000 ドル未満で済むと見積もっています。最後に、潜在的な防御策と緩和策を提示し、この攻撃を拡張し得る将来の研究の可能性についてその含意を議論します。

We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under $20 USD, our attack extracts the entire projection matrix of OpenAI's Ada and Babbage language models. We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under $2,000 in queries to recover the entire projection matrix. We conclude with potential defenses and mitigations, and discuss the implications of possible future work that could extend our attack.

論文リンク

https://arxiv.org/abs/2403.06634

さらに読む

https://x.com/omarsar0/status/1767641831079067694

Branch-Train-MiX: 専門家LLMをMixture-of-Experts LLMへ統合する / Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

論文紹介

LLM学習のための、より計算効率の高いアプローチとして、専門家LLMを Mixture-of-Experts LLM に統合する手法を提案しています。このアプローチは、より大きな汎用LLMや複数の個別の特化型LLMを学習するよりも効率的であることが示されており、まず異なるドメインに特化したシードLLM（すなわち専門家LLM）の複数のコピーを並列に学習し、MoE のフィードフォワード層を使って単一のLLMへ統合した後、モデル全体をファインチューニングします。

Proposes mixing expert llms into a mixture-of-experts llm as a more compute-efficient approach for training llms; it's shown to be more efficient than training a larger generalist llm or several separate specialized llms; the approach, btx, first trains (in parallel) multiple copies of a seed llm specialized in different domains (i.e., expert llms) and merges them into a single llm using moe feed-forward layers, followed by fine-tuning of the overall unified model.

論文要旨(Abstract)

コーディング、数学的推論、世界知識など複数の専門領域で能力を持つように大規模言語モデル（LLM）を学習させる効率的な方法を研究しています。BTX（Branch-Train-MiX）と名付けられたこの手法は、高いスループットと通信コストの削減を実現しつつ専門家を学習するため、分岐したシードモデルから開始します。各専門家を非同期に学習した後、BTXは専門家混合（MoE）層においてフィードフォワード・パラメータを専門家として集約し、残りのパラメータを平均したうえで、トークンレベルのルーティングを学習するためのMoEファインチューニング段階を行います。BTXは、ルーティング学習のためのMoEファインチューニング段階を持たないBranch-Train-Merge手法と、専門家を非同期に学習する段階を省略するsparse upcyclingという2つの特殊ケースを一般化したものです。他のアプローチと比べて、BTXは精度と効率のトレードオフを最も良く達成します。

We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.

論文リンク

https://arxiv.org/abs/2403.07816

さらに読む

https://x.com/jaseweston/status/1767727740952682667

大規模言語モデルは神経科学の結果予測で人間の専門家を上回る / Large language models surpass human experts in predicting neuroscience results

論文紹介

神経科学の結果を予測する機械学習の能力を評価するためのベンチマークであるBrainBenchを提案し、機械学習が実験結果の予測で専門家を上回ることを発見し、神経科学文献に合わせて調整した機械学習がさらに優れた性能を示すことを明らかにしました。

Proposes a benchmark, brainbench, for evaluating the ability of llms to predict neuroscience results; finds that llms surpass experts in predicting experimental outcomes; an llm tuned on neuroscience literature was shown to perform even better.

論文要旨(Abstract)

科学的発見はしばしば数十年にわたる研究の統合に依存しており、この作業は人間の情報処理能力を超える可能性があります。大規模言語モデル（LLM）はその解決策を提示します。膨大な科学文献で学習したLLMは、ノイズを含みつつも相互に関連する研究成果を統合し、人間の専門家よりもうまく新たな結果を予測できる可能性があります。この可能性を評価するために、神経科学の結果予測のための将来志向ベンチマークであるBrainBenchを作成しました。実験結果の予測において、LLMは専門家を上回ることがわかりました。神経科学文献をもとにチューニングしたLLMであるBrainGPTは、さらに優れた性能を示しました。人間の専門家と同様に、LLMも自身の予測に確信があるときほど、その予測が正しい可能性が高く、これは人間とLLMが協力して発見を進める未来を予感させます。このアプローチは神経科学に特化したものではなく、他の知識集約型分野にも適用可能です。

Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer a solution. LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. To evaluate this possibility, we created BrainBench, a forward-looking benchmark for predicting neuroscience results. We find that LLMs surpass experts in predicting experimental outcomes. BrainGPT, an LLM we tuned on the neuroscience literature, performed better yet. Like human experts, when LLMs were confident in their predictions, they were more likely to be correct, which presages a future where humans and LLMs team together to make discoveries. Our approach is not neuroscience-specific and is transferable to other knowledge-intensive endeavors.

論文リンク

https://arxiv.org/abs/2403.03230

さらに読む

https://x.com/ProfData/status/1765689739682754824

C4AI Command-R

論文紹介

推論、要約、質問応答などのユースケース向けに最適化されたコンテキスト長128kの350億パラメータモデルであり、10言語で評価された多言語生成機能、高性能なツール利用およびRAG機能を備えたcommand-rが研究目的で公開されました。

A 35b parameter model, with a context length of 128k, optimized for use cases that include reasoning, summarization, and question answering; command-r has the capability for multilingual generation evaluated in 10 languages and performant tool use and rag capabilities; it has been released for research purposes.

論文リンク

https://huggingface.co/CohereForAI/c4ai-command-r-v01

さらに読む

https://x.com/CohereForAI/status/1767275927505977455

埋め込みのコサイン類似度は本当に類似性に関するものなのか？ / Is Cosine-Similarity of Embeddings Really About Similarity?

論文紹介

正則化線形モデルから導出された埋め込みを研究し、コサイン類似度がどのように恣意的で意味のない類似性を生みうるのかを解析的に導出するとともに、一部の線形モデルでは類似性が一意ではなく、正則化によって制御される場合もあることを示し、著者らはコサイン類似度を盲目的に使うことへ警鐘を鳴らし、考慮点と代替案を提示しています。

Studies embeddings derived from regularized linear models and derive analytically how cosine-similarity can yield arbitrary and meaningless similarities; also finds that for some linear models, the similarities are not even unique and others are controlled by regularization; the authors caution against blindly using cosine similarity and presents considerations and alternatives.

論文要旨(Abstract)

コサイン類似度とは、2つのベクトル間の角度のコサイン、すなわち正規化したベクトル同士のドット積を指します。コサイン類似度は、学習された低次元の特徴埋め込みに適用することで、高次元オブジェクト間の意味的類似性を定量化する用途で広く使われています。これは実務上、埋め込みベクトル間の非正規化ドット積よりもうまく機能することもありますが、逆に悪化する場合もあります。この経験的観察への洞察を得るために、私たちは正則化線形モデルから導出された埋め込みを研究します。この設定では、閉形式解によって解析的な洞察が得やすくなります。コサイン類似度がどのようにして恣意的で、したがって意味のない「類似性」を生み出しうるのかを解析的に導出します。一部の線形モデルでは類似性が一意ですらなく、別のモデルでは類似性が正則化によって暗黙的に制御されます。ディープモデルの学習ではさまざまな正則化の組み合わせが用いられますが、これは得られた埋め込みのコサイン類似度を取る際に暗黙的・非意図的な影響を及ぼし、結果を不透明で恣意的なものにしうるという点で、線形モデルを超えた含意についても議論します。こうした知見に基づき、コサイン類似度を盲目的に使わないよう注意を促し、代替案を概説します。

Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations. A popular application is to quantify semantic similarity between high-dimensional objects by applying cosine-similarity to a learned low-dimensional feature embedding. This can work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice. To gain insight into this empirical observation, we study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights. We derive analytically how cosine-similarity can yield arbitrary and therefore meaningless `similarities.' For some linear models the similarities are not even unique, while for others they are implicitly controlled by the regularization. We discuss implications beyond linear models: a combination of different regularizations are employed when learning deep models; these have implicit and unintended effects when taking cosine-similarities of the resulting embeddings, rendering results opaque and possibly arbitrary. Based on these insights, we caution against blindly using cosine-similarity and outline alternatives.

論文リンク

https://arxiv.org/abs/2403.05440

さらに読む

https://x.com/_reachsumit/status/1767045820384477575

MM1: マルチモーダルLLM事前学習による手法、分析、インサイト / MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

論文紹介

さまざまなアーキテクチャ構成要素を研究し、画像キャプション、インターリーブされた画像-テキスト、テキストのみのデータを慎重に混合することが最先端性能の鍵であることを見いだし、事前学習指標でSOTAを達成し、改善されたインコンテキスト学習、複数画像推論、少数ショットのChain-of-Thoughtプロンプティングを可能にするなどの特性を備えた、最大300億パラメータのマルチモーダルモデル群を提案するなど、マルチモーダルLLM事前学習に関する手法、分析、インサイトを包括的に提供します。

Provides a comprehensive overview of methods, analysis, and insights into multimodal llm pre-training; studies different architecture components and finds that carefully mixing image-caption, interleaved image-text, and text-only data is key for state-of-the-art performance; it also proposes a family of multimodal models up to 30b parameters that achieve sota in pre-training metrics and include properties such as enhanced in-context learning, multi-image reasoning, enabling few-shot chain-of-thought prompting.

論文要旨(Abstract)

この作業では、高性能なマルチモーダル大規模言語モデル（MLLM）を構築する方法について説明します。特に、さまざまなアーキテクチャ構成要素とデータ選択の重要性を研究しています。画像エンコーダ、ビジョン・ランゲージ・コネクタ、さまざまな事前学習データ選択について、慎重かつ包括的なアブレーションを通じて、いくつかの重要な設計上の教訓を確認しました。たとえば、画像キャプション、インターリーブされた画像・テキスト、テキストのみのデータを慎重に組み合わせて大規模マルチモーダル事前学習を行うことが、複数のベンチマークで他の公開済み事前学習結果と比較して最先端（SOTA）のfew-shot結果を達成するうえで重要であることを実証しました。また、画像エンコーダは画像解像度および画像トークン数とともに大きな影響を与える一方で、ビジョン・ランゲージ・コネクタの設計は比較的重要ではないことを示しています。提示したレシピをスケールアップして、事前学習指標でSOTAとなる高密度モデルとMixture-of-Experts（MoE）バリアントで構成された、最大300億パラメータのマルチモーダルモデル群であるMM1を構築し、さまざまな既存のマルチモーダルベンチマークで教師ありファインチューニングを経て競争力のある性能を達成しました。大規模な事前学習のおかげで、MM1は強化されたインコンテキスト学習や複数画像推論といった魅力的な特性を備えており、これによってfew-shotでのChain-of-Thoughtプロンプティングが可能になります。

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

この記事はGPTモデルで整理したものであり、誤った部分がある可能性があるため、記事下部の原文もあわせてご参照ください。お読みいただく中で不自然な点や誤りを見つけた場合は、コメントでお知らせいただけますと幸いです。

⚠️広告⚠️: PyTorch韓国ユーザー会がまとめたこの記事は役に立ちましたか？会員登録していただくと、主要な記事をメールでお届けします！（デフォルトはWeeklyですが、Dailyへの変更も可能です。）

6件のコメント

prelude9903 2024-03-19

どの自動翻訳ツールを使ったのか教えてください。

ninebow 2024-03-19

はい、DeepLを使っています（笑）
最近、韓国語でも翻訳用語集を作れるようになったので使ってみたのですが、問題があって orz...

libner 2024-03-19

RATの部分の論文紹介で、rat と rag がそれぞれネズミ、雑巾と訳されているようです。おそらくモデルが小文字をそのまま読んでしまったのだと思います。

ninebow 2024-03-20

次のように修正しました。ありがとうございます！ :D

情報検索を通じて思考の連鎖（CoT）を反復的に修正することで、長文生成タスクにおけるLLMの推論と生成を大幅に改善できることを示しています。中核となるアイデアは、各思考ステップがタスククエリ、現在および過去の思考ステップに関連して検索された情報によって修正されるというものです。検索拡張思考（RAT）は、GPT-4やCodeLlama-7bのような他のモデルにも適用でき、長文生成タスク（例：創作ライティングや具体化された作業計画）において、RATはゼロショットのプロンプト方式であり、ゼロショットCoTプロンプト、基本的なRAG、その他のベースラインを含む基準手法を大幅に上回ります。

ninebow 2024-03-19

あっ、本当ですね; 原文を修正しておきます笑
ありがとうございます！

ninebow 2024-03-19

あっ、タイトルが……『今週の主要なML論文』に変更をお願いします;;

[2024/03/11 ~ 03/17] 今週の注目ML論文（Top ML Papers of the Week）