12] 今週の主要ML論文（Top ML Papers of the Week）

(discuss.pytorch.kr)

3 ポイント投稿者 ninebow 2023-11-13 | まだコメントはありません。 | WhatsAppで共有

概要

DAIR.AIが毎週公開しているML論文の紹介記事を自動翻訳してみました。
今週選ばれた論文を見ると、Transformerモデルと大規模言語モデル（Large Language Models, LLM）に関する研究が多数を占めていることが分かります。
「Simplifying Transformer Blocks」「Understanding In-Context Learning Abilities in Transformers」「S-LoRA」といったタイトルからは、Transformerモデルの構造と学習メカニズムへの理解を深めることに焦点を当てているようです。
「Hallucination in LLMs」「On the Road with GPT-4V(ision)」「GPT4All」は、GPTのような大規模言語モデルの性能と適用事例を扱っており、大規模言語モデルの発展と応用に重点を置く傾向が強く見られます。

大規模言語モデルのハルシネーションに関するサーベイ：原理、分類法、課題、未解決の問い / A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

論文紹介

LLMのハルシネーションに関する包括的なサーベイ論文（50ページ超）で、LLMのハルシネーション問題に関する原理、分類、課題、オープンクエスチョンについての情報を提供します。 #survey-paper #hallucination

A comprehensive survey (50+ pages) on hallucination in llms; provides information about principles, taxonomy, challenges, and open questions related to the issue of hallucination in llms.

論文要旨

大規模言語モデル（LLM）の登場は、自然言語処理（NLP）における大きなブレークスルーとなり、テキスト理解と生成に著しい進歩をもたらしました。しかしその一方で、LLMは現実世界の事実やユーザー入力と一致しない内容、すなわちハルシネーションを生成する重大な傾向を示しています。この現象は実運用に大きな課題をもたらし、現実のシナリオにおけるLLMの信頼性への懸念を引き起こしているため、こうしたハルシネーションを検出し緩和するための研究への関心が高まっています。本サーベイでは、LLMのハルシネーション分野における最近の進展について、徹底的かつ深い概観を提供することを目的としています。まず、LLMのハルシネーションに関する新しい分類法を提示し、次にハルシネーションの要因を詳しく掘り下げます。続いて、ハルシネーション検出手法とベンチマークの包括的な概観を示します。さらに、ハルシネーションを緩和するために設計された代表的なアプローチも紹介します。最後に、現在の限界を浮き彫りにする課題を分析し、未解決の問いを整理することで、今後のLLMにおけるハルシネーション研究の方向性を示します。

The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of LLMs in real-world scenarios, which attracts increasing attention to detect and mitigate these hallucinations. In this survey, we aim to provide a thorough and in-depth overview of recent advances in the field of LLM hallucinations. We begin with an innovative taxonomy of LLM hallucinations, then delve into the factors contributing to hallucinations. Subsequently, we present a comprehensive overview of hallucination detection methods and benchmarks. Additionally, representative approaches designed to mitigate hallucinations are introduced accordingly. Finally, we analyze the challenges that highlight the current limitations and formulate open questions, aiming to delineate pathways for future research on hallucinations in LLMs.

論文リンク

https://arxiv.org/abs/2311.05232

さらに読む

https://x.com/omarsar0/status/1722985251129966705

Transformerブロックの単純化 / Simplifying Transformer Blocks

論文紹介

Transformerブロックの単純化を探り、多くのブロック構成要素を取り除いても学習速度の損失がないことを見いだしています。自己回帰型のデコーダ専用モデルやBERTのようなエンコーダ専用モデルなど、異なるアーキテクチャを用いた結果、単純化されたブロックは標準的なTransformerの更新あたりの学習速度と性能を再現し、さらに少ないパラメータ（15%削減）で15%高速な学習スループットを達成できることも示しています。

Explores simplifying the transformer block and finds that many block components can be removed with no loss of training speed; using different architectures like autoregressive decoder-only and bert encoder-only models, the simplified blocks emulate per-update training speed and performance of standard transformers, and even achieve 15% faster training throughput with fewer parameters (15%).

論文要旨

ディープTransformerのためのシンプルな設計レシピは、同一のビルディングブロックを組み合わせることです。しかし、標準的なTransformerブロックは単純とはほど遠く、注意機構とMLPのサブブロックがスキップ接続や正規化レイヤーとともに精密な配置で複雑に絡み合っています。この複雑さは、些細に見える変更でも学習速度を大きく低下させたり、モデルを学習不能にしたりする脆弱なアーキテクチャにつながります。この研究では、標準的なTransformerブロックをどこまで単純化できるのかを問います。信号伝播理論と経験的観察を組み合わせることで、スキップ接続、射影またはvalueパラメータ、逐次的なサブブロック、正規化レイヤーなど、多くのブロック構成要素を学習速度の低下なしに削除できるようにする修正の動機を示します。自己回帰型decoder-onlyモデルとBERTのencoder-onlyモデルの両方での実験において、簡素化されたTransformerは、標準的なTransformerの更新あたりの学習速度と性能を再現しつつ、学習スループットを15%高速化し、パラメータ数を15%削減しました。

A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters.

論文リンク

https://arxiv.org/abs/2311.01906

さらに読む

https://x.com/maksym_andr/status/1722235666724192688

事前学習データ混合によりTransformerモデルでより限定的なモデル選択能力を実現 / Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models

論文紹介

Transformerが事前学習データ混合をどの程度効果的に橋渡しし、事前学習分布の内外にある新しいタスクをコンテキスト内で識別・学習できるかを調査します。研究対象となった条件では、モデルのコンテキスト内学習の振る舞いが事前学習データを超えて一般化できることを示す証拠は限定的です。

Investigates how effectively transformers can bridge between pretraining data mixture to identify and learn new tasks in-context which are both inside and outside the pretraining distribution; in the regimes studied, there is limited evidence that the models’ in-context learning behavior is capable of generalizing beyond their pretraining data.

論文要旨

Transformerモデル、特に大規模言語モデル（LLM）は、明示的なモデル学習を行わなくても、未知の入出力例を与えることで新しいタスクを実行できるコンテキスト内学習（ICL）という注目すべき能力を備えています。この研究では、複数の異なるタスクファミリーから構成される事前学習データ混合を、Transformerがどの程度効果的に橋渡しし、事前学習分布の内外にある新しいタスクをコンテキスト内で識別・学習できるかを調べます。先行研究を踏まえ、自然言語ではなく $(x, f(x))$ ペアの系列で学習したTransformerモデルを対象とする制御された設定で、この問題を検証します。実験結果は、タスクファミリーが事前学習データ内で十分に表現されている場合、Transformerがまずコンテキスト内で異なるタスクファミリーを識別し、その中でコンテキスト内学習を行う能力において、ほぼ最適な教師なしモデル選択能力を示すことを明らかにしました。しかし、事前学習データの領域外にあるタスクや関数が提示されると、Transformerにはさまざまな失敗モードが見られ、単純な外挿タスクでさえ一般化性能が低下することが示されました。これらの結果は総じて、高容量シーケンスモデルの印象的なICL能力が、根本的な一般化能力を生み出す帰納バイアスよりも、事前学習データ混合のカバレッジとより密接に結び付いている可能性を強調しています。

Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can bridge between their pretraining data mixture, comprised of multiple distinct task families, to identify and learn new tasks in-context which are both inside and outside the pretraining distribution. Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of $(x, f(x))$ pairs rather than natural language. Our empirical results show transformers demonstrate near-optimal unsupervised model selection capabilities, in their ability to first in-context identify different task families and in-context learn within them when the task families are well-represented in their pretraining data. However when presented with tasks or functions which are out-of-domain of their pretraining data, we demonstrate various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks. Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.

論文リンク

https://arxiv.org/abs/2311.00871

さらに読む

https://x.com/abacaj/status/1721223737729581437

シンプルで制御可能な音楽生成 / Simple and Controllable Music Generation

論文紹介

圧縮された離散的な音楽表現の複数ストリームにわたって動作する単一ステージのTransformerベースLLMであり、テキスト説明やメロディ特徴に応じて制御しながら高品質なサンプル（モノラルおよびステレオ）を生成できます。

A single-stage transformer-based llm that operates over several streams of compressed discrete music representation; it can generate high-quality samples (mono and stereo) while conditioning on textual description or melodic features.

論文要約

条件付き音楽生成という課題に取り組みます。圧縮された離散的な音楽表現、すなわちトークンの複数ストリーム上で動作する単一の言語モデル（LM）であるMusicGenを紹介します。従来研究と異なり、MusicGenは効率的なトークン・インターリービング・パターンとともに単一ステージのTransformer LMで構成されているため、階層化やアップサンプリングのように複数のモデルをカスケードする必要がありません。このアプローチにより、MusicGenがテキスト説明やメロディ特徴に条件付けされながら、モノラルとステレオの両方で高品質なサンプルを生成し、同時に生成出力をよりよく制御できることを示します。自動評価と人手による評価の両方を含む広範な実証評価を行い、提案手法が標準的なテキスト音楽ベンチマークにおいて評価対象のベースラインを上回ることを示します。アブレーション研究を通じて、MusicGenを構成する各要素の重要性を明らかにします。音楽サンプル、コード、モデルは https://github.com/facebookresearch/audiocraft で確認できます。

We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft

論文リンク

https://arxiv.org/abs/2306.05284

さらに読む

https://x.com/AIatMeta/status/1723043913638810025

効率的なTransformerモデルのための交互更新 / Alternating Updates for Efficient Transformers

論文紹介

計算コストを増やすことなくTransformerモデルのスケールと容量の増大を活用できる手法であり、各層で拡張された表現のサブブロックを処理し、予測・補正メカニズムを用いて非活性化されたブロックを更新することで、学習表現を拡張しつつレイテンシの増加を無視できる程度に抑えられます。

A method that makes it possible to take advantage of increasing scale and capacity in transformer models without increasing the computational cost; achieved by working on a subblock of the widened representation at each layer and using a predict-and-correct mechanism to update the inactivated blocks; it widens the learn representation while only incurring a negligible increase in latency.

論文要約

ディープトランスフォーマーネットワークでは、規模を拡大すると品質と性能が向上することは、すでによく知られています。しかし、このようなスケール拡大は、しばしば計算コストと推論レイテンシの膨大な増加を伴います。Pure Storageは、計算負荷を増やさずにモデル容量を拡大できる、実装が容易な手法であるAlternating Updates（AltUp）を紹介しています。AltUpを使うと、学習済み表現、すなわちトークン埋め込みを拡張しながら、レイテンシの増加を無視できる程度に抑えられます。AltUpは、各レイヤーで拡張された表現のサブブロックを処理し、予測と修正のメカニズムを用いて非活性化されたブロックを更新することで、これを実現します。私たちは、系列次元への適用可能性などAltUpの拡張性を示し、さらに、Sparse Mixture-of-Expertsモデルのような既存アプローチとAltUpを相乗的に組み合わせることで、より高い容量を持つ効率的なモデルを得る方法を実証します。ベンチマーク用トランスフォーマーモデルと言語タスクに関する実験を通じて、多様なシナリオにおけるAltUpの一貫した有効性を確認できます。特にSuperGLUEおよびSQuADベンチマークでは、AltUpは同一精度で高密度ベースラインに対して最大$87%$の高速化を実現します。

It has been well established that increasing scale in deep transformer networks leads to improved quality and performance. However, this increase in scale often comes with prohibitive increases in compute cost and inference latency. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation, i.e., the token embedding, while only incurring a negligible increase in latency. AltUp achieves this by working on a subblock of the widened representation at each layer and using a predict-and-correct mechanism to update the inactivated blocks. We present extensions of AltUp, such as its applicability to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity. Our experiments on benchmark transformer models and language tasks demonstrate the consistent effectiveness of AltUp on a diverse set of scenarios. Notably, on SuperGLUE and SQuAD benchmarks, AltUp enables up to $87%$ speedup relative to the dense baselines at the same accuracy.

論文リンク

https://arxiv.org/abs/2301.13310

さらに読む

https://x.com/GoogleAI/status/1722004366201418132

言い換えて応答する: 大規模言語モデルが自らより良い質問をするようにする / Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

論文紹介

LLMを使って人間が投げかけた質問を言い換え・拡張し、全体的な性能を向上させる効果的なプロンプト手法であり、幅広いタスクにおいてさまざまなモデルの性能を改善でき、このアプローチを思考の連鎖と組み合わせることで性能をさらに高められます。

An effective prompting method that uses llms to rephrase and expand questions posed by humans to improve overall performance; it can improve the performance of different models across a wide range of tasks; the approach can be combined with chain-of-thought to improve performance further.

論文要旨

誤解は対人コミュニケーションだけでなく、人間と大規模言語モデル（LLM）の間でも生じます。こうした不一致により、LLMは一見曖昧でない質問を予想外の形で解釈し、誤った応答を返すことがあります。質問のようなプロンプトの質が、LLMの応答品質に大きく影響することは広く知られていますが、LLMがよりよく理解できる質問を体系的に作成する方法は、まだ十分に整備されていません。本論文では、人間が投げかけた質問を言い換え・拡張し、単一のプロンプトで回答を提供できる「Rephrase and Respond」（RaR）という手法を提案します。このアプローチは、性能改善のためのシンプルかつ効果的なプロンプティング手法です。さらにRaRの2段階変種も導入します。まず質問の文言を修正するLLMが質問を言い換え、その後、元の質問と言い換えた質問の両方を別の応答用LLMに渡します。これにより、あるLLMが生成した言い換えを別のLLMで効果的に活用できます。実験の結果、この手法は多様なタスクにおいて複数モデルの性能を大幅に向上させることが示されました。また、RaRと広く使われているChain-of-Thought（CoT）手法を理論面・実験面の両方から包括的に比較しています。その結果、RaRはCoTと相補的であり、CoTと組み合わせることでさらに高い性能を達成できることを示しています。本研究は、LLMの性能を効率的かつ効果的に向上させることに貢献するだけでなく、LLMの能力を公正に評価することについても示唆を与えます。データとコードは https://github.com/uclaml/Rephrase-and-Respond で確認できます。

Misunderstandings arise not only in interpersonal communication but also between humans and Large Language Models (LLMs). Such discrepancies can make LLMs interpret seemingly unambiguous questions in unexpected ways, yielding incorrect responses. While it is widely acknowledged that the quality of a prompt, such as a question, significantly impacts the quality of the response provided by LLMs, a systematic method for crafting questions that LLMs can better comprehend is still underdeveloped. In this paper, we present a method named `Rephrase and Respond' (RaR), which allows LLMs to rephrase and expand questions posed by humans and provide responses in a single prompt. This approach serves as a simple yet effective prompting method for improving performance. We also introduce a two-step variant of RaR, where a rephrasing LLM first rephrases the question and then passes the original and rephrased questions together to a different responding LLM. This facilitates the effective utilization of rephrased questions generated by one LLM with another. Our experiments demonstrate that our methods significantly improve the performance of different models across a wide range to tasks. We further provide a comprehensive comparison between RaR and the popular Chain-of-Thought (CoT) methods, both theoretically and empirically. We show that RaR is complementary to CoT and can be combined with CoT to achieve even better performance. Our work not only contributes to enhancing LLM performance efficiently and effectively but also sheds light on a fair evaluation of LLM capabilities. Data and codes are available at https://github.com/uclaml/Rephrase-and-Respond.

論文リンク

https://arxiv.org/abs/2311.04205

さらに読む

https://x.com/QuanquanGu/status/1722364144379396513

GPT-4V(ision)とともに道路を走る：自動運転における視覚言語モデルの初期的探究 / On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

論文紹介

最新の視覚言語モデルであるGPT-4V(ision)と、その自動運転への適用を徹底的に評価し、既存の自動運転システムと比較して、シーン理解と因果推論で優れた性能を示すことを明らかにします。

Provides an exhaustive evaluation of the latest state-of-the-art visual language model, gpt-4v(ision), and its application in autonomous driving; the model demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems.

論文要旨

自動運転技術の実現は、認識、意思決定、制御システムの高度な統合にかかっています。データ駆動型およびルールベースの従来手法には、複雑な走行環境の微妙な違いや他の道路利用者の意図を把握できないという限界がありました。これは特に、安全で信頼性の高い自動運転に必要な常識的推論と繊細なシーン理解の発展において大きなボトルネックとなっていました。視覚言語モデル（VLM）の登場は、完全自動運転の実現に向けて新たな地平を切り開きました。本レポートでは、最新の最先端VLMとその自動運転シナリオへの適用について徹底的な評価を提供します。走行シーンを理解して推論し、意思決定を行い、最終的に運転者のように振る舞うモデルの能力を検証します。基本的なシーン認識から、複雑な因果関係の推論、さまざまな条件下でのリアルタイムな意思決定まで、包括的なテストを実施しました。テスト結果によると、「モデル名」は既存の自動運転システムと比べて、シーン理解と因果推論で優れた性能を示しました。これは、実際の走行状況で分布外シナリオを処理し、意図を認識し、情報に基づいた意思決定を行える可能性を示しています。しかし、進行方向の識別、信号機認識、ビジョングラウンディング、空間推論タスクなどの課題は依然として残っています。こうした限界は、さらなる研究開発の必要性を強調しています。このプロジェクトは現在GitHubで公開されており、誰でもアクセスして活用できます: URL{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}

The pursuit of autonomous driving technology hinges on the sophisticated integration of perception, decision-making, and control systems. Traditional approaches, both data-driven and rule-based, have been hindered by their inability to grasp the nuance of complex driving environments and the intentions of other road users. This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving. The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, \modelnamefull, and its application in autonomous driving scenarios. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that \modelname demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems. It showcases the potential to handle out-of-distribution scenarios, recognize intentions, and make informed decisions in real driving contexts. However, challenges remain, particularly in direction discernment, traffic light recognition, vision grounding, and spatial reasoning tasks. These limitations underscore the need for further research and development. Project is now available on GitHub for interested parties to access and utilize: \url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}

論文リンク

https://arxiv.org/abs/2311.05332

さらに読む

https://x.com/arankomatsuzaki/status/1722795897359139057

GPT4All: オープンソース圧縮言語モデルのエコシステム / GPT4All: An Ecosystem of Open Source Compressed Language Models

論文紹介

LLMへのアクセスの民主化を目指すオープンソースリポジトリとともに、GPT4Allモデルファミリーの技術的詳細を簡潔に説明します。

Outlines technical details of the gpt4all model family along with the open-source repository that aims to democratize access to llms.

論文要旨

近年、大規模言語モデル（LLM）はさまざまな専門的・学術的ベンチマークで人間レベルの性能を達成しました。これらのモデルのアクセス性は、その性能に比べて立ち遅れています。最新のLLMは高価なインフラを必要とし、レート制限、地域制限、検閲されたWebインターフェースを通じてしかアクセスできず、公開されているコードや技術報告も不足しています。本論文では、LLMへのアクセスの民主化を目指す人気のオープンソースリポジトリであるGPT4Allの歩みを紹介します。また、オリジナルのGPT4Allモデルファミリーの技術的詳細と、単一モデルから本格的なオープンソースエコシステムへと発展したGPT4Allプロジェクトについて簡潔に説明します。本論文が、オリジナルGPT4Allモデルの技術的概要であると同時に、その後のGPT4Allオープンソースエコシステムの成長に関するケーススタディとしても役立つことを願っています。

Large language models (LLMs) have recently achieved human-level performance on a range of professional and academic benchmarks. The accessibility of these models has lagged behind their performance. State-of-the-art LLMs require costly infrastructure; are only accessible via rate-limited, geo-locked, and censored web interfaces; and lack publicly available code and technical reports. In this paper, we tell the story of GPT4All, a popular open source repository that aims to democratize access to LLMs. We outline the technical details of the original GPT4All model family, as well as the evolution of the GPT4All project from a single model into a fully fledged open source ecosystem. It is our hope that this paper acts as both a technical overview of the original GPT4All models as well as a case study on the subsequent growth of the GPT4All open source ecosystem.

論文リンク

https://arxiv.org/abs/2311.04931

さらに読む

https://x.com/_akhaliq/status/1722833378590793915

S-LoRA: 数千の同時LoRAアダプタを提供する / S-LoRA: Serving Thousands of Concurrent LoRA Adapters

論文紹介

すべてのアダプターをメインメモリに保存し、現在実行中のクエリのアダプターをGPUメモリへ読み込み、新しいテンソル並列化戦略と高度に最適化されたカスタムCUDAカーネルを用いることで、LoRA計算の異種バッチ処理を実現し、他のソリューションと比べてスループットを4倍向上させ、提供可能なアダプター数を桁違いに増やすなど、多数のLoRAアダプターのスケーラブルなサービングを可能にするアプローチです。

An approach that enables the scalable serving of many lora adapters; it stores all adapters in main memory and fetches adapters of currently running queries to the gpu memory; employs novel tensor parallelism strategy and highly optimized custom cuda kernels for heterogenous batching of lora computation; improves throughput by 4x, when compared to other solutions, and increases the number of served adapters by several orders of magnitude.

論文要旨

大規模言語モデルのデプロイでは、一般に「事前学習してから微調整する」パラダイムが採用されています。パラメータ効率の高い微調整手法であるLow-Rank Adaptation（LoRA）は、ベースモデルを多数のタスクに適応させるためによく用いられ、その結果、1つのベースモデルから派生した大量のLoRAアダプターが生まれます。こうしたパラダイムは、サービング中のバッチ推論に大きな機会をもたらします。これらの機会を活用するために、本論文では、多数のLoRAアダプターをスケーラブルにサービングするために設計されたシステム、S-LoRAを紹介します。S-LoRAはすべてのアダプターをメインメモリに保存し、現在実行中のクエリで使用されるアダプターをGPUメモリに読み込みます。GPUメモリを効率的に使用し、断片化を減らすために、S-LoRAはUnified Pagingを提案します。Unified Pagingは、異なるランクを持つ動的なアダプター重みと、異なるシーケンス長を持つKVキャッシュテンソルを、統一メモリプールを用いて管理します。さらにS-LoRAは、新しいテンソル並列化戦略と高度に最適化されたカスタムCUDAカーネルを用いて、LoRA計算の異種バッチ処理をサポートします。これらの機能により、S-LoRAは小さなオーバーヘッドで、単一GPUまたは複数GPUにまたがって数千のLoRAアダプターを提供できます。HuggingFace PEFTやvLLMのような最先端ライブラリ（LoRAサービングを素朴にサポートするもの）と比較すると、S-LoRAはスループットを最大4倍改善し、提供可能なアダプター数を数桁増やすことができます。その結果、S-LoRAは多数のタスク特化型ファインチューニング済みモデルのスケーラブルなサービングを可能にし、大規模なカスタマイズド・ファインチューニングサービスの可能性を提供します。コードは https://github.com/S-LoRA/S-LoRA で確認できます。

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at https://github.com/S-LoRA/S-LoRA

論文リンク

https://arxiv.org/abs/2311.03285v2

さらに読む

https://x.com/ai_database/status/1722190708797592013

FreshLLM: 検索エンジン拡張による大規模言語モデルのリフレッシュ / FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

論文紹介

LLMが生成したテキストの事実性を検証するための動的QAベンチマーク（FreshQA）を提案し、検索エンジンから取得した関連性の高い最新情報をプロンプトに組み込むことでFreshQAにおけるLLMの性能を大きく向上させる、シンプルなfew-shotプロンプティング手法であるFreshPromptを提案し、さらに、LLMに簡潔で直接的な回答を生成するよう指示すると、冗長な回答を促す場合よりもハルシネーションの低減に役立つことを明らかにしています。

Proposes a dynamic qa benchmark (freshqa) to test the factuality of llm-generated text; proposes freshprompt, a simple few-shot prompting method that substantially boosts the performance of an llm on freshqa by incorporating relevant and up-to-date information retrieved from a search engine into the prompt; finds that instructing the llm to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers.

論文要旨

ほとんどの大規模言語モデル（LLM）は一度だけ学習され、その後更新されないため、絶えず変化する世界に動的に適応する能力が不足しています。本研究では、現在の世界知識を試す質問に答えるという文脈において、LLMが生成したテキストの事実性について詳細な研究を行います。具体的には、急速に変化する世界知識を必要とする質問や、反駁すべき誤った前提を含む質問など、多様な質問・回答タイプを含む新しい動的QAベンチマークであるFreshQAを紹介します。正答性とハルシネーションの両方を測定できる2つのモードの評価手順の下で、さまざまなクローズド型およびオープンソースのLLMをベンチマークします。5万件を超える判定を含む人手評価を通じて、これらのモデルの限界を明らかにし、大きな改善の余地があることを示しました。たとえば、モデル規模に関係なく、すべてのモデルが急速に変化する知識や誤った前提を含む質問に苦戦することを発見しました。これらの結果に基づき、検索エンジンから取得した関連性の高い最新情報をプロンプトに統合することで、FreshQAにおけるLLMの性能を大幅に向上させるシンプルなfew-shotプロンプティング手法であるFreshPromptを紹介します。実験の結果、FreshPromptはSelf-Ask（Press et al., 2022）のような競合する検索エンジン拡張型プロンプティング手法だけでなく、Perplexity.AIのような商用システムよりも優れた性能を示しました。FreshPromptの追加分析により、取得された証拠の数とその順序の両方が、LLMの生成した回答の正しさに影響を与える重要な役割を果たすことが明らかになりました。また、簡潔で直接的な回答を生成するようLLMに指示することは、より冗長な回答を促す場合と比べてハルシネーションの低減に役立つことも分かりました。今後の研究を容易にするため、FreshQAを github.com/freshllms/freshqa で公開し、定期的に更新することを約束します。

Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.

[2023/11/06 ~ 11/12] 今週の主要ML論文（Top ML Papers of the Week）

概要

大規模言語モデルのハルシネーションに関するサーベイ：原理、分類法、課題、未解決の問い / A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

論文紹介

論文要旨

論文リンク

さらに読む

Transformerブロックの単純化 / Simplifying Transformer Blocks

論文紹介

論文要旨

論文リンク

さらに読む

事前学習データ混合によりTransformerモデルでより限定的なモデル選択能力を実現 / Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models

論文紹介

論文要旨

論文リンク

さらに読む

シンプルで制御可能な音楽生成 / Simple and Controllable Music Generation

論文紹介

論文要約

論文リンク

さらに読む

効率的なTransformerモデルのための交互更新 / Alternating Updates for Efficient Transformers

論文紹介

論文要約

論文リンク

さらに読む

言い換えて応答する: 大規模言語モデルが自らより良い質問をするようにする / Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

論文紹介

論文要旨

論文リンク

さらに読む

GPT-4V(ision)とともに道路を走る：自動運転における視覚言語モデルの初期的探究 / On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

論文紹介

論文要旨

論文リンク

さらに読む

GPT4All: オープンソース圧縮言語モデルのエコシステム / GPT4All: An Ecosystem of Open Source Compressed Language Models

論文紹介

論文要旨

論文リンク

さらに読む

S-LoRA: 数千の同時LoRAアダプタを提供する / S-LoRA: Serving Thousands of Concurrent LoRA Adapters

論文紹介

論文要旨

論文リンク

さらに読む

FreshLLM: 検索エンジン拡張による大規模言語モデルのリフレッシュ / FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

論文紹介

論文要旨

論文リンク

さらに読む

原文

関連記事

まだコメントはありません。