31] 今週の主要ML論文 (Top ML Papers of the Week)

(discuss.pytorch.kr)

7 ポイント投稿者 ninebow 2024-01-01 | まだコメントはありません。 | WhatsAppで共有

概要

DAIR.AIで毎週公開されるML論文に関する記事を自動翻訳しました。
今週選定された論文を見ると、全体としてはGPT-4のような大規模言語モデル（Large Language Models, LLMs）を中心とした研究が主流を成している傾向があるようです。特に、これらの研究はGPT-4の新しいAPIの活用、LLMにおける事実想起能力、そしてLLMをどのようにより優れた高密度な検索能力へと発展させるかに焦点を当てています。また、言語モデルベースの数学的問題解決や、こうしたモデルがどのように推論を行うかというテーマも含まれています。
このような傾向が見られるのは、LLMが人工知能分野において依然として主要な研究テーマであり続けているためかもしれません。GPT-4のようなモデルは、優れた言語理解および生成能力をもとに多様な応用分野で活用可能性を広げており、この能力を改善し、新たな形で活用する研究が継続的に行われています。実際の性能向上に向けた具体的な方法論の研究や適用事例の分析が重要な研究領域として浮上しており、今週選定された論文もこうした傾向を反映しているようです。
一方で、LLMの理解度と推論能力を評価し、向上させようとする研究も人気のあるテーマのようです。これを通じて、知的エージェントが人間とより自然かつ効果的に相互作用できる方法を探ることは、人工知能分野において非常に重要な課題でしょう。これによって改善された効率性や実生活への適用可能性を探求することは、今後の研究動向において重要な位置を占めると予想されます。

CogAgent: GUIエージェントのための視覚言語モデル / CogAgent: A Visual Language Model for GUI Agents

論文紹介

GUIの理解とナビゲーションに特化した180億パラメータの視覚言語モデルを提供し、高解像度入力（1120x1120）をサポート、視覚的質問応答、視覚的グラウンディング、GUIエージェントのようなタスクで能力を発揮し、テキストが豊富な5つのベンチマークと4つの一般VQAベンチマークで最先端を達成しました。

Presents an 18 billion parameter visual language model specializing in gui understanding and navigation; supports high-resolution inputs (1120x1120) and shows abilities in tasks such as visual q&a, visual grounding, and gui agent; achieves state of the art on 5 text-rich and 4 general vqa benchmarks.

論文要旨(Abstract)

人々は、コンピュータやスマートフォンの画面のようなグラフィカルユーザーインターフェース（GUI）を通じて、デジタルデバイス上で膨大な時間を過ごしています。ChatGPTのような大規模言語モデル（LLM）は、メール作成のような作業を支援できますが、GUIを理解して相互作用することには苦戦するため、自動化レベルを高める潜在力が制限されています。本論文では、GUIの理解とナビゲーションに特化した180億パラメータの視覚言語モデル（VLM）であるCogAgentを紹介します。低解像度および高解像度の画像エンコーダの両方を活用することで、CogAgentは1120*1120解像度の入力をサポートし、小さなページ要素やテキストも認識できます。汎用視覚言語モデルとして、CogAgentはVQAv2、OK-VQA、Text-VQA、ST-VQA、ChartQA、infoVQA、DocVQA、MM-Vet、POPEを含む、テキストが豊富な5つのベンチマークと4つの一般VQAベンチマークで最先端の性能を達成しました。スクリーンショットのみを入力として使用するCogAgentは、抽出されたHTMLテキストを用いるLLMベースの手法であるMind2WebおよびAITWを、PCとAndroidのGUIナビゲーションタスクの両方で上回り、最先端技術をさらに前進させました。モデルとコードは https://github.com/THUDM/CogVLM で公開されています。

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM .

論文リンク

https://arxiv.org/abs/2312.08914

さらに読む

https://x.com/cenyk1230/status/1739916469272789222

Google GeminiからOpenAI Q（Q-Star）まで: 生成AI研究環境の再編に関するサーベイ / From Google Gemini to OpenAI Q (Q-Star): A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape

論文紹介

300本以上の論文を調査し、生成AI分野で注目すべき研究開発事項を要約したこのレポートは、計算上の困難、スケーラビリティ、実世界での適用可能性、そして医療、金融、教育などの分野で進展を牽引しうる生成AIの可能性を扱っています。

Surveys 300+ papers and summarizes research developments to look at in the space of generative ai; it covers computational challenges, scalability, real-world implications, and the potential for gen ai to drive progress in fields like healthcare, finance, and education.

論文要旨(Abstract)

この包括的なサーベイは、進化する生成AIの動向を探り、特にMixture of Experts（MoE）、マルチモーダル学習、Artificial General Intelligence（AGI）に向けた推測上の進展がもたらす変革的な影響に焦点を当てています。このレポートは、生成AIの現状と将来の軌道を批判的に検討し、GoogleのGeminiや期待されるOpenAIのQ*プロジェクトのようなイノベーションが、生成AI研究のタクソノミーへの影響分析を含め、さまざまな領域における研究優先順位と応用をどのように再編しているかを考察しました。また、これらの技術の計算上の課題、スケーラビリティ、実世界への影響を評価すると同時に、医療、金融、教育などの分野で大きな進展を促進しうる潜在力を強調しています。さらに、AIをテーマにした論文とAIが生成した論文の双方のプレプリント増加によって生じる新たな学術的課題を扱い、査読プロセスと学術コミュニケーションへの影響を検討しました。この研究は、AI開発において倫理的かつ人間中心の方法を取り入れ、社会規範と福祉との整合性を確保する重要性を強調するとともに、生成AIにおけるMoE、マルチモーダル性、AGIのバランスの取れた良識的な活用に焦点を当てた将来のAI研究戦略を示しました。

This comprehensive survey explored the evolving landscape of generative Artificial Intelligence (AI), with a specific focus on the transformative impacts of Mixture of Experts (MoE), multimodal learning, and the speculated advancements towards Artificial General Intelligence (AGI). It critically examined the current state and future trajectory of generative Artificial Intelligence (AI), exploring how innovations like Google's Gemini and the anticipated OpenAI Q* project are reshaping research priorities and applications across various domains, including an impact analysis on the generative AI research taxonomy. It assessed the computational challenges, scalability, and real-world implications of these technologies while highlighting their potential in driving significant progress in fields like healthcare, finance, and education. It also addressed the emerging academic challenges posed by the proliferation of both AI-themed and AI-generated preprints, examining their impact on the peer-review process and scholarly communication. The study highlighted the importance of incorporating ethical and human-centric methods in AI development, ensuring alignment with societal norms and welfare, and outlined a strategy for future AI research that focuses on a balanced and conscientious use of MoE, multimodality, and AGI in generative AI.

論文リンク

https://arxiv.org/abs/2312.10868

さらに読む

https://x.com/omarsar0/status/1740119485011390558

PromptBench: 大規模言語モデル評価のための統合ライブラリ / PromptBench: A Unified Library for Evaluation of Large Language Models

論文紹介

プロンプト構築、プロンプトエンジニアリング、データセットおよびモデルの読み込み、敵対的プロンプト攻撃、動的評価プロトコル、分析ツールなどの機能で構成された統合ライブラリで、LLMの包括的な評価と分析を支援します。

A unified library that supports comprehensive evaluation and analysis of llms; it consists of functionalities for prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools.

論文要旨(Abstract)

大規模言語モデル（LLM）の評価は、その性能を評価し、潜在的なセキュリティリスクを軽減するうえで極めて重要です。本論文では、LLMを評価するための統合ライブラリであるPromptBenchを紹介します。このライブラリは、研究者が容易に利用・拡張できるいくつかの主要コンポーネント、すなわちプロンプト構築、プロンプトエンジニアリング、データセットおよびモデルの読み込み、敵対的プロンプト攻撃、動的評価プロトコル、分析ツールから構成されています。PromptBenchは、新しいベンチマークの作成、ダウンストリームアプリケーションの展開、新たな評価プロトコルの設計に関する独創的な研究を促進できる、研究目的のためのオープンで汎用的かつ柔軟なコードベースとして設計されています。コードは https://github.com/microsoft/promptbench で公開されており、継続的にサポートされる予定です。

The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.

論文リンク

https://arxiv.org/abs/2312.07910v1

さらに読む

https://x.com/omarsar0/status/1739360426134028631

新しいGPT-4 APIを活用する / Exploiting Novel GPT-4 APIs

論文紹介

GPT-4 APIで公開されている3つの機能、すなわちファインチューニング、関数呼び出し、知識検索に対してレッドチーム評価を実施し、主な結果として次を導きました。1) 有害な例15件または無害な例100件でのファインチューニングにより、GPT-4の中核的な安全装置を取り除ける、2) GPT-4アシスタントに関数呼び出しスキーマを開示させ、任意の関数呼び出しを実行させられる、3) 検索文書に命令を注入することで知識検索を乗っ取れる。

Performs red-teaming on three functionalities exposed in the gpt-4 apis: fine-tuning, function calling, and knowledge retrieval; main findings: 1) fine-tuning on as few as 15 harmful examples or 100 benign examples can remove core safeguards from gpt-4, 2) gpt-4 assistants divulge the function call schema and can be made to execute arbitrary function calls, and 3) knowledge retrieval can be hijacked by injecting instructions into retrieval documents.

論文要旨(Abstract)

言語モデルへの攻撃は通常、モデル重みへの完全なホワイトボックスアクセス、またはテキスト生成APIに限定されたブラックボックスアクセスという、2つの極端な脅威モデルのいずれかを前提としています。しかし、実世界のAPIは単なるテキスト生成よりも柔軟であることが多く、こうしたAPIは新たな脅威ベクトルにつながる「グレーボックス」アクセスを露出させています。これを調べるため、私たちはGPT-4 APIで公開されている3つの新機能、すなわちファインチューニング、関数呼び出し、知識検索をレッドチーム評価しました。その結果、有害な例15件または無害な例100件でモデルをファインチューニングするだけで、GPT-4の中核的な安全装置を取り除き、さまざまな有害出力を可能にできることが分かりました。さらに、GPT-4アシスタントは関数呼び出しスキーマを容易に開示し、任意の関数呼び出しを実行するよう誘導できることも分かりました。最後に、検索文書に命令を挿入することで知識検索を乗っ取れることも確認しました。これらの脆弱性は、APIが公開する機能に追加があるたびに新たな脆弱性が生まれ得ることを浮き彫りにしています。

Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API. However, real-world APIs are often more flexible than just text generation: these APIs expose ``gray-box'' access leading to new threat vectors. To explore this, we red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents. These vulnerabilities highlight that any additions to the functionality exposed by an API can create new vulnerabilities.

論文リンク

https://arxiv.org/abs/2312.14302

さらに読む

https://x.com/omarsar0/status/1739677995747450964

LLMにおける事実想起 / Fact Recalling in LLMs

論文紹介

事実記憶のためのルックアップテーブルをMLP層がどのように実装しているかを調査し、Pythia 2.8Bの初期MLPが、さまざまなアスリートが3種類のスポーツのうちどれをしているかをどのように参照しているかへと研究範囲を広げ、初期のMLP層がルックアップテーブルとして機能していると提案するとともに、モデル内の事実知識の想起をマルチトークン埋め込みとして捉えることを推奨しています。

Investigates how mlp layers implement a lookup table for factual recall; scopes the study on how early mlps in pythia 2.8b look up which of 3 different sports various athletes play; suggests that early mlp layers act as a lookup table and recommends thinking about the recall of factual knowledge in the model as multi-token embeddings.

論文リンク

https://www.alignmentforum.org/s/hpWHhjvjn67LJ4xXX/p/iGuwZTHWb6DFY3sKB

さらに読む

https://x.com/NeelNanda5/status/1738559368361349122

数学のための生成AI: 第1部 - MathPile: 10億トークン規模の数学事前学習コーパス / Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

論文紹介

基盤モデルを学習するために、約95億トークンで構成される、多様で高品質な数学中心のコーパスを提示します。

Presents a diverse and high-quality math-centric corpus comprising of ~9.5 billion tokens to train foundation models.

論文要旨(Abstract)

高品質で大規模なコーパスは、ファウンデーションモデル構築の礎です。本研究では、約95億トークンから成る、多様で高品質な数学中心コーパス ${MathPile}$ を紹介します。このコーパスの構築にあたって、私たちは「少ないほど豊かである」という原則を貫き、事前学習段階においてさえデータの量より質が優先されるという確固たる信念を持っていました。前処理、事前フィルタリング、言語識別、クリーニング、フィルタリング、重複除去などの複雑な処理を通じて、コーパスの高品質を保証するために細心のデータ収集・処理の取り組みを行いました。さらに、ダウンストリームのベンチマークテストセットに対してデータ汚染検出を実施し、重複を除去しました。テキストを通じた数学的推論が、言語モデルの数学的推論能力の向上に役立つことを期待しています。今後、この分野の発展を促進するため、処理に使用したスクリプトとともに複数バージョンの $MathPile$ をオープンソース化する予定です。

High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce ${MathPile}$, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of {less is more}, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our ${MathPile}$ can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field.

論文リンク

https://arxiv.org/abs/2312.17120

さらに読む

https://x.com/arankomatsuzaki/status/1740564961032556942

原則に基づく指針だけでLLaMA-1/2、GPT-3.5/4に質問できる / Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4

論文紹介

大規模言語モデルへの問い合わせとプロンプト作成のプロセスを簡素化するために設計された26の指針原則を紹介し、これらの原則を適用して llama-1/2(7b, 13b, 70b)、gpt-3.5/4 に関する広範な実験を行い、指示およびプロンプト設計に対する有効性を検証します。

Introduces 26 guiding principles designed to streamline the process of querying and prompting large language models; applies these principles to conduct extensive experiments on llama-1/2 (7b, 13b and 70b), gpt-3.5/4 to verify their effectiveness on instructions and prompts design.

論文要旨(Abstract)

本論文では、大規模言語モデルへの問い合わせとプロンプト作成のプロセスを簡素化するために設計された26の基本原則を紹介します。私たちの目的は、さまざまな規模の大規模言語モデルに対する質問の定式化、その能力の検証、そして異なるプロンプトを与えた際のさまざまな規模の大規模言語モデルの振る舞いについてのユーザー理解を高めるための基礎概念を単純化することです。命令およびプロンプト設計に対する提案原則の有効性を検証するため、LLaMA-1/2(7B, 13B, 70B)、GPT-3.5/4 で広範な実験を実施しました。この研究が、大規模言語モデルのプロンプティングを研究する研究者により良いガイドを提供できることを願っています。プロジェクトページは https://github.com/VILA-Lab/ATLAS で公開されています。

This paper introduces 26 guiding principles designed to streamline the process of querying and prompting large language models. Our goal is to simplify the underlying concepts of formulating questions for various scales of large language models, examining their abilities, and enhancing user comprehension on the behaviors of different scales of large language models when feeding into different prompts. Extensive experiments are conducted on LLaMA-1/2 (7B, 13B and 70B), GPT-3.5/4 to verify the effectiveness of the proposed principles on instructions and prompts design. We hope that this work provides a better guide for researchers working on the prompting of large language models. Project page is available at https://github.com/VILA-Lab/ATLAS.

論文リンク

https://arxiv.org/abs/2312.16171v1

さらに読む

https://x.com/_akhaliq/status/1739857456161759455

ファウンデーションモデルを用いた推論に関するサーベイ / A Survey of Reasoning with Foundation Models

論文紹介

さまざまな推論タスク、手法、ベンチマーク、将来の可能性のある方向性における最新の進展を強調しつつ、推論における重要なファウンデーションモデルの包括的なサーベイを提供し、さらにマルチモーダル学習、自律エージェント、スーパーアラインメントのような他の発展がどのように推論研究を加速・拡張するかについても議論します。

Provides a comprehensive survey of seminal foundational models for reasoning, highlighting the latest advancements in various reasoning tasks, methods, benchmarks, and potential future directions; also discusses how other developments like multimodal learning, autonomous agents, and super alignment accelerate and extend reasoning research.

論文要旨(Abstract)

複雑な問題解決において重要な能力である推論は、交渉、医療診断、犯罪捜査といったさまざまな実世界の場面で中核的な役割を果たします。これは人工汎用知能（AGI）分野の基本的方法論として用いられています。ファウンデーションモデルの開発が継続するにつれて、推論タスクにおけるファウンデーションモデルの能力を探ることへの関心が高まっています。本論文では、推論のために提案された、あるいは適用可能な重要なファウンデーションモデルを紹介し、多様な推論タスク、手法、ベンチマークにおける最新の進展を強調します。続いて、ファウンデーションモデル内で推論能力が出現する背景にある潜在的な今後の方向性を考察します。また、推論の文脈におけるマルチモーダル学習、自律エージェント、スーパーアラインメントの関連性についても議論します。これらの将来の研究方向を議論することで、研究者がこの分野を探求するうえで着想を得て、ファウンデーションモデルによる推論のさらなる発展が促進され、AGIの発展に貢献できることを期待しています。

Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI.

論文リンク

https://arxiv.org/abs/2312.11562v4

さらに読む

https://x.com/omarsar0/status/1740729489661874632

高密度検索のより良い基盤となる大規模言語モデルの構築 / Making Large Language Models A Better Foundation For Dense Retrieval

論文紹介

高密度検索向けにLLMを適応させるLLaRAを提案します。これは2つの事前タスクであるEBAE（Embedding-Based Auto-Encoding）とEBAR（Embedding-Based Auto-Regression）で構成され、それぞれLLMのテキスト埋め込みを用いて入力文のトークンを再構成し、次の文のトークンを予測します。llama-2-7bはMSMARCOやBEIRのようなベンチマークで改善されました。

Proposes llara which adapts an llm for dense retrieval; it consists of two pretext tasks: ebae (embedding-based auto-encoding) and ebar (embedding-based auto-regression), where the text embeddings from llm are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively; a llama-2-7b was improved on benchmarks like msmarco and beir.

論文要旨(Abstract)

高密度検索では、クエリと文書の間の意味的関係を表現するために、識別的なテキスト埋め込みを学習する必要があります。意味理解に優れた能力を持つLLM（大規模言語モデル）の活用は有益である可能性があります。しかし、LLMはテキストを埋め込みとして表現することとは作業パターンがまったく異なるテキスト生成タスクによって事前学習されています。そのため、高密度検索のバックボーンエンコーダとして効果的に初期化できるよう、LLMをどのように適切に適応させるかを研究することが不可欠です。本論文では、高密度検索アプリケーションのためにLLMを事後的に適応させる新しいアプローチLLaRA（LLM adapted for dense RetrievAl）を提案します。LLaRAは2つの事前タスクで構成されます。すなわち、LLMのテキスト埋め込みを用いて入力文のトークンを再構成するEBAE（Embedding-Based Auto-Encoding）と、次の文のトークンを予測するEBAR（Embedding-Based Auto-Regression）です。LLaRAはシンプルで軽量でありながら、非常に高い有効性を示しました。本手法はWikipediaコーパス上でLLaMA-2-7B（base）を適応させるために適用され、MSMARCOやBEIRのようなさまざまな高密度検索ベンチマークにおいて、モデルのファインチューニング後の性能を大幅に向上させました。モデルとコードはBGEリポジトリで公開される予定です。

Dense retrieval needs to learn discriminative text embeddings to represent the semantic relationship between query and document. It may benefit from the using of large language models (LLMs), given LLMs' strong capability on semantic understanding. However, the LLMs are pre-trained by text generation tasks, whose working pattern is completely different from representing texts as embeddings. As a result, it is imperative to study how to adapt LLMs properly so that they can be effectively initialized as the backbone encoder for dense retrieval. In this paper, we propose a novel approach, called LLaRA (LLM adapted for dense RetrievAl), which works as a post-hoc adaptation of LLM for the dense retrieval application. LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the text embeddings from LLM are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively. LLaRA turns out to be simple, lightweight, and highly effective. It is applied to adapt LLaMA-2-7B (base) on the Wikipedia corpus, where it substantially improves the model's fine-tuned performances on a variety of dense retrieval benchmarks, like MSMARCO and BEIR. Our model and code will be made publicly available at BGE repository.

論文リンク

https://arxiv.org/abs/2312.15503v1

GeminiとGPT-4V: 質的事例を通じた視覚言語モデルの予備比較と組み合わせ / Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

論文紹介

いくつかの定性的な事例を通じて、GeminiやGPT-4Vのような視覚言語モデルを包括的に予備比較し、組み合わせた結果、GPT-4Vは正確で簡潔な回答を提供する一方、Geminiは関連画像やリンクを添えた詳細で幅広い回答の提供に優れていることが分かりました。

Provides a comprehensive preliminary comparison and combination of vision-language models like gemini and gpt-4v through several qualitative cases; finds that gpt-4v is precise and succinct in responses, while gemini excels in providing detailed, expansive answers accompanied by relevant imagery and links.

論文要旨(Abstract)

急速に発展しているマルチモーダル大規模言語モデル（MLLM）の分野は、人工知能における言語処理と視覚処理の統合を先導しています。本論文では、2つの先駆的モデル、GoogleのGeminiとOpenAIのGPT-4V(ision)について、詳細な比較研究を提示します。本研究では、Vision-Language Capability、人間との相互作用、時間的理解、さらに知能指数および感情指数の評価といった主要な観点から、両モデルを多面的に評価します。分析の中核は、各モデルが持つ独自の視覚理解能力の探究にあります。さまざまな産業応用シナリオにおける性能を評価するため、一連の構造化された実験を実施し、実用的有用性について包括的な視点を提示しました。単純な性能比較にとどまらず、均衡が取れた公正な分析を行うために、プロンプトやシナリオの調整も含めています。今回の調査結果は、両モデルの固有の強みと適したニッチを明らかにしています。GPT-4Vは正確で簡潔な回答によって際立つ一方、Geminiは関連画像やリンクを伴う詳細で広範な回答の提供に優れています。これらの知見は、GeminiとGPT-4Vの比較上の長所を明らかにするだけでなく、マルチモーダル基盤モデルを取り巻く進化する状況を浮き彫りにし、この分野の今後の発展への道を開くものです。比較後、より良い結果を得るために両モデルを組み合わせる試みも行いました。最後に、この分野に先駆的な貢献を果たしたGPT-4VおよびGeminiのチームメンバーに深い感謝を表します。また、画像サンプル、プロンプト、GPT-4V関連の結果を幅広く収集し、分析の基盤を提供したYangらの『Dawn』で提示された包括的な定性分析にも謝意を表します。

The rapidly evolving sector of Multi-modal Large Language Models (MLLMs) is at the forefront of integrating linguistic and visual processing in artificial intelligence. This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision). Our study involves a multi-faceted evaluation of both models across key dimensions such as Vision-Language Capability, Interaction with Humans, Temporal Understanding, and assessments in both Intelligence and Emotional Quotients. The core of our analysis delves into the distinct visual comprehension abilities of each model. We conducted a series of structured experiments to evaluate their performance in various industrial application scenarios, offering a comprehensive perspective on their practical utility. We not only involve direct performance comparisons but also include adjustments in prompts and scenarios to ensure a balanced and fair analysis. Our findings illuminate the unique strengths and niches of both models. GPT-4V distinguishes itself with its precision and succinctness in responses, while Gemini excels in providing detailed, expansive answers accompanied by relevant imagery and links. These understandings not only shed light on the comparative merits of Gemini and GPT-4V but also underscore the evolving landscape of multimodal foundation models, paving the way for future advancements in this area. After the comparison, we attempted to achieve better results by combining the two models. Finally, We would like to express our profound gratitude to the teams behind GPT-4V and Gemini for their pioneering contributions to the field. Our acknowledgments are also extended to the comprehensive qualitative analysis presented in 'Dawn' by Yang et al. This work, with its extensive collection of image samples, prompts, and GPT-4V-related results, provided a foundational basis for our analysis.

[2023/12/25 ~ 12/31] 今週の主要ML論文 (Top ML Papers of the Week)

概要

CogAgent: GUIエージェントのための視覚言語モデル / CogAgent: A Visual Language Model for GUI Agents

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

Google GeminiからOpenAI Q*（Q-Star）まで: 生成AI研究環境の再編に関するサーベイ / From Google Gemini to OpenAI Q* (Q-Star): A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

PromptBench: 大規模言語モデル評価のための統合ライブラリ / PromptBench: A Unified Library for Evaluation of Large Language Models

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

新しいGPT-4 APIを活用する / Exploiting Novel GPT-4 APIs

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

LLMにおける事実想起 / Fact Recalling in LLMs

論文紹介

論文リンク

さらに読む

数学のための生成AI: 第1部 - MathPile: 10億トークン規模の数学事前学習コーパス / Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

原則に基づく指針だけでLLaMA-1/2、GPT-3.5/4に質問できる / Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

ファウンデーションモデルを用いた推論に関するサーベイ / A Survey of Reasoning with Foundation Models

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

高密度検索のより良い基盤となる大規模言語モデルの構築 / Making Large Language Models A Better Foundation For Dense Retrieval

論文紹介

論文要旨(Abstract)

論文リンク

GeminiとGPT-4V: 質的事例を通じた視覚言語モデルの予備比較と組み合わせ / Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

論文紹介

論文要旨(Abstract)

論文リンク

さらに読む

原文

関連記事

まだコメントはありません。

Google GeminiからOpenAI Q（Q-Star）まで: 生成AI研究環境の再編に関するサーベイ / From Google Gemini to OpenAI Q (Q-Star): A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape