OpenAI オーディオモデル

(openai.fm)

13 ポイント投稿者 GN⁺ 2025-03-21 | 2件のコメント | WhatsAppで共有

開発者が OpenAI API の新しいテキスト読み上げモデルを試せるインタラクティブデモ
プロンプトで音声効果、トーン、速度、感情、発音、ポーズなどを詳細に指定可能

デモ

声の選択: Alloy, Ash, Ballad, Coral, Echo など 11 種類
さまざまな Vibe を選択: Sincere, Friendly, Noir Detective, Robot, Auctioneer など

例: Sincere

Voice Affect: Calm, composed, and reassuring. Competent and in control, instilling trust.  
Tone: Sincere, empathetic, with genuine concern for the customer and understanding of the situation.  
Pacing: Slower during the apology to allow for clarity and processing. Faster when offering solutions to signal action and resolution.  
Emotions: Calm reassurance, empathy, and gratitude.  
Pronunciation: Clear, precise: Ensures clarity, especially with key details. Focus on key words like "refund" and "patience."   
Pauses: Before and after the apology to give space for processing the apology.

例: Medieval Knight

Voice Affect: Deep, commanding, and slightly dramatic, reflecting the grandeur of ancient English storytelling.  
Tone: Noble, heroic, and formal, capturing the essence of a medieval knight and epic adventure.  
Emotions: A blend of excitement, anticipation, mystery, and the gravity of fate and duty.  
Pronunciation: Clear, deliberate, with a slightly formal cadence; words like "hast", "thou", and "doth" are slowly emphasized to reflect archaic English pronunciation patterns.  
Pauses: Pause after archaic English phrases like "Lo!" and "Hark!", and between clauses such as "Choose thy path" to emphasize the importance of the decision and allow the listener to reflect on the seriousness of the quest.

2件のコメント

GN⁺ 2025-03-21

Hacker Newsのコメント

これらのモデルの価格はElevenLabsよりかなり安い
- "gpt-4o-mini-tts" モデルの場合、音声1分あたり$0.015で、ElevenLabsより85%安い
- ElevenLabsの "Business" プランは月額$1100で11,000分のTTSを提供し、1分あたり10セント課金
- OpenAIは11,000分のTTSを$165で提供可能
- 計算が合っているか確認を求めている
OpenAIのJeffが新しいオーディオモデルをリリースしたことを告知
- 2つの音声認識モデルと新しいTTSモデルをリリース
- テキストエージェントを音声エージェントへ簡単に移行できるようにするAgents SDKをサポート
- 質問があれば知らせてほしいとのこと
テキスト読み上げおよび音声認識モデルの信頼性の問題に言及
- 実世界の応用でどれほど問題になるかは確信が持てない
- 関連ノートへのリンクを提示
生成された音声と一緒に "speech marks" を取得する方法を質問
- AWSのPolly TTSサービスで使われる "speech marks" を説明
- テキストの強調やリップシンクに役立つ
最近の大規模なテキスト読み上げおよび音声認識モデルの進歩
- オフラインで多言語対応のテキスト読み上げソリューションの必要性に言及
- Tortoise TTSは単語を頻繁に歪めると考えている
- Acapela SDKが唯一のデスクトップアプリ向けプラグインソリューション
- 新しいニューラルネットワークベースのモデルが一般的なコンピュータで効率よく動作してほしい
"vibe" ボックスに入力したテキストに応じて、さまざまな抑揚や性格を表現できる
- 知的なプロソディと抑揚のレベルに驚かされる
- オーディオブックの録音には有名人だけが必要になるほど進歩している
- さまざまな面白い声の例を提示
Navy Seal copypastaを入力したときの反応
- 安全制御が "vibe" 指示に応じて異なる動作をする
- NYCのタクシー運転手は問題なく動作し、面白い
新しいモデルの声には微細な揺れがあり、Siriより劣ると感じる
OpenAIの公式ツールが新モデルの発表と結びついている
公式発表での重要な引用
- 開発者はモデルに何を話させるかだけでなく、どのように話させるかも指示できる
- "vibes" はUI上の指示事項である
- 新しいモデルは微細な違いをよりよく受け入れる
- gpt-4o-mini-ttsの音声出力コストは1分あたり$0.015で実用的
- さらに多くのテストを計画している

sylee999 2025-03-21

日本語も完璧に動きますね。

OpenAI オーディオモデル

デモ

関連記事

2件のコメント

Hacker Newsのコメント