diff --git a/docs/assets/openaudio.jpg b/docs/assets/openaudio.jpg new file mode 100644 index 0000000..f23b7aa Binary files /dev/null and b/docs/assets/openaudio.jpg differ diff --git a/docs/assets/openaudio.png b/docs/assets/openaudio.png new file mode 100644 index 0000000..80e300c Binary files /dev/null and b/docs/assets/openaudio.png differ diff --git a/docs/en/index.md b/docs/en/index.md index 38db6db..d1baf4b 100644 --- a/docs/en/index.md +++ b/docs/en/index.md @@ -1,4 +1,14 @@ -# Introduction +# OpenAudio (formerly Fish-Speech) + +
+ +
+ +OpenAudio + +
+ +Advanced Text-to-Speech Model Series
@@ -12,39 +22,114 @@
-!!! warning - We assume no responsibility for any illegal use of the codebase. Please refer to the local laws regarding DMCA (Digital Millennium Copyright Act) and other relevant laws in your area.
- This codebase is released under Apache 2.0 license and all models are released under the CC-BY-NC-SA-4.0 license. +Try it now: Fish Audio Playground | Learn more: OpenAudio Website -## Requirements +
-- GPU Memory: 12GB (Inference) -- System: Linux, Windows +--- -## Setup +!!! warning "Legal Notice" + We assume no responsibility for any illegal use of the codebase. Please refer to the local laws regarding DMCA (Digital Millennium Copyright Act) and other relevant laws in your area. + + **License:** This codebase is released under Apache 2.0 license and all models are released under the CC-BY-NC-SA-4.0 license. -First, we need to create a conda environment to install the packages. +## **Introduction** -```bash +We are excited to announce that we have rebranded to **OpenAudio** - introducing a brand new series of advanced Text-to-Speech models that builds upon the foundation of Fish-Speech with significant improvements and new capabilities. -conda create -n fish-speech python=3.12 -conda activate fish-speech +**Openaudio-S1-mini**: [Video](To Be Uploaded); [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini); -pip install sudo apt-get install portaudio19-dev # For pyaudio -pip install -e . # This will download all rest packages. +**Fish-Speech v1.5**: [Video](https://www.bilibili.com/video/BV1EKiDYBE4o/); [Hugging Face](https://huggingface.co/fishaudio/fish-speech-1.5); -apt install libsox-dev ffmpeg # If needed. +## **Highlights** ✨ + +### **Emotion Control** +OpenAudio S1 **supports a variety of emotional, tone, and special markers** to enhance speech synthesis: + +- **Basic emotions**: +``` +(angry) (sad) (excited) (surprised) (satisfied) (delighted) +(scared) (worried) (upset) (nervous) (frustrated) (depressed) +(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed) +(grateful) (confident) (interested) (curious) (confused) (joyful) ``` -!!! warning - The `compile` option is not supported on windows and macOS, if you want to run with compile, you need to install trition by yourself. +- **Advanced emotions**: +``` +(disdainful) (unhappy) (anxious) (hysterical) (indifferent) +(impatient) (guilty) (scornful) (panicked) (furious) (reluctant) +(keen) (disapproving) (negative) (denying) (astonished) (serious) +(sarcastic) (conciliative) (comforting) (sincere) (sneering) +(hesitating) (yielding) (painful) (awkward) (amused) +``` -## Acknowledgements +- **Tone markers**: +``` +(in a hurry tone) (shouting) (screaming) (whispering) (soft tone) +``` -- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2) -- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) -- [GPT VITS](https://github.com/innnky/gpt-vits) -- [MQTTS](https://github.com/b04901014/MQTTS) -- [GPT Fast](https://github.com/pytorch-labs/gpt-fast) -- [Transformers](https://github.com/huggingface/transformers) -- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) +- **Special audio effects**: +``` +(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) +(groaning) (crowd laughing) (background laughter) (audience laughing) +``` + +You can also use Ha,ha,ha to control, there's many other cases waiting to be explored by yourself. + +### **Excellent TTS quality** + +We use Seed TTS Eval Metrics to evaluate the model performance, and the results show that OpenAudio S1 achieves **0.008 WER** and **0.004 CER** on English text, which is significantly better than previous models. (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM) + +| Model | Word Error Rate (WER) | Character Error Rate (CER) | Speaker Distance | +|-------|----------------------|---------------------------|------------------| +| **S1** | **0.008** | **0.004** | **0.332** | +| **S1-mini** | **0.011** | **0.005** | **0.380** | + +### **Two Type of Models** + +| Model | Size | Availability | Features | +|-------|------|--------------|----------| +| **S1** | 4B parameters | Avaliable on [fish.audio](fish.audio) | Full-featured flagship model | +| **S1-mini** | 0.5B parameters | Avaliable on huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini) | Distilled version with core capabilities | + +Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF). + +## **Features** + +1. **Zero-shot & Few-shot TTS:** Input a 10 to 30-second vocal sample to generate high-quality TTS output. **For detailed guidelines, see [Voice Cloning Best Practices](https://docs.fish.audio/text-to-speech/voice-clone-best-practices).** + +2. **Multilingual & Cross-lingual Support:** Simply copy and paste multilingual text into the input box—no need to worry about the language. Currently supports English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish. + +3. **No Phoneme Dependency:** The model has strong generalization capabilities and does not rely on phonemes for TTS. It can handle text in any language script. + +4. **Highly Accurate:** Achieves a low CER (Character Error Rate) of around 0.4% and WER (Word Error Rate) of around 0.8% for Seed-TTS Eval. + +5. **Fast:** With fish-tech acceleration, the real-time factor is approximately 1:5 on an Nvidia RTX 4060 laptop and 1:15 on an Nvidia RTX 4090. + +6. **WebUI Inference:** Features an easy-to-use, Gradio-based web UI compatible with Chrome, Firefox, Edge, and other browsers. + +7. **GUI Inference:** Offers a PyQt6 graphical interface that works seamlessly with the API server. Supports Linux, Windows, and macOS. [See GUI](https://github.com/AnyaCoder/fish-speech-gui). + +8. **Deploy-Friendly:** Easily set up an inference server with native support for Linux, Windows (MacOS comming soon), minimizing speed loss. + +## **Disclaimer** + +We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws. + +## **Media & Demos** + +#### 🚧 Coming Soon +Video demonstrations and tutorials are currently in development. + +## **Documentation** + +### Quick Start +- [Build Environment](en/install.md) - Set up your development environment +- [Inference Guide](en/inference.md) - Run the model and generate speech + + +## **Community & Support** + +- **Discord:** Join our [Discord community](https://discord.gg/Es5qTB9BcN) +- **Website:** Visit [OpenAudio.com](https://openaudio.com) for latest updates +- **Try Online:** [Fish Audio Playground](https://fish.audio) diff --git a/docs/en/inference.md b/docs/en/inference.md index a1eb81d..4d9cf17 100644 --- a/docs/en/inference.md +++ b/docs/en/inference.md @@ -34,9 +34,7 @@ python fish_speech/models/text2semantic/inference.py \ --text "The text you want to convert" \ --prompt-text "Your reference text" \ --prompt-tokens "fake.npy" \ - --checkpoint-path "checkpoints/openaudio-s1-mini" \ - --num-samples 2 \ - --compile # if you want a faster speed + --compile ``` This command will create a `codes_N` file in the working directory, where N is an integer starting from 0. @@ -50,15 +48,12 @@ This command will create a `codes_N` file in the working directory, where N is a ### 3. Generate vocals from semantic tokens: -#### VQGAN Decoder - !!! warning "Future Warning" We have kept the interface accessible from the original path (tools/vqgan/inference.py), but this interface may be removed in subsequent releases, so please change your code as soon as possible. ```bash python fish_speech/models/dac/inference.py \ -i "codes_0.npy" \ - --checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth" ``` ## HTTP API Inference diff --git a/docs/en/install.md b/docs/en/install.md new file mode 100644 index 0000000..6830156 --- /dev/null +++ b/docs/en/install.md @@ -0,0 +1,31 @@ +## Requirements + +- GPU Memory: 12GB (Inference) +- System: Linux, WSL + +## Setup + +First you need install pyaudio and sox, which is used for audio processing. + +``` bash +apt install portaudio19-dev libsox-dev ffmpeg +``` + +### Conda + +```bash +conda create -n fish-speech python=3.12 +conda activate fish-speech + +pip install -e . +``` + +### UV + +```bash + +uv sync --python 3.12 +``` + +!!! warning + The `compile` option is not supported on windows and macOS, if you want to run with compile, you need to install trition by yourself. diff --git a/docs/ja/index.md b/docs/ja/index.md index bd937d7..bbb66c7 100644 --- a/docs/ja/index.md +++ b/docs/ja/index.md @@ -1,4 +1,14 @@ -# 紹介 +# OpenAudio (旧 Fish-Speech) + +
+ +
+ +OpenAudio + +
+ +先進的なText-to-Speechモデルシリーズ
@@ -12,39 +22,113 @@
-!!! warning - このコードベースの違法な使用について、当方は一切の責任を負いません。お住まいの地域のDMCA(デジタルミレニアム著作権法)およびその他の関連法規をご参照ください。
- このコードベースはApache 2.0ライセンスの下でリリースされ、すべてのモデルはCC-BY-NC-SA-4.0ライセンスの下でリリースされています。 +今すぐ試す: Fish Audio Playground | 詳細情報: OpenAudio ウェブサイト -## システム要件 +
-- GPU メモリ:12GB(推論) -- システム:Linux、Windows +--- -## セットアップ +!!! warning "法的通知" + このコードベースの違法な使用について、当方は一切の責任を負いません。お住まいの地域のDMCA(デジタルミレニアム著作権法)およびその他の関連法規をご参照ください。 + + **ライセンス:** このコードベースはApache 2.0ライセンスの下でリリースされ、すべてのモデルはCC-BY-NC-SA-4.0ライセンスの下でリリースされています。 -まず、パッケージをインストールするためのconda環境を作成する必要があります。 +## **紹介** -```bash +私たちは **OpenAudio** への改名を発表できることを嬉しく思います。Fish-Speechを基盤とし、大幅な改善と新機能を加えた、新しい先進的なText-to-Speechモデルシリーズを紹介します。 -conda create -n fish-speech python=3.12 -conda activate fish-speech +**Openaudio-S1-mini**: [動画](アップロード予定); [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini); -pip install sudo apt-get install portaudio19-dev # pyaudio用 -pip install -e . # これにより残りのパッケージがすべてダウンロードされます。 +**Fish-Speech v1.5**: [動画](https://www.bilibili.com/video/BV1EKiDYBE4o/); [Hugging Face](https://huggingface.co/fishaudio/fish-speech-1.5); -apt install libsox-dev ffmpeg # 必要に応じて。 +## **ハイライト** ✨ + +### **感情制御** +OpenAudio S1は**多様な感情、トーン、特殊マーカーをサポート**して音声合成を強化します: + +- **基本感情**: +``` +(angry) (sad) (excited) (surprised) (satisfied) (delighted) +(scared) (worried) (upset) (nervous) (frustrated) (depressed) +(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed) +(grateful) (confident) (interested) (curious) (confused) (joyful) ``` -!!! warning - `compile`オプションはWindowsとmacOSでサポートされていません。compileで実行したい場合は、tritionを自分でインストールする必要があります。 +- **高度な感情**: +``` +(disdainful) (unhappy) (anxious) (hysterical) (indifferent) +(impatient) (guilty) (scornful) (panicked) (furious) (reluctant) +(keen) (disapproving) (negative) (denying) (astonished) (serious) +(sarcastic) (conciliative) (comforting) (sincere) (sneering) +(hesitating) (yielding) (painful) (awkward) (amused) +``` -## 謝辞 +- **トーンマーカー**: +``` +(in a hurry tone) (shouting) (screaming) (whispering) (soft tone) +``` -- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2) -- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) -- [GPT VITS](https://github.com/innnky/gpt-vits) -- [MQTTS](https://github.com/b04901014/MQTTS) -- [GPT Fast](https://github.com/pytorch-labs/gpt-fast) -- [Transformers](https://github.com/huggingface/transformers) -- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) +- **特殊音響効果**: +``` +(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) +(groaning) (crowd laughing) (background laughter) (audience laughing) +``` + +Ha,ha,haを使用してコントロールすることもでき、他にも多くの使用法があなた自身の探索を待っています。 + +### **優秀なTTS品質** + +Seed TTS評価指標を使用してモデルのパフォーマンスを評価した結果、OpenAudio S1は英語テキストで**0.008 WER**と**0.004 CER**を達成し、以前のモデルより大幅に改善されました。(英語、自動評価、OpenAI gpt-4o-転写に基づく、話者距離はRevai/pyannote-wespeaker-voxceleb-resnet34-LM使用) + +| モデル | 単語誤り率 (WER) | 文字誤り率 (CER) | 話者距離 | +|-------|----------------------|---------------------------|------------------| +| **S1** | **0.008** | **0.004** | **0.332** | +| **S1-mini** | **0.011** | **0.005** | **0.380** | + +### **2つのモデルタイプ** + +| モデル | サイズ | 利用可能性 | 特徴 | +|-------|------|--------------|----------| +| **S1** | 40億パラメータ | [fish.audio](fish.audio) で利用可能 | 全機能搭載のフラッグシップモデル | +| **S1-mini** | 5億パラメータ | huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini) で利用可能 | コア機能を備えた蒸留版 | + +S1とS1-miniの両方にオンライン人間フィードバック強化学習(RLHF)が組み込まれています。 + +## **機能** + +1. **ゼロショット・フューショットTTS:** 10〜30秒の音声サンプルを入力するだけで高品質なTTS出力を生成します。**詳細なガイドラインについては、[音声クローニングのベストプラクティス](https://docs.fish.audio/text-to-speech/voice-clone-best-practices)をご覧ください。** + +2. **多言語・言語横断サポート:** 多言語テキストを入力ボックスにコピー&ペーストするだけで、言語を気にする必要はありません。現在、英語、日本語、韓国語、中国語、フランス語、ドイツ語、アラビア語、スペイン語をサポートしています。 + +3. **音素依存なし:** このモデルは強力な汎化能力を持ち、TTSに音素に依存しません。あらゆる言語スクリプトのテキストを処理できます。 + +4. **高精度:** Seed-TTS Evalで低い文字誤り率(CER)約0.4%と単語誤り率(WER)約0.8%を達成します。 + +5. **高速:** fish-tech加速により、Nvidia RTX 4060ラップトップでリアルタイム係数約1:5、Nvidia RTX 4090で約1:15を実現します。 + +6. **WebUI推論:** Chrome、Firefox、Edge、その他のブラウザと互換性のあるGradioベースの使いやすいWebUIを備えています。 + +7. **GUI推論:** APIサーバーとシームレスに連携するPyQt6グラフィカルインターフェースを提供します。Linux、Windows、macOSをサポートします。[GUIを見る](https://github.com/AnyaCoder/fish-speech-gui)。 + +8. **デプロイフレンドリー:** Linux、Windows、MacOSの native サポートで推論サーバーを簡単にセットアップし、速度低下を最小化します。 + +## **免責事項** + +コードベースの違法な使用について、当方は一切の責任を負いません。お住まいの地域のDMCAやその他の関連法律をご参照ください。 + +## **メディア・デモ** + +#### 🚧 近日公開 +動画デモとチュートリアルは現在開発中です。 + +## **ドキュメント** + +### クイックスタート +- [環境構築](install.md) - 開発環境をセットアップ +- [推論ガイド](inference.md) - モデルを実行して音声を生成 + +## **コミュニティ・サポート** + +- **Discord:** [Discordコミュニティ](https://discord.gg/Es5qTB9BcN)に参加 +- **ウェブサイト:** 最新アップデートは[OpenAudio.com](https://openaudio.com)をご覧ください +- **オンライン試用:** [Fish Audio Playground](https://fish.audio) diff --git a/docs/ja/inference.md b/docs/ja/inference.md index db4132e..8cbde0d 100644 --- a/docs/ja/inference.md +++ b/docs/ja/inference.md @@ -34,9 +34,7 @@ python fish_speech/models/text2semantic/inference.py \ --text "変換したいテキスト" \ --prompt-text "参照テキスト" \ --prompt-tokens "fake.npy" \ - --checkpoint-path "checkpoints/openaudio-s1-mini" \ - --num-samples 2 \ - --compile # より高速化を求める場合 + --compile ``` このコマンドは、作業ディレクトリに `codes_N` ファイルを作成します(Nは0から始まる整数)。 @@ -50,15 +48,12 @@ python fish_speech/models/text2semantic/inference.py \ ### 3. セマンティックトークンから音声を生成: -#### VQGANデコーダー - !!! warning "将来の警告" 元のパス(tools/vqgan/inference.py)からアクセス可能なインターフェースを維持していますが、このインターフェースは後続のリリースで削除される可能性があるため、できるだけ早くコードを変更してください。 ```bash python fish_speech/models/dac/inference.py \ - -i "codes_0.npy" \ - --checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth" + -i "codes_0.npy" ``` ## HTTP API推論 @@ -103,5 +98,3 @@ python -m tools.run_webui !!! note `GRADIO_SHARE`、`GRADIO_SERVER_PORT`、`GRADIO_SERVER_NAME` などのGradio環境変数を使用してWebUIを設定できます。 - -お楽しみください! diff --git a/docs/ja/install.md b/docs/ja/install.md new file mode 100644 index 0000000..5d815ab --- /dev/null +++ b/docs/ja/install.md @@ -0,0 +1,30 @@ +## システム要件 + +- GPU メモリ:12GB(推論) +- システム:Linux、WSL + +## セットアップ + +まず、音声処理に使用される pyaudio と sox をインストールする必要があります。 + +``` bash +apt install portaudio19-dev libsox-dev ffmpeg +``` + +### Conda + +```bash +conda create -n fish-speech python=3.12 +conda activate fish-speech + +pip install -e . +``` + +### UV + +```bash +uv sync --python 3.12 +``` + +!!! warning + `compile` オプションは Windows と macOS でサポートされていません。compile で実行したい場合は、triton を自分でインストールする必要があります。 diff --git a/docs/ko/index.md b/docs/ko/index.md index 612d7b8..15cf280 100644 --- a/docs/ko/index.md +++ b/docs/ko/index.md @@ -1,4 +1,14 @@ -# 소개 +# OpenAudio (구 Fish-Speech) + +
+ +
+ +OpenAudio + +
+ +고급 텍스트-음성 변환 모델 시리즈
@@ -12,39 +22,113 @@
-!!! warning - 코드베이스의 불법적인 사용에 대해서는 일체 책임을 지지 않습니다. 귀하의 지역의 DMCA(디지털 밀레니엄 저작권법) 및 기타 관련 법률을 참고하시기 바랍니다.
- 이 코드베이스는 Apache 2.0 라이선스 하에 배포되며, 모든 모델은 CC-BY-NC-SA-4.0 라이선스 하에 배포됩니다. +지금 체험: Fish Audio Playground | 자세히 알아보기: OpenAudio 웹사이트 -## 시스템 요구사항 +
-- GPU 메모리: 12GB (추론) -- 시스템: Linux, Windows +--- -## 설치 +!!! warning "법적 고지" + 코드베이스의 불법적인 사용에 대해서는 일체 책임을 지지 않습니다. 귀하의 지역의 DMCA(디지털 밀레니엄 저작권법) 및 기타 관련 법률을 참고하시기 바랍니다. + + **라이선스:** 이 코드베이스는 Apache 2.0 라이선스 하에 배포되며, 모든 모델은 CC-BY-NC-SA-4.0 라이선스 하에 배포됩니다. -먼저 패키지를 설치하기 위한 conda 환경을 만들어야 합니다. +## **소개** -```bash +저희는 **OpenAudio**로의 브랜드 변경을 발표하게 되어 기쁩니다. Fish-Speech를 기반으로 하여 상당한 개선과 새로운 기능을 추가한 새로운 고급 텍스트-음성 변환 모델 시리즈를 소개합니다. -conda create -n fish-speech python=3.12 -conda activate fish-speech +**Openaudio-S1-mini**: [동영상](업로드 예정); [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini); -pip install sudo apt-get install portaudio19-dev # pyaudio용 -pip install -e . # 나머지 모든 패키지를 다운로드합니다. +**Fish-Speech v1.5**: [동영상](https://www.bilibili.com/video/BV1EKiDYBE4o/); [Hugging Face](https://huggingface.co/fishaudio/fish-speech-1.5); -apt install libsox-dev ffmpeg # 필요한 경우. +## **주요 특징** ✨ + +### **감정 제어** +OpenAudio S1은 **다양한 감정, 톤, 특수 마커를 지원**하여 음성 합성을 향상시킵니다: + +- **기본 감정**: +``` +(angry) (sad) (excited) (surprised) (satisfied) (delighted) +(scared) (worried) (upset) (nervous) (frustrated) (depressed) +(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed) +(grateful) (confident) (interested) (curious) (confused) (joyful) ``` -!!! warning - `compile` 옵션은 Windows와 macOS에서 지원되지 않습니다. compile로 실행하려면 trition을 직접 설치해야 합니다. +- **고급 감정**: +``` +(disdainful) (unhappy) (anxious) (hysterical) (indifferent) +(impatient) (guilty) (scornful) (panicked) (furious) (reluctant) +(keen) (disapproving) (negative) (denying) (astonished) (serious) +(sarcastic) (conciliative) (comforting) (sincere) (sneering) +(hesitating) (yielding) (painful) (awkward) (amused) +``` -## 감사의 말 +- **톤 마커**: +``` +(in a hurry tone) (shouting) (screaming) (whispering) (soft tone) +``` -- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2) -- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) -- [GPT VITS](https://github.com/innnky/gpt-vits) -- [MQTTS](https://github.com/b04901014/MQTTS) -- [GPT Fast](https://github.com/pytorch-labs/gpt-fast) -- [Transformers](https://github.com/huggingface/transformers) -- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) +- **특수 음향 효과**: +``` +(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) +(groaning) (crowd laughing) (background laughter) (audience laughing) +``` + +Ha,ha,ha를 사용하여 제어할 수도 있으며, 여러분 스스로 탐구할 수 있는 다른 많은 사용법이 있습니다. + +### **뛰어난 TTS 품질** + +Seed TTS 평가 지표를 사용하여 모델 성능을 평가한 결과, OpenAudio S1은 영어 텍스트에서 **0.008 WER**과 **0.004 CER**을 달성하여 이전 모델보다 현저히 향상되었습니다. (영어, 자동 평가, OpenAI gpt-4o-전사 기반, 화자 거리는 Revai/pyannote-wespeaker-voxceleb-resnet34-LM 사용) + +| 모델 | 단어 오류율 (WER) | 문자 오류율 (CER) | 화자 거리 | +|-------|----------------------|---------------------------|------------------| +| **S1** | **0.008** | **0.004** | **0.332** | +| **S1-mini** | **0.011** | **0.005** | **0.380** | + +### **두 가지 모델 유형** + +| 모델 | 크기 | 가용성 | 특징 | +|-------|------|--------------|----------| +| **S1** | 40억 매개변수 | [fish.audio](fish.audio)에서 이용 가능 | 모든 기능을 갖춘 플래그십 모델 | +| **S1-mini** | 5억 매개변수 | huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini)에서 이용 가능 | 핵심 기능을 갖춘 경량화 버전 | + +S1과 S1-mini 모두 온라인 인간 피드백 강화 학습(RLHF)이 통합되어 있습니다. + +## **기능** + +1. **제로샷 및 퓨샷 TTS:** 10~30초의 음성 샘플을 입력하여 고품질 TTS 출력을 생성합니다. **자세한 가이드라인은 [음성 복제 모범 사례](https://docs.fish.audio/text-to-speech/voice-clone-best-practices)를 참조하세요.** + +2. **다국어 및 교차 언어 지원:** 다국어 텍스트를 입력 상자에 복사하여 붙여넣기만 하면 됩니다. 언어에 대해 걱정할 필요가 없습니다. 현재 영어, 일본어, 한국어, 중국어, 프랑스어, 독일어, 아랍어, 스페인어를 지원합니다. + +3. **음소 의존성 없음:** 이 모델은 강력한 일반화 능력을 가지고 있으며 TTS에 음소에 의존하지 않습니다. 어떤 언어 스크립트의 텍스트도 처리할 수 있습니다. + +4. **높은 정확도:** Seed-TTS Eval에서 약 0.4%의 낮은 문자 오류율(CER)과 약 0.8%의 단어 오류율(WER)을 달성합니다. + +5. **빠른 속도:** fish-tech 가속을 통해 Nvidia RTX 4060 노트북에서 실시간 계수 약 1:5, Nvidia RTX 4090에서 약 1:15를 달성합니다. + +6. **WebUI 추론:** Chrome, Firefox, Edge 및 기타 브라우저와 호환되는 사용하기 쉬운 Gradio 기반 웹 UI를 제공합니다. + +7. **GUI 추론:** API 서버와 원활하게 작동하는 PyQt6 그래픽 인터페이스를 제공합니다. Linux, Windows, macOS를 지원합니다. [GUI 보기](https://github.com/AnyaCoder/fish-speech-gui). + +8. **배포 친화적:** Linux, Windows, MacOS의 네이티브 지원으로 추론 서버를 쉽게 설정하여 속도 손실을 최소화합니다. + +## **면책 조항** + +코드베이스의 불법적인 사용에 대해서는 일체 책임을 지지 않습니다. 귀하 지역의 DMCA 및 기타 관련 법률을 참고하시기 바랍니다. + +## **미디어 및 데모** + +#### 🚧 곧 출시 예정 +동영상 데모와 튜토리얼이 현재 개발 중입니다. + +## **문서** + +### 빠른 시작 +- [환경 구축](install.md) - 개발 환경 설정 +- [추론 가이드](inference.md) - 모델 실행 및 음성 생성 + +## **커뮤니티 및 지원** + +- **Discord:** [Discord 커뮤니티](https://discord.gg/Es5qTB9BcN)에 참여하세요 +- **웹사이트:** 최신 업데이트는 [OpenAudio.com](https://openaudio.com)을 방문하세요 +- **온라인 체험:** [Fish Audio Playground](https://fish.audio) diff --git a/docs/ko/inference.md b/docs/ko/inference.md index b32eaad..268f107 100644 --- a/docs/ko/inference.md +++ b/docs/ko/inference.md @@ -34,9 +34,7 @@ python fish_speech/models/text2semantic/inference.py \ --text "변환하고 싶은 텍스트" \ --prompt-text "참조 텍스트" \ --prompt-tokens "fake.npy" \ - --checkpoint-path "checkpoints/openaudio-s1-mini" \ - --num-samples 2 \ - --compile # 더 빠른 속도를 원한다면 + --compile ``` 이 명령은 작업 디렉토리에 `codes_N` 파일을 생성합니다. 여기서 N은 0부터 시작하는 정수입니다. @@ -50,15 +48,12 @@ python fish_speech/models/text2semantic/inference.py \ ### 3. 의미 토큰에서 음성 생성: -#### VQGAN 디코더 - !!! warning "향후 경고" 원래 경로(tools/vqgan/inference.py)에서 액세스 가능한 인터페이스를 유지하고 있지만, 이 인터페이스는 향후 릴리스에서 제거될 수 있으므로 가능한 한 빨리 코드를 변경해 주세요. ```bash python fish_speech/models/dac/inference.py \ - -i "codes_0.npy" \ - --checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth" + -i "codes_0.npy" ``` ## HTTP API 추론 @@ -103,5 +98,3 @@ python -m tools.run_webui !!! note `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME`과 같은 Gradio 환경 변수를 사용하여 WebUI를 구성할 수 있습니다. - -즐기세요! diff --git a/docs/ko/install.md b/docs/ko/install.md new file mode 100644 index 0000000..6cddc5f --- /dev/null +++ b/docs/ko/install.md @@ -0,0 +1,30 @@ +## 시스템 요구사항 + +- GPU 메모리: 12GB (추론) +- 시스템: Linux, WSL + +## 설정 + +먼저 오디오 처리에 사용되는 pyaudio와 sox를 설치해야 합니다. + +``` bash +apt install portaudio19-dev libsox-dev ffmpeg +``` + +### Conda + +```bash +conda create -n fish-speech python=3.12 +conda activate fish-speech + +pip install -e . +``` + +### UV + +```bash +uv sync --python 3.12 +``` + +!!! warning + `compile` 옵션은 Windows와 macOS에서 지원되지 않습니다. compile로 실행하려면 triton을 직접 설치해야 합니다. diff --git a/docs/pt/index.md b/docs/pt/index.md index 5477c4d..2f611ba 100644 --- a/docs/pt/index.md +++ b/docs/pt/index.md @@ -1,4 +1,14 @@ -# Introdução +# OpenAudio (anteriormente Fish-Speech) + +
+ +
+ +OpenAudio + +
+ +Série Avançada de Modelos Text-to-Speech
@@ -12,39 +22,113 @@
-!!! warning - Não assumimos nenhuma responsabilidade pelo uso ilegal da base de código. Consulte as leis locais sobre DMCA (Digital Millennium Copyright Act) e outras leis relevantes em sua área.
- Esta base de código é lançada sob a licença Apache 2.0 e todos os modelos são lançados sob a licença CC-BY-NC-SA-4.0. +Experimente agora: Fish Audio Playground | Saiba mais: Site OpenAudio -## Requisitos +
-- Memória GPU: 12GB (Inferência) -- Sistema: Linux, Windows +--- -## Configuração +!!! warning "Aviso Legal" + Não assumimos nenhuma responsabilidade pelo uso ilegal da base de código. Consulte as leis locais sobre DMCA (Digital Millennium Copyright Act) e outras leis relevantes em sua área. + + **Licença:** Esta base de código é lançada sob a licença Apache 2.0 e todos os modelos são lançados sob a licença CC-BY-NC-SA-4.0. -Primeiro, precisamos criar um ambiente conda para instalar os pacotes. +## **Introdução** -```bash +Estamos empolgados em anunciar que mudamos nossa marca para **OpenAudio** - introduzindo uma nova série de modelos avançados de Text-to-Speech que se baseia na fundação do Fish-Speech com melhorias significativas e novas capacidades. -conda create -n fish-speech python=3.12 -conda activate fish-speech +**Openaudio-S1-mini**: [Vídeo](A ser carregado); [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini); -pip install sudo apt-get install portaudio19-dev # Para pyaudio -pip install -e . # Isso baixará todos os pacotes restantes. +**Fish-Speech v1.5**: [Vídeo](https://www.bilibili.com/video/BV1EKiDYBE4o/); [Hugging Face](https://huggingface.co/fishaudio/fish-speech-1.5); -apt install libsox-dev ffmpeg # Se necessário. +## **Destaques** ✨ + +### **Controle Emocional** +O OpenAudio S1 **suporta uma variedade de marcadores emocionais, de tom e especiais** para aprimorar a síntese de fala: + +- **Emoções básicas**: +``` +(angry) (sad) (excited) (surprised) (satisfied) (delighted) +(scared) (worried) (upset) (nervous) (frustrated) (depressed) +(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed) +(grateful) (confident) (interested) (curious) (confused) (joyful) ``` -!!! warning - A opção `compile` não é suportada no Windows e macOS, se você quiser executar com compile, precisa instalar o trition por conta própria. +- **Emoções avançadas**: +``` +(disdainful) (unhappy) (anxious) (hysterical) (indifferent) +(impatient) (guilty) (scornful) (panicked) (furious) (reluctant) +(keen) (disapproving) (negative) (denying) (astonished) (serious) +(sarcastic) (conciliative) (comforting) (sincere) (sneering) +(hesitating) (yielding) (painful) (awkward) (amused) +``` -## Agradecimentos +- **Marcadores de tom**: +``` +(in a hurry tone) (shouting) (screaming) (whispering) (soft tone) +``` -- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2) -- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) -- [GPT VITS](https://github.com/innnky/gpt-vits) -- [MQTTS](https://github.com/b04901014/MQTTS) -- [GPT Fast](https://github.com/pytorch-labs/gpt-fast) -- [Transformers](https://github.com/huggingface/transformers) -- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) +- **Efeitos sonoros especiais**: +``` +(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) +(groaning) (crowd laughing) (background laughter) (audience laughing) +``` + +Você também pode usar Ha,ha,ha para controlar, há muitos outros casos esperando para serem explorados por você mesmo. + +### **Qualidade TTS Excelente** + +Utilizamos as métricas Seed TTS Eval para avaliar o desempenho do modelo, e os resultados mostram que o OpenAudio S1 alcança **0.008 WER** e **0.004 CER** em texto inglês, que é significativamente melhor que modelos anteriores. (Inglês, avaliação automática, baseada na transcrição OpenAI gpt-4o, distância do falante usando Revai/pyannote-wespeaker-voxceleb-resnet34-LM) + +| Modelo | Taxa de Erro de Palavras (WER) | Taxa de Erro de Caracteres (CER) | Distância do Falante | +|-------|----------------------|---------------------------|------------------| +| **S1** | **0.008** | **0.004** | **0.332** | +| **S1-mini** | **0.011** | **0.005** | **0.380** | + +### **Dois Tipos de Modelos** + +| Modelo | Tamanho | Disponibilidade | Características | +|-------|------|--------------|----------| +| **S1** | 4B parâmetros | Disponível em [fish.audio](fish.audio) | Modelo principal com todas as funcionalidades | +| **S1-mini** | 0.5B parâmetros | Disponível no huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini) | Versão destilada com capacidades principais | + +Tanto o S1 quanto o S1-mini incorporam Aprendizado por Reforço Online com Feedback Humano (RLHF). + +## **Características** + +1. **TTS Zero-shot e Few-shot:** Insira uma amostra vocal de 10 a 30 segundos para gerar saída TTS de alta qualidade. **Para diretrizes detalhadas, veja [Melhores Práticas de Clonagem de Voz](https://docs.fish.audio/text-to-speech/voice-clone-best-practices).** + +2. **Suporte Multilíngue e Cross-lingual:** Simplesmente copie e cole texto multilíngue na caixa de entrada—não precisa se preocupar com o idioma. Atualmente suporta inglês, japonês, coreano, chinês, francês, alemão, árabe e espanhol. + +3. **Sem Dependência de Fonemas:** O modelo tem fortes capacidades de generalização e não depende de fonemas para TTS. Pode lidar com texto em qualquer script de idioma. + +4. **Altamente Preciso:** Alcança uma baixa Taxa de Erro de Caracteres (CER) de cerca de 0,4% e Taxa de Erro de Palavras (WER) de cerca de 0,8% para Seed-TTS Eval. + +5. **Rápido:** Com aceleração fish-tech, o fator de tempo real é aproximadamente 1:5 em um laptop Nvidia RTX 4060 e 1:15 em um Nvidia RTX 4090. + +6. **Inferência WebUI:** Apresenta uma interface web fácil de usar baseada em Gradio, compatível com Chrome, Firefox, Edge e outros navegadores. + +7. **Inferência GUI:** Oferece uma interface gráfica PyQt6 que funciona perfeitamente com o servidor API. Suporta Linux, Windows e macOS. [Ver GUI](https://github.com/AnyaCoder/fish-speech-gui). + +8. **Amigável para Deploy:** Configure facilmente um servidor de inferência com suporte nativo para Linux, Windows e MacOS, minimizando a perda de velocidade. + +## **Isenção de Responsabilidade** + +Não assumimos nenhuma responsabilidade pelo uso ilegal da base de código. Consulte suas leis locais sobre DMCA e outras leis relacionadas. + +## **Mídia e Demos** + +#### 🚧 Em Breve +Demonstrações em vídeo e tutoriais estão atualmente em desenvolvimento. + +## **Documentação** + +### Início Rápido +- [Configurar Ambiente](install.md) - Configure seu ambiente de desenvolvimento +- [Guia de Inferência](inference.md) - Execute o modelo e gere fala + +## **Comunidade e Suporte** + +- **Discord:** Junte-se à nossa [comunidade Discord](https://discord.gg/Es5qTB9BcN) +- **Site:** Visite [OpenAudio.com](https://openaudio.com) para as últimas atualizações +- **Experimente Online:** [Fish Audio Playground](https://fish.audio) diff --git a/docs/pt/inference.md b/docs/pt/inference.md index d8b9b7f..10b129d 100644 --- a/docs/pt/inference.md +++ b/docs/pt/inference.md @@ -34,9 +34,7 @@ python fish_speech/models/text2semantic/inference.py \ --text "O texto que você quer converter" \ --prompt-text "Seu texto de referência" \ --prompt-tokens "fake.npy" \ - --checkpoint-path "checkpoints/openaudio-s1-mini" \ - --num-samples 2 \ - --compile # se você quiser uma velocidade mais rápida + --compile ``` Este comando criará um arquivo `codes_N` no diretório de trabalho, onde N é um inteiro começando de 0. @@ -50,15 +48,12 @@ Este comando criará um arquivo `codes_N` no diretório de trabalho, onde N é u ### 3. Gerar vocais a partir de tokens semânticos: -#### Decodificador VQGAN - !!! warning "Aviso Futuro" Mantivemos a interface acessível do caminho original (tools/vqgan/inference.py), mas esta interface pode ser removida em versões subsequentes, então por favor altere seu código o mais breve possível. ```bash python fish_speech/models/dac/inference.py \ - -i "codes_0.npy" \ - --checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth" + -i "codes_0.npy" ``` ## Inferência com API HTTP @@ -103,5 +98,3 @@ python -m tools.run_webui !!! note Você pode usar variáveis de ambiente do Gradio, como `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` para configurar o WebUI. - -Divirta-se! diff --git a/docs/pt/install.md b/docs/pt/install.md new file mode 100644 index 0000000..005237a --- /dev/null +++ b/docs/pt/install.md @@ -0,0 +1,30 @@ +## Requisitos + +- Memória GPU: 12GB (Inferência) +- Sistema: Linux, WSL + +## Configuração + +Primeiro você precisa instalar pyaudio e sox, que são usados para processamento de áudio. + +``` bash +apt install portaudio19-dev libsox-dev ffmpeg +``` + +### Conda + +```bash +conda create -n fish-speech python=3.12 +conda activate fish-speech + +pip install -e . +``` + +### UV + +```bash +uv sync --python 3.12 +``` + +!!! warning + A opção `compile` não é suportada no Windows e macOS, se você quiser executar com compile, precisa instalar o triton por conta própria. diff --git a/docs/zh/index.md b/docs/zh/index.md index 64e373b..bde91b5 100644 --- a/docs/zh/index.md +++ b/docs/zh/index.md @@ -1,4 +1,14 @@ -# 简介 +# OpenAudio (原 Fish-Speech) + +
+ +
+ +OpenAudio + +
+ +先进的文字转语音模型系列
@@ -12,39 +22,113 @@
-!!! warning - 我们不对代码库的任何非法使用承担责任。请参考您所在地区有关 DMCA(数字千年版权法)和其他相关法律的规定。
- 此代码库在 Apache 2.0 许可证下发布,所有模型在 CC-BY-NC-SA-4.0 许可证下发布。 +立即试用: Fish Audio Playground | 了解更多: OpenAudio 网站 -## 系统要求 +
-- GPU 内存:12GB(推理) -- 系统:Linux、Windows +--- -## 安装 +!!! warning "法律声明" + 我们不对代码库的任何非法使用承担责任。请参考您所在地区有关 DMCA(数字千年版权法)和其他相关法律的规定。 + + **许可证:** 此代码库在 Apache 2.0 许可证下发布,所有模型在 CC-BY-NC-SA-4.0 许可证下发布。 -首先,我们需要创建一个 conda 环境来安装包。 +## **介绍** -```bash +我们很高兴地宣布,我们已经更名为 **OpenAudio** - 推出全新的先进文字转语音模型系列,在 Fish-Speech 的基础上进行了重大改进并增加了新功能。 -conda create -n fish-speech python=3.12 -conda activate fish-speech +**Openaudio-S1-mini**: [视频](即将上传); [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini); -pip install sudo apt-get install portaudio19-dev # 用于 pyaudio -pip install -e . # 这将下载所有其余的包。 +**Fish-Speech v1.5**: [视频](https://www.bilibili.com/video/BV1EKiDYBE4o/); [Hugging Face](https://huggingface.co/fishaudio/fish-speech-1.5); -apt install libsox-dev ffmpeg # 如果需要的话。 +## **亮点** ✨ + +### **情感控制** +OpenAudio S1 **支持多种情感、语调和特殊标记**来增强语音合成效果: + +- **基础情感**: +``` +(angry) (sad) (excited) (surprised) (satisfied) (delighted) +(scared) (worried) (upset) (nervous) (frustrated) (depressed) +(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed) +(grateful) (confident) (interested) (curious) (confused) (joyful) ``` -!!! warning - `compile` 选项在 Windows 和 macOS 上不受支持,如果您想使用 compile 运行,需要自己安装 trition。 +- **高级情感**: +``` +(disdainful) (unhappy) (anxious) (hysterical) (indifferent) +(impatient) (guilty) (scornful) (panicked) (furious) (reluctant) +(keen) (disapproving) (negative) (denying) (astonished) (serious) +(sarcastic) (conciliative) (comforting) (sincere) (sneering) +(hesitating) (yielding) (painful) (awkward) (amused) +``` -## 致谢 +- **语调标记**: +``` +(in a hurry tone) (shouting) (screaming) (whispering) (soft tone) +``` -- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2) -- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) -- [GPT VITS](https://github.com/innnky/gpt-vits) -- [MQTTS](https://github.com/b04901014/MQTTS) -- [GPT Fast](https://github.com/pytorch-labs/gpt-fast) -- [Transformers](https://github.com/huggingface/transformers) -- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) +- **特殊音效**: +``` +(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) +(groaning) (crowd laughing) (background laughter) (audience laughing) +``` + +您还可以使用 Ha,ha,ha 来控制,还有许多其他用法等待您自己探索。 + +### **卓越的 TTS 质量** + +我们使用 Seed TTS 评估指标来评估模型性能,结果显示 OpenAudio S1 在英文文本上达到了 **0.008 WER** 和 **0.004 CER**,明显优于以前的模型。(英语,自动评估,基于 OpenAI gpt-4o-转录,说话人距离使用 Revai/pyannote-wespeaker-voxceleb-resnet34-LM) + +| 模型 | 词错误率 (WER) | 字符错误率 (CER) | 说话人距离 | +|-------|----------------------|---------------------------|------------------| +| **S1** | **0.008** | **0.004** | **0.332** | +| **S1-mini** | **0.011** | **0.005** | **0.380** | + +### **两种模型类型** + +| 模型 | 规模 | 可用性 | 特性 | +|-------|------|--------------|----------| +| **S1** | 40亿参数 | 在 [fish.audio](fish.audio) 上可用 | 功能齐全的旗舰模型 | +| **S1-mini** | 5亿参数 | 在 huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini) 上可用 | 具有核心功能的蒸馏版本 | + +S1 和 S1-mini 都集成了在线人类反馈强化学习 (RLHF)。 + +## **功能特性** + +1. **零样本和少样本 TTS:** 输入 10 到 30 秒的语音样本即可生成高质量的 TTS 输出。**详细指南请参见 [语音克隆最佳实践](https://docs.fish.audio/text-to-speech/voice-clone-best-practices)。** + +2. **多语言和跨语言支持:** 只需复制粘贴多语言文本到输入框即可——无需担心语言问题。目前支持英语、日语、韩语、中文、法语、德语、阿拉伯语和西班牙语。 + +3. **无音素依赖:** 该模型具有强大的泛化能力,不依赖音素进行 TTS。它可以处理任何语言文字的文本。 + +4. **高度准确:** 在 Seed-TTS Eval 中实现低字符错误率 (CER) 约 0.4% 和词错误率 (WER) 约 0.8%。 + +5. **快速:** 通过 fish-tech 加速,在 Nvidia RTX 4060 笔记本电脑上实时因子约为 1:5,在 Nvidia RTX 4090 上约为 1:15。 + +6. **WebUI 推理:** 具有易于使用的基于 Gradio 的网络界面,兼容 Chrome、Firefox、Edge 和其他浏览器。 + +7. **GUI 推理:** 提供与 API 服务器无缝配合的 PyQt6 图形界面。支持 Linux、Windows 和 macOS。[查看 GUI](https://github.com/AnyaCoder/fish-speech-gui)。 + +8. **部署友好:** 轻松设置推理服务器,原生支持 Linux、Windows 和 MacOS,最小化速度损失。 + +## **免责声明** + +我们不对代码库的任何非法使用承担责任。请参考您当地关于 DMCA 和其他相关法律的规定。 + +## **媒体和演示** + +#### 🚧 即将推出 +视频演示和教程正在开发中。 + +## **文档** + +### 快速开始 +- [构建环境](install.md) - 设置您的开发环境 +- [推理指南](inference.md) - 运行模型并生成语音 + +## **社区和支持** + +- **Discord:** 加入我们的 [Discord 社区](https://discord.gg/Es5qTB9BcN) +- **网站:** 访问 [OpenAudio.com](https://openaudio.com) 获取最新更新 +- **在线试用:** [Fish Audio Playground](https://fish.audio) diff --git a/docs/zh/inference.md b/docs/zh/inference.md index 50ac2a7..de821ad 100644 --- a/docs/zh/inference.md +++ b/docs/zh/inference.md @@ -1,6 +1,6 @@ # 推理 -由于声码器模型已更改,您需要比以前更多的显存,建议使用12GB显存以便流畅推理。 +由于声码器模型已更改,您需要比以前更多的 VRAM,建议使用 12GB 进行流畅推理。 我们支持命令行、HTTP API 和 WebUI 进行推理,您可以选择任何您喜欢的方法。 @@ -17,7 +17,7 @@ huggingface-cli download fishaudio/openaudio-s1-mini --local-dir checkpoints/ope !!! note 如果您计划让模型随机选择音色,可以跳过此步骤。 -### 1. 从参考音频获取VQ tokens +### 1. 从参考音频获取 VQ 令牌 ```bash python fish_speech/models/dac/inference.py \ @@ -27,38 +27,33 @@ python fish_speech/models/dac/inference.py \ 您应该会得到一个 `fake.npy` 和一个 `fake.wav`。 -### 2. 从文本生成语义tokens: +### 2. 从文本生成语义令牌: ```bash python fish_speech/models/text2semantic/inference.py \ --text "您想要转换的文本" \ --prompt-text "您的参考文本" \ --prompt-tokens "fake.npy" \ - --checkpoint-path "checkpoints/openaudio-s1-mini" \ - --num-samples 2 \ - --compile # 如果您想要更快的速度 + --compile ``` -此命令将在工作目录中创建一个 `codes_N` 文件,其中N是从0开始的整数。 +此命令将在工作目录中创建一个 `codes_N` 文件,其中 N 是从 0 开始的整数。 !!! note - 您可能想要使用 `--compile` 来融合CUDA内核以获得更快的推理速度(约30 tokens/秒 -> 约500 tokens/秒)。 - 相应地,如果您不打算使用加速,可以删除 `--compile` 参数的注释。 + 您可能希望使用 `--compile` 来融合 CUDA 内核以实现更快的推理(~30 令牌/秒 -> ~500 令牌/秒)。 + 相应地,如果您不计划使用加速,可以注释掉 `--compile` 参数。 !!! info - 对于不支持bf16的GPU,您可能需要使用 `--half` 参数。 + 对于不支持 bf16 的 GPU,您可能需要使用 `--half` 参数。 -### 3. 从语义tokens生成人声: - -#### VQGAN 解码器 +### 3. 从语义令牌生成声音: !!! warning "未来警告" - 我们保留了从原始路径(tools/vqgan/inference.py)访问的接口,但此接口可能在后续版本中被移除,请尽快更改您的代码。 + 我们保留了从原始路径(tools/vqgan/inference.py)访问接口的能力,但此接口可能在后续版本中被删除,因此请尽快更改您的代码。 ```bash python fish_speech/models/dac/inference.py \ -i "codes_0.npy" \ - --checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth" ``` ## HTTP API 推理 diff --git a/docs/zh/install.md b/docs/zh/install.md new file mode 100644 index 0000000..be82665 --- /dev/null +++ b/docs/zh/install.md @@ -0,0 +1,30 @@ +## 系统要求 + +- GPU 内存:12GB(推理) +- 系统:Linux、WSL + +## 安装 + +首先需要安装 pyaudio 和 sox,用于音频处理。 + +``` bash +apt install portaudio19-dev libsox-dev ffmpeg +``` + +### Conda + +```bash +conda create -n fish-speech python=3.12 +conda activate fish-speech + +pip install -e . +``` + +### UV + +```bash +uv sync --python 3.12 +``` + +!!! warning + `compile` 选项在 Windows 和 macOS 上不受支持,如果您想使用 compile 运行,需要自己安装 triton。 diff --git a/mkdocs.yml b/mkdocs.yml index 214c4e3..f2f62f9 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,4 +1,4 @@ -site_name: Fish Speech +site_name: OpenAudio site_description: Targeting SOTA TTS solutions. site_url: https://speech.fish.audio @@ -12,7 +12,7 @@ copyright: Copyright © 2023-2025 by Fish Audio theme: name: material - favicon: assets/figs/logo-circle.png + favicon: assets/openaudio.png language: en features: - content.action.edit @@ -25,8 +25,7 @@ theme: - search.highlight - search.share - content.code.copy - icon: - logo: fontawesome/solid/fish + logo: assets/openaudio.png palette: # Palette toggle for automatic mode @@ -56,7 +55,8 @@ theme: code: Roboto Mono nav: - - Installation: en/index.md + - Introduction: en/index.md + - Installation: en/install.md - Inference: en/inference.md # Plugins @@ -80,25 +80,29 @@ plugins: name: 简体中文 build: true nav: - - 安装: zh/index.md + - 介绍: zh/index.md + - 安装: zh/install.md - 推理: zh/inference.md - locale: ja name: 日本語 build: true nav: - - インストール: ja/index.md + - はじめに: ja/index.md + - インストール: ja/install.md - 推論: ja/inference.md - locale: pt name: Português (Brasil) build: true nav: - - Instalação: pt/index.md + - Introdução: pt/index.md + - Instalação: pt/install.md - Inferência: pt/inference.md - locale: ko name: 한국어 build: true nav: - - 설치: ko/index.md + - 소개: ko/index.md + - 설치: ko/install.md - 추론: ko/inference.md markdown_extensions: