forked from griptape-ai/griptape
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Text to Speech Support (griptape-ai#755)
- Loading branch information
1 parent
a7fa3c7
commit c98b26d
Showing
51 changed files
with
955 additions
and
32 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
## Overview | ||
|
||
[Text to Speech Drivers](../../reference/griptape/drivers/text_to_speech/index.md) are used by [Text To Speech Engines](../engines/audio-engines.md) to build and execute API calls to audio generation models. | ||
|
||
Provide a Driver when building an [Engine](../engines/audio-engines.md), then pass it to a [Tool](../tools/index.md) for use by an [Agent](../structures/agents.md): | ||
|
||
### Eleven Labs | ||
|
||
The [Eleven Labs Text to Speech Driver](../../reference/griptape/drivers/text_to_speech/elevenlabs_text_to_speech_driver.md) provides support for text-to-speech models hosted by Eleven Labs. This Driver supports configurations specific to Eleven Labs, like voice selection and output format. | ||
|
||
```python | ||
import os | ||
|
||
from griptape.drivers import ElevenLabsTextToSpeechDriver | ||
from griptape.engines import TextToSpeechEngine | ||
from griptape.tools.text_to_speech_client.tool import TextToSpeechClient | ||
from griptape.structures import Agent | ||
|
||
|
||
driver = ElevenLabsTextToSpeechDriver( | ||
api_key=os.getenv("ELEVEN_LABS_API_KEY"), | ||
model="eleven_multilingual_v2", | ||
voice="Matilda", | ||
) | ||
|
||
tool = TextToSpeechClient( | ||
engine=TextToSpeechEngine( | ||
text_to_speech_driver=driver, | ||
), | ||
) | ||
|
||
Agent(tools=[tool]).run("Generate audio from this text: 'Hello, world!'") | ||
``` | ||
|
||
## OpenAI | ||
|
||
The [OpenAI Text to Speech Driver](../../reference/griptape/drivers/text_to_speech/openai_text_to_speech_driver.md) provides support for text-to-speech models hosted by OpenAI. This Driver supports configurations specific to OpenAI, like voice selection and output format. | ||
|
||
```python | ||
from griptape.drivers import OpenAiTextToSpeechDriver | ||
from griptape.engines import TextToSpeechEngine | ||
from griptape.tools.text_to_speech_client.tool import TextToSpeechClient | ||
from griptape.structures import Agent | ||
|
||
driver = OpenAiTextToSpeechDriver() | ||
|
||
tool = TextToSpeechClient( | ||
engine=TextToSpeechEngine( | ||
text_to_speech_driver=driver, | ||
), | ||
) | ||
|
||
Agent(tools=[tool]).run("Generate audio from this text: 'Hello, world!'") | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
## Overview | ||
|
||
[Audio Generation Engines](../../reference/griptape/engines/audio/index.md) facilitate audio generation. Audio Generation Engines provides a `run` method that accepts the necessary inputs for its particular mode and provides the request to the configured [Driver](../drivers/text-to-speech-drivers.md). | ||
|
||
### Text to Speech Engine | ||
|
||
This Engine facilitates synthesizing speech from text inputs. | ||
|
||
```python | ||
import os | ||
|
||
from griptape.drivers import ElevenLabsTextToSpeechDriver | ||
from griptape.engines import TextToSpeechEngine | ||
|
||
|
||
driver = ElevenLabsTextToSpeechDriver( | ||
api_key=os.getenv("ELEVEN_LABS_API_KEY"), | ||
model="eleven_multilingual_v2", | ||
voice="Rachel", | ||
) | ||
|
||
engine = TextToSpeechEngine( | ||
text_to_speech_driver=driver, | ||
) | ||
|
||
engine.run( | ||
prompts=["Hello, world!"], | ||
) | ||
``` |
27 changes: 27 additions & 0 deletions
27
docs/griptape-tools/official-tools/text-to-speech-client.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# TextToSpeechClient | ||
|
||
This tool enables LLMs to synthesize speech from text using [Text to Speech Engines](../../reference/griptape/engines/audio/text_to_speech_engine.md) and [Text to Speech Drivers](../../reference/griptape/drivers/text_to_speech/index.md). | ||
|
||
```python | ||
import os | ||
|
||
from griptape.drivers import ElevenLabsTextToSpeechDriver | ||
from griptape.engines import TextToSpeechEngine | ||
from griptape.tools.text_to_speech_client.tool import TextToSpeechClient | ||
from griptape.structures import Agent | ||
|
||
|
||
driver = ElevenLabsTextToSpeechDriver( | ||
api_key=os.getenv("ELEVEN_LABS_API_KEY"), | ||
model="eleven_multilingual_v2", | ||
voice="Matilda", | ||
) | ||
|
||
tool = TextToSpeechClient( | ||
engine=TextToSpeechEngine( | ||
text_to_speech_driver=driver, | ||
), | ||
) | ||
|
||
Agent(tools=[tool]).run("Generate audio from this text: 'Hello, world!'") | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
from __future__ import annotations | ||
|
||
from attr import define | ||
|
||
from griptape.artifacts import MediaArtifact | ||
|
||
|
||
@define | ||
class AudioArtifact(MediaArtifact): | ||
"""AudioArtifact is a type of MediaArtifact representing audio.""" | ||
|
||
media_type: str = "audio" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
44 changes: 44 additions & 0 deletions
44
griptape/drivers/text_to_speech/base_text_to_speech_driver.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
from __future__ import annotations | ||
|
||
from abc import ABC, abstractmethod | ||
from typing import TYPE_CHECKING, Optional | ||
|
||
from attr import define, field | ||
|
||
from griptape.artifacts.audio_artifact import AudioArtifact | ||
from griptape.events.finish_text_to_speech_event import FinishTextToSpeechEvent | ||
from griptape.events.start_text_to_speech_event import StartTextToSpeechEvent | ||
from griptape.mixins import ExponentialBackoffMixin, SerializableMixin | ||
|
||
if TYPE_CHECKING: | ||
from griptape.structures import Structure | ||
|
||
|
||
@define | ||
class BaseTextToSpeechDriver(SerializableMixin, ExponentialBackoffMixin, ABC): | ||
model: str = field(kw_only=True, metadata={"serializable": True}) | ||
structure: Optional[Structure] = field(default=None, kw_only=True) | ||
|
||
def before_run(self, prompts: list[str]) -> None: | ||
if self.structure: | ||
self.structure.publish_event(StartTextToSpeechEvent(prompts=prompts)) | ||
|
||
def after_run(self) -> None: | ||
if self.structure: | ||
self.structure.publish_event(FinishTextToSpeechEvent()) | ||
|
||
def run_text_to_audio(self, prompts: list[str]) -> AudioArtifact: | ||
for attempt in self.retrying(): | ||
with attempt: | ||
self.before_run(prompts) | ||
result = self.try_text_to_audio(prompts) | ||
self.after_run() | ||
|
||
return result | ||
|
||
else: | ||
raise Exception("Failed to run text to audio generation") | ||
|
||
@abstractmethod | ||
def try_text_to_audio(self, prompts: list[str]) -> AudioArtifact: | ||
... |
13 changes: 13 additions & 0 deletions
13
griptape/drivers/text_to_speech/dummy_text_to_speech_driver.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
from typing import Optional | ||
from attrs import define, field | ||
from griptape.artifacts.audio_artifact import AudioArtifact | ||
from griptape.drivers import BaseTextToSpeechDriver | ||
from griptape.exceptions import DummyException | ||
|
||
|
||
@define | ||
class DummyTextToSpeechDriver(BaseTextToSpeechDriver): | ||
model: str = field(init=False) | ||
|
||
def try_text_to_audio(self, prompts: list[str]) -> AudioArtifact: | ||
raise DummyException(__class__.__name__, "try_text_to_audio") |
42 changes: 42 additions & 0 deletions
42
griptape/drivers/text_to_speech/elevenlabs_text_to_speech_driver.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
from __future__ import annotations | ||
|
||
from typing import TYPE_CHECKING, Optional, Any | ||
|
||
from attr import define, field, Factory | ||
|
||
from griptape.artifacts.audio_artifact import AudioArtifact | ||
from griptape.drivers import BaseTextToSpeechDriver | ||
from griptape.utils import import_optional_dependency | ||
|
||
if TYPE_CHECKING: | ||
from elevenlabs.client import ElevenLabs | ||
|
||
|
||
@define | ||
class ElevenLabsTextToSpeechDriver(BaseTextToSpeechDriver): | ||
api_key: str = field(kw_only=True, metadata={"serializable": True}) | ||
client: Any = field( | ||
default=Factory( | ||
lambda self: import_optional_dependency("elevenlabs.client").ElevenLabs(api_key=self.api_key), | ||
takes_self=True, | ||
), | ||
kw_only=True, | ||
metadata={"serializable": True}, | ||
) | ||
voice: str = field(kw_only=True, metadata={"serializable": True}) | ||
output_format: str = field(default="mp3_44100_128", kw_only=True, metadata={"serializable": True}) | ||
|
||
def try_text_to_audio(self, prompts: list[str]) -> AudioArtifact: | ||
audio = self.client.generate( | ||
text=". ".join(prompts), voice=self.voice, model=self.model, output_format=self.output_format | ||
) | ||
|
||
content = b"" | ||
for chunk in audio: | ||
content += chunk | ||
|
||
# All ElevenLabs audio format strings have the following structure: | ||
# {format}_{sample_rate}_{bitrate} | ||
artifact_format = self.output_format.split("_")[0] | ||
|
||
return AudioArtifact(value=content, format=artifact_format) |
36 changes: 36 additions & 0 deletions
36
griptape/drivers/text_to_speech/openai_text_to_speech_driver.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
from __future__ import annotations | ||
|
||
from typing import Optional, Literal | ||
|
||
import openai | ||
from attr import define, field, Factory | ||
|
||
from griptape.artifacts.audio_artifact import AudioArtifact | ||
from griptape.drivers import BaseTextToSpeechDriver | ||
|
||
|
||
@define | ||
class OpenAiTextToSpeechDriver(BaseTextToSpeechDriver): | ||
model: str = field(default="tts-1", kw_only=True, metadata={"serializable": True}) | ||
voice: Literal["alloy", "echo", "fable", "onyx", "nova", "shimmer"] = field( | ||
default="alloy", kw_only=True, metadata={"serializable": True} | ||
) | ||
format: Literal["mp3", "opus", "aac", "flac"] = field(default="mp3", kw_only=True, metadata={"serializable": True}) | ||
api_type: str = field(default=openai.api_type, kw_only=True) | ||
api_version: Optional[str] = field(default=openai.api_version, kw_only=True, metadata={"serializable": True}) | ||
base_url: Optional[str] = field(default=None, kw_only=True, metadata={"serializable": True}) | ||
api_key: Optional[str] = field(default=None, kw_only=True) | ||
organization: Optional[str] = field(default=openai.organization, kw_only=True, metadata={"serializable": True}) | ||
client: openai.OpenAI = field( | ||
default=Factory( | ||
lambda self: openai.OpenAI(api_key=self.api_key, base_url=self.base_url, organization=self.organization), | ||
takes_self=True, | ||
) | ||
) | ||
|
||
def try_text_to_audio(self, prompts: list[str]) -> AudioArtifact: | ||
response = self.client.audio.speech.create( | ||
input=". ".join(prompts), voice=self.voice, model=self.model, response_format=self.format | ||
) | ||
|
||
return AudioArtifact(value=response.content, format=self.format) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
from __future__ import annotations | ||
|
||
from attr import define, field | ||
|
||
from griptape.artifacts.audio_artifact import AudioArtifact | ||
from griptape.drivers import BaseTextToSpeechDriver | ||
|
||
|
||
@define | ||
class TextToSpeechEngine: | ||
text_to_speech_driver: BaseTextToSpeechDriver = field(kw_only=True) | ||
|
||
def run(self, prompts: list[str], *args, **kwargs) -> AudioArtifact: | ||
return self.text_to_speech_driver.try_text_to_audio(prompts=prompts) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.