feat: Add Speech to Text feature (#25)

osl-incubator · Mar 13, 2024 · 122dc7c · 122dc7c
1 parent af3c6d1
commit 122dc7c
Show file tree

Hide file tree

Showing 17 changed files with 1,904 additions and 1,240 deletions.
diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@ order to have everything well installed, create a conda/mamba environment and
 install `artbox` there.
 
 ```bash
-$ mamba create --name artbox "python>=3.8.1,<3.12" pygobject pip
+$ mamba create --name artbox "python>=3.8.1,<3.12" "pygobject==3.48.1" pip
 $ conda activate artbox
 $ pip install artbox
 ```
@@ -31,17 +31,17 @@ $ mkdir /tmp/artbox
 
 ### Convert text to audio
 
-By default, the `artbox voice` uses
+By default, the `artbox speech` uses
 [`edge-tts`](https://pypi.org/project/edge-tts/) engine, but if you can also
 specify [`gtts`](https://github.com/pndurette/gTTS) with the flag
 `--engine gtts`.
 
 ```bash
 $ echo "Are you ready to join Link and Zelda in fighting off this unprecedented threat to Hyrule?" > /tmp/artbox/text.md
-$ artbox voice text-to-speech \
+$ artbox speech from-text \
     --title artbox \
-    --text-path /tmp/artbox/text.md \
-    --output-path /tmp/artbox/voice.mp3 \
+    --input-path /tmp/artbox/text.md \
+    --output-path /tmp/artbox/speech.mp3 \
     --engine edge-tts
 ```
 
@@ -50,10 +50,10 @@ If you need to generate the audio for different language, you can use the flag
 
 ```bash
 $ echo "Bom dia, mundo!" > /tmp/artbox/text.md
-$ artbox voice text-to-speech \
+$ artbox speech from-text \
     --title artbox \
-    --text-path /tmp/artbox/text.md \
-    --output-path /tmp/artbox/voice.mp3 \
+    --input-path /tmp/artbox/text.md \
+    --output-path /tmp/artbox/speech.mp3 \
     --lang pt
 ```
 
@@ -62,10 +62,10 @@ locale for that language, for example:
 
 ```bash
 $ echo "Are you ready to join Link and Zelda in fighting off this unprecedented threat to Hyrule?" > /tmp/artbox/text.md
-$ artbox voice text-to-speech \
+$ artbox speech from-text \
     --title artbox \
-    --text-path /tmp/artbox/text.md \
-    --output-path /tmp/artbox/voice.mp3 \
+    --input-path /tmp/artbox/text.md \
+    --output-path /tmp/artbox/speech.mp3 \
     --engine edge-tts \
     --lang en-IN
 ```
@@ -75,17 +75,42 @@ and `--pitch`, for example:
 
 ```bash
 $ echo "Do you want some coffee?" > /tmp/artbox/text.md
-$ artbox voice text-to-speech \
+$ artbox speech from-text \
     --title artbox \
-    --text-path /tmp/artbox/text.md \
-    --output-path /tmp/artbox/voice.mp3 \
+    --input-path /tmp/artbox/text.md \
+    --output-path /tmp/artbox/speech.mp3 \
     --engine edge-tts \
     --lang en \
     --rate +10% \
     --volume -10% \
     --pitch -5Hz
 ```
 
+### Convert audio to text
+
+ArtBox uses `speechrecognition` to convert from audio to text. Currently, ArtBox
+just support the `google` engine.
+
+For this example, let's first create our audio:
+
+```bash
+$ echo "Are you ready to join Link and Zelda in fighting off this unprecedented threat to Hyrule?" > /tmp/artbox/text.md
+$ artbox speech from-text \
+    --title artbox \
+    --input-path /tmp/artbox/text.md \
+    --output-path /tmp/artbox/speech.mp3 \
+    --engine edge-tts
+```
+
+Now we can convert it back to text:
+
+```bash
+$ artbox speech to-text \
+    --input-path /tmp/artbox/speech.mp3 \
+    --output-path /tmp/artbox/text-from-speech.md \
+    --lang en
+```
+
 ### Download a youtube video
 
 If you want to download videos from the youtube, you can use the following

diff --git a/docs/changelog.md b/docs/changelog.md
@@ -58,7 +58,7 @@
 
 ### Features
 
-- Add engine options for Voice class. ([#6](https://github.com/ggpedia/artbox/issues/6)) ([d4381f7](https://github.com/ggpedia/artbox/commit/d4381f781a98ffb51fb103d671c5a9115bb3f6d1))
+- Add engine options for Speech class. ([#6](https://github.com/ggpedia/artbox/issues/6)) ([d4381f7](https://github.com/ggpedia/artbox/commit/d4381f781a98ffb51fb103d671c5a9115bb3f6d1))
 
 # [0.2.0](https://github.com/ggpedia/artbox/compare/0.1.0...0.2.0) (2023-08-29)
 
@@ -69,4 +69,4 @@
 
 ### Features
 
-- Add the flag `--lang` for the voice command ([#2](https://github.com/ggpedia/artbox/issues/2)) ([cb937e9](https://github.com/ggpedia/artbox/commit/cb937e9e7a9de5a19b3dc4dc8d34f6daf4ba6304))
+- Add the flag `--lang` for the speech command ([#2](https://github.com/ggpedia/artbox/issues/2)) ([cb937e9](https://github.com/ggpedia/artbox/commit/cb937e9e7a9de5a19b3dc4dc8d34f6daf4ba6304))
diff --git a/docs/index.md b/docs/index.md
@@ -31,17 +31,17 @@ $ mkdir /tmp/artbox
 
 ### Convert text to audio
 
-By default, the `artbox voice` uses
+By default, the `artbox speech` uses
 [`edge-tts`](https://pypi.org/project/edge-tts/) engine, but if you can also
 specify [`gtts`](https://github.com/pndurette/gTTS) with the flag
 `--engine gtts`.
 
 ```bash
 $ echo "Are you ready to join Link and Zelda in fighting off this unprecedented threat to Hyrule?" > /tmp/artbox/text.md
-$ artbox voice text-to-speech \
+$ artbox speech text-to-speech \
     --title artbox \
-    --text-path /tmp/artbox/text.md \
-    --output-path /tmp/artbox/voice.mp3 \
+    --input-path /tmp/artbox/text.md \
+    --output-path /tmp/artbox/speech.mp3 \
     --engine edge-tts
 ```
 
@@ -50,10 +50,10 @@ If you need to generate the audio for different language, you can use the flag
 
 ```bash
 $ echo "Bom dia, mundo!" > /tmp/artbox/text.md
-$ artbox voice text-to-speech \
+$ artbox speech text-to-speech \
     --title artbox \
-    --text-path /tmp/artbox/text.md \
-    --output-path /tmp/artbox/voice.mp3 \
+    --input-path /tmp/artbox/text.md \
+    --output-path /tmp/artbox/speech.mp3 \
     --lang pt
 ```
 
@@ -62,10 +62,10 @@ locale for that language, for example:
 
 ```bash
 $ echo "Are you ready to join Link and Zelda in fighting off this unprecedented threat to Hyrule?" > /tmp/artbox/text.md
-$ artbox voice text-to-speech \
+$ artbox speech text-to-speech \
     --title artbox \
-    --text-path /tmp/artbox/text.md \
-    --output-path /tmp/artbox/voice.mp3 \
+    --input-path /tmp/artbox/text.md \
+    --output-path /tmp/artbox/speech.mp3 \
     --engine edge-tts \
     --lang en-IN
 ```
@@ -75,10 +75,10 @@ and `--pitch`, for example:
 
 ```bash
 $ echo "Do you want some coffee?" > /tmp/artbox/text.md
-$ artbox voice text-to-speech \
+$ artbox speech text-to-speech \
     --title artbox \
-    --text-path /tmp/artbox/text.md \
-    --output-path /tmp/artbox/voice.mp3 \
+    --input-path /tmp/artbox/text.md \
+    --output-path /tmp/artbox/speech.mp3 \
     --engine edge-tts \
     --lang en \
     --rate +10% \