Community : Add audio-parser "faster-whisper" in audio.py #20012

hulitaitai · 2024-04-04T17:17:28Z

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr.

The gitbub repository for faster-whisper is :
https://github.com/SYSTRAN/faster-whisper

Add that URL of the embedding tool "text2vec". Fix minor mistakes in the doc-string.

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU. It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr. The gitbub repository for faster-whisper is : https://github.com/SYSTRAN/faster-whisper

vercel · 2024-04-04T17:17:32Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Apr 18, 2024 8:44pm

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU. It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr. The gitbub repository for faster-whisper is : https://github.com/SYSTRAN/faster-whisper

hulitaitai · 2024-04-04T17:47:03Z

I am sorry, I have never passed the lint test successfully. I really don't know what is wrong.

faster-whisper is an open-source parser, and it is a low-cost parser which can run with less than 5GB of GPU memory.

eyurtsev · 2024-04-04T18:19:37Z

libs/community/langchain_community/document_loaders/parsers/audio.py

+
+    def __init__(
+        self,
+        device: Optional[str] = "0",


Suggested change

device: Optional[str] = "0",

*,

device: Optional[str] = "0",

Yes, here, I presumed I would reassign the definite value later, so I casually placed a value here.

Additionally, I admit I copied a lot of code and idees from the parsers above in this file. The code I've written runs well on my PC. I will meticulously re-examine various transmission and variables as you reminde, for the sake of overall system. I'll come back to you after I've finished checking.

eyurtsev · 2024-04-04T18:22:24Z

libs/community/langchain_community/document_loaders/parsers/audio.py

+
+    def __init__(
+        self,
+        device: Optional[str] = "0",


device "0" doesn't look like correct value given that it can only end up being cuda or cpu with logic below.

eyurtsev · 2024-04-04T18:22:31Z

libs/community/langchain_community/document_loaders/parsers/audio.py

+            )
+
+        # Determine the device to use
+        if device == "cpu":


logic looks like it's ignoring input from user

eyurtsev · 2024-04-04T18:23:11Z

libs/community/langchain_community/document_loaders/parsers/audio.py

+
+    def __init__(
+        self,
+        device: Optional[str] = "0",


could you document the arguments via an init?

eyurtsev · 2024-04-04T18:23:43Z

libs/community/langchain_community/document_loaders/parsers/audio.py

+
+    The gitbub repository for faster-whisper is :
+    https://github.com/SYSTRAN/faster-whisper
+


If possible it'll be great to include a usage Example with the full import using a ..code-block: python

eyurtsev · 2024-04-04T18:23:58Z

libs/community/langchain_community/document_loaders/parsers/audio.py

+    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
+        """Lazily parse the blob."""
+
+        import io


can be in global namespace

eyurtsev · 2024-04-04T18:26:01Z

libs/community/langchain_community/document_loaders/parsers/audio.py

+            )
+
+        # Audio file from disk
+        audio = AudioSegment.from_file(blob.path)


Could you check that a .path is set on the blob? Blobs do not have to be associated with a file in the file system as they can live just in memory.

Is it possible to handle using blob.as_bytes().

In the blob sent by ffmpeg, there is only the path of the audio file and nothing in its metadata.

Faster-whisper does not transcribe the bytes, it transcribes only mp3 files. So it should ensure that the file sent to it is an mp3 file.

The parsing is decoupled from the code that loads the blob. To be correct, the parser cannot assume how the blob was generated instead it should assume blob against the blob API (see link below) -- the blob may be associated with metadata and the blob may not exist on file but instead in memory.

Here's the API for the blob

https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/document_loaders/blob_loaders.py#L40-L40

Let me know if you have any questions!

So things that need to be done here:

Propagate metadata from the blob

Make sure that the parser keeps working if the blob was only specified as binary data (using the data attribute) rather than generated from an existing file

@hulitaitai if you're willing, I can make the needed changes, but I'll need your help to help test it

eyurtsev · 2024-04-04T18:26:57Z

libs/community/langchain_community/document_loaders/parsers/audio.py

+        for segment in segments:
+            yield Document(
+                page_content=segment.text,
+                metadata={"source": blob.source},


{
'source': blob.source,
**blob.metadata, # To use existing metadata
}

eyurtsev · 2024-04-04T18:27:20Z

libs/community/langchain_community/document_loaders/parsers/audio.py

+            import torch
+        except ImportError:
+            raise ImportError(
+                "torch package not found, please install it with " "`pip install torch`"


Suggested change

"torch package not found, please install it with " "`pip install torch`"

"torch package not found, please install it with `pip install torch`"

eyurtsev · 2024-04-04T20:25:13Z

i can help with the linter -- i mostly care about some of the other functional changes

https://python.langchain.com/docs/contributing/code/#formatting-and-linting

make format and make lint will help

In the blob sent by ffmpeg, there is only the path of the audio file and nothing in its metadata. Faster-whisper transcribes only mp3 files, so it should ensure that the file sent to it is an mp3 file. I also added timestamps in the metadata returned to other LangChain functions.

hulitaitai · 2024-04-05T17:37:17Z

i can help with the linter -- i mostly care about some of the other functional changes

https://python.langchain.com/docs/contributing/code/#formatting-and-linting

make format and make lint will help

I asked ChatGPT how to use 'make format' and 'make lint'. I will use it before my next Pull Request.

Ensure that the variable self.model_size is always assigned a correct value.

eyurtsev · 2024-04-09T21:01:36Z

libs/community/langchain_community/document_loaders/parsers/audio.py

+            )
+
+        # Audio file from disk
+        audio = AudioSegment.from_file(blob.path)


The parsing is decoupled from the code that loads the blob. To be correct, the parser cannot assume how the blob was generated instead it should assume blob against the blob API (see link below) -- the blob may be associated with metadata and the blob may not exist on file but instead in memory.

Here's the API for the blob

https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/document_loaders/blob_loaders.py#L40-L40

Let me know if you have any questions!

So things that need to be done here:

Propagate metadata from the blob

Make sure that the parser keeps working if the blob was only specified as binary data (using the data attribute) rather than generated from an existing file

1. The audio can come from the data in the form of bytes, or from the path. 2. More information is included in the metadata, and the origin metadata is propagated.

hulitaitai · 2024-04-14T23:04:29Z

I have tested the changes with with a blob built with "Blob.from_data(data=m4a_bytes)", and the parser works.

eyurtsev · 2024-04-18T02:49:51Z

Looks great! I'll resolve the linting issue tomorrow!

If you end up working with blob again, it has a convenience interface called blob.as_bytes_io, that will take care of the logic of checking if the data is in memory or on file

    @contextlib.contextmanager
    def as_bytes_io(self) -> Generator[Union[BytesIO, BufferedReader], None, None]:
        """Read data as a byte stream."""
        if isinstance(self.data, bytes):
            yield BytesIO(self.data)
        elif self.data is None and self.path:
            with open(str(self.path), "rb") as f:
                yield f
        else:
            raise NotImplementedError(f"Unable to convert blob {self}")

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU. It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr. The gitbub repository for faster-whisper is : https://github.com/SYSTRAN/faster-whisper --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>

hulitaitai added 2 commits March 28, 2024 00:03

Update text2vec.py

9f050ad

Add that URL of the embedding tool "text2vec". Fix minor mistakes in the doc-string.

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: parsing Related to output parser module 🔌: openai Primarily related to OpenAI integrations 🤖:improvement Medium size change to existing code to handle new use-cases labels Apr 4, 2024

hulitaitai closed this Apr 4, 2024

hulitaitai reopened this Apr 4, 2024

eyurtsev self-assigned this Apr 4, 2024

eyurtsev reviewed Apr 4, 2024

View reviewed changes

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Apr 5, 2024

Ensure "self.model_size" is always correct

0ad1393

Ensure that the variable self.model_size is always assigned a correct value.

eyurtsev requested changes Apr 9, 2024

View reviewed changes

hulitaitai added 2 commits April 15, 2024 06:37

Add verification of the origin of the audio and complete the metadata

4d75d87

1. The audio can come from the data in the form of bytes, or from the path. 2. More information is included in the metadata, and the origin metadata is propagated.

Merge branch 'master' into master

e2fc1f1

eyurtsev added 2 commits April 17, 2024 22:44

Merge branch 'master' into master

64db93a

x

c4401a9

eyurtsev approved these changes Apr 18, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Apr 18, 2024

Merge branch 'master' into hulitaitai/master

d676268

eyurtsev enabled auto-merge (squash) April 18, 2024 20:44

eyurtsev merged commit 7d0a008 into langchain-ai:master Apr 18, 2024
58 of 59 checks passed

Alonoparag mentioned this pull request Apr 24, 2024

langchain_community:async_documentdb_vectorstore #20825

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Community : Add audio-parser "faster-whisper" in audio.py #20012

Community : Add audio-parser "faster-whisper" in audio.py #20012

hulitaitai commented Apr 4, 2024

vercel bot commented Apr 4, 2024 •

edited

Loading

hulitaitai commented Apr 4, 2024

eyurtsev Apr 4, 2024

hulitaitai Apr 5, 2024

eyurtsev Apr 4, 2024

eyurtsev Apr 4, 2024

eyurtsev Apr 4, 2024

eyurtsev Apr 4, 2024

eyurtsev Apr 4, 2024

eyurtsev Apr 4, 2024

hulitaitai Apr 5, 2024

eyurtsev Apr 9, 2024

eyurtsev Apr 9, 2024

eyurtsev Apr 4, 2024

eyurtsev Apr 4, 2024

eyurtsev commented Apr 4, 2024

hulitaitai commented Apr 5, 2024

eyurtsev Apr 9, 2024

hulitaitai commented Apr 14, 2024

eyurtsev commented Apr 18, 2024 •

edited

Loading


		The gitbub repository for faster-whisper is :
		https://github.com/SYSTRAN/faster-whisper

	"torch package not found, please install it with " "`pip install torch`"
	"torch package not found, please install it with `pip install torch`"

Community : Add audio-parser "faster-whisper" in audio.py #20012

Community : Add audio-parser "faster-whisper" in audio.py #20012

Conversation

hulitaitai commented Apr 4, 2024

vercel bot commented Apr 4, 2024 • edited Loading

hulitaitai commented Apr 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev commented Apr 4, 2024

hulitaitai commented Apr 5, 2024

Choose a reason for hiding this comment

hulitaitai commented Apr 14, 2024

eyurtsev commented Apr 18, 2024 • edited Loading

vercel bot commented Apr 4, 2024 •

edited

Loading

eyurtsev commented Apr 18, 2024 •

edited

Loading