-
Notifications
You must be signed in to change notification settings - Fork 14.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Community : Add audio-parser "faster-whisper" in audio.py #20012
Conversation
Add that URL of the embedding tool "text2vec". Fix minor mistakes in the doc-string.
faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU. It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr. The gitbub repository for faster-whisper is : https://github.com/SYSTRAN/faster-whisper
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU. It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr. The gitbub repository for faster-whisper is : https://github.com/SYSTRAN/faster-whisper
I am sorry, I have never passed the lint test successfully. I really don't know what is wrong. faster-whisper is an open-source parser, and it is a low-cost parser which can run with less than 5GB of GPU memory. |
|
||
def __init__( | ||
self, | ||
device: Optional[str] = "0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device: Optional[str] = "0", | |
*, | |
device: Optional[str] = "0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, here, I presumed I would reassign the definite value later, so I casually placed a value here.
Additionally, I admit I copied a lot of code and idees from the parsers above in this file. The code I've written runs well on my PC. I will meticulously re-examine various transmission and variables as you reminde, for the sake of overall system. I'll come back to you after I've finished checking.
|
||
def __init__( | ||
self, | ||
device: Optional[str] = "0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device "0" doesn't look like correct value given that it can only end up being cuda or cpu with logic below.
) | ||
|
||
# Determine the device to use | ||
if device == "cpu": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic looks like it's ignoring input from user
|
||
def __init__( | ||
self, | ||
device: Optional[str] = "0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you document the arguments via an init?
|
||
The gitbub repository for faster-whisper is : | ||
https://github.com/SYSTRAN/faster-whisper | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible it'll be great to include a usage Example with the full import using a ..code-block: python
def lazy_parse(self, blob: Blob) -> Iterator[Document]: | ||
"""Lazily parse the blob.""" | ||
|
||
import io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be in global namespace
) | ||
|
||
# Audio file from disk | ||
audio = AudioSegment.from_file(blob.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check that a .path is set on the blob? Blobs do not have to be associated with a file in the file system as they can live just in memory.
Is it possible to handle using blob.as_bytes()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the blob sent by ffmpeg, there is only the path of the audio file and nothing in its metadata.
Faster-whisper does not transcribe the bytes, it transcribes only mp3 files. So it should ensure that the file sent to it is an mp3 file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parsing is decoupled from the code that loads the blob. To be correct, the parser cannot assume how the blob was generated instead it should assume blob against the blob API (see link below) -- the blob may be associated with metadata and the blob may not exist on file but instead in memory.
Here's the API for the blob
Let me know if you have any questions!
So things that need to be done here:
- Propagate metadata from the blob
- Make sure that the parser keeps working if the blob was only specified as binary data (using the data attribute) rather than generated from an existing file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hulitaitai if you're willing, I can make the needed changes, but I'll need your help to help test it
for segment in segments: | ||
yield Document( | ||
page_content=segment.text, | ||
metadata={"source": blob.source}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
'source': blob.source,
**blob.metadata, # To use existing metadata
}
import torch | ||
except ImportError: | ||
raise ImportError( | ||
"torch package not found, please install it with " "`pip install torch`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"torch package not found, please install it with " "`pip install torch`" | |
"torch package not found, please install it with `pip install torch`" |
i can help with the linter -- i mostly care about some of the other functional changes https://python.langchain.com/docs/contributing/code/#formatting-and-linting
|
In the blob sent by ffmpeg, there is only the path of the audio file and nothing in its metadata. Faster-whisper transcribes only mp3 files, so it should ensure that the file sent to it is an mp3 file. I also added timestamps in the metadata returned to other LangChain functions.
I asked ChatGPT how to use 'make format' and 'make lint'. I will use it before my next Pull Request. |
Ensure that the variable self.model_size is always assigned a correct value.
) | ||
|
||
# Audio file from disk | ||
audio = AudioSegment.from_file(blob.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parsing is decoupled from the code that loads the blob. To be correct, the parser cannot assume how the blob was generated instead it should assume blob against the blob API (see link below) -- the blob may be associated with metadata and the blob may not exist on file but instead in memory.
Here's the API for the blob
Let me know if you have any questions!
So things that need to be done here:
- Propagate metadata from the blob
- Make sure that the parser keeps working if the blob was only specified as binary data (using the data attribute) rather than generated from an existing file
1. The audio can come from the data in the form of bytes, or from the path. 2. More information is included in the metadata, and the origin metadata is propagated.
I have tested the changes with with a blob built with "Blob.from_data(data=m4a_bytes)", and the parser works. |
Looks great! I'll resolve the linting issue tomorrow! If you end up working with
|
faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU. It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr. The gitbub repository for faster-whisper is : https://github.com/SYSTRAN/faster-whisper --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.
It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr.
The gitbub repository for faster-whisper is :
https://github.com/SYSTRAN/faster-whisper