Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community : Add audio-parser "faster-whisper" in audio.py #20012

Merged
merged 10 commits into from
Apr 18, 2024

Conversation

hulitaitai
Copy link
Contributor

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr.

The gitbub repository for faster-whisper is :
https://github.com/SYSTRAN/faster-whisper

Add that URL of the embedding tool "text2vec".
Fix minor mistakes in the doc-string.
faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr.

The gitbub repository for faster-whisper is :
    https://github.com/SYSTRAN/faster-whisper
Copy link

vercel bot commented Apr 4, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Apr 18, 2024 8:44pm

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: parsing Related to output parser module 🔌: openai Primarily related to OpenAI integrations 🤖:improvement Medium size change to existing code to handle new use-cases labels Apr 4, 2024
faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is up to 4 times faster than enai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

It can automatically detect the following 14 languages and transcribe the text into their respective languages: en, zh, fr, de, ja, ko, ru, es, th, it, pt, vi, ar, tr.

The gitbub repository for faster-whisper is :
https://github.com/SYSTRAN/faster-whisper
@hulitaitai hulitaitai closed this Apr 4, 2024
@hulitaitai hulitaitai reopened this Apr 4, 2024
@hulitaitai
Copy link
Contributor Author

I am sorry, I have never passed the lint test successfully. I really don't know what is wrong.

faster-whisper is an open-source parser, and it is a low-cost parser which can run with less than 5GB of GPU memory.

@eyurtsev eyurtsev self-assigned this Apr 4, 2024

def __init__(
self,
device: Optional[str] = "0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device: Optional[str] = "0",
*,
device: Optional[str] = "0",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here, I presumed I would reassign the definite value later, so I casually placed a value here.

Additionally, I admit I copied a lot of code and idees from the parsers above in this file. The code I've written runs well on my PC. I will meticulously re-examine various transmission and variables as you reminde, for the sake of overall system. I'll come back to you after I've finished checking.


def __init__(
self,
device: Optional[str] = "0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device "0" doesn't look like correct value given that it can only end up being cuda or cpu with logic below.

)

# Determine the device to use
if device == "cpu":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic looks like it's ignoring input from user


def __init__(
self,
device: Optional[str] = "0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you document the arguments via an init?


The gitbub repository for faster-whisper is :
https://github.com/SYSTRAN/faster-whisper

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible it'll be great to include a usage Example with the full import using a ..code-block: python

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""

import io
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be in global namespace

)

# Audio file from disk
audio = AudioSegment.from_file(blob.path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check that a .path is set on the blob? Blobs do not have to be associated with a file in the file system as they can live just in memory.

Is it possible to handle using blob.as_bytes().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the blob sent by ffmpeg, there is only the path of the audio file and nothing in its metadata.

Faster-whisper does not transcribe the bytes, it transcribes only mp3 files. So it should ensure that the file sent to it is an mp3 file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsing is decoupled from the code that loads the blob. To be correct, the parser cannot assume how the blob was generated instead it should assume blob against the blob API (see link below) -- the blob may be associated with metadata and the blob may not exist on file but instead in memory.

Here's the API for the blob

https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/document_loaders/blob_loaders.py#L40-L40

Let me know if you have any questions!

So things that need to be done here:

  1. Propagate metadata from the blob
  2. Make sure that the parser keeps working if the blob was only specified as binary data (using the data attribute) rather than generated from an existing file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hulitaitai if you're willing, I can make the needed changes, but I'll need your help to help test it

for segment in segments:
yield Document(
page_content=segment.text,
metadata={"source": blob.source},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
'source': blob.source,
**blob.metadata, # To use existing metadata
}

import torch
except ImportError:
raise ImportError(
"torch package not found, please install it with " "`pip install torch`"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"torch package not found, please install it with " "`pip install torch`"
"torch package not found, please install it with `pip install torch`"

@eyurtsev
Copy link
Collaborator

eyurtsev commented Apr 4, 2024

i can help with the linter -- i mostly care about some of the other functional changes

https://python.langchain.com/docs/contributing/code/#formatting-and-linting

make format and make lint will help

In the blob sent by ffmpeg, there is only the path of the audio file and nothing in its metadata.

Faster-whisper transcribes only mp3 files, so it should ensure that the file sent to it is an mp3 file.

I also added timestamps in the metadata returned to other LangChain functions.
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Apr 5, 2024
@hulitaitai
Copy link
Contributor Author

i can help with the linter -- i mostly care about some of the other functional changes

https://python.langchain.com/docs/contributing/code/#formatting-and-linting

make format and make lint will help

I asked ChatGPT how to use 'make format' and 'make lint'. I will use it before my next Pull Request.

Ensure that the variable self.model_size is always assigned a correct value.
)

# Audio file from disk
audio = AudioSegment.from_file(blob.path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsing is decoupled from the code that loads the blob. To be correct, the parser cannot assume how the blob was generated instead it should assume blob against the blob API (see link below) -- the blob may be associated with metadata and the blob may not exist on file but instead in memory.

Here's the API for the blob

https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/document_loaders/blob_loaders.py#L40-L40

Let me know if you have any questions!

So things that need to be done here:

  1. Propagate metadata from the blob
  2. Make sure that the parser keeps working if the blob was only specified as binary data (using the data attribute) rather than generated from an existing file

1. The audio can come from the data in the form of bytes, or from the path.
2. More information is included in the metadata, and the origin metadata is propagated.
@hulitaitai
Copy link
Contributor Author

I have tested the changes with with a blob built with "Blob.from_data(data=m4a_bytes)", and the parser works.

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Apr 18, 2024
@eyurtsev
Copy link
Collaborator

eyurtsev commented Apr 18, 2024

Looks great! I'll resolve the linting issue tomorrow!


If you end up working with blob again, it has a convenience interface called blob.as_bytes_io, that will take care of the logic of checking if the data is in memory or on file

    @contextlib.contextmanager
    def as_bytes_io(self) -> Generator[Union[BytesIO, BufferedReader], None, None]:
        """Read data as a byte stream."""
        if isinstance(self.data, bytes):
            yield BytesIO(self.data)
        elif self.data is None and self.path:
            with open(str(self.path), "rb") as f:
                yield f
        else:
            raise NotImplementedError(f"Unable to convert blob {self}")

@eyurtsev eyurtsev enabled auto-merge (squash) April 18, 2024 20:44
@eyurtsev eyurtsev merged commit 7d0a008 into langchain-ai:master Apr 18, 2024
58 of 59 checks passed
hinthornw pushed a commit that referenced this pull request Apr 26, 2024
faster-whisper is a reimplementation of OpenAI's Whisper model using
CTranslate2, which is up to 4 times faster than enai/whisper for the
same accuracy while using less memory. The efficiency can be further
improved with 8-bit quantization on both CPU and GPU.

It can automatically detect the following 14 languages and transcribe
the text into their respective languages: en, zh, fr, de, ja, ko, ru,
es, th, it, pt, vi, ar, tr.

The gitbub repository for faster-whisper is :
    https://github.com/SYSTRAN/faster-whisper

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. 🔌: openai Primarily related to OpenAI integrations Ɑ: parsing Related to output parser module size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants