Stars
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Music repair method to convert lossy MP3 compressed music to lossless music.
Training music tagging model with accelerate framework on multi-node multi-gpu
Text-to-Music Generation with Rectified Flow Transformers
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
Repository for training models for music source separation.
Utilities intended for use with Llama models.
This reporsitory contains metadata of WavCaps dataset and codes for downstream tasks.
Ultra-low bitrate neural audio codec (0.31~1.40 kbps) with a better semantic in the latent space.
š¦ Encoder of BAT (Learning to Reason about Spatial Sounds with Large Language Models)
code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
MiniSora: A community aims to explore the implementation path and future development direction of Sora.
Python library for downloading, loading & working with sound datasets
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
MARS5 speech model (TTS) from CAMB.AI
Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllableā¦
A large-scale dataset of caption-annotated MIDI files.
A generative speech model for daily dialogue.
A multi-voice TTS system trained with an emphasis on quality
This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
Clustering routines for the unit sphere
Speech, Language, Audio, Music Processing with Large Language Model
Models and code for RepCodec: A Speech Representation Codec for Speech Tokenization
Lumina-T2X is a unified framework for Text to Any Modality Generation