Stars
Aligning LMMs with Factually Augmented RLHF
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列
ImageBind One Embedding Space to Bind Them All
[ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
This is a Phi-3 book for getting started with Phi-3. Phi-3, a family of open AI models developed by Microsoft. Phi-3 models are the most capable and cost-effective small language models (SLMs) avai…
A flexible and efficient codebase for training visually-conditioned language models (VLMs)
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
Python packaging and dependency management made easy
Talk to any LLM with hands-free voice interaction, voice interruption, Live2D taking face, and long-term memory running locally across platforms
Open-Waifu open-sourced finetunable customizable simpable AI waifu inspired by neuro-sama
Extracting character conversations in Genshin Project
A project that extracts Honkai: Star Rail text corpus
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
litagin02 / Style-Bert-VITS2
Forked from fishaudio/Bert-VITS2Style-Bert-VITS2: Bert-VITS2 with more controllable voice styles.
Inference and training library for high-quality TTS models.
Zero-Shot Speech Editing and Text-to-Speech in the Wild