Stars
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
mPLUG-HalOwl: Multimodal Hallucination Evaluation and Mitigating
PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
iBOT 🤖: Image BERT Pre-Training with Online Tokenizer (ICLR 2022)
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
Official implementation of the Law of Vision Representation in MLLMs
⚡️HivisionIDPhotos: a lightweight and efficient AI ID photos tools. 一个轻量级的AI证件照制作算法。
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images (AAAI2023)
MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
Fast and memory-efficient exact attention
Official Repository of MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations
Stack Solver is an app for the optimisation of palletizing and shipping items.
🏞️ PicX 是一款基于 GitHub API 开发的图床工具,提供图片上传托管、生成图片链接和常用图片工具箱服务。
Animated sprite editor & pixel art tool (Windows, macOS, Linux)
Code for Fast Training of Diffusion Models with Masked Transformers
Official Repo for the paper: VCR: Visual Caption Restoration. Check arxiv.org/pdf/2406.06462 for details.
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
A PyTorch implementation of MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis