Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Starred repositories
[ECCV2024] Learning Video Context as Interleaved Multimodal Sequences
Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
g1: Using Llama-3.1 70b on Groq to create o1-like reasoning chains
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Meshed-Memory Transformer for Image Captioning. CVPR 2020
Code for experiments for "ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy"
MiniCPM3-4B: An edge-side LLM that surpasses GPT-3.5-Turbo.
搜索、推荐、广告、用增等工业界实践文章收集(来源:知乎、Datafuntalk、技术公众号)
Pytorch Implementation for CVPR 2024 paper: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Aura is like Siri, but in your browser. An AI voice assistant optimized for low latency responses.
A simple, easy-to-hack GraphRAG implementation
[ECCV2024] ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery 🧑🔬
The code of the paper "Negative Pre-aware for Noisy Cross-modal Matching" in AAAI 2024.
Visualizing the attention of vision-language models
The official PyTorch implementation of the CVPR 2023 paper "Contrastive Grouping with Transformer for Referring Image Segmentation".
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making
Projects based on SigLIP (Zhai et. al, 2023) and Hugging Face transformers integration 🤗
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
[ICLR 2024] Official repository for "Vision-by-Language for Training-Free Compositional Image Retrieval"
[ICCV2023] UniVTG: Towards Unified Video-Language Temporal Grounding
[MICCAI-2022] This is the official implementation of Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training.
[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-…
This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"