Skip to content
@EvolvingLMMs-Lab

LMMs-Lab

Feeling and building multimodal intelligence.

LMMs-Lab: Building Multimodal Intelligence

We are a group of researchers, with a focus on large multimodal models (LMMs). We wish to bring insights to community with our research.

Here're a few of our projects.

We're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.

To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. To address this challenge, we introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.

We expanded the LLaVA-NeXT series with recent stronger open LLMs, reporting our findings on more capable language models: We maintain an efficient training strategy like previous LLaVA models. We supervised finetuned our model on the same data as in previous LLaVA-NeXT 7B/13B/34B models. Our current largest model LLaVA-NeXT-110B is trained on 128 H800-80G for 18 hours.

With stronger LLMs support, LLaVA-NeXT achieves consistently better performance compared with prior open-source LMMs by simply increasing the LLM capability. It catches up to GPT4-V on selected benchmarks.

We report detailed ablations, including architectural modifications, enlarged visual tokens, and varied training strategies, to explore potential improvements in LLaVA-NeXT's performance.

We explore LLaVA-NeXT's capabilities in video understanding tasks, highlighting its strong performance. Key improvements include:

SoTA Performance! Without seeing any video data, LLaVA-Next demonstrates strong zero-shot modality transfer ability, outperforming all the existing open-source LMMs (e.g., LLaMA-VID) that have been specifically trained for videos. Compared with proprietary ones, it achieves comparable performance with Gemini Pro on NextQA and ActivityNet-QA.

Strong length generalization ability Despite being trained under the sequence length constraint of a 4096-token limit, LLaVA-Next demonstrates remarkable ability to generalize to longer sequences. This capability ensures robust performance even when processing long-frame content that exceeds the original token length limitation.

DPO pushes performance DPO with AI feedback on videos yields significant performance gains.

Pinned Loading

  1. lmms-eval lmms-eval Public

    Accelerating the development of large multimodal models (LMMs) with lmms-eval

    Python 1k 55

Repositories

Showing 4 of 4 repositories
  • LongVA Public

    Long Context Transfer from Language to Vision

    EvolvingLMMs-Lab/LongVA’s past year of commit activity
    Python 172 Apache-2.0 10 7 0 Updated Jul 3, 2024
  • lmms-eval Public

    Accelerating the development of large multimodal models (LMMs) with lmms-eval

    EvolvingLMMs-Lab/lmms-eval’s past year of commit activity
  • .github Public
    EvolvingLMMs-Lab/.github’s past year of commit activity
    0 0 0 0 Updated Jun 22, 2024
  • sglang Public Forked from sgl-project/sglang

    SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.

    EvolvingLMMs-Lab/sglang’s past year of commit activity
    Python 1 Apache-2.0 181 0 0 Updated May 23, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…