Merge pull request #31 from lm-sys/longchat-update

update
lm-sys · Jul 1, 2023 · d70ab64 · d70ab64
2 parents 1e9237f + 949d9c8
commit d70ab64
Show file tree

Hide file tree

Showing 3 changed files with 9 additions and 9 deletions.
diff --git a/blog/2023-06-29-longchat.md b/blog/2023-06-29-longchat.md
@@ -6,7 +6,7 @@ previewImg: /images/blog/longchat/topic_retrieval_preview.png
 ---
 
 In this blogpost, we introduce our latest series of chatbot models, LongChat-7B and LongChat-13B, featuring a new level of extended context length up to 16K tokens.
-Evaluation results show that the long-range retrieval accuracy of LongChat-13B is up to 2x higher than other long context open models such as MPT-7B-storywriter (65K), MPT-30B-chat (8K), and ChatGLM2-6B (32k).
+Evaluation results show that the long-range retrieval accuracy of LongChat-13B is up to 2x higher than other long-context open models such as MPT-7B-storywriter (84K), MPT-30B-chat (8K), and ChatGLM2-6B (8k).
 LongChat shows promising results in closing the gap between open models and proprietary long context models such as Claude-100K and GPT-4-32K.
 
 <img src="/images/blog/longchat/topic_retrieval.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 1000px;"></img>
@@ -25,9 +25,9 @@ python3 -m fastchat.serve.cli --model-path lmsys/longchat-7b-16k
 There has been a significant surge of interest within the open-source community in developing language models with longer context or extending the context length of existing models like LLaMA. 
 This trend has led to interesting observations and extensive discussions in various sources, such as [Kaiokendev’s blog](https://kaiokendev.github.io/context) and this [arXiv manuscript](https://arxiv.org/pdf/2306.15595.pdf); 
 meanwhile, several notable models have been released claiming to support much longer context than LLaMA, notable ones include:
-- [MPT-7B-storywriter](https://huggingface.co/mosaicml/mpt-7b-storywriter) supports 65K context length and extrapolates to 80K. 
+- [MPT-7B-storywriter](https://huggingface.co/mosaicml/mpt-7b-storywriter) supports 65K context length and extrapolates to 84K. 
 - [MPT-30B-chat](https://huggingface.co/spaces/mosaicml/mpt-30b-chat) supports 8K context length.
-- [ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b) supports 32K context.
+- [ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b) supports 8K context.
 
 At LMSYS Org, we have been concurrently exploring various techniques to lengthen the context of our models like [Vicuna](https://huggingface.co/lmsys/vicuna-13b-v1.3). 
 In this blogpost, alongside the release of the LongChat series, we share our [evaluation tools](https://github.com/DachengLi1/LongChat) to verify the long-context capability of LLMs. 
@@ -73,8 +73,8 @@ We finetune the model using standard next-token prediction loss. We fine-tune th
 To save memory, we use Pytorch FSDP and Flash Attention. Assume A100 is $3/hour on Cloud, the 7B model costs ~$300, and the 13B model costs ~$700. 
 
 ## Evaluation toolkits: LongEval
-Recently, commercial and open-source models have continued to tout their abilities to support expanded context length (from 8K, 32K, to 100K) in their latest releases, but how can we verify these claims?
-The term "long-context capability" can mean different things for different model providers. For instance, does [MPT-7B-StoryWriter's](https://huggingface.co/mosaicml/mpt-7b-storywriter) advertised 65K context length operate at the same capacity as OpenAI’s ChatGPT at 16K? 
+Recently, commercial and open-source models have continued to tout their abilities to support expanded context length (from 8K, 32K, 84K, to 100K) in their latest releases, but how can we verify these claims?
+The term "long-context capability" can mean different things for different model providers. For instance, does [MPT-7B-StoryWriter's](https://huggingface.co/mosaicml/mpt-7b-storywriter) advertised 84K context length operate at the same capacity as OpenAI’s ChatGPT at 16K? 
 This issue is also prevalent in our LongChat models development: how do we swiftly and effectively confirm if a freshly trained model can handle the intended context length?
 
 To address this, we can base our evaluations on tasks that necessitate LLMs to process lengthy contexts, such as text generation, retrieval, summarization, and information association in long text sequences. 
@@ -145,13 +145,13 @@ In particular, we consider four open-sourced models and two proprietary models,
 
 ### LongEval results
 
-From the coarse-grained topic retrieval test results (Figure 2 at the beginning), we already observe the questionable performance of open-source long-context models. For instance, Mpt-7b-storywriter claims to have a context length of 84K but barely achieves 50% accuracy even at one-fourth of its claimed context length (16K). 
-chatglm2-6B cannot reliably retrieve the first topic even at the length of 6K (46% accuracy). Its accuracy falls to almost 0% when tested on > 10K context length. On the other hand, we observed that our LongChat-13B-16K model reliably retrieves the first topic, with comparable accuracy to gpt-3.5-turbo.
+From the coarse-grained topic retrieval test results (Figure 2 at the beginning), we already observe the probelmatic performance of open-source long-context models. For instance, Mpt-7b-storywriter claims to have a context length of 84K but barely achieves 50% accuracy even at one-fifth of its claimed context length (16K). 
+chatglm2-6B cannot reliably retrieve the first topic at the length of 6K (46% accuracy). On the other hand, we observed that our LongChat-13B-16K model reliably retrieves the first topic, with comparable accuracy to gpt-3.5-turbo.
 
 <img src="/images/blog/longchat/line_retrieval.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 1000px;"></img>
 <p style="color:gray; text-align: center;">Figure 3: Accuracy on the long-range line retrieval task.</p>
 
-In the finer-grained line retrieval test, Mpt-7b-storywriter performs even worse than in the coarse-grained cases, dropping accuracy from ~50% to ~30%. Chatglm2-6B also observes degradation and does not perform well at the shortest length we test (5K context length). In contrast, we observe that LongChat-13B-16K performs reliably, achieving near gpt-3.5/Anthropic-claude ability within 12K context length (we also find the preview version is not perfect at 12K-16K, see discussion section).
+In the finer-grained line retrieval test, Mpt-7b-storywriter performs even worse than in the coarse-grained cases, dropping accuracy from ~50% to ~30%. Chatglm2-6B also observes degradation and does not perform well at 5K context length (32%). We notice that ChatGLM2-6B states that it has not been yet fully optimized for single-turn long document understanding, which could explain its current weak performance on LongEval. In contrast, we observe that LongChat-13B-16K performs reliably, achieving near gpt-3.5/Anthropic-claude ability within 12K context length (we also find the preview version is not perfect at 12K-16K, see discussion section).
 
 
 **Disentangle irrelevant LLM abilities in LongEval**
@@ -213,7 +213,7 @@ In our evaluations, commercial long-context models always fulfill their promise:
 <tr> <td><a target="_blank" href="https://huggingface.co/lmsys/longchat-13b-16k">LongChat-13B-16K </a> <td>16K</td> <td>⭐⭐⭐</td> <td>⭐⭐⭐</td> <td>⭐⭐</td></tr>
 <tr> <td><a target="_blank" href="https://huggingface.co/mosaicml/mpt-30b-chat">MPT-30B-chat</a></td> <td>8K</td> <td>⭐⭐⭐</td> <td>⭐⭐⭐</td> <td>⭐⭐</td></tr>
 <tr> <td><a target="_blank" href="https://huggingface.co/mosaicml/mpt-7b-storywriter">MPT-7B-storywriter</a></td> <td>80K</td> <td>⭐⭐⭐</td> <td>⭐⭐</td> <td>⭐</td></tr>
-<tr> <td><a target="_blank" href="https://huggingface.co/THUDM/chatglm2-6b">ChatGLM2-6B</a></td> <td>8K</td>  <td>⭐⭐⭐</td> <td>⭐</td> <td>⭐</td></tr>
+<tr> <td><a target="_blank" href="https://huggingface.co/THUDM/chatglm2-6b">ChatGLM2-6B</a></td> <td>8K</td>  <td>⭐⭐⭐</td> <td>⭐⭐</td> <td>⭐</td></tr>
 <tr> <td><a target="_blank" href="https://chat.openai.com/">GPT-3.5-turbo</a></td> <td>16K</td> <td>⭐⭐⭐</td> <td>⭐⭐⭐</td> <td>⭐⭐⭐</td></tr>
 <tr> <td><a target="_blank" href="https://www.anthropic.com/index/introducing-claude">Anthropic Claude-1.3</a></td> <td>100K</td> <td>⭐⭐⭐</td> <td>⭐⭐⭐</td> <td>⭐⭐⭐</td></tr>
 </tbody>

diff --git a/public/images/blog/longchat/line_retrieval.png b/public/images/blog/longchat/line_retrieval.png
diff --git a/public/images/blog/longchat/topic_retrieval.png b/public/images/blog/longchat/topic_retrieval.png