deploy: ad3c743

lm-sys · Jul 4, 2023 · 1340234 · 1340234
1 parent 9f9cd32
commit 1340234
Show file tree

Hide file tree

Showing 32 changed files with 79 additions and 57 deletions.
diff --git a/404/index.html b/404/index.html
diff --git a/_next/data/gcymKKz-gZ-FOjvzL_xyc/about.json → _next/data/DIoZyHyRDlOUdM9Txv-DV/about.json b/_next/data/gcymKKz-gZ-FOjvzL_xyc/about.json → _next/data/DIoZyHyRDlOUdM9Txv-DV/about.json
diff --git a/_next/data/DIoZyHyRDlOUdM9Txv-DV/blog.json b/_next/data/DIoZyHyRDlOUdM9Txv-DV/blog.json
diff --git a/...gZ-FOjvzL_xyc/blog/2023-03-30-vicuna.json → ...DlOUdM9Txv-DV/blog/2023-03-30-vicuna.json b/...gZ-FOjvzL_xyc/blog/2023-03-30-vicuna.json → ...DlOUdM9Txv-DV/blog/2023-03-30-vicuna.json
diff --git a/_next/data/DIoZyHyRDlOUdM9Txv-DV/blog/2023-05-03-arena.json b/_next/data/DIoZyHyRDlOUdM9Txv-DV/blog/2023-05-03-arena.json
@@ -0,0 +1 @@
+{"pageProps":{"frontmatter":{"title":"Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings","author":"Lianmin Zheng*, Ying Sheng*, Wei-Lin Chiang, Hao Zhang, Joseph E. Gonzalez, Ion Stoica","date":"May 3, 2023","previewImg":"/images/blog/arena/cover.png"},"content":"\r\nWe present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in chess and other competitive games. We invite the entire community to join this effort by contributing new models and evaluating them by asking questions and voting for your favorite answer.\r\n\r\n<style>\r\nth {text-align: left}\r\ntd {text-align: left}\r\n</style>\r\n\r\n<br>\r\n<p style=\"color:gray; text-align: center;\">Table 1. LLM Leaderboard (Timeframe: April 24 - May 1, 2023). The latest and detailed version <a href=\"https://chat.lmsys.org/?leaderboard\" target=\"_blank\">here</a>.</p>\r\n<table style=\"display: flex; justify-content: center;\" align=\"left\" >\r\n<tbody>\r\n<tr>\r\n<th>Rank</th> <th>Model</th> <th>Elo Rating</th> <th>Description</th>\r\n</tr>\r\n<tr>\r\n<td>1</td> <td>🥇 <a href=\"https://lmsys.org/blog/2023-03-30-vicuna/\" target=\"_blank\">vicuna-13b</a></td> <td>1169</td> <td>a chat assistant fine-tuned from LLaMA on user-shared conversations by LMSYS</td>\r\n</tr>\r\n<tr>\r\n<td>2</td> <td>🥈 <a href=\"https://bair.berkeley.edu/blog/2023/04/03/koala\" target=\"_blank\">koala-13b</a></td> <td>1082</td> <td>a dialogue model for academic research by BAIR</td>\r\n</tr>\r\n<tr>\r\n<td>3</td> <td>🥉 <a href=\"https://open-assistant.io\" target=\"_blank\">oasst-pythia-12b</a></td> <td>1065</td> <td>an Open Assistant for everyone by LAION</td>\r\n</tr>\r\n<tr>\r\n<td>4</td> <td><a href=\"https://crfm.stanford.edu/2023/03/13/alpaca.html\" target=\"_blank\">alpaca-13b</a></td> <td>1008</td> <td>a model fine-tuned from LLaMA on instruction-following demonstrations by Stanford</td>\r\n</tr>\r\n<tr>\r\n<td>5</td> <td><a href=\"https://chatglm.cn/blog\" target=\"_blank\">chatglm-6b</a></td> <td>985</td> <td>an open bilingual dialogue language model by Tsinghua University</td>\r\n</tr>\r\n<tr>\r\n<td>6</td> <td><a href=\"https://huggingface.co/lmsys/fastchat-t5-3b-v1.0\" target=\"_blank\">fastchat-t5-3b</a></td> <td>951</td> <td>a chat assistant fine-tuned from FLAN-T5 by LMSYS</td>\r\n</tr>\r\n<tr>\r\n<td>7</td> <td><a href=\"https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm\" target=\"_blank\">dolly-v2-12b</a></td> <td>944</td> <td>an instruction-tuned open large language model by Databricks</td>\r\n</tr>\r\n<tr>\r\n<td>8</td> <td><a href=\"https://arxiv.org/abs/2302.13971\" target=\"_blank\">llama-13b</a></td> <td>932</td> <td>open and efficient foundation language models by Meta</td>\r\n</tr>\r\n<tr>\r\n<td>9</td> <td><a href=\"https://github.com/stability-AI/stableLM\" target=\"_blank\">stablelm-tuned-alpha-7b</a></td> <td>858</td> <td>Stability AI language models</td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n\r\n&shy;\r\n\r\nTable 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this [notebook](https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing). You can also try the voting [demo](https://arena.lmsys.org).\r\n\r\n<img src=\"/images/blog/arena/chat_demo.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;\"></img>\r\n<p style=\"color:gray; text-align: center;\">Figure 1. The side-by-side chatting and voting interface.</p>\r\n\r\nPlease note that we periodically release blog posts to update the leaderboard. Feel free to check the following updates:\r\n- [May 10 Updates](https://lmsys.org/blog/2023-05-10-leaderboard/)\r\n- [May 25 Updates](https://lmsys.org/blog/2023-05-25-leaderboard/)\r\n- [June 22 Updates](https://lmsys.org/blog/2023-06-22-leaderboard/)\r\n\r\n## Introduction\r\nFollowing the great success of ChatGPT, there has been a proliferation of open-source large language models that are finetuned to follow instructions. These models are capable of providing valuable assistance in response to users’ questions/prompts. Notable examples include Alpaca and Vicuna, based on LLaMA, and OpenAssistant and Dolly, based on Pythia.\r\n\r\nDespite the constant release of new models every week, the community faces a challenge in benchmarking these models effectively. Benchmarking LLM assistants is extremely challenging because the problems can be open-ended, and it is very difficult to write a program to automatically evaluate the response quality.\r\nIn this case, we typically have to resort to human evaluation based on pairwise comparison.\r\n\r\nThere are some desired properties for a good benchmark system based on pairwise comparison.\r\n- **Scalability**. The system should scale to a large number of models when it is not feasible to collect sufficient data for all possible model pairs.\r\n- **Incrementality**. The system should be able to evaluate a new model using a relatively small number of trials.\r\n- **Unique order**. The system should provide a unique order for all models. Given any two models, we should be able to tell which ranks higher or whether they are tied.\r\n\r\nExisting LLM benchmark systems rarely satisfy all of these properties. Classical LLM benchmark frameworks, such as [HELM](https://crfm.stanford.edu/helm/latest/) and [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), provide multi-metric measurements for tasks commonly used in academic research. However, they are not based on pairwise comparison and are not effective at evaluating open-ended questions. OpenAI also launched the [evals](https://github.com/openai/evals) project to collect better questions, but this project does not provide ranking mechanisms for all participating models. When we launched our [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) model, we utilized a GPT-4-based evaluation pipeline, but it does not provide a solution for scalable and incremental ratings.\r\n\r\nIn this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system), which is a widely-used rating system in chess and other competitive games. The Elo rating system is promising to provide the desired property mentioned above. We noticed that the [Anthropic LLM paper](https://arxiv.org/pdf/2204.05862.pdf) also adopted the Elo rating system.\r\n\r\nTo collect data, we launched the arena with several popular open-source LLMs one week ago. In the arena, a user can chat with two anonymous models side-by-side and vote for which one is better. This crowdsourcing way of data collection represents some use cases of LLMs in the wild. A comparison between several evaluation methods is shown in Table 2.\r\n\r\n<br>\r\n<p style=\"color:gray; text-align: center;\">Table 2: Comparison between different evaluation methods.</p>\r\n<div style=\"display: flex; justify-content: center; min-width: 700px;\">\r\n<table>\r\n<tbody>\r\n<tr>\r\n<th></th> <th>HELM / lm-evaluation-harness</th> <th>OpenAI/eval</th> <th>Alpaca Evaluation</th> <th>Vicuna Evaluation</th> <th>Chatbot Arena</th>\r\n</tr>\r\n<tr>\r\n<td><strong>Question Source</strong></td> <td>Academic datasets</td> <td>Mixed</td> <td>Self-instruct evaluation set</td> <td>GPT-4 generated</td> <td>User prompts</td>\r\n</tr>\r\n<tr>\r\n<td><strong>Evaluator</strong></td> <td>Program</td> <td>Program/Model</td> <td>Human</td> <td>GPT-4</td> <td>User</td>\r\n</tr>\r\n<tr>\r\n<td><strong>Metrics</strong></td> <td>Basic metrics </td> <td>Basic metrics</td> <td>Win rate</td> <td>Win rate</td> <td>Elo ratings</td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n</div>\r\n\r\n## Data Collection\r\nWe hosted the arena at [https://arena.lmsys.org](https://arena.lmsys.org) with our multi-model serving system, [FastChat](https://github.com/lm-sys/FastChat). When a user enters the arena, they can chat with two anonymous models side-by-side, as shown in Figure 1.\r\nAfter getting responses from the two models, users can continue chatting or vote for the model they think is better. Once a vote is submitted, the model names will be revealed. Users can continue chatting or restart a new battle with two new randomly chosen anonymous models. The platform logs all user interactions. In our analysis, we only use the votes when the model names are hidden.\r\n\r\nThe arena was launched about one week ago and we have collected 4.7k valid anonymous votes since then.  We share some exploratory analysis in this [notebook](https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing) and present a short summary here.\r\n\r\n<img src=\"/images/blog/arena/battle_counts.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 60%\"></img>\r\n<p style=\"color:gray; text-align: center;\">Figure 2: Battle count of each combination of models</p>\r\n\r\nFigure 2 shows the battles count of each combination of models. When we initially launched the tournament, we had prior information on the likely ranking based on our benchmarks and chose to pair models according to this ranking. We gave preference to what we believed would be strong pairings based on this ranking. However, we later switched to uniform sampling to get better overall coverage of the rankings. Towards the end of the tournament, we also introduced a new model `fastchat-t5-3b`. All of these result in non-uniform model frequency.\r\n\r\n<img src=\"/images/blog/arena/lang_counts.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 80%\"></img>\r\n<p style=\"color:gray; text-align: center;\">Figure 3: Battle counts for the top-15 languages.</p>\r\n\r\nFigure 3 plots the language distribution and shows most user prompts are in English.\r\n\r\n## Elo Rating System\r\nThe [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system) is a method for calculating the relative skill levels of players, which has been widely adopted in competitive games and sports. The difference in the ratings between two players serves as a predictor of the outcome of a match. The Elo rating system works well for our case because we have multiple models and we run pairwise battles between them.\r\n\r\nIf player A has a rating of `Ra` and player B a rating of `Rb`, the exact formula (using the logistic curve with base 10) for the probability of player A winning is\r\n\r\n<img src=\" https://wikimedia.org/api/rest_v1/media/math/render/svg/7c80282e9c95e92d6b210467aab48a8c4c81ef10\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;\"></img>\r\n\r\nThe ratings of players can be linearly updated after each battle. Suppose player A (with Rating `Ra`) was expected to score `Ea` points but actucally scored `Sa` points. The formula for updating that player's rating is \r\n\r\n<img src=\"https://wikimedia.org/api/rest_v1/media/math/render/svg/1cad9fb1cfc6a8e845493ac9a40eb98541a4641a\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;\"></img>\r\n\r\nUsing the collected data, we compute the Elo ratings of the models in this [notebook](https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing) and put the main results in Table 1. You are welcome to try the notebook and play with the voting data by yourself. The data only contains voting results without conversation histories because releasing the conversation history will raise concerns such as privacy and toxicity.\r\n\r\n## Pairwise Win Rates\r\nAs a basis for calibration, we also present here the pairwise win rates for each model in the tournament (Figure 4) as well as the predicted pairwise win rate estimated using Elo ratings (Figure 5).\r\nBy comparing the figures, we find the elo ratings can predict win rates relatively well.\r\n\r\n<img src=\"/images/blog/arena/win_fraction.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;\"></img>\r\n<p style=\"color:gray; text-align: center;\">Figure 4: Fraction of Model A wins for all non-tied A vs. B battles.</p>\r\n\r\n<img src=\"/images/blog/arena/predicted_win_fraction.png\" style=\"display:block; margin-left: auto; margin-right: auto; margin-bottom: auto;\"></img>\r\n<p style=\"color:gray; text-align: center;\">Figure 5: Predicted win rate using Elo ratings for Model A in an A vs. B battle</p>\r\n\r\n## Future Plans\r\nWe plan to work on the following items:\r\n- Add more closed-source models (ChatGPT-3.5, ChatGPT-4, and Claude-v1 are avaiable now in the anonymous Arena)\r\n- Add more open-source models\r\n- Release periodically updated leaderboards (e.g., monthly)\r\n- Implement better sampling algorithms, tournament mechanisms, and serving systems to support a much larger number of models\r\n- Provide fine-grained rankings on different task types.\r\n\r\nWe appreciate any feedback from you to make the arena better.\r\n\r\n## Join Us\r\nWe invite the entire community to join this benchmarking effort by contributing your models and votes for the anonymous models you think provide better answers. You can visit [https://arena.lmsys.org](https://arena.lmsys.org) to vote for better models. If you want to see a specific model in the arena, you can follow this [guide](https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model) to help us add it.\r\n\r\n## Acknowledgment\r\nWe thank other members of the Vicuna team for valuable feedback and MBZUAI for donating compute resources. Additionally, we extend our thanks to Tianjun Zhang and Eric Wallace for their insightful discussions.\r\n\r\n## Links\r\n- Demo: [https://arena.lmsys.org](https://arena.lmsys.org)\r\n- Leaderboard: [https://leaderboard.lmsys.org](https://leaderboard.lmsys.org)\r\n- GitHub: [https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat)\r\n- Colab notebook: [https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing](https://colab.research.google.com/drive/1lAQ9cKVErXI1rEYq7hTKNaCQ5Q8TzrI5?usp=sharing)\r\n\r\n## Citation\r\nChatbot Arena is part of the effort described in the [paper](https://arxiv.org/abs/2306.05685) below. Please cite it if you find our work useful.\r\n```\r\n@misc{zheng2023judging,\r\n      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena}, \r\n      author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},\r\n      year={2023},\r\n      eprint={2306.05685},\r\n      archivePrefix={arXiv},\r\n      primaryClass={cs.CL}\r\n}\r\n```\r\n","slug":"2023-05-03-arena"},"__N_SSG":true}
diff --git a/...jvzL_xyc/blog/2023-05-10-leaderboard.json → ...M9Txv-DV/blog/2023-05-10-leaderboard.json b/...jvzL_xyc/blog/2023-05-10-leaderboard.json → ...M9Txv-DV/blog/2023-05-10-leaderboard.json
diff --git a/...jvzL_xyc/blog/2023-05-25-leaderboard.json → ...M9Txv-DV/blog/2023-05-25-leaderboard.json b/...jvzL_xyc/blog/2023-05-25-leaderboard.json → ...M9Txv-DV/blog/2023-05-25-leaderboard.json
diff --git a/...OjvzL_xyc/blog/2023-06-09-api-server.json → ...dM9Txv-DV/blog/2023-06-09-api-server.json b/...OjvzL_xyc/blog/2023-06-09-api-server.json → ...dM9Txv-DV/blog/2023-06-09-api-server.json
diff --git a/_next/data/DIoZyHyRDlOUdM9Txv-DV/blog/2023-06-22-leaderboard.json b/_next/data/DIoZyHyRDlOUdM9Txv-DV/blog/2023-06-22-leaderboard.json
diff --git a/_next/data/DIoZyHyRDlOUdM9Txv-DV/blog/2023-06-29-longchat.json b/_next/data/DIoZyHyRDlOUdM9Txv-DV/blog/2023-06-29-longchat.json
diff --git a/...data/gcymKKz-gZ-FOjvzL_xyc/donations.json → ...data/DIoZyHyRDlOUdM9Txv-DV/donations.json b/...data/gcymKKz-gZ-FOjvzL_xyc/donations.json → ...data/DIoZyHyRDlOUdM9Txv-DV/donations.json
diff --git a/...ta/gcymKKz-gZ-FOjvzL_xyc/vicuna_eval.json → ...ta/DIoZyHyRDlOUdM9Txv-DV/vicuna_eval.json b/...ta/gcymKKz-gZ-FOjvzL_xyc/vicuna_eval.json → ...ta/DIoZyHyRDlOUdM9Txv-DV/vicuna_eval.json
diff --git a/_next/data/gcymKKz-gZ-FOjvzL_xyc/blog.json b/_next/data/gcymKKz-gZ-FOjvzL_xyc/blog.json