Skip to content

Commit

Permalink
fix leaderboard table for mobile display (#33)
Browse files Browse the repository at this point in the history
* fix

* update all tables

---------

Co-authored-by: zhisbug <zhisbug@gmail.com>
  • Loading branch information
infwinston and zhisbug authored Jul 4, 2023
1 parent 9f2d27d commit ad3c743
Show file tree
Hide file tree
Showing 3 changed files with 51 additions and 36 deletions.
10 changes: 6 additions & 4 deletions blog/2023-05-03-arena.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,22 +80,24 @@ To collect data, we launched the arena with several popular open-source LLMs one

<br>
<p style="color:gray; text-align: center;">Table 2: Comparison between different evaluation methods.</p>
<table style="display: flex; justify-content: center;">
<div style="display: flex; justify-content: center; min-width: 700px;">
<table>
<tbody>
<tr>
<th></th> <th>HELM / lm-evaluation-harness</th> <th>OpenAI/eval</th> <th>Alpaca Evaluation</th> <th>Vicuna Evaluation</th> <th>Chatbot Arena</th>
</tr>
<tr>
<td>Question Source</td> <td>Academic datasets</td> <td>Mixed</td> <td>Self-instruct evaluation set</td> <td>GPT-4 generated</td> <td>User prompts</td>
<td><strong>Question Source</strong></td> <td>Academic datasets</td> <td>Mixed</td> <td>Self-instruct evaluation set</td> <td>GPT-4 generated</td> <td>User prompts</td>
</tr>
<tr>
<td>Evaluator</td> <td>Program</td> <td>Program/Model</td> <td>Human</td> <td>GPT-4</td> <td>User</td>
<td><strong>Evaluator</strong></td> <td>Program</td> <td>Program/Model</td> <td>Human</td> <td>GPT-4</td> <td>User</td>
</tr>
<tr>
<td>Metrics</td> <td>Basic metrics </td> <td>Basic metrics</td> <td>Win rate</td> <td>Win rate</td> <td>Elo ratings</td>
<td><strong>Metrics</strong></td> <td>Basic metrics </td> <td>Basic metrics</td> <td>Win rate</td> <td>Win rate</td> <td>Elo ratings</td>
</tr>
</tbody>
</table>
</div>

## Data Collection
We hosted the arena at [https://arena.lmsys.org](https://arena.lmsys.org) with our multi-model serving system, [FastChat](https://github.com/lm-sys/FastChat). When a user enters the arena, they can chat with two anonymous models side-by-side, as shown in Figure 1.
Expand Down
15 changes: 9 additions & 6 deletions blog/2023-06-22-leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,8 @@ th:nth-child(1) .arrow-down {

<br>
<p style="color:gray; text-align: center;">Table 1. LLM Leaderboard (Timeframe: April 24 - June 19, 2023). The latest and detailed version <a href="https://chat.lmsys.org/?leaderboard" target="_blank">here</a>.</p>
<table id="Table1" style="display: flex; justify-content: center;" align="left" >
<div style="display: flex; justify-content: center;">
<table id="Table1" >
<tbody>

<tr> <th>Model</th> <th onclick="sortTable(1, 'Table1')">MT-bench (score) <span class="arrow arrow-down"></span></th> <th onclick="sortTable(2, 'Table1')">Arena Elo Rating <span class="arrow"></span></th> <th onclick="sortTable(3, 'Table1')">MMLU <span class="arrow"></span></th> <th>License</th> </tr>
Expand Down Expand Up @@ -170,9 +171,9 @@ th:nth-child(1) .arrow-down {
<tr> <td><a target="_blank" href="https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b"> StableLM-Tuned-Alpha-7B </a></td> <td>2.75</td> <td>871</td> <td>24.4</td> <td>CC-BY-NC-SA-4.0</td> </tr>
<tr> <td><a target="_blank" href="https://arxiv.org/abs/2302.13971"> LLaMA-13B </a></td> <td>2.61</td> <td>826</td> <td>47.0</td> <td>Non-commercial</td> </tr>


</tbody>
</table>
</div>

&shy;

Expand Down Expand Up @@ -203,7 +204,7 @@ MT-Bench serves as a **quality-controlled complement** to our crowd-sourced base
Through running the Chatbot Arena for 2 months and analyzing our users' prompts, we've identified 8 primary categories of user prompts: Writing, Roleplay, Extraction, Reasoning, Math, Coding, Knowledge I (STEM), and Knowledge II (humanities/social science).
We crafted 10 multi-turn questions per category, yielding a set of 160 questions in total. We display some sample questions below in Figure 1. You can find more [here](https://huggingface.co/spaces/lmsys/mt-bench).

<img src="/images/blog/leaderboard_week8/sample_question.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 1500px;"></img>
<img src="/images/blog/leaderboard_week8/sample_question.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;"></img>
<p style="color:gray; text-align: center;">Figure 1: Sample questions from the MT-Bench.</p>

### But Still, How to Grade Chatbots' Answers?
Expand Down Expand Up @@ -242,7 +243,7 @@ To delve deeper into the distinguishing factors among chatbots, we select a few
GPT-4 shows superior performance in Coding and Reasoning compared to GPT-3.5/Claude, while Vicuna-13B lags significantly behind in several specific categories: Extraction, Coding, and Math.
This suggests there is still ample room for improvement for open-source models.

<img src="/images/blog/leaderboard_week8/ability_breakdown.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 1000px;"></img>
<img src="/images/blog/leaderboard_week8/ability_breakdown.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;"></img>
<p style="color:gray; text-align: center;">Figure 2: The comparison of 6 representative LLMs regarding their abilities in 8 categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities.</p>


Expand All @@ -252,7 +253,8 @@ We next analyze the multi-turn scores of selected models, presented in Table 2.

<br>
<p style="color:gray; text-align: center;">Table 2. The breakdown of LLMs' MT-bench scores in the 1st and 2nd turn of a dialogue. Full score is 10.</p>
<table style="display: flex; justify-content: center;" align="left" >
<div style="display: flex; justify-content: center;">
<table>
<tbody>
<tr> <th>Model</th> <th>Average 1st Turn Score</th> <th>Average 2nd Turn Score</th> <th>Score Difference</th>

Expand Down Expand Up @@ -285,6 +287,7 @@ We next analyze the multi-turn scores of selected models, presented in Table 2.
<tr><td><a href="https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-13b" target="_blank">H2OGPT-Oasst-Open-LLaMA-13B</a></td> <td>5.51</td> <td>3.74</td> <td>-1.78</td> </tr>
</tbody>
</table>
</div>

&shy;

Expand All @@ -301,7 +304,7 @@ GPT-4 provides thorough and logical feedback to support its judgment.
Our [study](https://arxiv.org/abs/2306.05685) found that such reviews are beneficial in guiding humans to make better-informed decisions (refer to Section 4.2 for more details).
All the GPT-4 judgments can be found on our [demo site](https://huggingface.co/spaces/lmsys/mt-bench).

<img src="/images/blog/leaderboard_week8/explainability_sample.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 1000px;"></img>
<img src="/images/blog/leaderboard_week8/explainability_sample.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;"></img>
<p style="color:gray; text-align: center;">Figure 3: MT-bench provides more explainability in evaluating LLMs' human preferences.</p>

In conclusion, we have shown that MT-Bench effectively differentiates between chatbots of varying capabilities.
Expand Down
Loading

0 comments on commit ad3c743

Please sign in to comment.