Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, Ion Stoica

Introduction

Recent advancements in large language models (LLMs) have significantly expanded their capabilities beyond traditional natural language processing boundaries, addressing a broad array of general tasks (OpenAI, 2023; Gemini et al., 2023; Touvron et al., 2023). These developments underscore the potential of LLMs but also have raised concerns with respect to performance evaluation. Current benchmarks often fail to capture the nuanced and diverse aspects of these models, particularly in assessing their alignment with human preferences in real-world, open-ended tasks.

To assess the performance of LLMs, the research community has introduced a variety of benchmarks. These benchmarks can be categorized based on two factors: the source of questions (either static or live) and the evaluation metric (either ground truth or human preference). According to these factors, benchmarks can be classified into four categories, as shown in Figure 1. While a range of benchmarks is beneficial, the most prevalent current method for evaluating LLMs remains a static, ground-truth-based evaluation, partly because such evaluations are inexpensive and reproducible.

However, these static, ground-truth-based benchmarks exhibit several limitations. Firstly, the questions within these benchmarks are not open-ended, hindering the ability to capture the flexible and interactive use found in real-world settings (Zheng et al., 2023b). Secondly, the test sets in these benchmarks are static, meaning they can become contaminated over time, which undermines the reliability of the evaluation results (Yang et al., 2023). Furthermore, for many complex tasks, establishing a definitive ground truth is not only challenging but sometimes unattainable. Consequently, current benchmarks fail to adequately address the needs of state-of-the-art LLMs, particularly in evaluating user preferences. Thus, there is an urgent necessity for an open, live evaluation platform based on human preference that can more accurately mirror real-world usage.

Creating such a benchmark platform entails significant challenges. It requires the collection of live, fresh, and diverse user questions to accurately represent real-world scenarios. Additionally, developing scalable, incremental, and efficient ranking systems is essential for evaluating a large number of models. Moreover, ensuring the quality of human evaluations is crucial given the noisy nature of human preferences.

To this end, we introduce Chatbot Arena, a benchmarking platform for LLMs that features anonymous, randomized battles in a crowdsourced setting. Chatbot Arena is a free website open to all users.111https://chat.lmsys.org On this website, a user can ask a question and get answers from two anonymous LLMs. Afterward, the user casts a vote for the model that delivers the preferred response, with the models’ identities revealed only after voting. This crowdsourced method effectively gathers a diverse array of fresh user prompts, accurately reflecting real-world LLM applications. Armed with this data, we employ a suite of powerful statistical techniques, ranging from the statistical model of Bradley & Terry (1952) to the E-values of Vovk & Wang (2021), to estimate the ranking over models as reliably and sample-efficiently as possible. With these tools in hand, we have designed efficient sampling algorithms specifically to select model pairs in a way that accelerates the convergence of rankings while retaining statistical validity.

We conduct a thorough analysis of the collected data to ensure the credibility of our platform. We demonstrate that the user-generated questions are sufficiently diverse to encompass a wide range of LLM use cases and are sufficiently challenging to differentiate between models. Furthermore, we confirm that the crowd-sourced votes are highly consistent with expert evaluations.

We have been running our system since Apr 2023 and have received over 240K votes from about 90K users in over 100 different languages as of Jan 2024. To encourage user engagement, we have made over 50 state-of-the-art models available for free. We also collaborate with leading model developers such as OpenAI, Google, Anthropic, Mistral, Hugging Face, and various universities, incorporating their latest models into our platform. We keep the community engaged by routinely updating the leaderboard, publishing analytical blogs, releasing datasets, and sharing information via tweets. Because of its unique and significant value, our leaderboard has emerged as one of the most referenced in the LLM field and has become a benchmark for the industry. We commit to making our data and code available, ensuring that this platform is open-source and open-accessible.

We build the first large-scale crowd-sourced live LLM evaluation platform with over 1M users visit.222The number was estimated by Google Analytics as of March 2024. Note that user visit may not convert to votes as our website also offers “direct chat” mode.

We conduct an in-depth analysis of the collected data, including prompt diversity, quality, vote quality, and insights on human feedback.

We will publicly release a human preference dataset with over 100K pairwise votes collected from Chatbot Arena.

We design an efficient sampling algorithm that actively chooses which model pairs to show, such that our sample efficiency improves, sometimes to a large degree.

Related Work

LLM Benchmarks. We briefly review the common LLM benchmarks, following the classification presented in Figure 1. The most prevalent benchmarks are static, ground-truth-based ones, typically in the form of multiple-choice questions or question-answering tasks with predefined answers and test cases. These benchmarks encompass a range of topics including language understanding, mathematics, coding, and logical reasoning. Prominent examples in this category are MMLU (Hendrycks et al., 2020), HellaSwag (Zellers et al., 2019), GSM-8K (Cobbe et al., 2021), BigBench (Srivastava et al., 2023), AGIEval (Zhong et al., 2023), and HumanEval (Chen et al., 2021). Benchmarks focusing on safety, such as ToxicChat (Lin et al., 2023), and comprehensive suites like HELM (Liang et al., 2022), also exist. In addition to closed-ended questions, benchmarks can include open-ended questions that are evaluated by human judgment, which can be rated by experts or crowd workers such as Amazon Mechanical Turk (Karpinska et al., 2021; Geng et al., 2023; Wang et al., 2023). The recent trend includes utilizing GPT-4 for approximating human judgment (Chiang & Lee, 2023), with notable instances being MT-Bench (Zheng et al., 2023b) and AlpacaEval (Li et al., 2023). In addition to static benchmarks, live benchmarks that include fresh questions are also available. These questions can be obtained from annual exams or weekly online contests such as Codeforces (Li et al., 2022; Huang et al., 2023). They can also be sourced from human interaction. Some studies have explored using live human interaction for reinforcement learning from human preference (Bai et al., 2022; Ouyang et al., 2022; Touvron et al., 2023). However, these studies are typically limited to specific organizations. In this paper, we introduce Chatbot Arena, the first open, large-scale, and crowdsourced benchmark platform that utilizes live human interaction.

Risks of Static Benchmarks. Static benchmarks have certain issues, including contamination, saturation, overfitting, and a lack of human alignment (Yang et al., 2023; Oren et al., 2023). DynaBench (Kiela et al., 2021) identifies these challenges and recommends the use of a live benchmark that incorporates a human-in-the-loop approach for classical NLP benchmarks. Our system adopts a similar spirit. However, our focus is on chatting with LLMs, and we implement this on a significantly larger user scale.

Ranking System. Ranking systems have been a well-studied topic in statistics. Related topics include probability models (Hunter, 2004; Rao & Kupper, 1967), rank elicitation (Szörényi et al., 2015; Busa-Fekete et al., 2014a, b), and online experiment design (Chernoff, 1992; Karimi et al., 2021). The Elo rating system has also been used for LLMs (Bai et al., 2022; Boubdir et al., 2023). Contributing to this literature, we introduce techniques for accelerating ranking convergence and detecting abnormalities, specifically applied to large-scale, real-world settings of LLMs.

Human Preference Dataset. Owing to the significance of human preferences, several datasets and analyses exist that incorporate human preferences. These include OpenAssistant (Köpf et al., 2023), HH-RLHF (Bai et al., 2022), LMSYS-Chat-1M (Zheng et al., 2023a), and synthetic approximations of human preferences like UltraFeedback (Cui et al., 2023) and Nectar (Zhu et al., 2023). Our prior data release, LMSYS-Chat-1M (Zheng et al., 2023a), is similarly collected via crowdsourcing. However, LMSYS-Chat-1M comprises solely conversations and lacks human preference data, rendering it unsuitable for direct use in ranking studies. This paper focuses on the analysis of preference data for ranking purposes.

Human Preference Data Collection

In this section, we discuss our interface design to collect human preferences and present summary statistics.

Chatbot Arena crowd-sources feedback from users for model evaluation. Our goal is to design an ease-of-use interface to reduce friction for users to contribute data. Since we collect feedback from many users, it is difficult to set a consistent grading rubric across different people. Hence, we adopt a pairwise comparison mechanism where users only need to compare two model responses and vote for the better one, instead of requiring users to provide an absolute score.

In each battle, two anonymous models are sampled. To encourage data diversity, we do not preset any input prompt on the website. Users are free to input any prompt to the two models. We believe this creates incentives for user engagement, particularly given that we offer a free service. It also helps us collect a diverse set of inputs representing real-world usage. After models provide their answers, user compare them side-by-side and vote for the preferred answer. If a user cannot choose in the first turn, the user can continue chatting until identifying a winner. For those who are unsure, we also present two buttons, “tie” or “both are bad.” Figure 8 shows a screenshot of our interface. Before using our service, users are required to accept terms of use, which gives us their consent to release the data publicly.

2 Data Statistics

We began collecting data in April 2023. As of Jan 2024, we have received around 240K votes from over 90K users. Our data involves more than 50 models, including both proprietary models like GPT-4, Claude, and Gemini, as well as open models such as LLaMA and Mistral. These conversations cover more than 100 languages, with 77% being in English, 5% in Chinese, and the remaining languages, such as Russian, German, Spanish, French, and Japanese, each representing less than 2% of the total. Each data point includes multi-turn conversations between the user and two LLMs, and a vote to indicate which model the user prefers. We summarize statistics in Table 1 along with other existing human preference datasets.

Figure 10 in the Appendix shows the vote count per model. On average, 8K votes are collected for each model. In Figure 2, we select a set of representative models and present their win rate and the number of battles. Note that we employ non-uniform sampling to concentrate votes on model pairs that have similar performance due to higher uncertainty. This helps us reduce the number of votes required to reach stable results. We later develop an adaptive sampling method and demonstrate its effectiveness against random sampling. See Section 5 for further analysis.

To ensure anonymity, we use keywords to filter out conversations containing model identity such as model name (e.g., GPT, Claude) or companies (e.g., OpenAI, Anthropic). To avoid misuse, we adopt OpenAI moderation API to flag conversations that contain unsafe content. The flagged user requests account for 3% of the total requests. Figure 9 in the Appendix shows the number of valid user votes over time, where we get 1-2K votes per day in recent months and spikes as we introduce new models or leaderboard updates.

From Pairwise Comparisons to Rankings

Our data consists of pairwise comparisons—but how can we use these comparisons to recover a ranking over all MM models? This is a well-studied topic in the literature on learning to rank (Liu et al., 2009), and we present our perspective here. We let A={(m,m):m<m and m,m[M]}\mathcal{A}=\{(m,m^{\prime}):m<m^{\prime}\text{ and }m,m^{\prime}\in[M]\} denote our comparative data set.

We consider a sequential setting, where at time tNt\in\mathbb{N}, we serve the human a pair of models AtAA_{t}\in\mathcal{A} (which we pick), and in turn we observe the human’s response HtH_{t}\in. As an example, we might have that At=(1,2)A_{t}=(1,2) and Ht=1H_{t}=1, indicating that the human prefers model 2 over model 1. In the ensuing text, we will primarily focus on the binary case—where Ht{0,1}H_{t}\in\{0,1\}—but our approach will generalize to any form of feedback, including the possibility of allowing the human to express different degrees of preference or to say the models are tied.

One critical goal is to estimate the win matrix: θ(a)=E[HtAt=a]\theta^{*}(a)=\mathbb{E}[H_{t}\mid A_{t}=a], for all aAa\in\mathcal{A}; see the left panel of Figure 2 for an illustration of the (empirical) win matrix. In the binary case, the aa entry in the win matrix corresponds to the probability the human prefers model a2a_{2} to a1a_{1} when shown the pair aa. Finding the win matrix is a relatively straightforward mean-estimation problem; we will provide details in Section 5.

Formally, consider a score s(P)RMs(\mathbb{P})\in\mathbb{R}^{M}, where P\mathbb{P} is a joint distribution over AA and HH (by default, we will target a uniform distribution over A\mathcal{A}). Each model has a true score s(P)ms(\mathbb{P})_{m}, and better models will have higher scores. In particular, we have the rank of model mm:

The best model has rank 11. If there is another model tied for best, they will both get assigned rank 11.

Picking a score. A standard score function in this setting is the vector of Bradley-Terry (BT) coefficients (Bradley & Terry, 1952). In the Bradley-Terry model, Ht{0,1}H_{t}\in\{0,1\}, and the probability model mm beats model mm^{\prime} is modeled via a logistic relationship:

1superscript𝑒subscript𝜉superscript𝑚′subscript𝜉𝑚\mathbb{P}(H_{t}=1)=\frac{1}{1+e^{\xi_{m^{\prime}}-\xi_{m}}}, (2) where ξ\xi is an MM-length vector of so-called BT coefficients. Without loss of generality, we take ξ1=0\xi_{1}=0 (since the model is invariant to addition in ξ\xi). Our goal is to estimate the population Bradley-Terry coefficients, i.e., those that minimize the binary cross-entropy:

1superscript𝑒subscript𝜉subscript𝐴2subscript𝜉subscript𝐴1s(\mathbb{P})=\operatorname*{argmin}_{\xi}\mathbb{E}_{(A,H)\sim\mathbb{P}}\left[\ell\left(H,\frac{1}{1+e^{\xi_{A_{2}}-\xi_{A_{1}}}}\right)\right], (3) where \ell is the binary cross-entropy loss, (h,p)=(hlog(p)+(1h)log(1p))\ell(h,p)=-(h\log(p)+(1-h)\log(1-p)).

Although the BT model technically assumes a parametric form for the model win rates, the seminal results of Huber et al. (1967); White (1982) show that maximum likelihood estimators are still asymptotically normal even when these assumptions do not hold, so long as the so-called “sandwich” covariance matrix is used; see Section 5 for details, and see Appendix B for a nonparametric extension of the Bradley-Terry model. Finally, we remark that previous evolutions of our online interface have reported different ranking scores, such as the Elo score (Elo, 1967) instead of the BT coefficients. We made this change because the BT coefficients are better for the purpose of statistical estimation.

Efficient Approximate Ranking

In Section 4 we described how to calculate the win matrix, score, and rank. Now we describe our estimation procedures.

Win matrix estimation. Estimation of the win matrix is relatively straightforward. Define Xt(a)=1Pt(a)Ht\mathds1{At=a}X_{t}(a)=\frac{1}{P_{t}(a)}H_{t}\mathds{1}\left\{A_{t}=a\right\}, where Pt(a)P_{t}(a) is the probability of sampling pair aa at time tt, and XtX_{t} as the according vector. Then the estimator is

Note that E[Xt(a)]=θ(a)\mathbb{E}[X_{t}(a)]=\theta^{*}(a) for all tt, and thus θ^T\hat{\theta}_{T} is an unbiased estimator of θ\theta^{*}. We will furthermore estimate the covariance matrix as

Under the appropriate regularity conditions, we have that

and we construct confidence intervals accordingly. For an understanding of the appropriate regularity conditions, see Durrett (2019), Theorem 8.2.8, where condition (ii) is trivially satisfied so long as Pt(a)>ϵ>0P_{t}(a)>\epsilon>0, and condition (i) is implied by the almost-sure convergence of Pt(a)P_{t}(a) to a limiting distribution P(a)P(a).

Estimating the BT scores. To estimate the BT coefficients, mirroring (3), we perform (reweighted) maximum likelihood estimation on our data points:

1superscript𝑒subscript𝜉subscript𝐴𝑡2subscript𝜉subscript𝐴𝑡1s(\hat{\mathbb{P}})=\operatorname*{argmin}_{\xi}\sum\limits_{t=1}^{T}\frac{1}{P(A_{t})}\ell\left(H_{t},\frac{1}{1+e^{\xi_{A_{t,2}}-\xi_{A_{t,1}}}}\right), (7) where AtPA_{t}\sim P. We perform the inverse weighting by P(At)P(A_{t}) because this allows us to target a score with a uniform distribution over AA.

To compute confidence intervals on the BT coefficients, we employ two strategies: (1) the pivot bootstrap (DiCiccio & Efron, 1996), and (2) the “sandwich” robust standard errors outlined in Huber et al. (1967) (see also Freedman (2006) for an outline of the necessary technical assumptions). Ultimately, based on the results of a simulation study described in Appendix A, we choose to deploy the sandwich intervals due to their smaller size in large samples.

Approximate rankings. Finally, we report an approximate ranking for each model that accounts for the uncertainty in the estimation of the score. Given an MM-dimensional confidence set C\mathcal{C} satisfying

we extract an approximate ranking Rm=1+m[M]\mathds1{infCm>supCm}R_{m}=1+\sum_{m^{\prime}\in[M]}\mathds{1}\left\{\inf\mathcal{C}_{m^{\prime}}>\sup\mathcal{C}_{m}\right\}. The uniform validity of C\mathcal{C} directly implies that P(m:Rm>rank(P)m)α\mathbb{P}(\exists m:R_{m}>\operatorname{rank}(\mathbb{P})_{m})\leq\alpha—i.e., with high probability, no model’s performance is understated. A guarantee on the other side—that no model’s performance is overstated—is possible by interchanging the inf\inf and sup\sup. To get the uniform confidence set, we construct the chi-squared interval implied by the central limit theorem using the sandwich estimate of the variance. In other words, we construct the interval {ξ:TV^1/2(ξ^ξ)χ1α,M12\{\xi:T\left\|\hat{V}^{-1/2}(\hat{\xi}-\xi)\right\|\leq\chi^{2}_{1-\alpha,M-1}, where ξ^\hat{\xi} is our MLE of the BT coefficients and V^ξ\hat{V}_{\xi} is the sandwich variance of the logistic regression.

Active sampling rule. Our sampling rule was to choose the model pair aAa\in\mathcal{A} proportionally to the reduction in confidence interval size by sampling that pair:

conditional-set𝑡subscript𝐴𝑡𝑎1P_{t}(a)\propto\sqrt{\frac{\hat{\Sigma}_{t,a,a}}{|\{t:A_{t}=a\}|}}-\sqrt{\frac{\hat{\Sigma}_{t,a,a}}{|\{t:A_{t}=a\}|+1}}. (9) 5.1 Detecting Anomalous Users On a different note, we take a first step towards identifying anomalous IP addresses in our dataset. In a dataset of UU unique IPs, we let IP={1,,U}\mathsf{IP}=\{1,\ldots,U\} be the set of all IP addresses. Consider a “test” user, outside this database, who gives ratings H1,,HnH^{\prime}_{1},\ldots,H^{\prime}_{n} when presented actions A1,,AnA^{\prime}_{1},\ldots,A^{\prime}_{n}. The idea of our procedure is to compare the distribution of ratings for the new user to the historical distribution of ratings for a given action. We let Ha={Ht:At=a}\mathcal{H}_{a}=\{H_{t}:A_{t}=a\} and every time a user submits a vote, we calculate the following number:

subscriptℋsubscriptsuperscript𝐴′𝑖11subscriptℎsubscriptℋsubscriptsuperscript𝐴′𝑖1ℎsubscriptsuperscript𝐻′𝑖p_{i}=\frac{1}{|\mathcal{H}_{A^{\prime}_{i}}|+1}\left(1+\sum\limits_{h\in\mathcal{H}_{A^{\prime}_{i}}}\mathds{1}\left\{h\geq H^{\prime}_{i}\right\}\right). (10) Under the null hypothesis that HAi\mathcal{H}_{A^{\prime}_{i}} is exchangeable with HiH^{\prime}_{i}, pip_{i} is a valid p-value (see Appendix C for a proof). Furthermore, the dependence of these p-values asymptotically is negligible.

With this p-value in hand, we can test against this null hypothesis sequentially by using Fisher’s combination test (Fisher, 1928) along with a variant of the Bonferroni correction. In particular, for each user, after their jjth vote, we compute Mj=2i=1jlog(pi)M_{j}=-2\sum\limits_{i=1}^{j}\log(p_{i}). At 5 randomly chosen values of jj between 1 and 100, we identify a user as anomalous if Mjχ2j,1α/52M_{j}\geq\chi^{2}_{2j,1-\alpha/5}. (The times are randomly chosen, as to avoid anomalous users strategizing to hack this p-value.) Despite the heuristic application of this procedure, it seems to work well in our small-scale tests reported in Table 5.

Data Analysis

To examine whether Arena’s crowdsourced data reflects real-world use cases, we conduct topic modeling on the user prompts. We show how effective are these prompts in distinguishing models. Lastly, we validate the vote quality by relabeling data with experts.

To study the prompt diversity, we build a topic modeling pipeline with BERTopic333https://github.com/MaartenGr/BERTopic (Grootendorst, 2022). We start with transforming user prompts into representation vectors using OpenAI’s text embedding model (text-embedding-3-small). To mitigate the curse of dimensionality for data clustering, we employ UMAP (Uniform Manifold Approximation and Projection) (McInnes et al., 2020) to reduce the embedding dimension from 1,536 to 5. We then use the hierarchical density-based clustering algorithm, HDBSCAN, to identify topic clusters with minimum cluster size 32. Finally, to obtain topic labels, we sample 10 prompts from each topic cluster and feed into GPT-4-Turbo for topic summarization.

The pipeline identifies 600 clusters covering a wide range of topics including poetry writing, coding, math, and medical queries. We present the top-16 topic clusters in Figure 3. We observe that the largest cluster only accounts for 1% of the entire set and the rest quickly drop to <0.5%, and the similarity between clusters is small, showing a long-tail and diverse distribution. Due to space limit, we present the similarity matrix and cluster hierarchy of top-64 clusters in Figure 11 and 12 in Appendix.

2 Can Arena Prompts Distinguish Models?

Next, we study how effective are these topic clusters in distinguishing models strengths. Constructing challenging prompts has become increasingly difficult due to LLMs’ fast growing capabilities. For example, open models such as Llama-2-70b-chat can likely answer inquiries about movie or travel recommendation as good as GPT-4, but not in other domains such as reasoning or coding. To demonstrate, we sample 30 prompts from seven topic clusters and compare the performance of Llama-2-70b-chat and GPT-4. To control variables, we factor out user votes and consider LLM-as-judge (Zheng et al., 2023b) to evaluate model response. Results are shown in Table 2, where we see GPT-4 has significantly higher win-rate (up to 97%) in clusters that require coding and reasoning skills. On the other hand, for clusters with less problem-solving tasks, GPT-4 win-rate drops to below 60%. We show examples in Appendix D.1. This result shows models may exhibit varying strengths in different areas, but also highlights some of the topic clusters in Chatbot Arena are effective in differentiate models.

Building Challenging Benchmark. To further demonstrate the prompt quality, we show it is possible to construct a challenging benchmark with crowd-sourced user prompts. To ensure both topic coverage and quality, we first run the topic modeling pipeline and follow a similar procedure in Zheng et al. (2023a) to select challenging questions sampled from each topic cluster. Examples prompts and evaluation procedures can be found in the Appendix D.2 and Appendix D.3, respectively. We observe the selected prompts are highly effective in differentiating models. In Figure 4, we compare Arena bench against a widely used LLM benchmark, MT-Bench (Zheng et al., 2023b). We can see that Arena Bench effectively reveals a significant gap in performance between proprietary and the strongest open models.

3 Validating Vote Quality

To assess the quality of crowdsourced votes, we randomly selected 160 battles between GPT-4-Turbo and Llama-2-13B, as well as GPT-4-Turbo and GPT-3.5-Turbo-0613. We then asked experts444The laborers are graduate students at UC Berkeley. to label their preference per comparison. The experts were given the prompts and answers blindly, and asked to carefully fact-check model’s answer with external resources like search engine. Manually labeling each data point took on average 3-5 minutes. For reference, we also use GPT-4 as a judge for pairwise comparisons. The agreement rate between crowd-users, experts, and GPT-4-judge are presented in Table 3. The corresponsing win-rate are shown in Table 4.

To summarize, we observe high agreement rates (72% to 83%) between Arena crowd-user and experts in both setup. Note that agreement rates between two experts are around similar levels (79.4% and 89.8%). As for the 10%-20% disagreement between experts, it is mostly due to some user prompts don’t have a ground truth answer. Depending on the preference of the evaluator, sometimes both answers can be argued as being better than the other one, such as the examples in Appendix D.4. The gap between crowd-vs-expert agreement rate and expert-vs-expert agreement rate (5%-10%) is mostly attributed to crowd user making mistakes or overlooking factual errors in model’s response. Overall, the agreement rates presented in Table 3 validate the decent quality of crowd-sourced votes in Chatbot Arena.

Experiments

Computing the rank on real data. In this section, we report results from our experiments on approximate ranking. For this experiment, we ran a replay of T=213,576T=213,576 historical votes from our online platform and calculate the BT coefficients using our earlier-described estimation algorithm with confidence intervals; see Figure 5 for these intervals (with and without multiplicity correction; the formal notion of approximate ranking technically requires multiplicity correction, but it makes the intervals looser).

Evaluating the coverage of the intervals. A natural follow-up question is whether or not the intervals are doing their job correctly: whether they cover the true BT coefficients with probability at least (and almost exactly) 1α1-\alpha. Of course, this cannot be evaluated on real data, so we run a simulation. A vector of BT coefficients is drawn, with each coordinate sampled i.i.d. from a distribution beta(1/γ,1/γ)\mathsf{beta}(1/\gamma,1/\gamma); we take γ=2\gamma=2 in Figure 6 (and we vary γ\gamma in Appendix A). Given these coefficients, a dataset is synthesized, and the coverage and average width are computed for each of 20 trials. The results can be seen in Figure 6 for the uncorrected intervals The coverage of the intervals behaves as expected, centering around 1α1-\alpha, regardless of the number of models. Meanwhile, the more models are included, the larger the intervals become.

Evaluating the active sampling rule. Next, we discuss the evaluation of our active sampling rule as Equation (9) for win matrix estimation. We evaluate this sampling rule by taking the best fit BT coefficients to our 213,576 point sized holdout set, and then sampling from that distribution using our active sampling algorithm. The results are displayed in Figure 7. It is hard to tell by looking at plots, but the improvement is substantial: To estimate θ\theta^{*} to a precision of 0.2, random needs 6,800 samples and adaptive needs 4,400 samples; meanwhile to estimate the score to a precision of 0.3, random needs 17,200 samples and adaptive needs 16,400 samples. Thus, the random baseline requires 54% and 5% more data to achieve the same level of precision, respectively. One can see from the plots in Figure 7 that these results are not cherry-picked: the sample-efficiency of our method is better at all values on the horizontal axis.

2 Anomalous Users Detection

We evaluate the outlier detection method in Section 5.1. We construct the evaluation set by manually identifying 25 anomalous users whose inputs are highly repetitive or meaningless (e.g., asking “hi” for 100 times or inputting garbled texts). We randomly sample 25 normal users with at least 50 votes, and inspect their input prompts to ensure no abnormal behaviors. As mentioned in Section 5.1, per user we compute five MjM_{j} and identify the user as anomalous if Mjχ2j,1α/52M_{j}\geq\chi^{2}_{2j,1-\alpha/5}. We present results of two different α\alpha (i.e., the significance leval) in Table 5. We find the detection method effective (e.g., reaching 90% true positive and 60-70% true negative rate). We inspect the false negative errors and find those are from users do not always behave abnormally, making them harder to detect.

Discussion

Limitations. Although our user base is extensive, we anticipate that it will primarily consist of LLM hobbyists and researchers who are eager to experiment with and evaluate the latest LLMs. This inclination may result in a biased distribution of users. Additionally, despite the wide array of topics encompassed by the prompts discussed in previous sections, the data predominantly comes from our online chat interface. This source might not accurately reflect the real-world usage of LLMs in production environments or specialized domains, potentially leading to a skewed prompt distribution. Moreover, our study concentrates on assessing the helpfulness of LLMs but overlooks their safety aspects. We recognize the possibility and necessity of a parallel mechanism to evaluate the safety of these models.

Future Directions. In our future work, we plan to develop comprehensive topic leaderboards and establish a dedicated section for multimodal and agent-based LLMs in more dynamic, gamified settings, catering to more complex tasks. We also believe our approach to detecting harmful users could be improved and made more formally rigorous by using the theory of nonnegative supermartingales and E-values (Howard et al., 2020; Waudby-Smith & Ramdas, 2020; Vovk & Wang, 2021; Ramdas et al., 2023); this would deal with the dependence, but the variants we tried did not perform well in terms of power.

Conclusion

In this paper, we present Chatbot Arena, an open platform for evaluating LLMs through crowdsourced, pairwise human preferences. We conduct an in-depth analysis of the crowdsourced user prompts and preference votes to validate the diversity and quality. We develop an efficient model sampling and ranking algorithm. Our dataset including 100K pairwise preference votes will be released for future research.

Acknowledgments

This project is supported by sponsorship from Kaggle, MBZUAI, a16z, Together AI, Anyscale, and HuggingFace. This project is also partly supported by Accenture, AMD, Google, IBM, Intel, Microsoft, Samsung SDS, SAP, Uber, and VMware. The authors would like to thank Siyuan Zhuang for insightful discussion and Tijana Zrnić for helpful feedback on the manuscript.

References

Appendix A Confidence Interval Simulation Study

We conduct a simulation study to evaluate the bootstrap confidence intervals versus the sandwich estimator. To a large extent, both intervals are the same—indeed, their intervals are often identical to the naked eye. Nonetheless, in our experiments, there are some differences. First, in Figure 13, we conduct a replay study using the same 213576 data points mentioned in the main text.

We also do a suite of experiments in simulation using the same beta\mathsf{beta} generating process as in the main text, with γ=2\gamma=2. The result is shown in Figure 14; results are similar across many choices of the parameter γ\gamma and the model strength, which indicates that both intervals will have good coverage and width in the practical conditions we would expose them to.

Appendix B The Nonparametric Bradley-Terry Model

Nonparametric Bradley-Terry. We next consider a nonparametric extension of the Bradley-Terry (BT) model (Bradley & Terry, 1952) to the case where the ranking is not necessarily transitive. Let G(m)\mathcal{G}(m) denote the set of all paths to the model mm, i.e.,

where B=A{(a2,a1):aA}\mathcal{B}=\mathcal{A}\cup\{(a_{2},a_{1}):a\in\mathcal{A}\}. Each element of G(m)\mathcal{G}(m) is a chain of model pairings that leads to mm; for example, if m=5m=5 and M=6M=6, one element of G(m)\mathcal{G}(m) is ((1,2),(2,4),(4,3),(3,6),(6,5))((1,2),(2,4),(4,3),(3,6),(6,5)). Our score function is given by the average path-sum of the log odds of the second model winning, over the entirety of G(m)\mathcal{G}(m):

superscript𝜃′1subscript𝑔111superscript𝜃′1subscript𝑔11subscript𝑎𝑔superscript𝜃′𝑎1superscript𝜃′𝑎s(\theta)_{m}=\frac{1}{|\mathcal{G}(m)|}\sum_{g\in\mathcal{G}(m)}\left(\log\frac{\theta^{\prime}((1,g_{1,1}))}{1-\theta^{\prime}((1,g_{1,1}))}+\sum_{a\in g}\log\frac{\theta^{\prime}(a)}{1-\theta^{\prime}(a)}\right), (12) where θ(a)=θ(a)\mathds1{aA}+(1θ((a2,a1)))\mathds1{aA}\theta^{\prime}(a)=\theta(a)\mathds{1}\left\{a\in\mathcal{A}\right\}+(1-\theta((a_{2},a_{1})))\mathds{1}\left\{a\notin\mathcal{A}\right\}, with the convention that θ((m,m))=1/2\theta((m,m))=1/2 for all mm. Note that for any gG(m)g\in\mathcal{G}(m) where aga\in g and mam\notin a, we also have some gG(m)g^{\prime}\in\mathcal{G}(m) such that (a2,a1)g(a_{2},a_{1})\in g. Meanwhile, if aga\in g and mam\in a, then a=(m,m)a=(m^{\prime},m) for some mm^{\prime}. Thus, we can compute

subscript𝑎𝒜𝑚𝑎12superscript𝜃′𝑎1superscript𝜃′𝑎superscript𝜃′subscript𝑎2subscript𝑎11superscript𝜃′subscript𝑎2subscript𝑎1subscriptsuperscript𝑚′delimited-[]𝑀𝑚superscript𝜃′superscript𝑚′𝑚1superscript𝜃′superscript𝑚′𝑚superscript𝜃′1superscript𝑚′1superscript𝜃′1superscript𝑚′\displaystyle=\sum\limits_{\begin{subarray}{c}a\in\mathcal{A}\\ m\notin a\end{subarray}}\frac{1}{2}\left(\log\frac{\theta^{\prime}(a)}{1-\theta^{\prime}(a)}+\log\frac{\theta^{\prime}((a_{2},a_{1}))}{1-\theta^{\prime}((a_{2},a_{1}))}\right)+\sum\limits_{m^{\prime}\in[M]\setminus\{m\}}\left(\log\frac{\theta^{\prime}((m^{\prime},m))}{1-\theta^{\prime}((m^{\prime},m))}+\frac{\theta^{\prime}((1,m^{\prime}))}{1-\theta^{\prime}((1,m^{\prime}))}\right) (13) =aAma12(logθ(a)1θ(a)+log1θ(a)θ(a))+m[M]{m}(logθ((m,m))1θ((m,m))+θ((1,m))1θ((1,m)))\displaystyle=\sum\limits_{\begin{subarray}{c}a\in\mathcal{A}\\ m\notin a\end{subarray}}\frac{1}{2}\left(\log\frac{\theta(a)}{1-\theta(a)}+\log\frac{1-\theta(a)}{\theta(a)}\right)+\sum\limits_{m^{\prime}\in[M]\setminus\{m\}}\left(\log\frac{\theta^{\prime}((m^{\prime},m))}{1-\theta^{\prime}((m^{\prime},m))}+\frac{\theta^{\prime}((1,m^{\prime}))}{1-\theta^{\prime}((1,m^{\prime}))}\right) (14) =m[M]{m}(logθ((m,m))1θ((m,m))+θ((1,m))1θ((1,m)))\displaystyle=\sum\limits_{m^{\prime}\in[M]\setminus\{m\}}\left(\log\frac{\theta^{\prime}((m^{\prime},m))}{1-\theta^{\prime}((m^{\prime},m))}+\frac{\theta^{\prime}((1,m^{\prime}))}{1-\theta^{\prime}((1,m^{\prime}))}\right) (15) =m[M]{m}((12\mathds1{m>m})logθ((m,m))1θ((m,m))+θ((1,m))1θ((1,m))).\displaystyle=\sum\limits_{m^{\prime}\in[M]\setminus\{m\}}\left((1-2\mathds{1}\left\{m^{\prime}>m\right\})\log\frac{\theta((m^{\prime},m))}{1-\theta((m^{\prime},m))}+\frac{\theta((1,m^{\prime}))}{1-\theta((1,m^{\prime}))}\right). (16) This score is always well-defined, and is a simple, smooth function of θ\theta. Its derivative is, for all aAa\in\mathcal{A},

𝜃𝑎𝑠subscript𝜃𝑚1subscript𝑎2𝑚121subscript𝑎1𝑚1𝜃𝑎1𝜃𝑎1formulae-sequencesubscript𝑎11subscript𝑎2𝑚1𝜃𝑎1𝜃𝑎\frac{\partial}{\partial\theta(a)}s(\theta)_{m}=\mathds{1}\left\{a_{2}=m\right\}(1-2\mathds{1}\left\{a_{1}>m\right\})\frac{1}{\theta(a)(1-\theta(a))}+\mathds{1}\left\{a_{1}=1,\ a_{2}\neq m\right\}\frac{1}{\theta(a)(1-\theta(a))}. (17) How is the BT score related to the original Bradley-Terry model? In the original Bradley-Terry model, Ht{0,1}H_{t}\in\{0,1\}, and the probability of model mm beating model mm^{\prime} is assumed to be given by

superscript𝑒subscript𝜉𝑚superscript𝑒subscript𝜉superscript𝑚′\theta((m^{\prime},m))=\frac{e^{\xi_{m}}}{e^{\xi_{m}}+e^{\xi_{m^{\prime}}}}, (18) for some unknown parameters ξ1,,ξM\xi_{1},\ldots,\xi_{M}—the Bradley-Terry coefficients. The basic goal of the Bradley-Terry model is to estimate these parameters from the observed outcomes. In our setting, however, we use the outcomes to get a CLT on θ\theta, and then can immediately recover the coefficients. Taking without loss of generality ξ1=0\xi_{1}=0, we have that

𝜃1superscript𝑚′1𝜃1superscript𝑚′𝜃superscript𝑚′𝑚1𝜃superscript𝑚′𝑚\displaystyle\log\frac{\theta((1,m^{\prime}))}{1-\theta((1,m^{\prime}))}+\log\frac{\theta((m^{\prime},m))}{1-\theta((m^{\prime},m))} =logθ((1,m))θ((m,1))+logθ((m,m))θ((m,m))\displaystyle=\log\frac{\theta((1,m^{\prime}))}{\theta((m^{\prime},1))}+\log\frac{\theta((m^{\prime},m))}{\theta((m,m^{\prime}))} (19) =logeξm(eξm+1)eξm+1+logeξm(eξm+eξm)eξm(eξm+eξm)\displaystyle=\log\frac{e^{\xi_{m^{\prime}}}(e^{\xi_{m^{\prime}}}+1)}{e^{\xi_{m^{\prime}}}+1}+\log\frac{e^{\xi_{m}}(e^{\xi_{m^{\prime}}}+e^{\xi_{m}})}{e^{\xi_{m^{\prime}}}(e^{\xi_{m^{\prime}}}+e^{\xi_{m}})} (20) =ξm+ξmξm=ξm\displaystyle=\xi_{m^{\prime}}+\xi_{m}-\xi_{m^{\prime}}=\xi_{m} (21) Thus, all the sums over paths in (12) are equal to ξmξg1,1\xi_{m}-\xi_{g_{1,1}}.

superscript𝜃′1subscript𝑔111superscript𝜃′1subscript𝑔11subscript𝑎𝑔superscript𝜃′𝑎1superscript𝜃′𝑎\displaystyle\log\frac{\theta^{\prime}((1,g_{1,1}))}{1-\theta^{\prime}((1,g_{1,1}))}+\sum\limits_{a\in g}\log\frac{\theta^{\prime}(a)}{1-\theta^{\prime}(a)} (22) =\displaystyle= ξg1,1+ξg1,2ξg1,1+ξg2,2ξg2,1++ξgM1,2ξgM1,1\displaystyle\xi_{g_{1,1}}+\xi_{g_{1,2}}-\xi_{g_{1,1}}+\xi_{g_{2,2}}-\xi_{g_{2,1}}+\cdots+\xi_{g_{M-1,2}}-\xi_{g_{M-1,1}} (23) =\displaystyle= ξgM1,2=ξm.\displaystyle\xi_{g_{M-1,2}}=\xi_{m}. (24) Thus, if the parametric BT model is well-specified, the nonparametric version will exactly recover the Bradley-Terry coefficients. However, our nonparametric analogue of the BT model has major advantages over the original: it will retain statistical validity even if HtH_{t} is not binary, if the win rate is non-transitive, and if the logistic model assumed by the BT model is misspecified. In practice, the nonparametric BT coefficient can be easily computed by (16).

Appendix C Valid P-Value

subscriptℋsubscriptsuperscript𝐴′𝑖11subscriptℎsubscriptℋsubscriptsuperscript𝐴′𝑖1ℎsubscriptsuperscript𝐻′𝑖p_{i}=\frac{1}{|\mathcal{H}_{A^{\prime}_{i}}|+1}\left(1+\sum\limits_{h\in\mathcal{H}_{A^{\prime}_{i}}}\mathds{1}\left\{h\geq H^{\prime}_{i}\right\}\right). (25) We will prove that this p-value is valid, i.e., that P(pit)t\mathbb{P}(p_{i}\leq t)\leq t, under the null hypothesis that the vector H=(Ht:At=Ai)(Hi)\mathcal{H}^{\prime}=(H_{t}:A_{t}=A^{\prime}_{i})\|(H^{\prime}_{i}) is exchangeable, where \| denotes the concatenation operator. First, notice that we can equivalently write

We also have that P(pit)P(pikH)\mathbb{P}(p_{i}\leq t)\leq\mathbb{P}(p_{i}\leq\frac{k}{|\mathcal{H}^{\prime}|}), where k=tH/Hk=\lfloor t|\mathcal{H}^{\prime}|\rfloor/|\mathcal{H}^{\prime}|. Then, since the data points are exchangeable, we have that HiH^{\prime}_{i} is uniformly distributed among the ranks of H\mathcal{H}^{\prime}, so P(pikH)kHt\mathbb{P}(p_{i}\leq\frac{k}{|\mathcal{H}^{\prime}|})\leq\frac{k}{|\mathcal{H}^{\prime}|}\leq t, completing the proof.

Appendix D Sample Prompts

Example 1: An example user prompt from the cluster "Python Game Programming Challenge". This prompt distinctively separates GPT-4-0613 performance from Llama-2-70b-chat. The user request "higher quality code“ from each assistant. While GPT-4-Turbo provides a correct implementation along with helpful explanation, Llama-2-70b-chat fail to provide any program and provided a less helpful response overall.

<|User Prompt|> Develop a snowman game similar to the Hangman game but draws a snowman figure standing on the snow instead of a hangman. It also downloads (or uses library) the complete English dictionary to select words for three levels of difficulty. Think step-by-step, take a deep breath and choose a higher quality code for each step. <|GPT-4-0613’s Answer|> ⬇ Sure, here’s a step-by-step guide on how to create a Snowman game similar to Hangman. We’ll use Python for this example. Step 1: Import necessary libraries ‘‘‘python import random import urllib.request ‘‘‘ Step 2: Download the English dictionary ‘‘‘python url = "https://raw.githubusercontent.com/dwyl/english-words/master/words.txt" urllib.request.urlretrieve(url, ’words.txt’) ‘‘‘ Step 3: Load the words from the dictionary and categorize them into three levels of difficulty ‘‘‘python with open(’words.txt’) as word_file: english_words = word_file.read().split() easy_words = [word for word in english_words if 4 <= len(word) <= 6] medium_words = [word for word in english_words if 7 <= len(word) <= 9] hard_words = [word for word in english_words if len(word) >= 10] ‘‘‘ Step 4: Create a function to randomly select a word based on the difficulty level ‘‘‘python def get_random_word(difficulty): if difficulty == ’easy’: word = random.choice(easy_words) elif difficulty == ’medium’: word = random.choice(medium_words) else: word = random.choice(hard_words) return word.upper() ‘‘‘ Step 5: Create a function to print the snowman ⬇ ‘‘‘python def print_snowman(guesses): snowman = [’ _____ ’, ’ ( ) ’, ’ ( ( ) ) ’, ’(___(___)_)’, ’ ( ) ’, ’ ( ) ’, ’ ( ) ’, ’ ( ) ’, ’( )’, ’ _______ ’] for i in range(guesses): print(snowman[i]) ‘‘‘ Step 6: Create the main function for the game ‘‘‘python def play_snowman(): difficulty = input("Choose difficulty (easy, medium, hard): ") word = get_random_word(difficulty) guessed_letters = [] guesses = 0 print(’The word contains’, len(word), ’letters.’) while guesses < 10: guess = input("Guess a letter: ").upper() if guess in guessed_letters: print("You already guessed that letter.") elif guess in word: guessed_letters.append(guess) print("Good job! The word contains that letter.") else: guesses += 1 print_snowman(guesses) print("Sorry, the word does not contain that letter.") print(’ ’.join(letter if letter in guessed_letters else ’_’ for letter in word)) if set(guessed_letters) == set(word): print("Congratulations! You’ve successfully guessed the word.") break else: print("Sorry, you didn’t guess the word. The word was", word) ‘‘‘ Step 7: Run the game ‘‘‘python play_snowman() ‘‘‘ This is a simple version of the game. You can enhance it by adding more features like scoring, time limit, etc. ‘‘‘ <|Llama-2-70b-chat’s Answer|> ⬇ Sure! Here’s a step-by-step guide to developing a snowman game similar to Hangman, using the complete English dictionary for three levels of difficulty. Step 1: Setting up the project * Create a new Python project directory and initialize it with Git. * Install the necessary libraries: + ‘numpy‘ for generating random numbers + ‘nltk‘ for accessing the English dictionary + ‘pygame‘ for creating the game graphics + ‘pydub‘ for playing sound effects (optional) * Download the English dictionary dataset (e.g., from Kaggle) and store it in a file named ‘dictionary.txt‘. Step 2: Loading the dictionary and generating words * Use the ‘nltk‘ library to read the dictionary file and store the words in a list. * Split the list into three parts: easy, medium, and hard words, based on their length and frequency of use. * Shuffle the lists to randomize the order of the words. Step 3: Creating the game graphics * Use ‘pygame‘ to create a window for the game, with a white background to represent the snow. * Draw a snowman figure on the background, using black rectangles for the body, hat, and scarf. * Add buttons for the player to submit their guesses. Step 4: Implementing the game logic * Create a function to generate a random word from the appropriate difficulty level and display it on the screen with missing letters (e.g., _ _ _ _ _ _). * Create a function to check if a guessed letter appears in the word, and update the display accordingly (e.g., _ _ _ X _ _). * Create a function to handle incorrect guesses, such as displaying a warning message or playing a sound effect. * Create a function to win the game, such as displaying a congratulations message or playing a victory sound effect. Step 5: Adding sound effects (optional) * Use ‘pydub‘ to play sound effects for various actions in the game, such as guessing a letter correctly or incorrectly. * Consider using a library like ‘pygame_mixer‘ to simplify the process of playing sound effects in pygame. ⬇ Step 6: Testing and refining the game * Test the game thoroughly to ensure that it works as expected, with no bugs or errors. * Refine the game mechanics and graphics as needed, based on player feedback or personal preference. * Consider adding additional features, such as keeping track of the player’s score or implementing a time limit for each guess. Step 7: Packaging and distributing the game * Package the game as a standalone executable file using a tool like PyInstaller. * Distribute the game to friends, family, or online communities, and gather feedback for future improvements. This is a high-level overview of the steps involved in creating a snowman game similar to Hangman. Of course, the details will depend on your specific implementation and the tools you choose to use. Good luck with your project! <|GPT-4-Turbo’s Judgment Verdict|> GPT-4-0613 Wins Program Output: Below is the terminal output of the snowman game programmed by GPT-4-0613, demonstrating its correctness. It is clear that GPT-4-0613’s response is more helpful and relevant to the user’s request.

⬇ >>> python snowman.py Choose difficulty (easy, medium, hard): easy The word contains 6 letters. Good job! The word contains that letter. _ A _ _ _ _ Guess a letter: e Good job! The word contains that letter. _ A _ _ _ E Guess a letter: d _____ ( ) Sorry, the word does not contain that letter. _ A _ _ _ E Guess a letter: g _____ ( ) ( ( ) ) Example 2: An example user prompt from the cluster "Python Game Programming Challenge". This prompt distinctively separates GPT-4-0613 performance from Llama-2-70b-chat. Due to limited space, the code from each assistant’s response is omitted.

<|User Prompt|> Write a python function that prints a christmas tree in the form of a triangle of asterisks, where the only input is size. <|GPT-4-0613’s Code Output|> >>> print_christmas_tree(5) \displaystyle* \displaystyle*** \displaystyle***** \displaystyle******* \displaystyle********* <|Llama-2-70b-chat’s Code Output|> >>> print_christmas_tree(5) * * * * * * * * * * * * * * * <|GPT-4-Turbo Verdict|> GPT-4-0613 Wins Example 3: An example user prompt from the cluster "Movie Recommendations & Rating". Both assistant retrieve the correct movie information and gave an almost identical explanation. Therefore, this prompt cannot distinguish GPT-4-0613’s performance from Llama-2-70b-chat.

<|User Prompt|> Which film came out first, Blind Shaft or The Mask Of Fu Manchu? <|GPT-4-0613’s Answer|> The Mask of Fu Manchu was released in 1932, and Blind Shaft was released in 2003. Therefore, The Mask of Fu Manchu came out first. <|Llama-2-70b-chat’s Answer|> The Mask Of Fu Manchu came out first. It was released in 1932, while Blind Shaft was released in 2003. <|GPT-4-Turbo Verdict|> Tie Example 4: An example user prompt from the cluster "Travel Itinerary Planning". Both assistant provided an equally helpful itinerary. Therefore, this prompt cannot distinguish GPT-4-0613’s performance from Llama-2-70b-chat.

D.2 Arena Bench Prompt

Example 1: An prompt from Arena Bench requiring a very high level of complex problem-solving skill and reasoning while adhering to real-world application.

Create a flutter app for habit tracking that tracks daily habits for a user. The user should be able create multiple daily tasks that he wants to track. And he should be able to group the habits into a set of groups. One the user completes a task, he is rewarded a number of points per task. The app should have a page that summarize the total score per group. And the aggregate score of all the groups of habits. This flutter app needs to be compilable for both andriod and iOS. Example 2: An prompt from Arena Bench requiring a very high level of complex problem-solving skill and reasoning while adhering to real-world application.

D.3 Arena Bench System Prompt

The novel evaluation procedure is as follow: we prompt GPT-4-Turbo with the system prompt displayed below alongside a user prompt, a reference answer, and 2 assistant’s answers. For reference answer, we present the user prompt with 3 assistants’ answers, GPT-4-Turbo, GPT-4-0314, and Claude-1, to GPT-4-Turbo and ask GPT-4-Turbo to generate an answer to the prompt. To ensure consistent pairwise judgment, we set up GPT-3.5-Turbo-0301 as the baseline answer for all models to be compared against. To avoid positional bias, we conduct two judgments per prompt: the first judgment presents the baseline answer as Assistant A while the second judgment presents the baseline answer as Assistant B. In total, we conduct 700 pairwise comparisons between each model against GPT-3.5-Turbo-0301 across 350 user prompts to calculate a win-rate against the baseline. Then we project the win-rate on a scale from 0 to 10 by assigning wins with a score of 10, ties with a score of 5, and losses with a score of 0. Further, we assign a significant win or loss as 3 wins or 3 losses, respectively, and keeping the other verdicts as a single win, loss, or tie. Finally, we calculate the final score by averaging across the wins, losses, and ties.

D.4 Arena Battles with No Ground Truth

Example 1: An example of an Arena Battle between GPT-4-Turbo and Llama-2-13B-chat, where depending on different preferences, either answer can be argued as better than the other. While both answers are correct, GPT-4-Turbo’s answer is concise and appropriate to the format of multiple choice question. On the other hand, Llama-2-13B-chat provides helpful explanation for the multiple choice answer.

<|User Prompt|> Multiple Choice Question All plants have both a haploid and a diploid form. As the plant changes from one form to another, this is called: Multiple choice question. alternative life cycle succession an alternation of generations <|GPT-4-Turbo’s Answer|> an alternation of generations <|Llama-2-13B-chat’s Answer|> The correct answer is: an alternation of generations. Plants, like other organisms, have different life cycle stages, and they can exist in two different forms: haploid and diploid. The haploid form is a single set of chromosomes, while the diploid form is two sets of chromosomes. As the plant changes from one form to another, this is called an alternation of generations. This process involves the alternation of haploid and diploid phases, and it is a fundamental aspect of the life cycle of plants. Alternative life cycle and succession are not correct descriptions of this process, as they refer to different concepts. Alternative life cycle refers to the different stages of life an organism may go through, such as larva, pupa, and adult in insects, while succession refers to the process of one species replacing another in a particular environment or niche. Example 2: An example of an Arena Battle between GPT-4-Turbo and Llama-2-13B-chat, where depending on different preferences, either answer can be argued as better than the other.

<|User Prompt|> What are critical elements of a Product Mindset for startup founders? <|GPT-4-Turbo’s Answer|> Adopting a Product Mindset is crucial for startup founders as it helps them focus on creating products that deliver value to customers while achieving business objectives. Here are some critical elements of a Product Mindset for startup founders: 1. Customer Centricity: - Understand and empathize with your customers’ needs, problems, and motivations. - Continuously gather and incorporate customer feedback into the product development process. - Prioritize features and improvements that directly address customer pain points and enhance their experience. 2. Value Proposition: - Clearly define the unique value your product offers to the market. - Align your product’s features and capabilities with its value proposition to ensure it meets the expectations of your target audience. 3. Problem-Solving Focus: - Identify and focus on solving real problems for users rather than just creating features for the sake of innovation. - Validate assumptions and hypotheses through experimentation and user testing. 4. Iterative Development: - Embrace an agile development approach, where you build, measure, and learn in iterative cycles. - Be open to pivoting or making significant changes based on feedback and market response. 5. Data-Driven Decision Making: - Use data to inform decisions about product development, marketing, and growth strategies. - Establish key performance indicators (KPIs) to measure the success of the product and track progress over time. … By internalizing these elements, startup founders can develop a strong Product Mindset that guides their decisions and strategies, ultimately leading to a product that customers love and that contributes to the startup’s success. <|Llama-2-13B-chat’s Answer|> A product mindset is a way of thinking and approaching product development that is critical for startup founders. Here are some essential elements of a product mindset for startup founders: 1. Customer-centricity: A product mindset starts with understanding the customer’s needs, desires, and pain points. Founders must be obsessed with their customers and be willing to put in the effort to understand their needs. 2. Iteration and experimentation: A product mindset is all about experimentation and iteration. Founders must be willing to try new things, test hypotheses, and iterate on their product based on customer feedback. 3. Focus on value: A product mindset is focused on delivering value to the customer. Founders must be able to identify and prioritize the most important features and functionality that deliver the most value to their customers. 4. Collaboration: A product mindset requires collaboration between cross-functional teams, including engineering, design, and product management. Founders must be able to work effectively with these teams to develop a product that meets customer needs. 5. Data-driven decision making: A product mindset is data-driven, and founders must be comfortable using data to make informed decisions about their product. This includes analyzing customer feedback, usage data, and other metrics to inform product development. … By embracing these elements of a product mindset, startup founders can develop a product that meets customer needs, delivers value, and sets their company up for long-term success.