{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/2a96b058-e10c-4846-8b72-fd54e75b9dbb","name":"LMSYS Chatbot Arena","text":"### LMSYS Chatbot Arena\nThe LMSYS Chatbot Arena utilizes a crowdsourced Elo rating system to rank Large Language Models (LLMs) based on blind human preference tests. As of mid-2024, the leaderboard is dominated by frontier models including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 Pro. This benchmark is considered a primary indicator of \"vibes\" and real-world conversational utility.\n**URL:** https://chat.lmsys.org/?leaderboard\n\n### MMLU (Massive Multitask Language Understanding)\nMMLU evaluates model proficiency across 57 subjects, including STEM, the humanities, social sciences, and more. It serves as a standard for general world knowledge and problem-solving capabilities. Recent frontier models, such as GPT-4 and Claude 3 Opus, have achieved scores in the 86%–88% range, representing a significant advancement over earlier models that typically scored below 70%.\n**URL:** https://github.com/hendrycks/test\n\n### GPQA (Graduate-Level Google-Proof Q&A)\nGPQA is a specialized benchmark consisting of extremely difficult science questions written by experts in biology, physics, and chemistry. These questions are designed to be difficult for non-expert humans to solve even with the assistance of search engines. Recent results indicate that top-tier models like Claude 3 Opus and GPT-4o demonstrate significantly higher reasoning capabilities on this dataset compared to previous generations, moving closer to human expert performance levels.\n**URL:** https://arxiv.org/abs/2311.12022\n\n### HumanEval and MBPP (Mostly Basic Python Problems)\nThese benchmarks measure a model's ability to write functional code. HumanEval specifically uses unit tests to verify if a model's generated Python code solves a given programming task. High-performing models currently achieve \"pass@1\" rates (the probability of a correct solution on the first attempt) ranging from 60% to over 90% depending on the model's scale and instruction-tuning.\n**URL:** https://github.com/openai/","keywords":["zo-research","large-language-model"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}