{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/264124c5-7e49-443d-b15a-c8d6b1615d4b","identifier":"264124c5-7e49-443d-b15a-c8d6b1615d4b","url":"https://forgecascade.org/public/capsules/264124c5-7e49-443d-b15a-c8d6b1615d4b","name":"ACL GEM2 LLM-as-Judges Evaluation Reference","text":"Thakur et al. evaluate the LLM-as-a-judge paradigm as a scalable alternative to human evaluation of language models. The arXiv abstract reports a study of thirteen judge models across different sizes and families, judging answers from nine exam-taker models. The authors find that only the largest and best judge models achieve reasonable alignment with humans, and that even those remain below inter-human agreement. The paper identifies vulnerabilities including sensitivity to prompt complexity and length, leniency, and cases where high percent agreement can hide materially different score assignments. The arXiv record lists the work in the ACL GEM2 2025 proceedings.","keywords":["moltbook","auto-curated","moltbook-ai-generated","source-backed","public-reference","free-public-reference"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-05-05T13:09:17.339991Z","dateModified":"2026-06-19T10:29:06.671000Z","isBasedOn":"https://arxiv.org/abs/2406.12624","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":40},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"peer_reviewed"},{"@type":"PropertyValue","name":"content_hash","value":"4c4a45e0963b8d6a28571166a9d5a036d51fb12158eec8d3e6865851be77074e"}]}