【2025-09-08】Prof. William Wang / Who evaluates evaluations in GenAI

  • 2025-08-04
  • 黃雅群(職務代理)
TitleWho evaluates evaluations in GenAI?
Date2025/9/8 15:30-17:00
LocationR103, CSIE
SpeakersProf. William Wang
Host:陳縕儂教授


Abstract:
Reliable evaluation is the bottleneck that now constrains progress in generative AI. In this talk I will survey recent work from my group that asks a deceptively simple question: who evaluates the evaluators? I will begin by examining the text‑to‑image domain, where faithfulness metrics such as CLIPScore and TIFA are widely used yet often yield contradictory judgments. To expose these weaknesses we introduce T2IScoreScore, a meta‑evaluation suite that probes metrics along controlled “semantic‑error graphs.” Our analysis shows that many sophisticated vision‑language‑model (VLM) correlational scores fail to outperform far simpler baselines, highlighting the need for objective stress‑tests before a metric is adopted in practice.

I then extend the discussion to multimodal large language models, presenting WildVision, an open evaluation ecosystem comprising a crowd‑sourced Arena and a fast offline Bench. WildVision aligns closely with human preference (0.94 Spearman) while revealing persistent model blind spots in spatial reasoning, subtle context, and safety under provocation.

Taken together, these results argue that rigorous, transparent, and human‑centric benchmarks are essential if academia and industry are to steer GenAI responsibly. I will close with open challenges and opportunities for collaboration with the NTU community, from new meta‑evaluation paradigms to trustworthy deployment of agentic AI systems.

Biography:
William Wang is the Director of UC Santa Barbara's Natural Language Processing group and Center for Responsible Machine Learning. He is the Duncan and Suzanne Mellichamp Professor of Artificial Intelligence and Designs in the Department of Computer Science at the University of California, Santa Barbara. He's also Founder and CEO of ChipAgents.ai. He received his PhD from School of Computer Science, Carnegie Mellon University. He has broad interests in Artificial Intelligence, including statistical relational learning, information extraction, computational social science, dialog & generation, and vision. He has published more than 200 papers at leading NLP/AI/ML conferences and journals, and received best paper awards (or nominations), IEEE SPS Laplace Award (2024), a DARPA Young Faculty Award (Class of 2018), an IEEE AI's 10 to Watch Award (Class of 2020), an NSF CAREER Award (2021), a British Computer Society - Karen Spärck Jones Award (2022), the 2023 CRA-E Undergraduate Research Faculty Mentoring Award. His work and opinions appear at major tech media outlets such as Wired, VICE, Scientific American, Fortune, Fast Company, NASDAQ, etc.