【2025-12-19】Prof. Lily Weng / Towards Trustworthy AI: Automated Interpretability, Adversarial Robustness, and AI safety

  • 2025-11-11
  • 黃雅群(職務代理)
TitleTowards Trustworthy AI: Automated Interpretability, Adversarial Robustness, and AI safety
Date2025/12/19 10:20
LocationR107, CSIE
SpeakersProf. Lily Weng
Host:林軒田教授


Abstract:
Deep learning models have become remarkably powerful – but often operate as black boxes. In this talk, I will share how my lab is making these systems more transparent, reliable, and trustworthy. I’ll highlight three research directions to bring interpretability into deep learning: (1) automated tools [1-4] that reveal what neural networks learn internally at scale; (2) inherently interpretable neural model architectures [5-8] that make model’s decision process more understandable and controllable; and (3) evaluation frameworks [9-12] that quantify interpretability and enable trust. I’ll also touch on our recent work [13-16] in jailbreak attacks on LLMs, robustness verification, and robust learning for safer AI deployment. Together, these efforts aim to move modern AI beyond accuracy – toward systems we can truly understand, align, and trust. For more details, please visit https://lilywenglab.github.io/.
Reference:
[1] Oikarinen and Weng, CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks, ICLR 23 (spotlight)
[2] Oikarinen and Weng, Linear Explanations for Individual Neurons, ICML 24
[3] Bai & Iyer etal, Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models, TMLR 25
[4] Wu & Lin etal, AND: Audio Network Dissection for Interpreting Deep Acoustic Models, ICML 24
[5] Oikarinen etal, Label-Free Concept Bottleneck Models, ICLR 23
[6] Srivastava & Yan etal, VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance, NeurIPS 24
[7] Sun etal, Concept Bottleneck Large Language Models, ICLR 25
[8] Kulkarni etal, Interpretable Generative Models through Post-hoc Concept Bottlenecks, CVPR 25
[9] Oikarinen etal, Evaluating Neuron Explanations: A unified Framework with Sanity Checks, ICML 25
[10] Oikarinen etal, Rethinking Crowd-Sourced Evaluation of Neuron Explanations, arxiv preprint 25
[11] Li etal, Effective Skill Unlearning through Intervention and Abstention, NAACL 25
[12] Sun etal, ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models, EMNLP 25
[13] Sun etal, Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities, NAACL 25 oral
[14] Kulkarni etal, Interpretability-Guided Test-Time Adversarial Defense, ECCV 24
[15] Sun etal, Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents, ICML 24
[16] Yan etal, Provably Robust Conformal Prediction with Improved Efficiency, ICLR 24

Biography:
Lily Weng is an Assistant Professor in the Halıcıoğlu Data Science Institute at UC San Diego with affiliation in the CSE department. She received her PhD in Electrical Engineering and Computer Science (EECS) from MIT in August 2020, and her Bachelor and Master degree both in Electrical Engineering at National Taiwan University. Prior to UCSD, she spent 1 year in MIT-IBM Watson AI Lab and several research internships in Google DeepMind, IBM Research and Mitsubishi Electric Research Lab. Her research interest is in machine learning and deep learning, with primary focus on Trustworthy AI. Her vision is to make the next generation AI systems and deep learning algorithms more
robust, reliable, explainable, trustworthy and safer. Her work has been recognized and supported by several NSF awards, ARL award, Intel Rising Star Faculty Award, Hellman Fellowship, and Nvidia Academic award. For more details, please see https://lilywenglab.github.io/.