Generative AI has been transforming the way we interact with technology and consume content. In this talk, I will introduce two of my recent research projects to showcase how generative models can be applied to generate music and audio. First, I will introduce the Multitrack Music Transformer project, where we aimed to generate orchestral music using a multi-dimensional transformer model and a new compact representation for multi-instrument music. Second, I will introduce the CLIPSonic project, where we aimed to tackle text-to-audio generation without using any text-audio pairs, and we leveraged the visual modality as a bridge to learn the desired text-audio correspondence using pretrained vision-language models and diffusion models. Finally, I will close this talk by sharing my view on the future of generative AI in music and audio.
Hao-Wen (Herman) Dong is a PhD Candidate in Computer Science at University of California San Diego, where he works on Music x AI research with Julian McAuley and Taylor Berg-Kirkpatrick. He is broadly interested in music generation, audio synthesis and machine learning for music and audio. He has collaborated with researchers at Adobe, Dolby, Amazon, Sony and Yamaha through internships. Prior to his PhD, he was a research assistant at Academia Sinica working with Yi-Hsuan Yang. He received his bachelor's degree in Electrical Engineering from National Taiwan University and his master's degree in Computer Science from UC San Diego. Herman's research has been recognized by the ICASSP Rising Stars in Signal Processing, UCSD GPSA Interdisciplinary Research Award, Taiwan Government Scholarship to Study Abroad, J. Yang Scholarship and UCSD ECE Department Fellowship. For more information, please visit his personal website (https://salu133445.github.io/