:::

# [2018-12-21] Prof. Jason Lee, University of Southern California, "On the Foundations of Deep Learning: SGD, Overparametrization, and Generalization"

專題討論演講公告

張貼人：白師瑜 ╱ 公告日期：2018-11-20**Title:**On the Foundations of Deep Learning: SGD, Overparametrization, and Generalization

**Date:**2018-12-21 2:20pm-3:30pm

**Location:**R103, CSIE

**Speaker:**Prof. Jason Lee, University of Southern California

**Hosted by:**Prof. Chih-Jen Lin

**Abstract:**

We provide new results on the effectiveness of SGD and overparametrization in deep learning.

a) SGD: We show that SGD converges to stationary points for general nonsmooth , nonconvex functions, and that stochastic subgradients can be efficiently computed via Automatic Differentiation. For smooth functions, we show that gradient descent, coordinate descent, ADMM, and many other algorithms, avoid saddle points and converge to local minimizers. For a large family of problems including matrix completion and shallow ReLU networks, this guarantees that gradient descent converges to a global minimum.

b) Overparametrization: For a k hidden node shallow network with quadratic activation and n training data points, we show as long as

k>=sqrt(2n), over-parametrization enables local search algorithms to

find a \emph{globally} optimal solution. Further, despite that the number of parameters may exceed the sample size, we show with weight decay, the solution also generalizes well.

For general neural networks, we establish a margin-based theory. The minimizer of the cross-entropy loss with weak regularization is a max-margin predictor, and enjoys stronger generalization guarantees as the amount of overparametrization increases.

c) Next, we analyze the implicit regularization effects of various optimization algorithms on overparametrized networks. In particular we prove that for least squares with mirror descent, the algorithm converges to the closest solution in terms of the bregman divergence.

For linearly separable classification problems, we prove that the steepest descent with respect to a norm solves SVM with respect to the same norm. For over-parametrized non-convex problems such as matrix sensing or neural net with quadratic activation, we prove that gradient descent converges to the minimum nuclear norm solution, which allows for both meaningful optimization and generalization guarantees.

**Biography:**

Jason Lee is an assistant professor in Data Sciences and Operations at the University of Southern California. Prior to that, he was a postdoctoral researcher at UC Berkeley working with Michael Jordan.

Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in statistics, machine learning, and optimization. Lately, he has worked on high dimensional statistical inference, analysis of non-convex optimization algorithms, and theory for deep learning.

最後修改時間：2018-11-20 PM 4:20