Optimization Methods for Deep Learning

Instructor: Chih-Jen Lin, Room 413, CSIE Building.
The best way to contact me is via e-mails.
TA: He-Zhe Lin (email: r11922027 at ntu.edu.tw) and Jie-Jyun Liu (email: d11922012 at ntu.edu.tw). TA hour: Tue. 14:00 ~ 15:00 online
Your HW/exam scores will be here
Time: Tuesday 10:20am-1pm
We will do two 10-minute breaks at around 11:10am and 12:10pm. The class ends at 1pm.

Place: room 101, CSIE building
FAQ of this course
We will pre-record most lectures and broadcast them in the class. In the class I will give additional comments while the video is being played.

Course Outline

Deep learning involves a difficult non-convex optimization problem. The goal of this course is to study the implementation of optimization methods for deep learning. We will run this course in the following formats:

lectures (by the instructor)
project/paper presentations (by students)

For potential students: you want to make sure that you are interested in optimization for deep learning.

Slides and recordings

This section (and slides) will be continuously updated.

Optimization problems for deep learning
- Linear classification ( part 1: slides video )
- Fully-connected networks ( part 1: slides video )
- Convolutional networks ( part 1: slides video part 2: slides video part 3: slides video part 4: slides video )
Stochastic gradient methods for deep learning
- Gradient descent ( part 1: slides video )
- Stochastic gradient methods
  - part 1: slides video: part 1 part 2
  - part 2: slides video: part 1 part 2
- a note on different momentum update rules: slides
- Convergence of stochastic gradient methods ( part 1: slides video part 2: slides video )
Gradient calculation
- Vector form ( part 1: slides video )
- Gradient calculation ( part 1: slides video part 2: slides video part 3: slides video )
Implementation
- part 1: slides video
- We will partially cover two sets of slides from the course "numerical methods" ( part 1: slides video part 2: slides video )
- part 2: slides video 1 video 2 video 3
- part 3: slides video
Automatic differentiation
- part 1: slides video
- part 2: slides video
Newton method
- Basic ( part 1: slides video part 2: slides video )
- Algorithms ( part 1: slides video part 2: slides video part 3: slides video )
- Gauss Newton matrix-vector product (optional)
  - slides
  - Using only backward process ( videos: part1, part2, part3, part4, part5 )
  - Using forward and backward processes ( videos: part1, part2, part3 )

Projects

Project 1: Using a linear classifier to train the LEDGAR set. Due on ~~23:59, March 7~~ 23:59, March 14.
Project 2: Using CNN to train the LEDGAR set and do the profiling. Due on 23:59, April 4.
Project 3: Compare the running time of CNN operations using MATLAB and PyTorch. Due on 23:59, May 30.

Final Project

You must choose the project topic before the end of week 7 (April 4) and let TAs know. See the following possible topics. If you want to do something else, you must discuss with the teacher first. The progress presentation is on April 25 (tentative). Each team should prepare a 15 mins presentation (including QA) about your progress on the final project. Please upload the slide to NTU COOL before April 24, 23:59.

The final project is research oriented. We expect that you conduct deep investigation. You are welcome to discuss with the teacher while doing the project.

A study on convergence results of stochastic gradient methods
You may survey existing works and point out possible future directions. You may see an earlier course project report at UIUC as an example
A study on autodiff in PyTorch
In our lecture, we only introduced the basic concept of automatic differentiation. Can you investigate the implementation in PyTorch?
A study on GPU for matrix-matrix products
In our lecture, we discussed block algorithms for reducing the cost of memory access in matrix-matrix products. It's a general illustration. For GPU, some architectural properties make it very efficient for matrix-matrix products. Can you investigate what these properties are and conduct experiments to confirm them?
A study on Adam and AdamW
In our discussion we see many issues of Adam/AdamW remain to be investigated. For example, the original implementation in BERT does not include bias correction. Can you confirm the importance of bias correction? Then in the AdamW paper, it decouples the weight decay step. Can you design experiments to show that the setting in AdamW is better?
BERT versus linear methods:
In this talk, for some document sets, linear methods are highly competitive with BERT. However, in Christian et al.(2021), their linear results are worse than BERT on IMdB Reviews sets. Can you investigate the task-specific manners of BERT in what situation linear methods are more robust than large pretrained language models for text classification?
Newton for KimCNN
From Wang et al.(2020), Newton methods may not be able to handle networks with many layers. However, some applications need only a few layers. An example is the CNN for texts discussed in our project 2. Can you implement a Newton method and compare with stochastic gradient? For this project you must know how to do the Jacobian-vector and transposed Jacobian-vector products in PyTorch.
Text Classification for Wilson et.al(2017)
Though Adam is widely used in deep learning, Wilson et al. (2017) showed that, adaptive methods (i.e., AdaGrad, RMSProp, and Adam) are worse than non-adaptive methods (i.e., SGD) in some image classification tasks. Does the same thing happen in text classification?

Course Schedule

The schedule is subject to change.

Week 3 4 (March 7): project 1 due
Week 7 (April 4): project 2 due
Week 10 (April 25): final project progress presentation - I
Week 13 (May 16): final project progress presentation - II
Week 15 (May 30): project 3 due
Week 16 (June 6): final project presentation

Exams

No exam

Grading

100% Projects and presentations.

Issues related to COVID-19

According to school's regulation, all students must wear masks
If the covid situation becomes serious, we will move the course online.
If you are sick, please do not come to the class.

Acknowledgements: we thank many people for helping to prepare materials for this course.

Last modified: