Optimization Methods for Deep Learning
This is a tentative page copied from the past course. Materials
are being updated.
Deep learning involves a difficult non-convex optimization
problem. The goal of this course is to study the implementation
of optimization methods for deep learning.
We will run this course in the following formats:
For potential students: you want to make sure that you are
interested in optimization for deep learning.
- lectures (by the instructor)
- project/paper presentations (by students)
Slides and recordings
This section (and slides) will be continuously updated.
Optimization problems for deep learning
Stochastic gradient methods for deep learning
Gradient descent (
Stochastic gradient methods
a note on different momentum update rules:
You must choose the project topic before the end of week 4 and let TAs know. See
the following possible topics. If you want to do something else,
you must discuss with the teacher first.
The final project is research oriented. We expect that you conduct
- A study on convergence results of stochastic gradient methods
You may survey existing works and point out possible future
directions. You may see an earlier course project report at UIUC as an example
- A study on autodiff in PyTorch
In our lecture, we only introduced the basic concept of automatic
differentiation. Can you investigate the implementation in PyTorch?
- A study on GPU for matrix-matrix products
In our lecture, we discussed block algorithms for reducing the
cost of memory access in matrix-matrix products. It's a general
illustration. For GPU, some architectural properties make it very
efficient for matrix-matrix products. Can you investigate what
these properties are and conduct experiments to confirm them?
- A study on Adam and AdamW
In our discussion we see many issues of Adam/AdamW remain to be investigated. For example, the original implementation in BERT does not include bias correction. Can you confirm the importance of bias correction? Then in the AdamW paper, it decouples the weight decay step. Can you design experiments to
show that the setting in AdamW is better? Moreover, in our notes we mentioned an issue in the Hugging face implementation of AdamW. Is this implementation really "wrong"? Can the Hugging face implementation still work well in
practice? Also how about the convergence?
- Newton for KimCNN
From ?? (Wang's paper), Newton methods may not be able to handle
networks with many layers. However, some applications need only
a few layers. An example is the CNN for texts discussed in our
project ??. Can you implement a Newton method and compare with
stochastic gradient? (may need to discuss with Yaxu)
- BERT versus linear methods: In this talk (?? link to Bloomberg
talk), for some document sets, linear methods are highly competitive
with BERT. However, in ??, their linear results are worse than
BERT on ?? sets. Can you conduct some investigation? (?? need
to say more)
The schedule is subject to change.
- Week 9: final project progress presentation
- Weeks 14-15: research paper presentation
- Week 16: final project presentation
100% Projects and presentations.
Issues related to COVID-19
- According to school's regulation, all students must wear
- If the covid situation becomes serious, we will move the course online.
- If you are sick, please do not come to the class.
Acknowledgements: we thank many people for helping to prepare
materials for this course.