Place: room 101, CSIE building
The final project is research oriented. We expect that you conduct deep investigation.
You may survey existing works and point out possible future directions. You may see an earlier course project report at UIUC as an example
In our lecture, we only introduced the basic concept of automatic differentiation. Can you investigate the implementation in PyTorch?
In our lecture, we discussed block algorithms for reducing the cost of memory access in matrix-matrix products. It's a general illustration. For GPU, some architectural properties make it very efficient for matrix-matrix products. Can you investigate what these properties are and conduct experiments to confirm them?
In our discussion we see many issues of Adam/AdamW remain to be investigated. For example, the original implementation in BERT does not include bias correction. Can you confirm the importance of bias correction? Then in the AdamW paper, it decouples the weight decay step. Can you design experiments to show that the setting in AdamW is better? Moreover, in our notes we mentioned an issue in the Hugging face implementation of AdamW. Is this implementation really "wrong"? Can the Hugging face implementation still work well in practice? Also how about the convergence?
From ?? (Wang's paper), Newton methods may not be able to handle networks with many layers. However, some applications need only a few layers. An example is the CNN for texts discussed in our project ??. Can you implement a Newton method and compare with stochastic gradient? (may need to discuss with Yaxu)
Last modified: