Deep Learning Algorithms and Implementations

Instructor: Chih-Jen Lin, Room 413, CSIE Building.
The best way to contact me is via e-mails.
TA: Jie-Jyun Liu (email: d11922012 at ntu.edu.tw). TA hour: Tue. 3pm ~ 4pm, Room 530, CSIE building.
Your HW/exam scores will be here
Time: Tuesday 10:20am-1pm
We will do two 10-minute breaks at around 11:10am and 12:10pm. The class ends at 1pm.

Place: room 101, CSIE building
FAQ of this course
We have pre-recorded some lectures and will broadcast them in the class. In this siutation, I will give additional comments while the video is being played.

Course Outline

Implementing a deep learning system is not an easy task due to the complicated architectures/algorithms and the need of efficient operations. In this course, we will study all the needed components in building a deep learning system. We will run this course in the following formats:

lectures (by the instructor)
project/paper presentations (by students)

For potential students: you want to make sure that you are interested in deep learning research.

Slides and recordings

This section (and slides) will be continuously updated.

Optimization problems for deep learning
- Linear classification ( part 1: slides video )
- Fully-connected networks ( part 1: slides video )
- Convolutional networks ( part 1: slides video part 2: slides video part 3: slides video part 4: slides video )
Stochastic gradient methods for deep learning
- Gradient descent ( part 1: slides video )
- Stochastic gradient methods
  - part 1: slides video: part 1 part 2
  - part 2: slides video: part 1 part 2
- a note on different momentum update rules: slides
- Convergence of stochastic gradient methods ( part 1: slides video part 2: slides video )
Gradient calculation
- Vector form ( part 1: slides video )
- Gradient calculation ( part 1: slides video part 2: slides video part 3: slides video )
Implementation: we will focus on the matrix-matrix multiplications and may skip some slides
- part 1: slides video
- We will partially cover two sets of slides from the course "numerical methods" ( part 1: slides video part 2: slides video )
- part 2: slides video 1 video 2 video 3
- part 3: slides video
GPU programming (slides, video 1, video 2)
Automatic differentiation
- Basic concepts (part 1: slides video, part 2: slides video
- Implementation (part 1: slides, part 2: slides); or see this document (this is a final project in 2023).
Large Language Models (LLM)
- A high-level overview of LLMs ( part 1: slides )
- Auto-regressive models ( part 1: slides )
- Detailed operations ( part 1: slides )

Final Project

The final project is research oriented. We expect that you conduct deep investigation. You are welcome to discuss with the teacher while doing the project. You must choose the project topic by October 10 and let TAs know.

Possible topics:

A study on Automatic Differentiation

In our lecture, we have covered the basic concept of automatic differentiation. You might be curious how a real-world library computes various operators during neural network training. Please complete the following steps:

We have implemented the forward mode of automatic differentiation in a previous course project, simpleautodiff. Please implement the reverse mode in simpleautodiff.
Expand the implementation document to give a similar illustration for the reverse mode (ask TA for the latex source).
Study the implementation of Autograd, a numpy-based automatic differentiation library. Run and trace the code starting from autograd/core.py::make_vjp(fun, x). Describe in your own words how reverse mode of automatic differentiation is implemented in Autograd.

A study on GPU for matrix-matrix products (Last update on Oct. 15)

In our lecture, we discussed why GPU is very effective for massive parallel operations. Besides, we also discussed block algorithms on CPU for reducing the cost of memory access in matrix-matrix products. In this project, we want to study more deeply on techniques to accelerate matrix products on GPU.

First, please implement three CUDA kernel for matrix multiplication A=BC following the instructions in the slides.
- Kernel 1: Explore how the mapping to the threads affects the performance
- Kernel 2: Explore how tiled matrix multiplication speed up matrix products (Note: While you can find many tutorials on "tiled matrix multiplication", this project requires you to do more than just copy-and-paste. Try to imagine you are going to teach other classmates in the progress presentation.)
- Kernel 3: Investigate more strategy for acceleration
Compare the running time of your implementation with an (or even more) existing BLAS on GPU (e.g., cuBLAS), or even a library that utilizes tensor cores with a GPU. Are the results for your implementation and the optimized library similar? If not, can you investigate what other optimization they may have done? If possible, please have some simple implementation to illustrate the idea.
Explore how BLAS on GPU are used by deep learning tools. For example, does pytorch utilize cuBLAS for matrix multiplication? Try to trace the codes to have some understanding. (You may need to build torch from source to get more information from the codes.)

Understanding and Improving GPT-2 Implementations in NanoGPT

In this project, you will explore the implementation of GPT-2 through the lightweight framework NanoGPT, with the goal of connecting mathematical definitions to practical code and improving the implementation.

Your first task is to connect the mathematical definition of multi-head attention (see Section Transformer blocks and other details in the slides) with its implementation in NanoGPT (see model.py in the NanoGPT).

Locating the computation of Query, Key, and Value: Find where queries, keys, and values are generated in the NanoGPT code, and compare this process with how they are introduced in the slides. What differences do you notice? What efficiency or memory benefits do these differences bring?
Head Splitting and Tensor Shapes: Identify where the code separates the attention heads and reshapes tensors. How does the reshaping in code relate to the mathematical definition in the slides? What reasons might explain the reshaping?

Your second task is to implement and evaluate KV Caching to NanoGPT with a focus on inference efficiency.

KV Caching: We introduce KV caching in the slides (see Section Prediction in the slides). Implement KV caching for the autoregressive prediction in NanoGPT (see sample.py in the NanoGPT) and compare inference time with the original NanoGPT. For reference, you can check the implementation of KV caching in the official GPT-2 repository.

Understand FlashAttention and Implement Its Forward Algorithms

In this project, you will study the paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". Your goal is to understand the key contributions of this work by implementing the algorithms it presents.

Task 1: Understand and Compare
- Please read our slides and explain why FlashAttention-2 in our slides achieves higher efficiency. Please focus on memory access and computation complexity, and submit a written explanation.
Task 2: Implement and Benchmark with Different Sequence Lengths (T)
- Implement the following two algorithms with C++:
  - Algorithm 0 (NaiveAttention) in the FlashAttention paper,
  - and FlashAttention-2 in our slides.
- Find your computer's real cache size (M) and set FlashAttention-2 with the M you found. For Linux systems, you can find the real cache size with the following Shell script,
```
lscpu | grep "cache"
```
  Run FlashAttention-2 and verify its correctness by comparing its output with NaiveAttention's with the function allclose (provided by TA).
- Compare the runtime of FlashAttention-2 against NaiveAttention's under different sequence lengths (T). Check what happens when the T-by-d matrices Q, K, and V (see their definitions in our slides) can not fit into your computer's cache, and analyze why.
- Utility functions provided by TA: We provide these utility functions: row_max, row_sum, random_matrix, and allclose. You can use these functions for your convenience. The implementations for these utility functions are in this sample code.

Understand FlashAttention and Derive Corresponding Backward Algorithms

Task 1: The same as the first task in Understand FlashAttention (I).
Task 2: Derive Backward Algorithms
- Derive the backward algorithms corresponding to FlashAttention-2 in our slides. In addition, compute the memory access and computation complexity of the two backward algorithms, and explain why the backward algorithm of FlashAttention is more efficient. Add this explanation to your report from the first task.

Redo experiments from Liu et al. (2023) that exploit sparsity to accelerate LLM inference

Liu et al. (2023) introduce the hypothesis of contextual sparsity, where only a small subset of weights is needed for computation. In this way, they leverage sparsity to accelerate inference without sacrificing model performance. To examine this idea, please investigate their implementation and complete the following two tasks:

First, we begin with a simple multi-class classification task. Please use our provided code to verify the hypothesis of contextual sparsity in this setting. Specifically, you need to perform two forward passes: the first using all weights, and the second using only the top-k weights with the largest outputs. Then, evaluate how the proportion of these k weights affects accuracy.
Next, please extend this verification process to LLMs. For example, you can use GPT-2 (or other models your device can handle) and examine contextual sparsity in the MLP or attention layers.

A study on GPU for sparse matrix multiplication

In the deep learning era, model predictions rely heavily on matrix operations. However, modern models are usually large, which makes these operations expensive. To reduce this computational cost, many studies explore sparsity as a way to skip unnecessary computations and speed up inference process. When matrices become sparse, how to perform sparse matrix multiplication efficiently is a key aspect of improving computational performance. Therefore, in this project, you will benchmark different libraries for sparse matrix multiplication.

cuSPARSE is a GPU library that provides optimized APIs for efficient sparse matrix operations. In Python, CuPy offers access to cuSPARSE, making it possible to run sparse matrix multiplications on GPUs. On the other hand, SciPy provides sparse matrix operations on CPUs. Can you use CuPy to perform sparse matrix multiplication and compare its runtime with SciPy's CPU-based implementation to evaluate the benefit of GPU acceleration? Please make sure to include all necessary setup costs in the timing when benchmarking, such as format conversions or initialization, to ensure a fair comparison between CPU and GPU. In addition, run the experiments in a clean environment (i.e., no other processes occupying the same machine). You may also use a profiler to inspect the running time of each statement in detail.
Sparse matrix multiplication mainly involves two common operations:
- Sparse matrix x Dense vector
- Sparse matrix x Sparse matrix
In this project, you are expected to benchmark the performance of both settings separately, and analyze how the results vary with different matrix densities.

Course Schedule

The schedule is subject to change.

Week 6 (October 7): Q&A session for each group (5-7 mins)
Week 8 (October 21): project progress presentation - I
Week 9 (October 28): Q&A session II for each group (5-7 mins)
Week 11 (November 11): progress report deadline
Week 12 (November 18): project progress presentation - II
Week 16 (December 14): final report deadline
Week 16 (December 16): final project presentation

Exams

No exam

Grading

100% Projects and presentations.

Acknowledgements: we thank many people for helping to prepare materials for this course.

Last modified: