Deep Learning Algorithms and Implementations

Instructor: Chih-Jen Lin, Room 413, CSIE Building.
The best way to contact me is via e-mails.
TA: Jie-Jyun Liu (email: d11922012 at ntu.edu.tw). TA hour: Tue. 3pm ~ 4pm, Room 530, CSIE building.
Your HW/exam scores will be here
Time: Tuesday 10:20am-1pm
We will do two 10-minute breaks at around 11:10am and 12:10pm. The class ends at 1pm.

Place: room 101, CSIE building
FAQ of this course
We have pre-recorded some lectures and will broadcast them in the class. In this siutation, I will give additional comments while the video is being played.

Course Outline

Implementing a deep learning system is not an easy task due to the complicated architectures/algorithms and the need of efficient operations. In this course, we will study all the needed components in building a deep learning system. We will run this course in the following formats:

lectures (by the instructor)
project/paper presentations (by students)

For potential students: you want to make sure that you are interested in deep learning research.

Slides and recordings

This section (and slides) will be continuously updated.

Optimization problems for deep learning
- Linear classification ( part 1: slides video )
- Fully-connected networks ( part 1: slides video )
- Convolutional networks ( part 1: slides video part 2: slides video part 3: slides video part 4: slides video )
Stochastic gradient methods for deep learning
- Gradient descent ( part 1: slides video )
- Stochastic gradient methods
  - part 1: slides video: part 1 part 2
  - part 2: slides video: part 1 part 2
- a note on different momentum update rules: slides
- Convergence of stochastic gradient methods ( part 1: slides video part 2: slides video )
Gradient calculation
- Vector form ( part 1: slides video )
- Gradient calculation ( part 1: slides video part 2: slides video part 3: slides video )
Implementation: we will focus on the matrix-matrix multiplications and may skip some slides
- part 1: slides video
- We will partially cover two sets of slides from the course "numerical methods" ( part 1: slides video part 2: slides video )
- part 2: slides video 1 video 2 video 3
- part 3: slides video
GPU programming (slides)
Automatic differentiation
- Basic concepts (part 1: slides video, part 2: slides video
- Implementation (part 1: slides, part 2: slides); or see this document (this is a final project in 2023).
Large Language Models (LLM)
- A high-level overview of LLMs ( part 1: slides )
- Auto-regressive models ( part 1: slides )
- Detailed operations ( part 1: slides )

Final Project

The final project is research oriented. We expect that you conduct deep investigation. You are welcome to discuss with the teacher while doing the project. You must choose the project topic by October 10 and let TAs know.

Possible topics:

A study on Automatic Differentiation

In our lecture, we have covered the basic concept of automatic differentiation. You might be curious how a real-world library computes various operators during neural network training. Please complete the following steps:

We have implemented the forward mode of automatic differentiation in a previous course project, simpleautodiff. Please implement the reverse mode in simpleautodiff.
Expand the implementation document to give a similar illustration for the reverse mode (ask TA for the latex source).
Study the implementation of Autograd, a numpy-based automatic differentiation library. Run and trace the code starting from autograd/core.py::make_vjp(fun, x). Describe in your own words how reverse mode of automatic differentiation is implemented in Autograd.

A study on GPU for matrix-matrix products

In our lecture, we discussed why GPU is very effective for massive parallel operations. Besides, we also discussed block algorithms on CPU for reducing the cost of memory access in matrix-matrix products. In this project, we want to study more deeply on techniques to accelerate matrix products on GPU.

The first task is to implement a memory-efficient CUDA kernel for matrix multiplication A = BC by using __shared__ variables. The __shared__ memory is shared by all threads within a single thread block. Similar to the "naive kernel" described in the course slides, each thread also computes a single entry C[i][j]. However, instead of each thread independently fetching A[i][k] and B[k][j], you'll use __shared__ memory as a staging area to load these needed entries so that all threads in the block can access them quickly.
- Under the assumption that accessing shared memory is faster than accessing global memory, can you mathematically show why this implementation is more efficient than the original one in the course slide?
- Compared with the naive implementation shown in the course slides, do the empirical results support your theoretical analysis?
Possible questions:
- How the block_size affect the performance? Do you observe a sweet-spot?
- What is the difference between the above implementation (called tiled matrix multiplication) and the block algorithm discussed in the class? Do you think it's possible to implement the block algorithm to further accelerate the runtime? Why or why not?
Note: While you can find many tutorials on "tiled matrix multiplication", this project requires you to do more than just copy-and-paste. Try to imagine you are going to teach other classmates in the progress presentation.
Compare the running time of your implementation with an (or even more) existing BLAS on GPU (e.g., cuBLAS). Are their results similar? If not, can you investigate what other optimization they may have done? If possible, please have some simple implementation to illustrate the idea.
How BLAS on GPU are used by deep learning tools? For example, does pytorch utilize cuBLAS for matrix multiplication? Try to trace the codes to have some understanding. (You may need to build torch from source to get more information from the codes.)

Understanding and Improving GPT-2 Implementations in NanoGPT

In this project, you will explore the implementation of GPT-2 through the lightweight framework NanoGPT, with the goal of connecting mathematical definitions to practical code and improving the implementation.

Your first task is to connect the mathematical definition of multi-head attention (see Section Transformer blocks and other details in the slides) with its implementation in NanoGPT (see model.py in the NanoGPT).

Locating the computation of Query, Key, and Value: Find where queries, keys, and values are generated in the NanoGPT code, and compare this process with how they are introduced in the slides. What differences do you notice? What efficiency or memory benefits do these differences bring?
Head Splitting and Tensor Shapes: Identify where the code separates the attention heads and reshapes tensors. How does the reshaping in code relate to the mathematical definition in the slides? What reasons might explain the reshaping?

Your second task is to implement and evaluate KV Caching to NanoGPT with a focus on inference efficiency.

KV Caching: We introduce KV caching in the slides (see Section Prediction in the slides). Implement KV caching for the autoregressive prediction in NanoGPT (see sample.py in the NanoGPT) and compare inference time with the original NanoGPT. For reference, you can check the implementation of KV caching in the official GPT-2 repository.

Understand FlashAttention (I)

In this project, you will study the paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". Your goal is to understand the key contributions of this work by implementing the algorithms it presents.

Task 1: Understand and Compare
- Algorithms 0 and 1 in the paper correspond to the standard and optimized implementations of the forward pass of Attention. Please read our slides on these two algorithms and explain why FlashAttention (Algorithm 1) achieves higher efficiency, focusing on memory access and computation complexity. Submit a written explanation.
Task 2: Implement and Benchmark
- Implement Algorithms 0 and 1 in C++. Compare their execution time and memory usage under identical conditions. Use your experimental results to validate your explanation from Task 1.

Understand FlashAttention (II)

Review the first task in Understand FlashAttention (I).
Task: Derive Backward Algorithms
- Derive the backward algorithms corresponding to Algorithm 0 and Algorithm 1 from the FlashAttention paper. Please follow our slides for Algorithms 0 and 1 as a guide. In addition, compute the memory access and computation complexity of the two backward algorithms, and explain why the backward algorithm of FlashAttention is more efficient. Add this explanation to your report from the first task.

Redo experiments from Liu et al. (2023) that exploit sparsity to accelerate LLM inference

Liu et al. (2023) introduce the hypothesis of contextual sparsity, where only a small subset of weights is needed for computation. In this way, they leverage sparsity to accelerate inference without sacrificing model performance. To examine this idea, please investigate their implementation and complete the following two tasks:

First, we begin with a simple multi-class classification task. Please use our provided code to verify the hypothesis of contextual sparsity in this setting. Specifically, you need to perform two forward passes: the first using all weights, and the second using only the top-k weights with the largest outputs. Then, evaluate how the proportion of these k weights affects accuracy.
Next, please extend this verification process to LLMs. For example, you can use GPT-2 (or other models your device can handle) and examine contextual sparsity in the MLP or attention layers.

A study on GPU for sparse matrix multiplication

In the deep learning era, model predictions rely heavily on matrix operations. However, modern models are usually large, which makes these operations expensive. To reduce this computational cost, many studies explore sparsity as a way to skip unnecessary computations and speed up inference process. When matrices become sparse, how to perform sparse matrix multiplication efficiently is a key aspect of improving computational performance. Therefore, in this project, you will benchmark different libraries for sparse matrix multiplication.

cuSPARSE is a GPU library that provides optimized APIs for efficient sparse matrix operations. In Python, CuPy offers access to cuSPARSE, making it possible to run sparse matrix multiplications on GPUs. On the other hand, SciPy provides sparse matrix operations on CPUs. Can you use CuPy to perform sparse matrix multiplication and compare its runtime with SciPy's CPU-based implementation to evaluate the benefit of GPU acceleration? Please make sure to include all necessary setup costs in the timing when benchmarking, such as format conversions or initialization, to ensure a fair comparison between CPU and GPU. In addition, run the experiments in a clean environment (i.e., no other processes occupying the same machine). You may also use a profiler to inspect the running time of each statement in detail.
Sparse matrix multiplication mainly involves two common operations:
- Sparse matrix x Dense vector
- Sparse matrix x Sparse matrix
In this project, you are expected to benchmark the performance of both settings separately, and analyze how the results vary with different matrix densities.

Course Schedule

The schedule is subject to change.

Week 6 (October 7): QA section for each group (5-7 mins)
Week 8 (October 21): project progress presentation - I
Week ?? (??): progress report deadline
Week 11 (November 11): project progress presentation - II
Week 16 (December 14): final report deadline
Week 16 (December 16): final project presentation

Exams

No exam

Grading

100% Projects and presentations.

Acknowledgements: we thank many people for helping to prepare materials for this course.

Last modified: