A study on Automatic Differentiation
In our lecture, we have covered the basic concept of automatic differentiation.
You might be curious how a real-world library computes various operators during neural network training.
Please complete the following steps:
- We have implemented the forward mode of automatic differentiation in a previous course project,
simpleautodiff.
Please implement the reverse mode in simpleautodiff.
- Expand the implementation document
to give a similar illustration for the reverse mode (ask TA for the latex source).
- Study the implementation of Autograd, a numpy-based automatic differentiation library.
Run and trace the code starting from autograd/core.py::make_vjp(fun, x).
Describe in your own words how reverse mode of automatic differentiation is implemented in Autograd.
A study on GPU for matrix-matrix products
In our lecture, we discussed why GPU is very effective for massive
parallel operations.
Besides, we also discussed block algorithms on CPU for reducing the
cost of memory access in matrix-matrix products.
In this project, we want to study more deeply on techniques to accelerate matrix products on GPU.
-
The first task is to implement a memory-efficient CUDA kernel for matrix multiplication A = BC by using __shared__ variables.
The __shared__ memory is shared by all threads within a single thread block.
Similar to the "naive kernel" described in the course slides, each thread also computes a single entry C[i][j].
However, instead of each thread independently fetching A[i][k] and B[k][j], you'll use __shared__ memory as a staging area to load these needed entries so that all threads in the block can access them quickly.
-
Under the assumption that accessing shared memory is faster than accessing global memory, can you mathematically show why this implementation is more efficient than the original one in the course slide?
-
Compared with the naive implementation shown in the course slides, do the empirical results support your theoretical analysis?
Possible questions:
-
How the block_size affect the performance? Do you observe a sweet-spot?
-
What is the difference between the above implementation (called tiled matrix multiplication) and the block algorithm discussed in the class? Do you think it's possible to implement the block algorithm to further accelerate the runtime? Why or why not?
Note: While you can find many tutorials on "tiled matrix multiplication", this project requires you to do more than just copy-and-paste. Try to imagine you are going to teach other classmates in the progress presentation.
-
Compare the running time of your implementation with an (or even more) existing BLAS on GPU (e.g., cuBLAS).
Are their results similar? If not, can you investigate what other optimization they may have done?
If possible, please have some simple implementation to illustrate the idea.
-
How BLAS on GPU are used by deep learning tools?
For example, does pytorch utilize cuBLAS for matrix multiplication?
Try to trace the codes to have some understanding. (You may need to build torch from source to get more information from the codes.)
Understanding and Improving GPT-2 Implementations in NanoGPT
In this project, you will explore the implementation of GPT-2 through the lightweight framework NanoGPT, with the goal of connecting mathematical definitions to practical code and improving the implementation.
- Your first task is to connect the mathematical definition of multi-head attention (see Section Transformer blocks and other details in the slides) with its implementation in NanoGPT (see
model.py
in the NanoGPT).
-
Locating the computation of Query, Key, and Value: Find where queries, keys, and values are generated in the NanoGPT code, and compare this process with how they are introduced in the slides. What differences do you notice? What efficiency or memory benefits do these differences bring?
-
Head Splitting and Tensor Shapes: Identify where the code separates the attention heads and reshapes tensors. How does the reshaping in code relate to the mathematical definition in the slides? What reasons might explain the reshaping?
-
Your second task is to implement and evaluate KV Caching to NanoGPT with a focus on inference efficiency.
-
KV Caching: We introduce KV caching in the slides (see Section Prediction in the
slides). Implement KV caching for the autoregressive prediction in NanoGPT (see
sample.py
in the NanoGPT) and
compare inference time with the original NanoGPT. For reference, you can check the implementation of
KV caching in the official GPT-2 repository.
Understand FlashAttention (I)
In this project, you will study the paper
"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness".
Your goal is to understand the key contributions of this work by implementing the algorithms it presents.
-
Task 1: Understand and Compare
-
Algorithms 0 and 1 in the paper correspond to the standard and optimized implementations of the forward pass of Attention.
Please read our slides on these two algorithms and explain why FlashAttention (Algorithm 1) achieves higher efficiency, focusing on memory access and computation complexity.
Submit a written explanation.
-
Task 2: Implement and Benchmark
-
Implement Algorithms 0 and 1 in C++.
Compare their execution time and memory usage under identical conditions.
Use your experimental results to validate your explanation from Task 1.
Understand FlashAttention (II)
In this project, you will study the paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". Your goal is to understand the key contributions of this work by deriving the corresponding backward algorithms.
-
Review the first task in Understand FlashAttention (I).
-
Task: Derive Backward Algorithms
-
Derive the backward algorithms corresponding to Algorithm 0 and Algorithm 1 from the FlashAttention paper.
Please follow our slides for Algorithms 0 and 1 as a guide.
In addition, compute the memory access and computation complexity of the two backward algorithms, and explain why the backward algorithm of FlashAttention is more efficient.
Add this explanation to your report from the first task.
Redo experiments from Liu et al. (2023) that exploit sparsity to accelerate LLM inference
Liu et al. (2023) introduce the hypothesis of contextual sparsity, where only a small subset of weights is needed for computation.
In this way, they leverage sparsity to accelerate inference without sacrificing model performance.
To examine this idea, please investigate their implementation and complete the following two tasks:
-
First, we begin with a simple multi-class classification task.
Please use our provided code to verify the hypothesis of contextual sparsity in this setting.
Specifically, you need to perform two forward passes: the first using all weights, and the second using only the top-k weights with the largest outputs.
Then, evaluate how the proportion of these k weights affects accuracy.
-
Next, please extend this verification process to LLMs.
For example, you can use GPT-2 (or other models your device can handle) and examine contextual sparsity in the MLP or attention layers.
A study on GPU for sparse matrix multiplication
In the deep learning era, model predictions rely heavily on matrix operations.
However, modern models are usually large, which makes these operations expensive.
To reduce this computational cost, many studies explore sparsity as a way to skip unnecessary computations and speed up inference process.
When matrices become sparse, how to perform sparse matrix multiplication efficiently is a key aspect of improving computational performance.
Therefore, in this project, you will benchmark different libraries for sparse matrix multiplication.
-
cuSPARSE is a GPU library that provides optimized APIs for efficient sparse matrix operations.
In Python, CuPy offers access to cuSPARSE, making it possible to run sparse matrix multiplications on GPUs.
On the other hand, SciPy provides sparse matrix operations on CPUs.
Can you use CuPy to perform sparse matrix multiplication and compare its runtime with SciPy's CPU-based implementation to evaluate the benefit of GPU acceleration?
Please make sure to include all necessary setup costs in the timing when benchmarking, such as format conversions or initialization, to ensure a fair comparison between CPU and GPU.
In addition, run the experiments in a clean environment (i.e., no other processes occupying the same machine).
You may also use a profiler to inspect the running time of each statement in detail.
-
Sparse matrix multiplication mainly involves two common operations:
- Sparse matrix x Dense vector
- Sparse matrix x Sparse matrix
In this project, you are expected to benchmark the performance of both settings separately, and analyze how the results vary with different matrix densities.