A study on Automatic Differentiation
In our lecture, we have covered the basic concept of automatic differentiation.
You might be curious how a real-world library computes various operators during neural network training.
Please complete the following steps:
- We have implemented the forward mode of automatic differentiation in a previous course project,
simpleautodiff.
Please implement the reverse mode in simpleautodiff.
- Expand the implementation document
to give a similar illustration for the reverse mode (ask TA for the latex source).
- Study the implementation of Autograd, a numpy-based automatic differentiation library.
Run and trace the code starting from autograd/core.py::make_vjp(fun, x).
Describe in your own words how reverse mode of automatic differentiation is implemented in Autograd.
A study on GPU for matrix-matrix products (Last update on Oct. 15)
In our lecture, we discussed why GPU is very effective for massive
parallel operations.
Besides, we also discussed block algorithms on CPU for reducing the
cost of memory access in matrix-matrix products.
In this project, we want to study more deeply on techniques to accelerate matrix products on GPU.
-
First, please implement three CUDA kernel for matrix multiplication A=BC following the instructions in the slides.
- Kernel 1: Explore how the mapping to the threads affects the performance
- Kernel 2: Explore how tiled matrix multiplication speed up matrix products (Note: While you can find many tutorials on "tiled matrix multiplication", this project requires you to do more than just copy-and-paste. Try to imagine you are going to teach other classmates in the progress presentation.)
- Kernel 3: Investigate more strategy for acceleration
-
Compare the running time of your implementation with an (or even more) existing BLAS on GPU (e.g., cuBLAS), or even a library that utilizes tensor cores with a GPU.
Are the results for your implementation and the optimized library similar? If not, can you investigate what other optimization they may have done?
If possible, please have some simple implementation to illustrate the idea.
-
Explore how BLAS on GPU are used by deep learning tools.
For example, does pytorch utilize cuBLAS for matrix multiplication?
Try to trace the codes to have some understanding. (You may need to build torch from source to get more information from the codes.)
Understanding and Improving GPT-2 Implementations in NanoGPT
In this project, you will explore the implementation of GPT-2 through the lightweight framework NanoGPT, with the goal of connecting mathematical definitions to practical code and improving the implementation.
- Your first task is to connect the mathematical definition of multi-head attention (see Section Transformer blocks and other details in the slides) with its implementation in NanoGPT (see
model.py in the NanoGPT).
-
Locating the computation of Query, Key, and Value: Find where queries, keys, and values are generated in the NanoGPT code, and compare this process with how they are introduced in the slides. What differences do you notice? What efficiency or memory benefits do these differences bring?
-
Head Splitting and Tensor Shapes: Identify where the code separates the attention heads and reshapes tensors. How does the reshaping in code relate to the mathematical definition in the slides? What reasons might explain the reshaping?
-
Your second task is to implement and evaluate KV Caching to NanoGPT with a focus on inference efficiency.
-
KV Caching: We introduce KV caching in the slides (see Section Prediction in the
slides). Implement KV caching for the autoregressive prediction in NanoGPT (see
sample.py in the NanoGPT) and
compare inference time with the original NanoGPT. For reference, you can check the implementation of
KV caching in the official GPT-2 repository.
Understand FlashAttention and Implement Its Forward Algorithms
In this project, you will study the paper
"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness".
Your goal is to understand the key contributions of this work by implementing the algorithms it presents.
-
Task 1: Understand and Compare
-
Please read our slides and explain why FlashAttention-2 in our slides achieves higher efficiency. Please focus on memory access and computation complexity, and submit a written explanation.
-
Task 2: Implement and Benchmark with Different Sequence Lengths (T)
-
Implement the following two algorithms with C++:
-
Find your computer's real cache size (M) and set FlashAttention-2 with the M you found.
For Linux systems, you can find the real cache size with the following Shell script,
lscpu | grep "cache"
Run FlashAttention-2 and verify its correctness by comparing its output with NaiveAttention's with the function allclose (provided by TA).
-
Compare the runtime of FlashAttention-2 against NaiveAttention's under different sequence lengths (T).
Check what happens when the
T-by-d matrices Q, K, and V (see their definitions in our slides) can not fit into your computer's cache, and analyze why.
-
Utility functions provided by TA: We provide these utility functions: row_max, row_sum, random_matrix, and allclose. You can use these functions for your convenience. The implementations for these utility functions are in this sample code.
Understand FlashAttention and Derive Corresponding Backward Algorithms
In this project, you will study the paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". Your goal is to understand the key contributions of this work by deriving the corresponding backward algorithms.
-
Task 1: The same as the first task in Understand FlashAttention (I).
-
Task 2: Derive Backward Algorithms
-
Derive the backward algorithms corresponding to FlashAttention-2 in our slides.
In addition, compute the memory access and computation complexity of the two backward algorithms, and explain why the backward algorithm of FlashAttention is more efficient.
Add this explanation to your report from the first task.
Redo experiments from Liu et al. (2023) that exploit sparsity to accelerate LLM inference
Liu et al. (2023) introduce the hypothesis of contextual sparsity, where only a small subset of weights is needed for computation.
In this way, they leverage sparsity to accelerate inference without sacrificing model performance.
To examine this idea, please investigate their implementation and complete the following two tasks:
-
First, we begin with a simple multi-class classification task.
Please use our provided code to verify the hypothesis of contextual sparsity in this setting.
Specifically, you need to perform two forward passes: the first using all weights, and the second using only the top-k weights with the largest outputs.
Then, evaluate how the proportion of these k weights affects accuracy.
-
Next, please extend this verification process to LLMs.
For example, you can use GPT-2 (or other models your device can handle) and examine contextual sparsity in the MLP or attention layers.
A study on GPU for sparse matrix multiplication
In the deep learning era, model predictions rely heavily on matrix operations.
However, modern models are usually large, which makes these operations expensive.
To reduce this computational cost, many studies explore sparsity as a way to skip unnecessary computations and speed up inference process.
When matrices become sparse, how to perform sparse matrix multiplication efficiently is a key aspect of improving computational performance.
Therefore, in this project, you will benchmark different libraries for sparse matrix multiplication.
-
cuSPARSE is a GPU library that provides optimized APIs for efficient sparse matrix operations.
In Python, CuPy offers access to cuSPARSE, making it possible to run sparse matrix multiplications on GPUs.
On the other hand, SciPy provides sparse matrix operations on CPUs.
Can you use CuPy to perform sparse matrix multiplication and compare its runtime with SciPy's CPU-based implementation to evaluate the benefit of GPU acceleration?
Please make sure to include all necessary setup costs in the timing when benchmarking, such as format conversions or initialization, to ensure a fair comparison between CPU and GPU.
In addition, run the experiments in a clean environment (i.e., no other processes occupying the same machine).
You may also use a profiler to inspect the running time of each statement in detail.
-
Sparse matrix multiplication mainly involves two common operations:
- Sparse matrix x Dense vector
- Sparse matrix x Sparse matrix
In this project, you are expected to benchmark the performance of both settings separately, and analyze how the results vary with different matrix densities.