Minitorch筆記

雪溯發表於2024-03-11

https://github.com/mukobi/Minitorch-Self-Study-Guide-SAIA/blob/main/README.md#what-is-this
https://minitorch.github.io/
srush/GPU-Puzzles: Solve puzzles. Learn CUDA. (github.com)

程式碼規範和提交

black . # Format all files
flake8 . # check docstring and etc.
mypy . # check type

visualization

streamlit run app.py -- 0

測試

pytest, hypothesis(property testing)

基本模式

minitorch: 計算相關,專案主體,operator
- operator.py: scalar functions
1. scalar相關的,基於float的數學函式如relu, add, less,
2. 基於float的最簡單的求導函式如log_backinv_back
3.基於float的map, reduce和zip,僅起到debug作用
- tensor_op, fast_op.py, cuda_op.py: operators, fastop基於numba.jit,主要最佳化為多執行緒並行,cudaop基於numba.cuda,主要最佳化為SIMT。包裝了operator.py中的Scalar function為tensor function
1. 繼承TensorOp類,用map, zip, reduce來呼叫operator中的基本函式。在呼叫時將Tensor降級為numpy.array(storage, strides, shape), e.g: SimpleOps.log_back(tensor)=SimpleOps.map(operators.log_back)
2. 基於Tensor的map, reduce, zip
3. Simplebackend=TensorBackend(SimpleOps);FastTensorBackend = minitorch.TensorBackend(minitorch.FastOps);GPUBackend = minitorch.TensorBackend(minitorch.CudaOps)

- `fast_conv`, `cuda_conv`
- `tensor_function.py`: class Function及其各個衍生類,記錄History(構成計算圖),實現forward和backward
	1. Function: 定義了介面forward和backward,呼叫引數所屬的backend對應的operators來完成計算;apply: 實際上被呼叫的介面,建立Context,如果需要backward,則記錄history將其加入計算圖,呼叫介面forward,確保輸出的結果是Tensor
	2. Neg, Add, Inv, Matmul...Function衍生類, 對dim等非tensor也不需要梯度的類,返回0,之後會在backpropagation中被忽略。forward會存一些backward需要的數值進入context, backward再取出來進行計算。此外,operators只對基本的函式進行了定義,例如sigmoid.backward的邏輯為`mul_zip(grad_output, mul_zip(sigmoid(input), add_zip(1.0, neg_map(sigmoid(input)))))`
	3. Copy, View, Permute等形狀相關的衍生類,Permute: 只是將strides和shapes進行調換,而非是移動實際資料。有些特別的操作要求contiguous,也就是按照空間順序排列,`strides[i] = prod(shape[i+1:])`
	4. zeros, rand等新建tensor的函式
	5. `grad_central_difference`, `grad_check`使用數值方法簡單計算梯度,但是有誤差,`grad_check`一般選擇10%的相對誤差
- `tensor_data`
	1. `index_to_position`, `to_index`, `broadcast_index`: 用於Strided access Tensordata的輔助函式,用於在position(storage[position]一維絕對地址),和index(多維strided索引,邏輯地址)之間轉化。此外broardcast可以將已經broadcast的大tensor的index轉化為對應的沒有broadcast的小tensor的index。注意`to_index`和`index_to_position`並不是互相為逆函式,因此`to_index(pos, shape, index)`和`pos2=index_to_position`中pos只是絕大多數情況下等於pos2,有時strided access和broadcast並用可能導致pos與pos2不同(仍需探究),最好先把絕對位置轉化為相對位置,再轉化回來。
	2. TensorData: storage, strides, shape, 還實現了諸如permute, copy等基本的轉化函式
- tensor.py
	1. class Tensor: `_tensor: TensorData`, backend, history, name, unique_id:過載了基本運算子,新增了如sigmoid, log等基本(呼叫Sigmoid等Function衍生類)計算函式,包裝了TensorData的轉化函式
	2. 如果tensor.history為空,則被視作constant,不需要梯度。如果tensor.history不為空,但是沒有內容(tensor.history.fn和tensor.history.inputs都為空),則被視為葉節點(計算圖的起點),梯度將會被儲存grad中。對於需要梯度的非葉節點,梯度只是被傳遞,沒有在傳遞後被儲存。
- autodiff.py
	1. 拓撲排序
	2. backpropagation
- optim.py
	1. SGD: `zero_grad`清理葉節點上已經儲存的梯度,step: 使用梯度將已經關聯的引數進行update
- module.py
	1. class Parameter
	2. class Module:方法: train, eval, named_parameters. `__init__`將自動將屬性中型別為Parameter的fields加入`self._parameters`,將型別為Module的加入`self._modules`。`self._parameters`和`self._modules`這些子Module對應的引數會自動被self.optim記錄並在step中update
  • nn.py: avgpool2d, softmax, logsoftmax, maxpool2d, dropout等呼叫Function衍生類和Tensor方法的類似神經網路層但是不需要記錄引數的層
    project: 視覺化,具體的神經網路層
  • project/run_tensor.py會實現Linear, Conv這類需要記錄weights等引數的層

Lesson0 準備 & ML Primer

  1. sigmoid雖然被定義為1/(1+exp(-x),但是由於exp很容易衝破精度,所以實際上是當x>=0時1/(1+exp(-x),x<0時,sigmoid(x)=exp(x)/(exp(x) + 1)
  2. log加了delta以保證不出現inf
  3. relu在0處導數為0而非1
  4. 可以把大部分數學操作分為map(1to1), zip(1+1), reduce(nto1),對這些數學操作,實現最基本的float基礎的函式之後,再用最佳化的map, zip, reduce來加速對tensor這個大張量的計算速度。(在minitorch中)
  5. sigmoid以0.5為中心對稱, 而非以0為中心對稱。當以float計算時,約>=37, sigmoid恆為1,當<=-800,sigmoid恆為0
  6.  probs = (out * y) + (out - 1.0) * (y - 1.0)
           loss = -probs.log().sum()
           # Update
           loss.view(1).backward()
    

主要卡頓的願意:
sigmoid精度臺上小
Iterable不等於List,沒有len這個方法,只有iter(x)再用for .. in或者next(iterx)

Lesson1 Autodiff

Derivatives求導數的方法:symbolic derivatives, numerical derivatives, auto differentiation。
auto differentiation是在symbolic derivatives和numerical derivatives之間,在精度和易於實現速度更快之間的平衡

Forward mode:令x1'為1,x2'為0,再由v4=x1 + 2x2 + cos(x2)來計算v4',一路傳遞合成df/dx
Forward-mode AD is implemented by a nonstandard interpretation of the program in which real numbers are replaced by dual numbers, constants are lifted to dual numbers with a zero epsilon coefficient, and the numeric primitives are lifted to operate on dual numbers. This nonstandard interpretation is generally implemented using one of two strategies: source code transformation or operator overloading.
Backward mode: 令f'=1,再由f=v4+v5,得v4'=αf/αv4,一路傳遞到x1'

Chain Rule:
h(x) = f(g(x)): 令z=g(x),則dfz = df/dz, αh/αx=dfz * αz/αx
h(x,y) = f(g(x,y)):令z=g(x,y),αh/αx=dfz * αz/αx, αh/αy=dfz * αz/αy
h(x) = f(g(x), g(x)):令z=g(x),αh/αx=2*dfz * αz/αx

Backpropagation:
可以從top也可以從bottom來時傳播,本教程使用BFS
topological sort of a directed acyclic graph
Steps:

  1. Call topological sort
  2. create a dictionary to record derivatives
  3. if the scalar is leaf, record to tensor.grad, else call chainrule and accumulate derivatives
  4. Q: 為什麼只在leaf tensor記錄Grad?A: 是為了遵從pytorch的寫法
  5. sigmoid backward
    t = operators.sigmoid(a)
    return d_output * (t - t * t)
    
  6. python protocol:
    This is called duck typing in Python. In duck typing, the behaviors and properties of an object determine the object type, not the explicit type of the object.
    To make the calculate_total() more dynamic while leveraging type hints, you can use the Protocol from the typing module.
    加了@runtime_checkable才能使用isinstance來判斷是否屬於Protocol的衍生類

Lesson2 Tensor

strided access, contiguous strides, non-contiguous strides
• Storage is where the core data of the tensor is kept. It is always a 1-D array of numbers of length size, no matter the dimensionality or shape of the tensor. Keeping a 1-D storage allows us to have tensors with different shapes point to the same type of underlying data.
• Strides is a tuple that provides the mapping from user indexing to the position in the 1-D storage storage[s1 * index1 + s2 * index2 + s3 * index3 ... ]

broadcasting: 例如只要給一個常數1,就能讓形為(2,2,2)的tensor點加1--不需要建立一個(2,2,2)的tensor,全部賦值為 1之後再點加。

Broadcasting is a protocol that allows us to automatically interpret the frist expression as implying the second one. Inside zip, we pretend that 10 is a vector of shape (3,) when zipping it with a vector of shape (3,). Again, this is just an interpretation: we never actually create this vector.

Rule 1: Any dimension of size 1 can be zipped with dimensions of size n > 1 by assuming the dimension is copied n times.

Rule 2: Extra dimensions of shape 1 can be added to a tensor to ensure the same number of dimensions with another tensor.

Rule 3: Any extra dimension of size 1 can only be implicitly added on the left side of the shape.

Matrix multiplication can be written in this style, here is (BxAT)( ) where A is 3 x 2 and B is 2 x 2. (And you will need to use this for the assignment). However, note this is a memory inefficient way

From https://minitorch.github.io/module2/broadcasting/

out=minitorch.zeros((2,3,1))+minitorch.zeros((7,2,1,5))out.shape
(7, 2, 3, 5)

From https://minitorch.github.io/module2/broadcasting/

• Operation a / map: These operations just touch each of the positions in the tensor individually. They don't need to deal with other positions or know anything about the shape or size of the tensor. We can view these operations as applying the following transformation:
• Operation b / zip: These operations only need to pair operations between input tensors. If we assume the tensors have the same size and shape, this type of operation simply aligns these two tensors and applies an operator to each pair of elements:
• Operation c / reduce: These operations need to group together cells within a single tensor. We can think of there being an implied reduce shape that is eliminated in the process of the output construction. For instance, in the example below, we start with an input of shape (3, 2) and create an output of shape (1, 2). Implicitly, we reduce over a tensor of shape (3, 1) for each element in the output.

From https://minitorch.github.io/module2/tensorops/

Note:

  1. 需要注意to_index的計算方向
  2. 在backward中需要格外注意Grad的shape,我的實現直接讓grad要麼為size1,要麼就與對應輸入shape一致
  3. 注意expand和reshape可能並沒有更改索引,有時候該用的是permute
  4. 注意除法使用乘法的倒數實現

Lesson3

教程中的:1. 並行(numba.jit(parallel=True)) 2. CUDA(numba.cuda)

numba.jit(parallel=True)
利用map和zip等結構來實現快速並行化,此外,將矩陣乘法等操作單獨實現。
教程要求的規範:1. Main loop in parallel(prange) 2. all indices use numpy buffers 3. inner function shouldn't call any functions or write any non-local variables. numba會自動對prange並行化。但是numba只支援了一部分numpy.array相關的函式和操作,如果無法透過編譯,可以嘗試將操作用更原始更簡單的numpy操作代替,可能就能透過編譯而無需更改邏輯。此外,numba還可能自動fusion一些operation(把多個基本操作聯合在一起用一個最佳化過後的操作替代,如pointadd + reduce->matmul)
Note: 儘管應該可以自動hoist declaration(將loop內的宣告放在外面節約建立時間),但是如果直接這麼寫,可能導致多個執行緒同寫一片陣列

會帶來一定的start overhead,但是能夠用fast low-level code代替效率低下的python code
combine operations to eliminate unnecessary intermediate tensors.
對matmul backward

g'M(f(M,N)) = dN^T
g'N(f(M,N)) = M^Td

numba.jit

Supported Operations with parallel semantics

  1. numba array operation,
    • unary operators: + - ~
    • binary operators: + - * / /? % | >> ^ << & ** //
    • comparison operators: == != < <= > >=
    • Numpy ufuncs that are supported in nopython mode.
    • User defined DUFunc through vectorize().
  2. Numpy reduction functions sum, prod, min, max, argmin, and argmax. Also, array math functions mean, var, and std.
  3. Numpy array creation functions zeros, ones, arange,...
  4. Numpy dot function
  5. The full semantics of Numpy broadcast is not supported
  6. Array assignment using a slice
  7. The reduce operator of functools on 1D Numpy arrays

Explicit Parrel Loops

  1. prange可以替代range來指代可以並行化的loop,一般numba會用最外層那個prange,裡面的都當作普通range對待
  2. reduce的結果變數線上程開始之前應當初始化,所有執行緒以相同初始狀態開始
from numba import njit, prange
import numpy as np

@njit(parallel=True)
def prange_wrong_result(x):
    n = x.shape[0]
    y = np.zeros(4)
    for i in prange(n):
        # accumulating into the same element of `y` from different
        # parallel iterations of the loop results in a race condition
        y[i % 4] += x[i] # ERROR

    return y

whereas performing a whole array reduction is fine:

from numba import njit, prange
import numpy as np

@njit(parallel=True)
def prange_ok_result_whole_arr(x):
    n = x.shape[0]
    y = np.zeros(4)
    for i in prange(n):
        y += x[i]#OK
    return y

基本模式

def tensor_map(
    fn: Callable[[float], float]
) -> Callable[[Storage, Shape, Strides, Storage, Shape, Strides], None]:
    """
    NUMBA low_level tensor_map function. See `tensor_ops.py` for description.

    Optimizations:

    * Main loop in parallel
    * All indices use numpy buffers
    * When `out` and `in` are stride-aligned, avoid indexing

    Args:
        fn: function mappings floats-to-floats to apply.

    Returns:
        Tensor map function.
    """

    def _map(
        out: Storage,
        out_shape: Shape,
        out_strides: Strides,
        in_storage: Storage,
        in_shape: Shape,
        in_strides: Strides,
    ) -> None:
        if np.array_equal(out_shape, in_shape) and np.array_equal(
            out_strides, in_strides
        ):
            for out_pos in prange(len(out)):
                out[out_pos] = fn(in_storage[out_pos])
            return

        for outi in prange(len(out)):
            out_index = np.zeros(
                len(out_shape), dtype=np.int32
            )  # will be automatically hoisted
            in_index = np.zeros(len(in_shape), dtype=np.int32)
            to_index(outi, out_shape, out_index)
            out_pos = index_to_position(out_index, out_strides)
            if out_pos != outi:
                print("ERROR!", out_pos, outi)
            # assert out_pos == i
            broadcast_index(out_index, out_shape, in_shape, in_index)
            in_pos = index_to_position(in_index, in_strides)
            out[out_pos] = fn(in_storage[in_pos])

    return njit(parallel=True)(_map)  # type: ignore

to_index = njit(inline="always")(to_index)
index_to_position = njit(inline="always")(index_to_position)
broadcast_index = njit(inline="always")(broadcast_index)

CUDA

grid->blocks->threads
threads bundle = block
block bundle = grid
threads數目根據GPU裝置不同有上限,數目最好是32的倍數。threads和grid都可以是1-3維,多維並不改變實際上的訪問和操作,只是方便程式設計師書寫。每個執行緒能使用的寄存數量和視訊記憶體大小都一定,如果超過了,在numba環境會表現為coredump

            # Instantiate and run the cuda kernel.
            threadsperblock = THREADS_PER_BLOCK
            blockspergrid = (out.size + THREADS_PER_BLOCK - 1) // THREADS_PER_BLOCK
            # print("map:", out.shape, out.strides, a.shape, a.strides)
            f[blockspergrid, threadsperblock](*out.tuple(), out.size, *a.tuple())  # type: ignore

There are multiple limits. All must be satisfied.

The maximum number of threads in the block is limited to 1024. This is the product of whatever your threadblock dimensions are (xyz). For example (32,32,1) creates a block of 1024 threads. (33,32,1) is not legal, since 33321 > 1024.

The maximum x-dimension is 1024. (1024,1,1) is legal. (1025,1,1) is not legal.

The maximum y-dimension is 1024. (1,1024,1) is legal. (1,1025,1) is not legal.

The maximum z-dimension is 64. (1,1,64) is legal. (2,2,64) is also legal. (1,1,65) is not legal.

使用numba的基本模式

    def map(fn: Callable[[float], float]) -> MapProto:
        "See `tensor_ops.py`"
        f = tensor_map(cuda.jit(device=True)(fn))

        def ret(a: Tensor, out: Optional[Tensor] = None) -> Tensor:
            if out is None:
                out = a.zeros(a.shape)

            # Instantiate and run the cuda kernel.
            threadsperblock = THREADS_PER_BLOCK
            blockspergrid = (out.size + THREADS_PER_BLOCK - 1) // THREADS_PER_BLOCK
            # print("map:", out.shape, out.strides, a.shape, a.strides)
            f[blockspergrid, threadsperblock](*out.tuple(), out.size, *a.tuple())  # type: ignore
            return out

        return ret


def tensor_map(
    fn: Callable[[float], float]
) -> Callable[[Storage, Shape, Strides, Storage, Shape, Strides], None]:
    """
    CUDA higher-order tensor map function. ::

      fn_map = tensor_map(fn)
      fn_map(out, ... )

    Args:
        fn: function mappings floats-to-floats to apply.

    Returns:
        Tensor map function.
    """

    def _map(
        out: Storage,
        out_shape: Shape,
        out_strides: Strides,
        out_size: int,
        in_storage: Storage,
        in_shape: Shape,
        in_strides: Strides,
    ) -> None:
        out_index = cuda.local.array(MAX_DIMS, numba.int32)
        in_index = cuda.local.array(MAX_DIMS, numba.int32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        if i >= out_size:
            return
        if is_array_eq(out_shape, in_shape) and is_array_eq(out_strides, in_strides):
            out[i] = fn(in_storage[i])
            # print(i, "->", i, "value:", out[i], "named, ", fn(in_storage[i]), "=", fn_name, "(", in_storage[i], ")")
        else:
            to_index(i, out_shape, out_index)
            broadcast_index(out_index, out_shape, in_shape, in_index)
            in_pos = index_to_position(in_index, in_strides)
            out_pos = index_to_position(out_index, out_strides)
            out[out_pos] = fn(in_storage[in_pos])
            # print(i, "(", out_pos, ")", "->", in_pos, "value:", out[out_pos], "named, ", fn(in_storage[in_pos]), "=", fn_name, "(", in_storage[in_pos], ")")

    return cuda.jit()(_map)  # type: ignore

shared memory:
在block中的threads均可訪問sharedmemory,在block間獨立,訪問比記憶體更快。

  1. Each block has exactly the same size and shape.
  2. Each thread knows its position in the global grid.
  3. x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x標識了threads第一維的地址

CUDA +numba: 缺失的功能:1. dynamic parallesim(能自動建立和管理並行工作負載,而無需CPU干預)2. Texture Memory(紋理記憶體,用於儲存紋理,對2D空間區域性計算進行了最佳化,例如雙線性和三線性filter

Writing CUDA kernels: CUDA的執行模型:CUDA的執行模型與已有的線性CPU程式設計模型不同,上千個同時執行
3種CPU Memory:

  1. global device memory: the large relative slow off-chip memory that connected the GPU itself.
  2. shared memory: on-chip shared memory
  3. local memory: on-chip

Kernel declaration
kernel function: A GPU function that is meant to be called from GPU code,沒有返回值,計算結果寫回引數
device function: cuda.jit(device=True),可以返回值,從裝置內部,由kernel function和其他device functions呼叫
function只編譯一次,可以被不同的grid設定呼叫

Kernel innovation:

blockspergrid = (16,16,40)
threadsperblock = (16,16,1)
kernel_func[blockspergrid, threadsperblock](inputs_array)

How to choose the block size? 需要足夠容納full occupation of execution units?
Thread positioning: x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
或者numba.cuda.grad()

Memory Management:
Data Transfer:

  1. Numba can transfer numpy arrays to the device and then transfer device memory back to the host when a kernel finished
  2. numba.cuda.device_array: allocate an empty device ndarray; numba.cuda.device_array_like
  3. numba.cuda.to_device: allocate and transfer a numpy ndarray or structured scalar to the device.numba.cuda.DeviceNDArray
    e.g.:
    ary = numpy.arrange(w)
    d_ary = cuda.to_device(ary)
    d_ary.copy_to_host(ary)
    
  4. Q:numba.cuda.DeviceNDarray中 is_c_contiguousis_f_contiguous的不同?ravel方法和reshape方法的不同?
  5. Pinned Memory: 不允許換出,directed memory access transfers, avoid page faults. numba.cuda.pinned
  6. numba.cuda.stream
  7. numba.cuda.shared.array
  8. numba.cuda.synncthreads: synchronize all threads in the same thread block,相當於barrier
  9. numba.cuda.local.array: private to each thread,當為區域性變數分配的暫存器不夠時,常使用這一方法增加scratchpad area(?)。會在kernel執行開始時分配一次,執行過程中不可分配。
  10. constant memory: read-only, cached, off-chip,所有執行緒均可訪問,由host分配而非device分配
  11. 釋放行為:
    1.外部記憶體管理外掛可以更改釋放行為
    2. 所有CUDA資源的重新分配都會接上下文跟蹤,在最後一個引用被刪掉時,底層記憶體將被加入釋放佇列
    3. 釋放不會立刻發生,而是將其新增到一個佇列中,延遲釋放帶來的同步代價,也避免持續的釋放錯誤可能導致嚴重錯誤
  12. CUDA python: numba沒有支援任何numpy陣列級別的操作,只支援單個元素級別的簡單操作,不支援任何try catch, 支援print但是是非同步的,支援簡單計算和zip,不支援陣列建立。assert會導致奇怪的錯誤,因此也不應該用。(program coredump或者部分執行緒退出,即使Assert沒有觸發也可能導致部分執行緒退出,從而讓載入快取或者計算的行為不完全)
  13. Atomic Operation: numba.cuda.atomic: 原子操作add, max, min, compare_and_swap
    14: 隨機數: numba.cuda.random
    15: 可以用CUDA simulator來除錯cudapython,但是隻能除錯簡單的單個執行緒。

CUDA程式的寫作技巧:
多個執行緒載入快取,然後再進行計算。
例如matmul,A@B,可以載入A[blockstart:blockend, calculatenowstart:calculatenowend], B[calculatenowstart:calculatenowend, blockstarty:blockendy],一塊塊計算,按理來說要比直接從off-chip的輸入讀取更快。
注意,在計算之後也需要進行syncthreads,而不僅僅只是在讀寫快取的時候。!!!!!!!

e.g.:

def _tensor_matrix_multiply(
    out: Storage,
    out_shape: Shape,
    out_strides: Strides,
    out_size: int,
    a_storage: Storage,
    a_shape: Shape,
    a_strides: Strides,
    b_storage: Storage,
    b_shape: Shape,
    b_strides: Strides,
) -> None:
    """
    CUDA tensor matrix multiply function.
    Requirements:
    * All data must be first moved to shared memory.
    * Only read each cell in `a` and `b` once.
    * Only write to global memory once per kernel.
    Should work for any tensor shapes that broadcast as long as ::
    ```python
    assert a_shape[-1] == b_shape[-2]
    ```
    Returns:
        None : Fills in `out`
    """
    out_size = 1
    for sz in out_shape:
        out_size *= sz
    out_bbatch_stride = out_strides[-4] if len(out_strides) >= 4 else out_size
    out_bbatch_size = out_size // out_bbatch_stride
    # Batch dimension - fixed, c[batch, i, j]
    batch = cuda.blockIdx.z
    out_dim = len(out_shape)
    a_dim = len(a_shape)
    b_dim = len(b_shape)
    BLOCK_DIM = 16
    a_shared = cuda.shared.array((BLOCK_DIM, BLOCK_DIM), numba.float64)
    b_shared = cuda.shared.array((BLOCK_DIM, BLOCK_DIM), numba.float64)
    out_index = cuda.local.array(MAX_DIMS, numba.int32)
    a_index = cuda.local.array(MAX_DIMS, numba.int32)
    b_index = cuda.local.array(MAX_DIMS, numba.int32)
    # The final position c[i, j]
    i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
    debug = False  # (cuda.blockIdx.x, cuda.blockIdx.y, cuda.blockIdx.z) == (2,2,0)
    # The local position in the block.
    pi = cuda.threadIdx.x
    pj = cuda.threadIdx.y
    # Code Plan:
    # 1) Move across shared dimension by block dim.
    #    a) Copy into shared memory for a matrix.
    #    b) Copy into shared memory for b matrix
    #    c) Compute the dot produce for position c[i, j]
    dim_m = a_shape[-2]
    dim_n = a_shape[-1]
    dim_d = b_shape[-1]
    if i >= dim_m and j >= dim_d:
        return
    out_pos_now = batch * out_strides[-3] + i * out_strides[-2] + j
    for bbatch_i in range(out_bbatch_size):
        to_index(
            batch * out_strides[-3] + bbatch_i * out_bbatch_stride, out_shape, out_index
        )
        out_index[out_dim - 2] = i
        out_index[out_dim - 1] = j
        broadcast_index(out_index, out_shape, a_shape, a_index)
        broadcast_index(out_index, out_shape, b_shape, b_index)
        tmp = 0.0
        for base_n in range(0, dim_n, BLOCK_DIM):
            a_index[a_dim - 1] = base_n + pj
            a_c = b_c = 0
            if a_index[a_dim - 1] < dim_n and i < dim_m:
                a_c = 344
                a_pos = index_to_position(a_index, a_strides)
                a_shared[pi, pj] = a_storage[a_pos]
            b_index[b_dim - 2] = base_n + pi
            if b_index[b_dim - 2] < dim_n and j < dim_d:
                b_c = 234
                b_pos = index_to_position(b_index, b_strides)
                b_shared[pi, pj] = b_storage[b_pos]
            numba.cuda.syncthreads()
            if i < dim_m and j < dim_d:
                k_lim = min(BLOCK_DIM, dim_n - base_n)
                for k in range(k_lim):
                    tmp += a_shared[pi, k] * b_shared[k, pj]
            numba.cuda.syncthreads()  # Note:!!!!!!!!!!!!!
        if i < dim_m and j < dim_d and out_pos_now < out_size:
            out[out_pos_now] = tmp
        out_pos_now += out_bbatch_stride


def _sum_practice(out: Storage, a: Storage, size: int) -> None:
    """
    This is a practice sum kernel to prepare for reduce.
    Given an array of length $n$ and out of size $n // \text{blockDIM}$
    it should sum up each blockDim values into an out cell.
    $[a_1, a_2, ..., a_{100}]$
    |
    $[a_1 +...+ a_{31}, a_{32} + ... + a_{64}, ... ,]$
    Note: Each block must do the sum using shared memory!
    Args:
        out (Storage): storage for `out` tensor.
        a (Storage): storage for `a` tensor.
        size (int):  length of a.
    """
    BLOCK_DIM = 32
    cache = cuda.shared.array(BLOCK_DIM, numba.float64)
    i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    pos = cuda.threadIdx.x
    if i >= size:
        return
    cache[pos] = a[i]
    numba.cuda.syncthreads()
    s = 1
    while s < BLOCK_DIM:
        if pos % (2 * s) == 0 and pos + s < BLOCK_DIM and i + s < size:
            cache[pos] += cache[pos + s]
        s *= 2
        cuda.syncthreads()
    if pos == 0:
        out[cuda.blockIdx.x] = cache[pos]

Note:

  1. Sum居然是用二分法加在一起,而不是直接加和(log複雜度)
  2. Bug常在大於2倍BlockDim.x的時候發生
  3. 慎用return或者continue,因為即使執行緒不參與計算,也可能參與快取。而如果提前continue,可能導致只有一兩個快取沒有載入上。此外,必須要在載入快取的時候檢查,因為cuda,尤其是在index_to_position這種情況下,很可能誤載入快取造成錯誤
  4. cuda函式中不應該寫assert,即使assert不觸發也可能造成其他問題

Lesson4

1-D convolution

假設滑動串列埠超過輸出邊緣,則pad為0。
在此前提下output = (unrolled_input.view(batch, width, inchannle*kernelwidth) @ weight.view(outchannel, inchannel*kenrelwidth) + self.bias
where weights.shape = (outchannel, inchannel, kernelwidth), bias.shape=(1, outchannel,1), input.shape = (batch, inchannel, width), 在不控制的情況下,簡化output.shape=(batch, outchannel, width),則unrolledinput.shape=(batch, ,width, inchannel, kernelwidth),利用矩陣乘(@)操作將weights和unrolledinput.shape的後兩位消掉
Note:

  1. output的shape可以用來控制convolution每一維要計算多長(比如只算到不需要padding的地方)
  2. conv可以用來計算其自身的導數
    
       def backward(ctx: Context, grad_output: Tensor) -> Tuple[Tensor, Tensor]:
            input, weight = ctx.saved_values
            batch, in_channels, w = input.shape
            out_channels, in_channels, kw = weight.shape
            grad_weight = grad_output.zeros((in_channels, out_channels, kw))
            new_input = input.permute(1, 0, 2)
            new_grad_output = grad_output.permute(1, 0, 2)
            tensor_conv1d(
                *grad_weight.tuple(),
                grad_weight.size,
                *new_input.tuple(),
                *new_grad_output.tuple(),
                False,
            )
            grad_weight = grad_weight.permute(1, 0, 2)
            grad_input = input.zeros((batch, in_channels, w))
            new_weight = weight.permute(1, 0, 2)
            tensor_conv1d(
                *grad_input.tuple(),
                grad_input.size,
                *grad_output.tuple(),
                *new_weight.tuple(),
                True,
            )
            return grad_input, grad_weight
    
    
  3. 最後的一維用於kernel而非輸出維數
  4. 檢查le pytorch程式碼,具體是aten/src/aTEn/vulkan/ops/convolutions.cpp,Tile(unrolled)也往往需要逐個數字複製
  5. 注意exp相當不穩定,也導致softmax不穩定-輸出達到700後就會導致exp(x)=inf,進而導致對應的weights被更新為-inf=nan。loss多使用logsoftmax,且計算需要格外注意,
    
    
    def logsoftmax(input: Tensor, dim: int) -> Tensor:
        r"""
        Compute the log of the softmax as a tensor.
        $z_i = x_i - \log \sum_i e^{x_i}$
        See https://en.wikipedia.org/wiki/LogSumExp#log-sum-exp_trick_for_log-domain_calculations
        Args:
            input : input tensor
            dim : dimension to apply log-softmax
        Returns:
             log of softmax tensor
        """
        max_input = max(input, dim)
        mid_input = input - max_input
        mid_input = mid_input.exp()
        mid_input = mid_input.sum(dim)
        mid_input = mid_input.log()
        lse_input = max_input + mid_input
        return input - lse_input
    
    

2-Dconv

max

In practice, to avoid implementing the pooling operation as a loop, we can manipulate the shape and strides of the input tensor in order to pool it directly.

From https://minitorch.github.io/module4/pooling/

為了防止變成迴圈,可以結合tile(加padding,用tile函式將陣列最後一維改為newshape*kernelsize,如果不夠整除就加pad

training

  1. 在最基本的CNNsentimenttraining中,如果使用maxpool2d而不是avgpool2d,則在3個epoch時還可能收斂,之後就會穩定在loss11,也就是大部分輸出都是-1,然後被relu卡掉的情況。
  2. 線性層之間的relu是必須的,至少就實驗的這幾次,沒有relu就不會收斂
  3. 神經網路單元太大會導致coredump
  4. logsoftmax確實是返回負數,除了機率最大的一維
  5. nn結構設計不合理很容易導致loss不下降或者間隔出現loss飛速增長然後達到inf導致nan
  6. 在寫快取和寫計算時需要分開看cuda。blockspergrid是資料分治,要按照輸出資料分。threadsperblock。則按照利用到的資料(需要cache的資料來思考)