ECE 498/598 Associative Recall Problem

OneDay3發表於2024-11-13

原文網址 : https://www.cnblogs.com/comp9321/p/18543632

ECE 498/598 Fall 2024, Homeworks 3 and 4

Remarks:

HW3&4: You can reduce the context length to 32 if you are having trouble with thetraining time.

HW3&4: During test evaluation, note that positional encodings for unseen/longcontext are not trained. You are supposed to evaluate it as is. It is OK if it doesn’twork well.

HW3&4: Comments are an important component of the HW grade. You are expectedto explain the experimental findings. If you don’t provide technically meaningfulcomments, you might receive a lower score even if your code and experimentsareaccurate.

The deadline for HW3 is November 11th at 11:59 PM, and the deadline for HW4 is

18th at 11:59 PM. For each assignment, please submit both your code and aPDF report that includes your results (figures) for each question. You can generate thePDF report from a Jupyter Notebook (.ipynb file) by adding comments in markdowncells.1The objective of this assignment is comparing transformer architecture and SSM-typearchitectures (specifically Mamba [1]) on the associative recall problem. We provided anexample code recall.ipynb which provides an example implementation using 2 layertransformer. You will adapt this code to incorporate different positional encodings, useMamba layers, or modify dataset generation.Background: As you recall from the class, associative recall (AR) assesses two abilitiesof the model: Ability to locate relevant information and retrieve the context around thatinformation. AR task can be understood via the following question: Given input promptX = [a 1 b 2 c 3 b], we wish the model to locate where the last token b occursearlierand output the associated value Y = 2. This is crucial for memory-related tasks or bigram

retrieval (e.g. ‘Baggins’ should follow ‘Bilbo’).To proceed, let us formally define the associative recall task we will study in the HW.

Definition 1 (Associative Recall Problem) Let Q be the set of target queries with cardinality |Q| = k. Consider a discrete input sequence X of the form X = [. . . q v . . . q] where the query q appears exactly twice in the sequence and the value v follows the first appearance

of q. We say the model f solves AR(k) if f(X) = v for all sequences X with q ∈ Q. Induction head is a special case of the definition above where the query q is fixed (i.e. Qis singleton). Induction head is visualized in Figure 1. On the other extreme, we can ask themodel to solve AR for all queries in the vocabulary.Problem Setting

Vocabulary: Let [K] = {1, . . . , K} be the token vocabulary. Obtain the embedding ofthe vocabulary by randomly generating a K × d matrix V with IID N(0, 1) entries, thennormalized its rows to unit length. Here d is the embedding dimension. The embedding ofthe i-th token is V[i]. Use numpy.random.seed(0) to ensure reproducibility.

Experimental variables: Finally, for the AR task, Q will simply be the first M elementsof the vocabulary. During experiments, K, d, M are under our control. Besides this we willalso play with two other variables:

Context length: We will train these models up to context length L. However, wewill evaluate with up to 3L. This is to test the generalization of the model to unseenlengths.

Delay: In the basic AR problem, the value v immediately follows q. Instead, we will

introduce a delay variable where v will appear τ tokens after q. τ = 1 is the standard.

Models: The motivation behind this HW is reproducing the results in the Mamba paper.However, we will also go beyond their evaluations and identify weaknesses of both transformer and Mamba architectures. 代寫ECE 498/598 Associative Recall Problem Specifically, we will consider the following models in ourevaluations:2Figure 1: We will work on the associative recall (AR) problem. AR problem requires themodel to retrieve the value associated with all queries whereas the induction head requiresthe same for a specific query. Thus, the latter is an easier problem. The figure above isdirectly taken from the Mamba paper [1]. The yellow-shaded regions highlight the focus ofthis homework.

Transformer: We will use the transformer architecture with 2 attention layers (no

MLP). We will try the following positional encodings: (i) learned PE (provided code),

(ii) Rotary PE (RoPE), (iii) NoPE (no positional encoding)

Mamba: We will use the Mamba architecture with 2 layers.
Hybrid Model: We will use an initial Mamba layer followed by an attention layer.

No positional encoding is used.nd implementations (e.g. RoPE encoding or Mamba layer). As a suggestion, you can usethis GitHub Repo for the Mamba model.Generating training dataset: During training, you train with minibatch SGD (e.g. withbatch size 64) until satisfactory convergence. You can generate the training sequences forAR as follows given (K, d, M, L, τ):

Training sequence length is equal to L.
Sample a query q ∈ Q and a value v ∈ [K] uniformly at random, independently. Recallthat size of Q is |Q| = M.

Place q at the end of the sequence and place another q at an index i chosen uniformlyat random from 1 to L − τ.

Place value token at the index i + τ.
3 Sample other tokens IID from [K]−q i.e. other tokens are drawn uniformly at randombut are not equal to q.

Set label token Y = v.

Test evaluation: Test dataset is same as above. However, we will evaluate on all sequencelengths from τ + 1 to 3L. Note that τ + 2 is the shortest possible sequence.

Empirical Evidence from Mamba Paper: Table 2 of [1] demonstrates that Mamba can do

a good job on the induction head problem i.e. AR with single query. Additionally, Mambais the only model that exhibits length generalization, that is, even if you train it puto contextlength L, it can still solve AR for context length beyond L. On the other hand, since Mambais inherently a recurrent model, it may not solve the AR problem in its full generality. Thismotivates the question: What are the tradeoffs between Mamba and transformer, and canhybrid models help improve performance over both?Your assignments are as follows. For each problem, make sure to return the associatedcode. These codes can be separate cells (clearly commented) on a single Jupyter/Python file.

Grading structure:

Problem 1 will count as your HW3 grade. This only involves Induction Head

experiments (i.e. M = 1).

Problems 2 and 3 will count as your HW4 grade.
You will make a single submission.

Problem 1 (50=25+15+10pts). Set K = 16, d = 8, L = 32 or L = 64.

Train all models on the induction heads problem (M = 1, τ = 1). After training,evaluate the test performance and plot the accuracy of all models as a function ofthe context length (similar to Table 2 of [1]). In total, you will be plotting 5 curves(3Transformers, 1 Mamba, 1 Hybrid). Comment on the findings and compare theperformance of the models including length generalization ability.

Repeat the experiment above with delay τ = 5. Comment on the impact of delay.
Which models converge faster during training? Provide a plot of the convergence rate the x-axis is the number of iterations and the y-axis is the AR accuracy over atest batch. Make sure to specify the batch size you are using (ideally use 32 or 64).

Problem 2 (30pts). Set K = 16, d = 8, L = 32 or L = 64. We will train Mamba, Transformerwith RoPE, and Hybrid. Set τ = 1 (standard AR).

Train Mamba models for M = 4, 8, 16. Note that M = 16 is the full AR (retrieve anyquery). Comment on the results.

Transformer models for M = 4, 8, 16. Comment on the results and comparethem against Mamba’s behavior.4• Train the Hybrid model for M = 4, 8, 16. Comment and compare.

Problem 3 (20=15+5pts). Set K = 16, d = 64, L = 32 or L = 64. We will only trainMamba models.

Set τ = 1 (standard AR). Train Mamba models for M = 4, 8, 16. Compare against thecorresponding results of Problem 2. How does embedding d impact results?

Train a Mamba model for M = 16 for τ = 10. Comment if any difference.

ECE 5041 Electric Machine
2024-06-13
Mac
Tricky Sum【數學】CodeForces 598A
2020-04-06
ECE4016 A simple Local DNS Server
2024-10-09
DNSServer
ECE 4122/6122 OpenGL with OBJ files and Multiple Objects
2024-10-21
Object
精度(precision)，召回率(recall)，map
2020-10-20
ECE6101/CSE6461 Distributed, Independent random
2024-11-02
random
Sum Problem
2020-11-16
Mathematical Problem
2024-07-20
Prime Ring Problem
2020-10-23
2019 MCM Problem A
2019-02-06
Yet Another Problem
2024-07-22
Nanami and the Constructive Problem
2024-07-06
NaNStruct
積社教階最這感圓為紅滿ece
2022-03-04
Fixed "There was a problem with the editor 'vi'"
2018-12-19
Prime Ring Problem （dfs）
2018-03-14
HDU - 6182 A Math Problem
2020-10-12
Problem A. Ascending Rating
2020-04-05
E. Not a Nim Problem
2024-08-17
Nanami and the House Protecting Problem
2024-07-02
NaN
Precision,Recall,TPR,FPR,ROC,AUC,F1辨析
2018-10-04
P1865 A % B Problem
2024-04-07
A + B Problem II hd 1002
2020-04-06
Follow/Unfollow problem in system design
2020-10-19
Euclid Problem - PC110703
2020-04-05
HDU 1002 A + B Problem II
2019-05-10
HDU 1792 A New Change Problem
2019-02-08
Joe Harris is a real problem with this team
2022-01-19
Assignment Problem的若干思考
2021-05-28
POJ 2355 Railway Ticket problem
2020-12-12
AI
The Door Problem 並查集
2021-01-04
並查集
準確率（Accuracy）精確率（Prescision）召回率（Recall）
2018-06-15
sklearn(七)計算多分類任務中每個類別precision、recall、f1的整合函式precision_recall_fscore_support()
2020-12-01
函式
[Algorithm] 1. A+B Problem
2018-11-02
Go
Solutions for Session Consistency Problem in Web Cluster
2019-04-15
SessionWeb
Problem 4：替換空格（字串）
2019-01-19
字串
Problem E: 向量的刪除
2018-05-26
QOJ6836 A Plus B Problem
2024-03-09
解決git SSL certificate problem
2024-05-23
Git

ECE 498/598 Associative Recall Problem

相關文章