人工智慧：原理與技術學習筆記

tzc_wk發表於2024-11-15

原文網址 : https://www.cnblogs.com/tzcwk/p/18548779

人工智慧筆記

Lecture 2

Supervised learning: regression, classification, ...
Unsupervised learning: clustering, dimensionality reduction, ...
The canonical machine learning problem: Given a set of training data \(\{(x_i,y_i)\}_{i=1}^m\) and a loss function \(\ell\), find the parameters \(\theta\) that minimizes the sum of losses \(f(\theta)=\sum\limits_{i=1}^m\ell(h_{\theta}(x_i),y_i)\).
Linear least squares problem: linear hypothesis function \(h_{\theta}(x)=\theta^Tx\), and \(\ell(h_{\theta}(x),y)=(h_{\theta}(x)-y)^2\).

Solutions for linear least squares problem:
- Gradient descent: Repeat \(\theta\leftarrow\theta-\alpha\nabla_{\theta}f(\theta)\).
- Analytical method: Let \(X=\begin{bmatrix}x_1^T\\x_2^T\\\vdots\\x_m^T\end{bmatrix},y=\begin{bmatrix}y_1\\y_2\\\vdots\\y_m\end{bmatrix}\), then \(f(\theta)=\lVert X\theta-y\rVert^2\), \(\nabla_{\theta}f=2X^T(X\theta-y)\), solve the equation \(2X^T(X\theta-y)=0\), we get \(\theta=(X^TX)^{-1}X^Ty\).
Linear regression: the hypothesis function to fit training data is linear to the parameters when basing on particular input features.
Linear classification (for \(2\) classes): the hypothesis function is linear to the parameters when basing on particular input features. And the true output is determined by the sign of the hypothesis function. (\(\hat{y}=\text{sign}(h_{\theta}(x)))\).
Loss functions for classification:
- \(\ell_{0/1}:\ell(h_{\theta}(x),y)=[\text{sign}(h_{\theta})=\text{sign}(y)]\) (NP-hard to solve)
- \(\ell_{\text{logistic}}:\ell(h_{\theta}(x),y)=\log(1+\exp(-y·h_{\theta}(x)))\)
- \(\ell_{\text{hinge}}:\ell(h_{\theta}(x),y)=\max(1-y·h_{\theta}(x),0)\)
- \(\ell_{\text{exp}}:\ell(h_{\theta}(x),y)=\exp(-y·h_{\theta}(x))\).
Typically no closed-formed solution, solvable by gradient descent.
Support vector machine: solves the canonical machine learning optimization problem using hinge loss and linear hypothesis, i.e. \(f(\theta)=\sum\limits_{i=1}^m\max(1-y_i·\theta^Tx_i,0)\).
Logistic regression: solves the canonical machine learning optimization problem using logistic loss and linear hypothesis, i.e. \(f(\theta)=\sum\limits_{i=1}^m\log(1+\exp(-y_i·\theta^Tx_i))\).
Multiclass classification: Build \(k\) different classifiers \(h_{\theta_i}\) and output prediction as \(\operatorname{argmax}_ih_{\theta_i}(x)\). And loss function is defined as \(\ell(h_{\theta}(x),y)=-\log \dfrac{e^{h_{\theta_y}(x)}}{\sum_{i=1}^me^{h_{\theta_i}(x)}}=\log \sum_{i=1}^me^{h_{\theta_i}(x)}-h_{\theta_y}(x)\). (called softmax loss or cross entropy loss).
Overfitting: As the model becomes more complex, training loss always decreases, generalization loss decreases to a point then starts to increase.
Cross-validation: Divide the data set into training set and holdout, use training set to determine the parameters, use holdout/validation set to determine the hyperparameters (degree of polynomial, \(\lambda\) in regularization, ...).
Regularization: add a term \(\dfrac{\lambda}{2}\lVert\theta\rVert_2^2\) to the loss function \(f(\theta)\), where \(\lambda\) is a hyperparameter (when the model becomes complex, the parameters tend to be large in order to overfit the training data).
Even though we used a training/holdout split to fit the parameters, we are still effectively fitting the hyperparameters to the holdout set. Use a test set to evaluate the performance. And the best solutions are: evaluate your system “in the wild” as often a possible, recollect data if you suspect overfitting to present data, ...

Lecture 3

Neural network: composed non-linear functions.
Elements of a neural network: weight \(w\), bias \(b\), activate function \(f\) (\(z_i=f_i(W_iz_{i-1}+b_i))\).
Common activate functions:
- Sigmoid: \(f(x)=\dfrac{1}{1+e^{-x}}\).
- Rectified linear unit (ReLU): \(f(x)=\max(x,0)\).
- Hyperbolic tangent: \(f(x)=\tanh(x)=\dfrac{e^{2x}-1}{e^{2x}+1}\).
Stochastic Gradient Descent (SGD): adjust parameters based upon just one random sample (or a small random collection of samples, called batch), i.e. \(\theta\gets \theta-\alpha\nabla_{\theta}\ell(h_\theta(x_i),y_i)\) for some random \(i\).
Backpropagation algorithm: Use the chain rule to recursively calculate the partial derivative of the loss function to every parameter, from the last layer to the first.
Momentum: \(v_t=\beta v_{t-1}+(1-\beta)\nabla_{\theta}f(\theta)\) and \(\theta\gets \theta-\eta v_t\). Usually, \(\beta=0.9\).

Lecture 4

Problems with fully connected networks: need a very large number of parameters, very likely to overfit the data; generic deep network also does not capture the “natural” invariances we expect in images (translation, scale).
Convolutional Neural Networks has 4 types of layers:
1. Convolution: require that activations between layers only occur in “local" manner; require that all activations share the same weights.
2. Non-linearity: Rectified Linear Unit (ReLU). Advantages: 1. Fast to compute; 2. No Cancellation problem; 3. More sparse activation volume; 4. Solving the vanishing gradient problem
3. Pooling (or downsampling):
  - Pick a window size.
  - Pick a stride.
  - Walk the window across the image.
  - For each window, take the maximum value.
4. Fully Connected Layer
Recurrent Neural Network: a type of neural network with memory, the output of hidden layer are stored in the memory, and the memory can be considered as another input, used to deal with sequential data.
Problems with Vanilla RNNs:
- exploding gradient \(\to\) gradient clipping
- vanishing gradient \(\to\) change RNN architecture \(\to\) LSTM

Lecture 5

A search problem consists of:
- State space \(S\)
- Start state \(s_{start}\in S\)
- Possible actions \(\text{Actions}(s)\)
- Successor: \(\text{Succ}(s,a)\)
- Action cost \(\text{Cost}(s,a)\geq 0\)
- Goal test \(\text{IsEnd}(s)\).
A solution is a sequence of actions (a plan) which transforms the start state to a goal state.
Search heuristic: A function \(h(x)\) that estimates how close a state is to a goal.
Uninformed cost search: \(h(x)=0\). Informed: Can know whether one non-goal is more promising than another.
Admissible heuristic: \(h(x)\leq \text{the true cost to a nearest goal}\). If \(h\) is admissible, A* is optimal (can find the min-cost solution) and also optimally efficient (expand minimum nodes, which means if you expand nodes less than A* expands, you can't ensure your solution is the optimal solution).
A* tree search: Expand the nodes available in the increasing order of \(f(x)+h(x)\), where \(f(x)\) is the least cost from the starting state to \(x\). The algorithm is optimal if the heuristic is admissible. But the time complexity of A* Tree Search can be exponential.
Consistent heuristic: \(h(x)-h(y)\leq \text{cost}(x\to y)\). If \(h\) is consistent, then at the first time expanding some state, we have obtained the shortest path to the state. So every state is expanded at most once.
A* graph search: Similar to A* tree search, but guarantees that each node is expanded once. The algorithm is optimal if the heuristic is consistent.

Lecture 6

An Markov Decision Problem (MDP) consists of: state space \(S\), start state, actions \(a\), transition function \(T(s,a,s')=P(s'|s,a)\), reward function \(R(s)\) (sometimes \(R(s,a)\) or \(R(s,a,s')\)), (maybe) terminal state.
A solution to an MDP is a policy:
- Non-stationary policy: function from states and times to actions.
- Stationary policy: mapping from states to actions.
- Both non-stationary policy and stationary policy satisfies the following property:
  - Full observability of the state
  - History-independence
  - Deterministic action choice
Infinite Horizon Discounted Value: \(V^{\pi}(s)=E\left[\sum_{t=0}^{\infty}\gamma^tR(s_t)\right]\). It follows Bellman Equation: \(V^{\pi}(s)=R(s)+\gamma\sum_{s'\in S}P(s'|s,\pi(s))V^{\pi}(s')\).
Value iteration: repeatedly perform Bellman backup operator \(B:\mathbb R^{|S|}\to\mathbb R^{|S|}\): \(BV(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V(s)\). The operator is a contraction, that for any \(V_1,V_2\) So there is unique policy \(V^*\), with property \(V^*=BV^*\).

\[\begin{aligned} \lVert BV_1-BV_2\rVert_\infty&=\gamma\max_{s\in S}\left|\left(\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V_1(s)\right)-\left(\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V_2(s)\right)\right|\\ &\leq\gamma\max_{s\in S}\max_{a\in A}\left|\sum_{s'\in S}P(s'|s,a)(V_1(s)-V_2(s))\right|\\ &\leq\gamma\max_{s\in S}\max_{a\in A}\sum_{s'\in S}P(s'|s,a)\left|V_1(s)-V_2(s)\right|\\ &=\gamma\max_{s\in S}\max_{a\in A}|V_1(s)-V_2(s)|\\ &=\gamma\max_{s\in S}|V_1(s)-V_2(s)|\\ &=\gamma\lVert V_1-V_2\rVert_\infty \end{aligned} \]
So \(V^*(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V^*(s')\). The policy is optimal: by induction we can prove that \(V^*(s)\geq BV(s)\geq V(s)\) if we start from any \(V^{\pi}\).
The convergence rate of value iteration: Assume rewards in \([0,R_{max}]\), then \(V*(s)\le\sum\limits_{t=0}^{\infty}\gamma^tR_{max}=\dfrac{R_{max}}{1-\gamma}\), so we can prove that \(\max\limits_{s\in S}|V^k(s)-V^*(s)|\le\dfrac{\gamma^k R_{max}}{1-\gamma}\) by induction. In other words, we have linear convergence to optimal value function.
Stopping Condition: Since \(\lVert V^k-V^{k-1}\rVert_{\infty}\le\epsilon\Rightarrow\lVert V^k-V^*\rVert_{\infty}\le\epsilon\dfrac{\gamma}{1-\gamma}\), we can continue iteration until \(\lVert V^k-V^{k-1}\rVert_{\infty}\le\epsilon\) for some small constant \(\epsilon\).
Policy iteration: \(\pi^{k+1}(s)\gets \operatorname{argmax}_a\sum_{s'\in S}P(s'|s,a)V^{\pi^k}(s)\) and \(V^{\pi^k}\) can be calculated by solving a linear system.
Modified Policy Iteration: using Bellman update with repeating \(k\) times for policy evaluation (instead of solving a linear system).

Lecture 7

Reinforcement Learning: Don't know transition function and reward function in an MDP.
Model-based approach to RL: Learn the MDP model, or an approximation of it, and use it for policy evaluation or to find the optimal policy

Model-free approach to RL: derive the optimal policy without explicitly learning the model
Passive learning: The agent has a fixed policy and tries to learn the utilities of states by observing the world go by.
- Approach 1 (model-free): Monte-Carlo direct estimation: Directly estimate \(V^{\pi}(s)\) as average total reward of episodes containing \(s\).
- Approach 2 (model-based): Adaptive Dynamic Programming (ADP): Estimate \(P(s'|s,a)\) by sampling \(s'\), then use estimated model to compute utility of policy.
- Approach 3 (model-free): Temporal Difference Learning (TD). After we take an action from \(s\) to \(s'\), \(V^\pi(s)\gets V^\pi(s)+\alpha(R(s)+\gamma V^{\pi}(s')-V^\pi(s))\). Intuition: \(\frac{\sum_{i=1}^{n+1}x_i}{n+1}=\frac{\sum_{i=1}^nx_i}{n}+\frac{1}{n+1}\left(x_{n+1}-\frac{\sum_{i=1}^nx_i}{n}\right)\). If we use decreasing learning rate (e.g. \(\dfrac{1}{n}\)) the estimation will converge to true value.
Active learning: The agent does not have a policy.
- ADP-based RL: Start with an initial model, solve for the optimal policy given the current model (using value or policy iteration), take an action according to an exploration/exploitation policy, and update the estimated model based on an observed transition.
- TD-based RL: Start with initial value function, take action from an exploration/exploitation policy giving new state \(s'\) (should converge to greedy policy), update estimated model, perform TD update \(V(s)\gets V(s)+\alpha(R(s)+\gamma V(s')-V(s))\).
- Q-learning (\(Q(s, a)\) is the expected value of taking action a in state s and then following the optimal policy thereafter): Start with initial Q-function, take action from an exploration/exploitation policy giving new state \(s'\), perform TD update \(Q(s,a)\gets Q(s,a)+\alpha(R(s)+\gamma\max_{a'}Q(s',a')-Q(s,a))\).
Exploration policy: we would like an exploration policy that is greedy in the limit of infinite exploration (GLIE).
- GLIE Policy 1: select to act randomly or greedily with probability regarding to \(t\).
- GLIE Policy 2: Boltzmann Exploration. \(\text{Pr}(a|s)=\frac{\exp(Q(s,a)/T)}{\sum_{a'\in A}\exp(Q(s,a')/T)}\). \(T\) is the temperature. Large \(T\) means that each action has about the same probability. Small \(T\) leads to more greedy behavior. Typically start with large \(T\) and decrease with time.

JAVA核心技術學習筆記--反射
2021-09-09
Java筆記反射
大話儲存——磁碟原理與技術筆記（一）
2021-07-16
筆記
LLM學習筆記-長度外推技術
2024-09-19
筆記
AQS原理學習筆記
2019-02-25
AQS筆記
synchronized原理學習筆記
2019-02-09
synchronized筆記
計算機組成原理與介面技術筆記（一）
2018-03-30
計算機筆記
【人工智慧】AI技術人才成長路線圖；深度學習課程筆記
2018-03-28
人工智慧AI深度學習筆記
人工智慧學習筆記（2）
2018-04-07
人工智慧筆記
容斥原理學習筆記
2020-10-22
筆記
產品幾何技術規範學習筆記
2024-09-29
筆記
人工智慧導論學習筆記
2023-03-13
人工智慧筆記
機器學習的技術原理、應用與挑戰
2024-04-01
機器學習
強化學習-學習筆記7 | Sarsa演算法原理與推導
2022-07-07
強化學習筆記演算法
飛機的 PHP 學習筆記十：應用技術
2020-01-31
PHP筆記
SpringBoot學習筆記 - 構建、簡化原理、快速啟動、配置檔案與多環境配置、技術整合案例
2023-02-02
Spring Boot筆記
筆記深入探索Android熱修復技術原理
2019-02-27
筆記Android
學習人工智慧技術，為何先學Python？
2021-12-17
人工智慧Python
AD學習筆記----原理圖設計
2020-10-21
筆記
黑馬公開課——執行原理與GC學習筆記
2021-09-09
GC筆記
資料科學和人工智慧技術筆記十九、資料整理（6）
2019-01-01
資料科學人工智慧筆記
飛機的 PHP 學習筆記之應用技術篇
2020-01-31
PHP筆記
技術分享 | Kubernetes 學習筆記之基礎知識篇
2021-10-19
筆記
《Java核心技術》第五章繼承學習筆記
2021-01-05
Java繼承筆記
《Java核心技術卷I》學習筆記2：資料型別、變數與常量
2020-11-01
Java筆記資料型別變數
逆向與安全學習筆記
2018-09-09
筆記
【學習筆記】mvc與mvvm
2020-12-19
筆記MVCMVVM
《大型網站技術架構核心原理與案例分析》讀書筆記（二）
2021-06-18
網站架構筆記
比特幣學習筆記——————2、比特幣原理
2018-06-12
比特幣筆記
《通訊原理基礎》學習筆記（1）
2018-03-27
筆記
mysql學習筆記-底層原理詳解
2022-03-22
MySql筆記
Java程式設計思想學習筆記4 - 序列化技術
2018-06-29
Java程式設計筆記
LTE-5G學習筆記17--COMP技術講解
2019-02-12
筆記
《Python核心技術與實戰》筆記3
2020-12-19
Python筆記
《人工智慧教育技術學》收穫記錄1
2024-09-11
人工智慧
讀人工智慧時代與人類未來筆記02_技術變革
2024-05-14
人工智慧筆記
CSS技術筆記
2020-11-22
CSS筆記
清華大學-作業系統學習筆記（五）--- 虛擬記憶體技術
2020-11-17
作業系統筆記記憶體
《深入理解計算機系統原理》學習筆記與習題答案（一）
2020-11-25
計算機筆記

人工智慧：原理與技術 學習筆記

Lecture 2

Lecture 3

Lecture 4

Lecture 5

Lecture 6

Lecture 7

相關文章

人工智慧：原理與技術學習筆記