Lecture 2
-
Supervised learning: regression, classification, ...
-
Unsupervised learning: clustering, dimensionality reduction, ...
-
The canonical machine learning problem: Given a set of training data \(\{(x_i,y_i)\}_{i=1}^m\) and a loss function \(\ell\), find the parameters \(\theta\) that minimizes the sum of losses \(f(\theta)=\sum\limits_{i=1}^m\ell(h_{\theta}(x_i),y_i)\).
-
Linear least squares problem: linear hypothesis function \(h_{\theta}(x)=\theta^Tx\), and \(\ell(h_{\theta}(x),y)=(h_{\theta}(x)-y)^2\).
Solutions for linear least squares problem:
- Gradient descent: Repeat \(\theta\leftarrow\theta-\alpha\nabla_{\theta}f(\theta)\).
- Analytical method: Let \(X=\begin{bmatrix}x_1^T\\x_2^T\\\vdots\\x_m^T\end{bmatrix},y=\begin{bmatrix}y_1\\y_2\\\vdots\\y_m\end{bmatrix}\), then \(f(\theta)=\lVert X\theta-y\rVert^2\), \(\nabla_{\theta}f=2X^T(X\theta-y)\), solve the equation \(2X^T(X\theta-y)=0\), we get \(\theta=(X^TX)^{-1}X^Ty\).
-
Linear regression: the hypothesis function to fit training data is linear to the parameters when basing on particular input features.
-
Linear classification (for \(2\) classes): the hypothesis function is linear to the parameters when basing on particular input features. And the true output is determined by the sign of the hypothesis function. (\(\hat{y}=\text{sign}(h_{\theta}(x)))\).
-
Loss functions for classification:
- \(\ell_{0/1}:\ell(h_{\theta}(x),y)=[\text{sign}(h_{\theta})=\text{sign}(y)]\) (NP-hard to solve)
- \(\ell_{\text{logistic}}:\ell(h_{\theta}(x),y)=\log(1+\exp(-y·h_{\theta}(x)))\)
- \(\ell_{\text{hinge}}:\ell(h_{\theta}(x),y)=\max(1-y·h_{\theta}(x),0)\)
- \(\ell_{\text{exp}}:\ell(h_{\theta}(x),y)=\exp(-y·h_{\theta}(x))\).
Typically no closed-formed solution, solvable by gradient descent.
-
Support vector machine: solves the canonical machine learning optimization problem using hinge loss and linear hypothesis, i.e. \(f(\theta)=\sum\limits_{i=1}^m\max(1-y_i·\theta^Tx_i,0)\).
-
Logistic regression: solves the canonical machine learning optimization problem using logistic loss and linear hypothesis, i.e. \(f(\theta)=\sum\limits_{i=1}^m\log(1+\exp(-y_i·\theta^Tx_i))\).
-
Multiclass classification: Build \(k\) different classifiers \(h_{\theta_i}\) and output prediction as \(\operatorname{argmax}_ih_{\theta_i}(x)\). And loss function is defined as \(\ell(h_{\theta}(x),y)=-\log \dfrac{e^{h_{\theta_y}(x)}}{\sum_{i=1}^me^{h_{\theta_i}(x)}}=\log \sum_{i=1}^me^{h_{\theta_i}(x)}-h_{\theta_y}(x)\). (called softmax loss or cross entropy loss).
-
Overfitting: As the model becomes more complex, training loss always decreases, generalization loss decreases to a point then starts to increase.
-
Cross-validation: Divide the data set into training set and holdout, use training set to determine the parameters, use holdout/validation set to determine the hyperparameters (degree of polynomial, \(\lambda\) in regularization, ...).
-
Regularization: add a term \(\dfrac{\lambda}{2}\lVert\theta\rVert_2^2\) to the loss function \(f(\theta)\), where \(\lambda\) is a hyperparameter (when the model becomes complex, the parameters tend to be large in order to overfit the training data).
-
Even though we used a training/holdout split to fit the parameters, we are still effectively fitting the hyperparameters to the holdout set. Use a test set to evaluate the performance. And the best solutions are: evaluate your system “in the wild” as often a possible, recollect data if you suspect overfitting to present data, ...
Lecture 3
- Neural network: composed non-linear functions.
- Elements of a neural network: weight \(w\), bias \(b\), activate function \(f\) (\(z_i=f_i(W_iz_{i-1}+b_i))\).
- Common activate functions:
- Sigmoid: \(f(x)=\dfrac{1}{1+e^{-x}}\).
- Rectified linear unit (ReLU): \(f(x)=\max(x,0)\).
- Hyperbolic tangent: \(f(x)=\tanh(x)=\dfrac{e^{2x}-1}{e^{2x}+1}\).
- Stochastic Gradient Descent (SGD): adjust parameters based upon just one random sample (or a small random collection of samples, called batch), i.e. \(\theta\gets \theta-\alpha\nabla_{\theta}\ell(h_\theta(x_i),y_i)\) for some random \(i\).
- Backpropagation algorithm: Use the chain rule to recursively calculate the partial derivative of the loss function to every parameter, from the last layer to the first.
- Momentum: \(v_t=\beta v_{t-1}+(1-\beta)\nabla_{\theta}f(\theta)\) and \(\theta\gets \theta-\eta v_t\). Usually, \(\beta=0.9\).
Lecture 4
-
Problems with fully connected networks: need a very large number of parameters, very likely to overfit the data; generic deep network also does not capture the “natural” invariances we expect in images (translation, scale).
-
Convolutional Neural Networks has 4 types of layers:
- Convolution: require that activations between layers only occur in “local" manner; require that all activations share the same weights.
- Non-linearity: Rectified Linear Unit (ReLU). Advantages: 1. Fast to compute; 2. No Cancellation problem; 3. More sparse activation volume; 4. Solving the vanishing gradient problem
- Pooling (or downsampling):
- Pick a window size.
- Pick a stride.
- Walk the window across the image.
- For each window, take the maximum value.
- Fully Connected Layer
-
Recurrent Neural Network: a type of neural network with memory, the output of hidden layer are stored in the memory, and the memory can be considered as another input, used to deal with sequential data.
-
Problems with Vanilla RNNs:
- exploding gradient \(\to\) gradient clipping
- vanishing gradient \(\to\) change RNN architecture \(\to\) LSTM
Lecture 5
-
A search problem consists of:
- State space \(S\)
- Start state \(s_{start}\in S\)
- Possible actions \(\text{Actions}(s)\)
- Successor: \(\text{Succ}(s,a)\)
- Action cost \(\text{Cost}(s,a)\geq 0\)
- Goal test \(\text{IsEnd}(s)\).
A solution is a sequence of actions (a plan) which transforms the start state to a goal state.
-
Search heuristic: A function \(h(x)\) that estimates how close a state is to a goal.
-
Uninformed cost search: \(h(x)=0\). Informed: Can know whether one non-goal is more promising than another.
-
Admissible heuristic: \(h(x)\leq \text{the true cost to a nearest goal}\). If \(h\) is admissible, A* is optimal (can find the min-cost solution) and also optimally efficient (expand minimum nodes, which means if you expand nodes less than A* expands, you can't ensure your solution is the optimal solution).
-
A* tree search: Expand the nodes available in the increasing order of \(f(x)+h(x)\), where \(f(x)\) is the least cost from the starting state to \(x\). The algorithm is optimal if the heuristic is admissible. But the time complexity of A* Tree Search can be exponential.
-
Consistent heuristic: \(h(x)-h(y)\leq \text{cost}(x\to y)\). If \(h\) is consistent, then at the first time expanding some state, we have obtained the shortest path to the state. So every state is expanded at most once.
-
A* graph search: Similar to A* tree search, but guarantees that each node is expanded once. The algorithm is optimal if the heuristic is consistent.
Lecture 6
-
An Markov Decision Problem (MDP) consists of: state space \(S\), start state, actions \(a\), transition function \(T(s,a,s')=P(s'|s,a)\), reward function \(R(s)\) (sometimes \(R(s,a)\) or \(R(s,a,s')\)), (maybe) terminal state.
-
A solution to an MDP is a policy:
- Non-stationary policy: function from states and times to actions.
- Stationary policy: mapping from states to actions.
- Both non-stationary policy and stationary policy satisfies the following property:
- Full observability of the state
- History-independence
- Deterministic action choice
-
Infinite Horizon Discounted Value: \(V^{\pi}(s)=E\left[\sum_{t=0}^{\infty}\gamma^tR(s_t)\right]\). It follows Bellman Equation: \(V^{\pi}(s)=R(s)+\gamma\sum_{s'\in S}P(s'|s,\pi(s))V^{\pi}(s')\).
-
Value iteration: repeatedly perform Bellman backup operator \(B:\mathbb R^{|S|}\to\mathbb R^{|S|}\): \(BV(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V(s)\). The operator is a contraction, that for any \(V_1,V_2\) So there is unique policy \(V^*\), with property \(V^*=BV^*\).
\[\begin{aligned} \lVert BV_1-BV_2\rVert_\infty&=\gamma\max_{s\in S}\left|\left(\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V_1(s)\right)-\left(\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V_2(s)\right)\right|\\ &\leq\gamma\max_{s\in S}\max_{a\in A}\left|\sum_{s'\in S}P(s'|s,a)(V_1(s)-V_2(s))\right|\\ &\leq\gamma\max_{s\in S}\max_{a\in A}\sum_{s'\in S}P(s'|s,a)\left|V_1(s)-V_2(s)\right|\\ &=\gamma\max_{s\in S}\max_{a\in A}|V_1(s)-V_2(s)|\\ &=\gamma\max_{s\in S}|V_1(s)-V_2(s)|\\ &=\gamma\lVert V_1-V_2\rVert_\infty \end{aligned} \]So \(V^*(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V^*(s')\). The policy is optimal: by induction we can prove that \(V^*(s)\geq BV(s)\geq V(s)\) if we start from any \(V^{\pi}\).
-
The convergence rate of value iteration: Assume rewards in \([0,R_{max}]\), then \(V*(s)\le\sum\limits_{t=0}^{\infty}\gamma^tR_{max}=\dfrac{R_{max}}{1-\gamma}\), so we can prove that \(\max\limits_{s\in S}|V^k(s)-V^*(s)|\le\dfrac{\gamma^k R_{max}}{1-\gamma}\) by induction. In other words, we have linear convergence to optimal value function.
-
Stopping Condition: Since \(\lVert V^k-V^{k-1}\rVert_{\infty}\le\epsilon\Rightarrow\lVert V^k-V^*\rVert_{\infty}\le\epsilon\dfrac{\gamma}{1-\gamma}\), we can continue iteration until \(\lVert V^k-V^{k-1}\rVert_{\infty}\le\epsilon\) for some small constant \(\epsilon\).
-
Policy iteration: \(\pi^{k+1}(s)\gets \operatorname{argmax}_a\sum_{s'\in S}P(s'|s,a)V^{\pi^k}(s)\) and \(V^{\pi^k}\) can be calculated by solving a linear system.
-
Modified Policy Iteration: using Bellman update with repeating \(k\) times for policy evaluation (instead of solving a linear system).
Lecture 7
-
Reinforcement Learning: Don't know transition function and reward function in an MDP.
-
Model-based approach to RL: Learn the MDP model, or an approximation of it, and use it for policy evaluation or to find the optimal policy
Model-free approach to RL: derive the optimal policy without explicitly learning the model
-
Passive learning: The agent has a fixed policy and tries to learn the utilities of states by observing the world go by.
- Approach 1 (model-free): Monte-Carlo direct estimation: Directly estimate \(V^{\pi}(s)\) as average total reward of episodes containing \(s\).
- Approach 2 (model-based): Adaptive Dynamic Programming (ADP): Estimate \(P(s'|s,a)\) by sampling \(s'\), then use estimated model to compute utility of policy.
- Approach 3 (model-free): Temporal Difference Learning (TD). After we take an action from \(s\) to \(s'\), \(V^\pi(s)\gets V^\pi(s)+\alpha(R(s)+\gamma V^{\pi}(s')-V^\pi(s))\). Intuition: \(\frac{\sum_{i=1}^{n+1}x_i}{n+1}=\frac{\sum_{i=1}^nx_i}{n}+\frac{1}{n+1}\left(x_{n+1}-\frac{\sum_{i=1}^nx_i}{n}\right)\). If we use decreasing learning rate (e.g. \(\dfrac{1}{n}\)) the estimation will converge to true value.
-
Active learning: The agent does not have a policy.
- ADP-based RL: Start with an initial model, solve for the optimal policy given the current model (using value or policy iteration), take an action according to an exploration/exploitation policy, and update the estimated model based on an observed transition.
- TD-based RL: Start with initial value function, take action from an exploration/exploitation policy giving new state \(s'\) (should converge to greedy policy), update estimated model, perform TD update \(V(s)\gets V(s)+\alpha(R(s)+\gamma V(s')-V(s))\).
- Q-learning (\(Q(s, a)\) is the expected value of taking action a in state s and then following the optimal policy thereafter): Start with initial Q-function, take action from an exploration/exploitation policy giving new state \(s'\), perform TD update \(Q(s,a)\gets Q(s,a)+\alpha(R(s)+\gamma\max_{a'}Q(s',a')-Q(s,a))\).
-
Exploration policy: we would like an exploration policy that is greedy in the limit of infinite exploration (GLIE).
-
GLIE Policy 1: select to act randomly or greedily with probability regarding to \(t\).
-
GLIE Policy 2: Boltzmann Exploration. \(\text{Pr}(a|s)=\frac{\exp(Q(s,a)/T)}{\sum_{a'\in A}\exp(Q(s,a')/T)}\). \(T\) is the temperature. Large \(T\) means that each action has about the same probability. Small \(T\) leads to more greedy behavior. Typically start with large \(T\) and decrease with time.
-