強化學習理論-第4課-值迭代與策略迭代

penuel發表於2024-11-13

1. value iteration algorithm:

值迭代上一節已經介紹過:

1.1 policy update:

1.2 Value update:

此時,\(\pi_{k+1}\)\(v_k\)都是已知的

1.3 procedure summary:

1.4 example:



2. policy iteration algorithm:



Q1:

Q2:

Q3:

2.1 Policy evaluation:

2.2 Policy improvement:


3. truncated policy iteration algorithm

3.1 compare value iteration and policy iteration:




計算一步是value interation,計算無窮多步,就是policy iteration。中間截斷一步,就叫做truncated policy iteration

3.2 pseudocode:


4. summary:

相關文章