From Bandit Problems to MDPs

Background

In bandit problems, we estimated the value q ∗ ( a ) q_*(a) q(a) of each action a a a. In MDPs, we estimate the value q ∗ ( s , a ) q_*(s, a) q(s,a) of each action a a a in each state s s s, or we estimate the value v ∗ ( s ) v_*(s) v(s) of each state given optimal action selections.

  • q ∗ ( s , a ) q_*(s,a) q(s,a) evaluates how good the action is taken at the state s s s.
  • v ∗ ( s ) v_*(s) v(s) evaluates how good the state s s s.

Markov property

The Markov property means that the future of the process only depends on the current observation, and the agent has no interest in looking at the full history.

Markov Decision Process(MDP)

MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called agent. The thing it interacts with, comprising everything outside the agent, is called the environment. The agent-environment interaction in a Markov decision process is as follows:
在这里插入图片描述
Definition: \textbf{Definition:} Definition: An MDP is a 5-tuple ( S , A , T , R , γ ) (S, A, T, R, \gamma) (S,A,T,R,γ) where:
在这里插入图片描述
In a finite MDP, there is a probability of those values( s ′ , r {s}',r s,r) occuring at time t t t, given particular values of the preceding state s s s and action a a a.
p ( s ′ , r ∣ s , a ) ≐ Pr ⁡ { S t = s ′ , R t = r ∣ S t − 1 = s , A t − 1 = a } p ( s ′ ∣ s , a ) ≐ Pr ⁡ { S t = s ′ ∣ S t − 1 = s , A t − 1 = a } = ∑ r ∈ R   p ( s ′ , r ∣ s , a ) \begin{aligned} p({s}', r|s,a)&\doteq \Pr\{S_t={s}', R_t=r|S_{t-1}=s,A_{t-1}=a\} \\ p({s}'|s,a)&\doteq \Pr\{S_t={s}'|S_{t-1}=s,A_{t-1}=a\}\\ &=\underset{r\in R}{\sum}{\ p({s}',r|s,a)} \\ \end{aligned} p(s,rs,a)p(ss,a)Pr{ St=s,Rt=rSt1=s,At1=a}Pr{ St=sSt1=s,At1=a}=rR p(s,rs,a)

From the four-argument dynamics function p p p, we can compute state-transition probilities, etc. as follows:
r : S × A → R , r ( s , a ) ≐ E [ R t ∣ S t − 1 = s , A t − 1 = a ] = ∑ r ∈ R   r   ∑ s ′ ∈ S p ( s ′ , r ∣ s , a ) r : S × A × S → R , r ( s , a , s ′ ) = E [ R t ∣ S t − 1 = s , A t − 1 = a , S t = s ′ ] = ∑ r ∈ R   r ⋅ p ( r ∣ s , a , s ′ ) = ∑ r ∈ R   r ⋅ p ( r , s , a , s ′ ) p ( s , a , s ′ ) = ∑ r ∈ R   r ⋅ p ( s ′ , r ∣ s , a ) ⋅ p ( s , a ) p ( s ′ ∣ s , a ) ⋅ p ( s , a ) = ∑ r ∈ R   r ⋅ p ( s ′ , r ∣ s , a ) p ( s ′ ∣ s , a ) \begin{aligned} r:S\times A \rightarrow \R, \\ r(s,a)&\doteq E[R_t|S_{t-1}=s, A_{t-1}=a]=\underset{r\in R}{\sum}\ r \ \underset{ {s}'\in S}{\sum}{p({s}',r|s,a)} \\ r:S\times A \times S\rightarrow \R, \\ r(s, a, {s}') &= E[R_t|S_{t-1}=s, A_{t-1}=a, S_t={s}'] \\ &=\underset{r \in R}{\sum}\ r \cdot p(r|s,a,{s}') \\ &=\underset{r \in R}{\sum}\ r \cdot \frac{p(r,s,a,{s}')}{p(s,a,{s}')} \\ &=\underset{r \in R}{\sum}\ r \cdot \frac{p({s}',r|s,a)\cdot p(s,a)}{p({s}'|s,a)\cdot p(s,a)}\\ &=\underset{r \in R}{\sum}\ r \cdot \frac{p({s}',r|s,a)}{p({s}'|s,a)} \end{aligned} r:S×AR,r(s,a)r:S×A×SR,r(s,a,s)E[RtSt1=s,At1=a]=rR r sSp(s,rs,a)=E[RtSt1=s,At1=a,St=s]=rR rp(rs,a,s)=rR rp(s,a,s)p(r,s,a,s)=rR rp(ss,a)p(s,a)p(s,rs,a)p(s,a)=rR rp(ss,a)p(s,rs,a)

At each step, the agent takes an action that changes its state in the environment and provides a reward.
在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/lun55423/article/details/112009579