【RL-Notes】Constraint Decoupling by Constraint Relaxiation

Navigator

Example

Constraint Relaxiation, whereby the constraint set is replaced by another constraint set that does not involve coupling.

Multiarmed bandit problem, involves n projects of which only one can be worked on at any time period. Each project i is characterized at time k by its state x k i x_k^i xki. If project i is worked on at time k, one receives an expected reward R i ( x k i ) R^i(x_k^i) Ri(xki), and the state x k i x_k^i xki evolves according to the equation:
x k + 1 i = f i ( x k i , w k i ) x_{k+1}^i=f^i(x_k^i, w_k^i) xk+1i=fi(xki,wki)
where w k i w_k^i wki is a random disturbance with probability distribution depending on x k i x_k^i xki but not on prior disturbances. If project i is not worked on, its state changes according to
x k + 1 i = f ˉ i ( x k i , w ˉ k i ) x_{k+1}^i=\bar{f}^i(x_k^i, \bar{w}_k^i) xk+1i=fˉi(xki,wˉki)

In particular, suppose that the optimal reward function J k ∗ ( x 1 , … , x n ) J_k^*(x^1, \dots, x^n) Jk(x1,,xn) is approximated by a separable function of the form ∑ i = 1 n J ~ k i ( x i ) \sum_{i=1}^n \tilde{J}_k^i(x^i) i=1nJ~ki(xi), where each J ~ k i \tilde{J}_k^i J~ki is a function that quantifies the contribution of the i i ith project to the total reward. The corresponding one-step lookhead policy selects at time k k k the project i i i that maximizes:
R i ( x i ) = ∑ j ≠ i R ˉ j ( x j ) + E { J ~ k + 1 i ( f i ( x i , w i ) ) } + ∑ j ≠ i E { J k + 1 j ( x j , w ˉ j ) } R^i(x^i)=\sum_{j\neq i}\bar{R}^j(x^j)+\mathbb{E}\{\tilde{J}_{k+1}^i(f^i(x^i, w^i))\}+\sum_{j\neq i}\mathbb{E}\{J_{k+1}^j(x^j, \bar{w}^j)\} Ri(xi)=j=iRˉj(xj)+E{ J~k+1i(fi(xi,wi))}+j=iE{ Jk+1j(xj,wˉj)}
which can also be written as:
R i ( x i ) − R ˉ i ( x i ) + E { J ~ k + 1 i ( f i ( x i , w i ) ) − J ~ k + 1 i ( f ˉ i ( x j , w ˉ j ) ) } + ∑ j = 1 n { R ˉ j ( x j ) + E { J ~ k + 1 j ( f ˉ j ( x j , w ˉ j ) ) } } R^i(x^i)-\bar{R}^i(x^i)+\mathbb{E}\{\tilde{J}_{k+1}^i(f^i(x^i, w^i))-\tilde{J}_{k+1}^i(\bar{f}^i(x^j, \bar{w}^j))\}+\sum_{j=1}^n\{\bar{R}^j(x^j)+\mathbb{E}\{\tilde{J}^j_{k+1}(\bar{f}^j(x^j, \bar{w}^j))\}\} Ri(xi)Rˉi(xi)+E{ J~k+1i(fi(xi,wi))J~k+1i(fˉi(xj,wˉj))}+j=1n{ Rˉj(xj)+E{ J~k+1j(fˉj(xj,wˉj))}}
Noting that the last term in the above expression does not depend on i.

An alternative possibility is to use a separable parametric approximation of the form
∑ i = 1 n J ~ k + 1 i ( x k + 1 i , r k + 1 i ) \sum_{i=1}^n\tilde{J}_{k+1}^i(x_{k+1}^i, r_{k+1}^i) i=1nJ~k+1i(xk+1i,rk+1i)
where r k + 1 i r_{k+1}^i rk+1i are vectors of ‘tunnable’ parameters. The values of r k + 1 i r_{k+1}^i rk+1i can be obtained by some training algorithm.

Reference

Reinforcement Learning and Optimal Control

猜你喜欢

转载自blog.csdn.net/qq_18822147/article/details/121847481
今日推荐