9.4.1 Implement of DiCE

DiCE의 처음부터 실용적인 논문임을 강조했는데, 이는 단순한 구현방법에 있습니다. MagicBox는 두가지 특성을 만족하면 됐었는데, 이는 다음과 같이 정의함으로써 두가지 성질을 다 가져갈 수 있습니다. 아래선 위를 확인하도록 합니다.

$\square (\mathcal{W}) = \exp(\tau - \bot(\tau))$

$\tau = \sum_{w \in \mathcal{W}} \log(p(w;\theta))$

$\bot$ 은 $\nabla_x \bot(x) = 0$ 이 되도록 하는 gradient를 안흐르도록 하는 operator로 pytorch의 detach같은 역할을 합니다. $\bot(x) \rightarrow x$ 이므로, $\square (\mathcal{W}) \rightarrow 1$ 임이 자명합니다. 이로써 첫번째 성질이 증명되었습니다. 바로 두번째 성질을 증명하면 다음과 같습니다.

$\nabla_\theta \square (\mathcal{W}) = \nabla_\theta\exp(\tau-\bot(\tau))$

$= \exp(\tau-\bot(\tau))\nabla_\theta(\tau-\bot(\tau))$

$=\square(\mathcal{W})(\nabla_\theta\tau-0)$

$=\square(\mathcal{W})\sum_{w \in \mathcal{W}}\nabla_\theta \log(p(w;\theta))$

그리고 magicbox operator를 구현하게되면, 주로 objective와 바로 연관지어 구현하는게 가장 간단한데, 일반적인 RL에선 $J = \mathbb{E}[\sum r_t]$ 로 나타낼 때, DiCE의 objective는 $J_\square= \sum_t \square(\{a_{t'},t'\leq t\})r_t$ 로 나타내야합니다. (이는 이전의 action에 따라 reward에 stochastic하게 영향을 주기 때문입니다.

Previous9.4 Correct Gradient Estimators with DiCE Next9.4.2 Casuality

Last updated 4 years ago

Was this helpful?