8.3.1 Naive Learner

각 agent의 policy를 $\theta^a$ 에 의해 parameterized된 $\pi^a$ 라 할 때, 두 agent에 의한 expected total discounted return은 $J^a(\theta^1,\theta^2)$ 로 정의할 수 있습니다. Naive Learner는 다음과 같이 반복적으로 expected total discounted return을 최대화 하기 위해 optimize됩니다.

$\theta^1_{i+1} = \mathrm{argmax}_{\theta^1}J^1(\theta^1,\theta^2_i)$

$\theta^2_{i+1} = \mathrm{argmax}_{\theta^1}J^1(\theta^1_i,\theta^2)$

이는 고정된 이전의 다른 agent parameter로 부터 expected total discounted return을 maximize하는 $\theta$ 를 찾게 되는데, 실제로 expected total discounted return에 대한 접근을 할 수 없으니, 다음과 같이 gradient ascent를 통해 학습을 진행합니다.

$\theta^1_{i+1} = \theta^1_i + \bm{f}^1_{\mathrm{nl}}(\theta^1_i,\theta^2_i)$

$\bm{f}^1_{\mathrm{nl}} = \nabla_{\theta^1_i}J^1(\theta^1_i,\theta^2_i) \delta$

delta는 learning rate로 다른 agent도 이와 같이 학습하게 됩니다.

Previous8.3 Methods Next8.3.2 Learning with Opponent Learning Awareness

Last updated 4 years ago

Was this helpful?