9.3.3. Simple Failing Example

이번 chapter에서는 $x \sim \mathrm{Ber}(\theta)$ 를 따르는 $x$ 에 대한 간단한 예제를 보겠습니다.

$f(x,\theta) = x(1-\theta) + (1-x)(1+\theta)$ 일 때, $\mathcal{L}$ 은 다음과 같이 정의됩니다.

$\mathcal{L} = \mathbb{E}_x[f(x;\theta)]$

$= \sum_x{p(x;\theta)}f(x)$

$= \sum_x p(x;\theta)(x(1-\theta)+(1-x)(1+\theta))$

$= (\theta (1(1-\theta)+0(1+\theta))) + ((1-\theta)(0(1-\theta)+ 1(1+\theta))$

$=-2 \theta^2+\theta + 1$

이 때, 그냥 2차 미분까지 한다면 다음과 같습니다.

$\nabla_{\theta}\mathcal{L} = -4\theta+1$

$\nabla^2_\theta\mathcal{L} = -4$

SL을 사용한 미분은 다음과 같습니다.

$( \nabla_{\theta}\mathcal{L})_{\mathrm{SL}} = -4\theta + 1$

$( \nabla^2_{\theta}\mathcal{L})_{\mathrm{SL}} = -2$

( $\hat{f}$ 를 어떻게 세워야 나올지 개념은 이해했는데, 여기에 적용이 어려웠습니다.)

sampling을 아무리 많이한다해도 SL estimator는 잘못된 2차 미분값을 내놓습니다. 만약 이런 잘못된 estimate이 2차 미분을 이용하는 Newton-Raphson method같은 optimization method와 결합된다면 절대 $\theta$ 는 바른값에 수렴하지 못할 것입니다. 반대로 정확한 gradient를 갖는다면 이는 단한번에도 수렴할 수 있습니다.

이번 예제에서 보여주는 점은 $\theta$ 가 stochastic sample에 의해 regularization될 때, 비슷하게 일어날 것이라는 것을 보여줍니다. 예를들면 soft Q-learning에서도 reward에 entropy penalty를 줌으로써 policy를 regularization하는데, 이 penalty는 policy parameter $\theta$ 에 의존적입니다. 또한 state에 대해서도 의존적인데, 이는 차례로 stochastically sampled action에 의해 영향을 받습니다. 결과적으로, entropy를 통해 regularization을 하는 RL objective는 모두 위에서의 문제처럼 SL의 접근이 실패할 것임을 암시합니다. 이는 차 gradient에서만 일어나는 것이 아닌, $\theta$ 의 복잡성에 따라도 달려있음을 보입니다. 이처럼 regularized objective를 가진 이차 gradient를 사용하는 method는 다른 대체 방법을 찾아야합니다.

Previous9.3.2 Higher Order Surrogate Losses Next9.4 Correct Gradient Estimators with DiCE

Last updated 4 years ago

Was this helpful?