7.4.4 Beliefs in Hanabi

Hanabi의 belief를 계산하는 가장 간단한 방법은 다음과 같습니다. $f^{\mathrm{pub}}_{t}$ 는 덱에 남아있는 카드에 대한 정보 candidates vector $C$ 와, hint mask $HM$ 은 $nN_h \times (N_{\mathrm{color}}N_{\mathrm{rank}}+1)$ 크기의 binary matrix로 나타낼 수 있습니다.(패가 없다는 정보 +1까지)이는 주어진 슬롯에서 플레이어가 지금까지 힌트에 따라 특정 카드를 들고 있을 수 있는 경우 1, 그렇지 않으면 0인 matrix입니다. 슬롯은 public state에 대한 private state space $f[i]$ 로 부터 계산됩니다. basic belief $B^0$ 는 다음과 같이 계산됩니다.

$B^0(f[i]) = P(f[i]|f^{\mathrm{pub}})\propto C(f) \times HM(f[i])$

이를 "V0 belief" 라고 부를건데, 이는 그 카드에 대한 public available information으로만 이루어져있습니다. 이번 실험에서 baseline agent는 basic belief 를 받는 것을 중점으로 진행하였습니다. 하지만 이 정보는 실제 사람이 이용하기에는 모든 힌트를 기억하는 것과 같기 때문에 문제가 있는 방법입니다. 그리고 basic belief는 다른 정보와의 상호작용을 통한 어떤 결과물을 내놓은 것에 대한 고려를 하지 않은 채로 이루어집니다. 이를 BAD에서는 self-consistent belief와 (7.3.11) 수식에서 이루어졌던 noisy sampling을 통해 해결합니다.

간단한 각 카드에 대한 belief는 다음처럼 나타낼 수 있습니다.

$B^0(f[i]) \propto C(f) \times HM(f[i]) \times \mathcal{L}(f[i])$

$B^0(f[i]) = \frac{C(f) \times HM(f[i]) \times \mathcal{L}(f[i])}{\sum_gC(g) \times HM(f[i]) \times \mathcal{L}(f[i])}$

$=\beta_i {C(f) \times HM(f[i]) \times \mathcal{L}(f[i])}$

아래 두 term은 probability처럼 사용할 수 있게 normalization을 진행합니다.

다음으로는 iterative belief update를 진행해 보겠습니다.

$\mathcal{B}^{k+1}(f[i]) = \sum_{f[-i]}\mathcal{B}^k(f[-i])P(f[i]|f[-i],f^\mathrm{pub}_{\leq t},u^a_{\leq t}, \hat{\pi}_{\leq t})$

$=\sum_{g[-i]}\mathcal{B}^k(g[-i])\beta_i(C(f)- \sum_{j \neq i} \bm{1}(g[j] = f)) M(f[i])$

이 때 $M(f[i]) = HM(f[i]) \times \mathcal{L}(f[i])$ 입니다. 이때, $\mathcal{B}^k$ 에 대해 factorized 하게 근사하면 다음과 같습니다.

$B^{k+1}(f[i]) = \sum_{g[-i]}\mathcal{B}^k(g[-i])\beta_i(C(f))- \sum_{j \neq i} \bm{1}(g[j] = f)) M(f[i])$

$= \sum_{g[-i]}\prod_{j \neq i}\mathcal{B}^k(g[-i])\beta_i(C(f))- \sum_{j \neq i} \bm{1}(g[j] = f)) M(f[i])$

$\simeq \beta_i\sum_{g[-i]}\prod_{j \neq i}\mathcal{B}^k(g[-i])\beta_i(C(f))- \sum_{j \neq i} \bm{1}(g[j] = f)) M(f[i])$

마지막 term은 sample을 평균한 후 normalizing을 해 결과값과는 차이가 생기지만, 이는 sample마다 normalizing을하고, 평균을 내는 것보다 좀 더 다루기 쉬운점을 이용했습니다.

이는 product-sum을 통해 다음과 같이 나타낼 수 있습니다.

$\mathcal{B}^{k+1}(f[i]) \simeq \beta_i(C(f) - \sum_{g[-i]}\prod_{j\neq i} \mathcal{B}^k(g[j]) \sum_{j \neq i } \bm{1}(g[j] = f)) M(f[i])$

$= \beta_i(C(f) - \sum_{{j\neq i} }\sum_g\mathcal{B}^k(g[j]) \bm{1}(g[j] = f)) M(f[i])$

$= \beta_i(C(f) - \sum_{{j\neq i} }\mathcal{B}^k(g[j]) ) M(f[i])$

$\propto(C(f) - \sum_{{j\neq i} }\mathcal{B}^k(g[j]) ) M(f[i])$

그러므로 self consistent belief를 sampling 없이 다음처럼 구할 수 있습니다.

$\mathcal{B}^{k+1}(f[i]) \propto(C(f) - \sum_{{j\neq i} }\mathcal{B}^k(g[j]) ) \times HM(f[i])$

그리고 belief가 수렴하거나 iteration을 최대로 했을 때, 이를 V1 belief라고 부릅니다. 이는 Bayesian probability에 의하여 만들어졌진 않지만, 다른카드와 힌트와의 상호작용을 통해 만들어졌습니다. 결과적으로 iteration을 할때마다 HM의 slot은 다른 slot들에 의해 들어갈 candidate가 줄어드는 추론이 이뤄집니다.

같은 algorithm이지만 $\mathcal{L}$ 을 포함하는 algorithm을 BAD라고 합니다.

$BB^0(f[i]) \propto C(f) \times HM(f[i]) \times \mathcal{L}(f[i])$

$BB^1(f[i]) \propto (C(f)-\sum_{j \neq i }B^k(f[j])) \times HM(f[i]) \times \mathcal{L}(f[i])$

실제로 V2 belief는 안정성을 위해 Bayesian Belief와 V1 belief의 보간법을 통해 구합니다.

$V2 = (1-\alpha) BB+ \alpha V1 \ \ , \alpha = 0.01 \ \ \mathrm{or} \ \ 0.1$

Previous7.4.3 Observations and Actions Next7.4.5 Architecture Details for Baselines and Method

Last updated 4 years ago

Was this helpful?