7.4.1 Matrix Game

์ •ํ™•ํ•œ matrix game์— ๋Œ€ํ•œ ์„ค๋ช…ํ•„์š”

์—ฌ๊ธฐ์„  ์ฒซ์งธ๋กœ, 2์ธ์šฉ 2-step์— ๋๋‚˜๋Š” partially observable matrix-like game์— ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด ์‹คํ—˜์—์„œ ๊ฐ agent์˜ state๋Š” random bit๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๊ณ , action space๋Š” 3๊ฐœ์˜ discrete action์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” ์ƒํ™ฉ์ž…๋‹ˆ๋‹ค. ๊ฐ agent๊ฐ€ ์ž์‹ ์˜ ์นด๋“œ๋ฅผ ๊ด€์ฐฐํ•˜๊ณ , (2-step game์ด๋ฏ€๋กœ, agent 1๋งŒ) ์ž์‹ ์˜ ํŒจ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ encodingํ•˜์—ฌ ํ–‰๋™์— ์˜ฎ๊ฒจ์•ผํ•˜๋Š” ์ƒํ™ฉ์ž…๋‹ˆ๋‹ค. reward๋Š” ๋‘ agent๊ฐ€ ์ข‹์€ convention์„ ์–ป์—ˆ์„ ๋•Œ๋งŒ ์ตœ๋Œ€ํ™” ๋  ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด๋Š” 2-step์ด๋ฏ€๋กœ, agent 1์ด ์ •๋ณด๋ฅผ ์ž˜ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๋Š” ํ–‰๋™์„ ํ•ด agent 2๊ฐ€ ์ž˜ ์•Œ์•„๋“ฃ๊ณ  ํ–‰๋™ํ–ˆ์„ ๋•Œ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์„ฑ๋Šฅ์— ๋Œ€ํ•ด ์•„๋ž˜์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ด ์‹คํ—˜์—์„œ BAD๋Š” baseline์ธ Vanila PG๋ณด๋‹ค ์••๋„์ ์œผ๋กœ ์ข‹์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ CF policy gradient๋Š” logโกฯ€a(utaโˆฃฯ„a)\log\pi^a(u^a_t|\tau^a)๋ฅผ logโก(ฯ€^โˆฃBt,fpub)\log(\hat{\pi}|\mathcal{B}_t,f^{\mathrm{pub}})๋‹จ์ˆœํžˆ ์„ ํƒ๋œ action์„ ๊ณ ๋ คํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์‹คํ–‰๋˜์ง€์•Š์€ action์— ๋Œ€ํ•ด์„œ๋„ ๊ณ ๋ ค๋ฅผ ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.๊ทธ๋Ÿฌ๋‚˜ ์ด๋Š” ์—ฌ๊ธฐ์„œ ์•„์ฃผ ์ ์€ ์ฐจ์ด๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

Last updated