8.4.3 Training Details

Gradient-based NL๊ณผ LOLA์„ ํ•™์Šตํ•  ๋•Œ์—”, actor-critic method๋ฅผ ์ ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  1. ํ•™์Šต ์ค‘๊ฐ„์—๋Š” learning rate ฮด\delta๋ฅผ actorsํ•™์Šตํ•  ๋• 0.005, critic์—” 1์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, td error๋กœ updateํ•˜๋ฏ€๋กœ ๋‹น์—ฐํžˆ 1์„ ์ฃผ๋Š”๊ฒŒ ๋งž๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

  2. discount rate ฮณ\gamma ๋Š” IPD์™€ Coin Game์—์„œ 0.96, matching pennies์—์„œ 0.9๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

    1. ์ด ๋•Œ, IPD์™€ Coin Game์—์„œ ์‚ฌ์šฉํ•œ ๋†’์€ ฮณ\gamma ๊ฐ’์€ ๊ธด time-step์— ๋Œ€ํ•œ reward signal์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.

    2. IMP์—์„  ๋‚ฎ์€ ฮณ\gamma ๊ฐ’์ด ํ•™์Šต์„ ์ข€ ๋” ์•ˆ์ •์„ฑ์žˆ๊ฒŒ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฆฌ๊ทธ์ „์„ ์œ„ํ•ด baseline algorithm๊ณผ parameter๋“ค์€ ์ž์„ธํ•˜๊ฒŒ๋Š” ๋ณธ๋ฌธ์„ ํ™•์ธํ•˜์‹œ๋ฉด ํŽธํ•ฉ๋‹ˆ๋‹ค. ๋ฆฌ๊ทธ์ „์€ ๋ชจ๋“  agent๊ฐ€ 1000 ์—ํ”ผ์†Œ๋“œ์”ฉ 200step์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Last updated