8.6 Conclusion & Future Work

์ด๋ฒˆ section์—์„œ๋Š” Opponent-Learning Awareness(LOLA)์— ๋Œ€ํ•ด ์•Œ์•„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด๋Š” MARL์ƒํ™ฉ์—์„œ ๋‹ค๋ฅธ agent์˜ ํ•™์Šต์„ ๊ณ ๋ คํ•ด ์ž์‹ ์˜ ํ•™์Šต์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” value function์— ๋Œ€ํ•ด ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์„ ๋•Œ IPD์—์„œ NL์€ defact ์ „๋žต์— ์ˆ˜๋ ดํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์˜€์ง€๋งŒ LOLA๋Š” tit-for-tat ์ „๋žต์ด ์šฐ์œ„๋ฅผ ์ ํ•˜๋Š” ๊ฒƒ์„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, LOLA๊ฐ€ IMP์—์„œ๋„ ๋‚ด์‰ฌ ๊ท ํ˜•์„ ์ด๋ฃจ๋Š” ๊ฒƒ์„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ multi-agent learning algorithm๊ณผ๋„ IPD์™€ IMP์—์„œ ์‹คํ—˜์ ์œผ๋กœ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ด๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

value function์— ์ง์ ‘ ์ ‘๊ทผํ•˜์ง€ ๋ชปํ•  ๋•Œ์— ๋Œ€ํ•ด gradient-based version LOLA๋ฅผ ์†Œ๊ฐœํ•˜๊ณ , Coin Game์—์„œ ์ด์˜ ํ•™์Šต์„ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด ๋•Œ, recurrent layer๊ฐ€ ํ•„์š”ํ•จ์„ ๋ณด์•˜๊ณ , LOLA๋Š” coordinationํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋„๋Š” ๊ฒƒ์„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด๋•Œ, opponent์˜ parameter ์ •๋ณด๊ฐ€ ์—†์–ด๋„ ์ด๋ฅผ ์–ด๋Š์ •๋„ ํ•ด๊ฒฐํ•  ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์„ค๋ช…ํ–ˆ๊ณ , LOLA์˜ high-order approximation์— ๋Œ€ํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด๋Š” IPD์—์„œ agent ๋ชจ๋‘ LOLA๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ  high order approximation๋กœ ์–ป๋Š” ์ถ”๊ฐ€์ ์ธ ์†Œ๋“์€ ์—†์—ˆ์Šต๋‹ˆ๋‹ค.

์ €์ž๋Š” ์ดํ›„ future work๋กœ ์ ๋Œ€์ ์ธ agent๊ฐ€ gradient-based method๊ฐ€ ์•„๋‹Œ, global search method๋ฅผ ํ†ตํ•ด LOLA๋ฅผ ์ด์šฉํ•˜๋ ค ๋“ค ๋•Œ์— ์–ด๋–ป๊ฒŒ LOLA์˜ ์ทจ์•ฝ์ ์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์„์ง€์— ๋Œ€ํ•ด ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์„ ์—ฐ๊ตฌํ•˜๊ฒ ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. LOLA๊ฐ€ naive learner๋ฅผ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ ๋งŒํผ LOLA learner๋ฅผ ์ด์šฉํ•  ์ˆ˜๋‹จ์ด ์žˆ์„ ๊ฒƒ์ด ํƒ€๋‹น์„ฑ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

Last updated