3.2 Related Work

์ด์ „ MARL์—ฐ๊ตฌ๋“ค์„ ๋ณด์ž๋ฉด, ์ฒ˜์Œ์—” ๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•œ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์‹คํ—˜์œผ๋กœ ์‹œ์ž‘๋˜์—ˆ๊ณ , ์ด๋•Œ ์•ž์—์„œ ๋ณด์•˜๋˜ IQL์˜ ๋“ฑ์žฅ๊ณผ two player pong์œผ๋กœ์˜ ์ ์šฉ ์ดํ›„ DMARL์— ๋Œ€ํ•œ ํฐ ๊ธฐํ‹€์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ agent๊ฐ„์˜ communication์— ๋Œ€ํ•œ ํ•„์š”์„ฑ์„ ๋А๋ผ๊ณ , ์ด์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋„ ์ด๋ฃจ์–ด์กŒ๋Š”๋ฐ, ์ด๋Š” ํ•˜๋‚˜๋Š” agent๊ฐ„์˜ gradient๋ฅผ ํ˜๋ ค๋ณด๋‚ด๋Š” ๋ฐฉ์‹๊ณผ parameter๋ฅผ sharingํ•˜๋Š” ๋ฐฉ์‹ ๋‘๊ฐ€์ง€๊ฐ€ ์ฃผ์š”ํ•œ ๋ฐฉ์‹์œผ๋กœ ์—ฐ๊ตฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋ฐฉ์‹๋“ค์ด ํ•™์ค‘ ์ถ”๊ฐ€์ ์ธ state information(centralized critic์ด global state์„)์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๊ณ , Credit Assignment Problem์„ ํ•ด๊ฒฐํ•˜์ง€ ์•Š์•˜๋‹ค๋Š” ์ ์—์„œ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ Gupta, Egorov, Kochenderfer์˜ ์—ฐ๊ตฌ์—์„œ centralized training, decentralized execution์„ ์ ์šฉํ•œ actor-critic์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์—ˆ์œผ๋‚˜, agent ๋ชจ๋‘ local observation critic์„ ๊ฐ€์ง€๊ณ , credit assignment problem๋ฅผ ์˜ค์ง local reward๋ฅผ ๋งŒ๋“ค์–ด์„œ ํ•œ์ ์—์„œ ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

RL์˜ starcraft micromanagement ์ ์šฉ์€ ์ฃผ๋กœ multi agent์— ๋Œ€ํ•œ architectureํŠน์„ฑ์€ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ centralized controller์™€ full state๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์ด ์ง„ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Usuiner์˜ ์—ฐ๊ตฌ์—์„œ๋Š” greedy MDP๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ ์ด๋Š” ๊ฐ timestep์—์„œ ๋‹ค๋ฅธ agent๋“ค์˜ ์ด์ „์˜ action๋“ค์ด ๋ชจ๋‘ ์ฃผ์–ด์ง„์ƒํƒœ์—์„œ action์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋Š” ๋…ผ๋ฌธ์˜ Zero-order (ZO) backpropagation algorithm์„ ๋ณด๋ฉด ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Peng์˜ ์—ฐ๊ตฌ์—์„œ๋Š” RNN์„ ํ†ตํ•ด agent๊ฐ„์˜ ์ •๋ณด ๊ต๋ฅ˜๊ฐ€ ์ผ์–ด๋‚˜๋„๋ก ์„ค๊ณ„ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋•Œ Usunier์˜ ์—ฐ๊ตฌ์—์„œ ์—ฌ๊ธฐ์„œ ์“ฐ์ธ ๋น„์Šทํ•œ ์‹คํ—˜์ •์˜๋ฅผ ํ•˜์˜€์œผ๋ฉฐ, DQN baseline์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. Omidshafiei์˜ ์—ฐ๊ตฌ์—์„œ๋Š” decentralized training์ค‘์˜ experience replay ์•ˆ์ •์„ฑ์„ ํ•ด๊ฒฐํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Rashid์™€ Sunehag์˜ ์—ฐ๊ตฌ์—์„œ๋Š” agent ๊ฐ์ž์˜ centralized critic์„ ์ œ์•ˆํ–ˆ๊ณ , Lowe์˜ ์—ฐ๊ตฌ์—์„œ๋Š” centralized critic(๋ณธ๋ฌธ์—์„œ๋Š” single critic์ด๋ผ๊ณ  ํ–ˆ์ง€๋งŒ MADDPG์ž์ฒด๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ์˜ q network๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.)์„ ์ œ์•ˆํ•˜๊ณ  ์ด๋ฅผ decentralized actor๋ฅผ ํ•™์Šตํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” COMA์™€ ์œ ์‚ฌํ•œ ๋ฉด์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”๋ฐ, ์‹ค์ œ๋กœ ์ด ์—ฐ๊ตฌ๋Š” ์—ฌ๊ธฐ์„œ ์ œ์‹œํ•˜๋Š” ์•„์ด๋””์–ด์™€ ๊ฑฐ์˜ ๋™์‹œ์— ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์—ฌ๊ธฐ์„œ๋Š” Credit Assignment Problem์„ ํ•ด๊ฒฐํ•  ์–ด๋–ค ์ ‘๊ทผ๋„ ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

Last updated

Was this helpful?