7.4.7 Results on Hanabi

BAD agent๋Š” 2์ธ Hanabi์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ทธ๋ž˜ํ”„์˜ (a)๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ด๋Š” BAD agent์™€ ๋‘ LSTM agent์˜ training curve์— ๋Œ€ํ•œ ๊ทธ๋ž˜ํ”„์ž…๋‹ˆ๋‹ค. LSTM agent๋Š” ํ…Œ์ŠคํŠธํ•  ๋•Œ, ํ•™์Šต๋œ policy์ค‘์—์„œ ์ œ์ผ ์ข‹์€ ๋ฒ„์ „์„ ์‚ฌ์šฉํ•ด ์กฐ๊ธˆ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. agent๋ฅผ ๊ณ ๋ฅผ ๋•Œ, agent๋งˆ๋‹ค 10,000๋ฒˆ์˜ ๊ฒŒ์ž„์„ ํ†ตํ•ด ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๊ณ , ๊ฐ€์žฅ ์ข‹์€ agent๋ฅผ ๊ฐ€์ง€๊ณ  100,000 ๋ฒˆ์˜ ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. BAD agent๋ฅผ ๊ณ ๋ฅผ ๋•Œ๋„ ๋น„์Šทํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜์˜€๋Š”๋ฐ, V1์„ ์–ผ๋งˆ๋‚˜ ์„ž๋Š”์ง€์— ๋Œ€ํ•œ ฮฑ \alpha์™€ hand์— ๋ช‡์žฅ์˜ ์นด๋“œ๋ฅผ ๋“œ๋А๋ƒ์— ๋”ฐ๋ผ ์ถ”๊ฐ€์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ ํƒ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋‹ค๋ฅธ method๋“ค ์ค‘์— 20์ ์ด ์•ˆ๋˜๋Š” method๋“ค์€ ๊ฐ€๋…์„ฑ์„ ์œ„ํ•ด์„œ ์ ์ง€ ์•Š์•˜๊ณ , Hanabi์˜ ๋ฃฐ์—์„œ 3๋ฒˆ์˜ ์‹คํŒจ๋ฅผํ•˜๋ฉด 0์ ์„ ์ฃผ๋„๋ก ํ•œ ๋ฒ„์ „์ธ๋ฐ๋„ 23.9์  ์ •๋„๋กœ ์—ฌ์ „ํžˆ heuristic rule๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

BAD์˜ agent์˜ ์‹ค์ œ ๊ฒŒ์ž„ํ”Œ๋ ˆ์ด๋Š” ๋ชจ๋‘ ๋”ฐ๋ฅด๊ธฐ ์‰ฌ์šด๊ฑด ์•„๋‹ˆ์ง€๋งŒ ๊ฒŒ์ž„์„ ๋ถ„์„ํ•ด๋ณด์•˜์„ ๋•Œ, ๋ช‡๋ช‡ convention์„ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์€ agent๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” convention์œผ๋กœ ์ƒˆ ์นด๋“œ์— ๋Œ€ํ•ด ๋นจ๊ฐ„์ƒ‰์ด๋‚˜ ๋…ธ๋ž€์ƒ‰์ด๋ž€ ํžŒํŠธ๋ฅผ ์ฃผ๋ฉด ์ด๋Š” ๋“ฑ๋กํ•ด๋„ ๋œ๋‹ค๋Š” convention์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ 25%์˜ ์ผ€์ด์Šค์—์„œ๋Š” ์ƒˆ๋กœ์šด ์นด๋“œ์— ๋Œ€ํ•ด ํฐ์ƒ‰์ด๋‚˜ ํŒŒ๋ž€์ƒ‰์„ ๊ฐ€๋ฆฌํ‚ค๋Š” ๊ฒƒ์ด ๋ฒ„๋ฆฌ๋ผ๋Š” convention์œผ๋กœ ์‚ฌ์šฉ๋˜๊ธฐ๋„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์œ„ ๊ทธ๋ฆผ์—์„œ์˜ (c)์—์„œ๋Š” V0, V1, V2์˜ iteration์— ๋”ฐ๋ฅธ cross entropy๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. belief update๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ์ง„ํ–‰ํ•  ๊ฒฝ์šฐ, ๊ธฐ์กด์˜ cross entropy๋ณด๋‹ค ํฌ๊ฒŒ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” convention์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ์ด ์„ฑ๊ณต์ ์ธ ๊ฒŒ์ž„ ํ”Œ๋ ˆ์ด์— ์žˆ์–ด ์ค‘์š”ํ•˜๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

Last updated

Was this helpful?