8.3.5 Higher-Order LOLA

LOLA์˜ objective์—์„œ opponent์— ๋ฏธ์†Œ๋ณ€ํ™”๋Ÿ‰์— ๋”ฐ๋ฅธ expected discounted return ์€ 1์ฐจ ํ…Œ์ผ๋Ÿฌ ๊ทผ์‚ฌ๋ฅผ ํ†ตํ•ด ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋” ๋†’์€ ์ฐจ์ˆ˜์˜ ๊ทผ์‚ฌ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋งŽ์€ ๊ณ„์‚ฐ๋Ÿ‰๊ณผ ๋†’์€ variance๋ฅผ ๊ฐ€์ง€๊ฒ ์ง€๋งŒ ์ข€ ๋” ์ •ํ™•ํ•œ ๊ทผ์‚ฌ๊ฐ’์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Last updated