5.5.2 Fingerprints

์—ฌ๊ธฐ์„  exploration rateฯต\epsilon ์™€ episode ee๋ฅผ finger-print๋กœ ์‚ฌ์šฉํ•œ ์‹คํ—˜์„ ๋ณด์ด๋Š”๋ฐ, XP+FPํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๊ทธ๋ž˜ํ”„์—์„œ๋„ ํ™•์‹คํžˆ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ด๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.์ด๋Š” finger-print๊ฐ€ ๋‹ค๋ฅธ agent์˜ policy์— ๋Œ€ํ•œ ์ ๋‹นํ•œ ์ง€ํ‘œ๋ฅผ ์ „๋‹ฌํ•ด ์ฃผ์—ˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. network๋Š” ์—ฌ์ „ํžˆ ๋‹ค์–‘ํ•œ input state๋ฅผ ๋ณด์ง€๋งŒ, finger-print๋กœ ์ธํ•ด ์•Œ๋ ค์ง„ training์˜ ์–ด๋Š ์ƒํƒœ์ธ์ง€์— ๋งž๊ฒŒ ์ž˜ mappingํ•ด ๋‚ด๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

XP+IS+FP์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด agent์˜ ์„ฑ๋Šฅ์ด ๋” ๋‚˜์•„์ง€์ง€ ์•Š์•˜๋Š”๋ฐ, ์ด๋Š” ๋‘˜๋‹ค ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋” ์„ฑ๋Šฅ์ด ๋‚˜์•„์ง€์ง€ ์•Š์•˜๋‹ค๋Š” ๊ฒƒ์€ ๋‘˜์ด ๋น„์Šทํ•œ ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๊ทธ๋ฆผ์—์„œ๋Š” episode์—๋”ฐ๋ฅธ ฯต \epsilon ์˜ ๊ฐ์‡„์™€ ๊ทธ์— ๋”ฐ๋ฅธ value function์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์ด๋Š” ํ•™์Šต์ด ์ง„ํ–‰๋ ์ˆ˜๋ก, ๊ฐ™์€ state๋ผ๋„ ฯต\epsilon๊ฐ€ ์ž‘์„ ๋•Œ ๋†’์€ value function์„ ๊ฐ€์กŒ๋Š”๋ฐ ์ด๋Š” ๋‹ค๋ฅธ agent์˜ policies์— ๋งž๋Š” ์ตœ์ ์˜ ๋ฐ˜์‘์„ ํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Last updated