๐Ÿ˜‡
Deep Multi-Agent Reinforcement Learning
  • Deep Multi-Agent Reinforcement Learning
  • Abstract & Contents
    • Abstract
  • 1. Introduction
    • 1. INTRODUCTION
      • 1.1 The Industrial Revolution, Cognition, and Computers
      • 1.2 Deep Multi-Agent Reinforcement-Learning
      • 1.3 Overall Structure
  • 2. Background
    • 2. BACKGROUND
      • 2.1 Reinforcement Learning
      • 2.2 Multi-Agent Settings
      • 2.3 Centralized vs Decentralized Control
      • 2.4 Cooperative, Zero-sum, and General-Sum
      • 2.5 Partial Observability
      • 2.6 Centralized Training, Decentralized Execution
      • 2.7 Value Functions
      • 2.8 Nash Equilibria
      • 2.9 Deep Learning for MARL
      • 2.10 Q-Learning and DQN
      • 2.11 Reinforce and Actor-Critic
  • I Learning to Collaborate
    • 3. Counterfactual Multi-Agent Policy Gradients
      • 3.1 Introduction
      • 3.2 Related Work
      • 3.3 Multi-Agent StarCraft Micromanagement
      • 3.4 Methods
        • 3.4.1 Independent Actor-Critic
        • 3.4.2 Counterfactual Multi-Agent Policy Gradients
        • 3.4.2.1 baseline lemma
        • 3.4.2.2 COMA Algorithm
      • 3.5 Results
      • 3.6 Conclusions & Future Work
    • 4 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.1 Introduction
      • 4.2 Related Work
      • 4.3 Dec-POMDP and Features
      • 4.4 Common Knowledge
      • 4.5 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.6 Pairwise MACKRL
      • 4.7 Experiments and Results
      • 4.8 Conclusion & Future Work
    • 5 Stabilizing Experience Replay
      • 5.1 Introduction
      • 5.2 Related Work
      • 5.3 Methods
        • 5.3.1 Multi-Agent Importance Sampling
        • 5.3.2 Multi-Agent Fingerprints
      • 5.4 Experiments
        • 5.4.1 Architecture
      • 5.5 Results
        • 5.5.1 Importance Sampling
        • 5.5.2 Fingerprints
        • 5.5.3 Informative Trajectories
      • 5.6 Conclusion & Future Work
  • II Learning to Communicate
    • 6. Learning to Communicate with Deep Multi-Agent ReinforcementLearning
      • 6.1 Introduction
      • 6.2 Related Work
      • 6.3 Setting
      • 6.4 Methods
        • 6.4.1 Reinforced Inter-Agent Learning
        • 6.4.2 Differentiable Inter-Agent Learning
      • 6.5 DIAL Details
      • 6.6 Experiments
        • 6.6.1 Model Architecture
        • 6.6.2 Switch Riddle
        • 6.6.3 MNIST Games
        • 6.6.4 Effect of Channel Noise
      • 6.7 Conclusion & Future Work
    • 7. Bayesian Action Decoder
      • 7.1 Introduction
      • 7.2 Setting
      • 7.3 Method
        • 7.3.1 Public belief
        • 7.3.2 Public Belief MDP
        • 7.3.3 Sampling Deterministic Partial Policies
        • 7.3.4 Factorized Belief Updates
        • 7.3.5 Self-Consistent Beliefs
      • 7.4 Experiments and Results
        • 7.4.1 Matrix Game
        • 7.4.2 Hanabi
        • 7.4.3 Observations and Actions
        • 7.4.4 Beliefs in Hanabi
        • 7.4.5 Architecture Details for Baselines and Method
        • 7.4.6 Hyperparamters
        • 7.4.7 Results on Hanabi
      • 7.5 Related Work
        • 7.5.1 Learning to Communicate
        • 7.5.2 Research on Hanabi
        • 7.5.3 Belief State Methods
      • 7.6 Conclusion & Future Work
  • III Learning to Reciprocate
    • 8. Learning with Opponent-Learning Awareness
      • 8.1 Introduction
      • 8.2 Related Work
      • 8.3 Methods
        • 8.3.1 Naive Learner
        • 8.3.2 Learning with Opponent Learning Awareness
        • 8.3.3. Learning via Policy gradient
        • 8.3.4 LOLA with Opponent modeling
        • 8.3.5 Higher-Order LOLA
      • 8.4 Experimental Setup
        • 8.4.1 Iterated Games
        • 8.4.2 Coin Game
        • 8.4.3 Training Details
      • 8.5 Results
        • 8.5.1 Iterated Games
        • 8.5.2 Coin Game
        • 8.5.3 Exploitability of LOLA
      • 8.6 Conclusion & Future Work
    • 9. DiCE: The Infinitely Differentiable Monte Carlo Estimator
      • 9.1 Introduction
      • 9.2 Background
        • 9.2.1 Stochastic Computation Graphs
        • 9.2.2 Surrogate Losses
      • 9.3 Higher Order Gradients
        • 9.3.1 Higher Order Gradient Estimators
        • 9.3.2 Higher Order Surrogate Losses
        • 9.3.3. Simple Failing Example
      • 9.4 Correct Gradient Estimators with DiCE
        • 9.4.1 Implement of DiCE
        • 9.4.2 Casuality
        • 9.4.3 First Order Variance Reduction
        • 9.4.4 Hessian-Vector Product
      • 9.5 Case Studies
        • 9.5.1 Empirical Verification
        • 9.5.2 DiCE For multi-agent RL
      • 9.6 Related Work
      • 9.7 Conclusion & Future Work
  • Reference
    • Reference
  • After
    • ๋ณด์ถฉ
    • ์—ญ์ž ํ›„๊ธฐ
Powered by GitBook
On this page

Was this helpful?

  1. I Learning to Collaborate
  2. 4 Multi-Agent Common Knowledge Reinforcement Learning

4.4 Common Knowledge

Previous4.3 Dec-POMDP and FeaturesNext4.5 Multi-Agent Common Knowledge Reinforcement Learning

Last updated 4 years ago

Was this helpful?

binary mask ฮผa\mu^aฮผa์— ๋Œ€ํ•œ ์ค‘์š”ํ•œ ํŠน์„ฑ์€ agentaaa๊ฐ€ entity eee๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋А๋ƒ์—๋งŒ ๋‹ฌ๋ ค ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋งŒ์•ฝ agent aaa์˜ ๋ชจ๋“  mask ฮผa\mu^aฮผa๊ฐ€ common knowledge๋ผ๋ฉด, ๋‹ค๋ฅธ agent b bb๊ฐ€ a aa๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค๋ฉด a aa์™€ ๊ทธ๊ฐ€ ๋ณผ ์ˆ˜ ์žˆ๋Š” entities e ee์— ๋Œ€ํ•ด a,eโˆˆMsb a,e \in \mathcal{M}^b_sa,eโˆˆMsbโ€‹์ž„์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•œ ๊ทธ๋ฃน์ด ์•Œ๊ณ ์žˆ๋Š” Mutual knowledge MsG,GโІA\mathcal{M}^\mathcal{G}_s ,\mathcal{G}\subseteq \mathcal{A}MsGโ€‹,GโІA๋ผ๊ณ  ์ •์˜ํ•  ๋•Œ ์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

MsG=โˆฉaโˆˆGMsa \mathcal{M}^\mathcal{G}_s = \cap_{a \in \mathcal{G}} \mathcal{M}^\mathcal{a}_sMsGโ€‹=โˆฉaโˆˆGโ€‹Msaโ€‹

๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ mutual knowledge๋Š” common knowledge๋ฅผ ๋œปํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ์ € ๋‹ค๊ฐ™์ด ์•Œ๊ณ ์žˆ๋Š” entities์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋Œ€์‹  ํ•œ group์˜ common knowledge๋Š” LsG\mathcal{L}^\mathcal{G}_sLsGโ€‹๋กœ ํ‘œํ˜„ํ•˜๋Š”๋ฐ, ์ด๋Š” ์ •์˜๋Œ€๋กœ group G \mathcal{G}G๊ฐ€ LsG\mathcal{L}^\mathcal{G}_sLsGโ€‹๋ฅผ ์•Œ๊ณ , G \mathcal{G}G๋ชจ๋‘๊ฐ€ LsG\mathcal{L}^\mathcal{G}_sLsGโ€‹๋ฅผ ์•ˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•„๋Š” ์ •๋ณด์ž…๋‹ˆ๋‹ค.

agent aaa๊ฐ€ ๋‹ค๋ฅธ agent b bb๋„ eโˆˆฮพe \in\xi eโˆˆฮพ๋ฅผ ๋ณด๊ณ ์žˆ๋Š”์ง€ ์•Œ๊ธฐ ์œ„ํ•ด์„ , agent a aa๋„ b bb๋ฅผ ๋ด์•ผํ•˜๊ณ  b bb๊ฐ€ eee๋ฅผ ๋ณด๊ณ ์žˆ๋Š”์ง€ ์•Œ์•„์•ผํ•˜๋Š”๋ฐ ์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œ๊ธฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ฮผa(sa,sb)โˆงฮผb(sb,se)=โŠค \mu^a(s^a,s^b) \wedge \mu^b(s^b,s^e)= \topฮผa(sa,sb)โˆงฮผb(sb,se)=โŠค

LsG\mathcal{L}^\mathcal{G}_sLsGโ€‹๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์žฌ๊ท€์ ์ธ ํ˜•ํƒœ๋กœ ํ‘œํ˜„๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  agent a a a๊ฐ€ ๊ทธ๋ฃน G\mathcal{G}G์— ์†ํ•œ๋‹ค๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

LsG=limโกmโ†’โˆžLsa,m,Lsa,0=Msa \mathcal{L}^{\mathcal{G}}_s = \lim_{m\rightarrow \infty}{\mathcal{L}^{a,m}_s}, \mathcal{L}^{a,0}_s = \mathcal{M}^a_sLsGโ€‹=limmโ†’โˆžโ€‹Lsa,mโ€‹,Lsa,0โ€‹=Msaโ€‹

Lsa,m=โ‹‚bโˆˆG{eโˆˆLsb,mโˆ’1โˆฃฮผa(sa,sb)}โ‹ฏ(4.4.2) \mathcal{L}^{a,m}_s =\bigcap_{b\in\mathcal{G}}\{e\in \mathcal{L}^{b,m-1}_s|\mu^a(s^a,s^b)\} \cdots (4.4.2)Lsa,mโ€‹=โ‹‚bโˆˆGโ€‹{eโˆˆLsb,mโˆ’1โ€‹โˆฃฮผa(sa,sb)}โ‹ฏ(4.4.2)

์ด๋Š” iteration m=0์ผ ๋•Œ, agent aa a์— ๋Œ€ํ•œ mutual knowledge Msa\mathcal{M}^a_sMsaโ€‹๋Š” Lsa,0 \mathcal{L}^{a,0}_sLsa,0โ€‹์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์Šค์Šค๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” mutual knowledge๋Š” ์Šค์Šค๋กœ์˜ common knowledge๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

m = 1๋กœ iteration์„ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Lsa,1=โ‹‚bโˆˆG{eโˆˆLsb,0โˆฃฮผa(sa,sb)} \mathcal{L}^{a,1}_s =\bigcap_{b\in\mathcal{G}}\{e\in \mathcal{L}^{b,0}_s|\mu^a(s^a,s^b)\}Lsa,1โ€‹=โ‹‚bโˆˆGโ€‹{eโˆˆLsb,0โ€‹โˆฃฮผa(sa,sb)}

=โ‹‚bโˆˆG{eโˆˆMsbโˆฃฮผa(sa,sb)} =\bigcap_{b\in\mathcal{G}}\{e\in \mathcal{M}^{b}_s|\mu^a(s^a,s^b)\}=โ‹‚bโˆˆGโ€‹{eโˆˆMsbโ€‹โˆฃฮผa(sa,sb)}

์ด๋ฅผ ํ•ด์„ํ•˜๋ฉด, a๊ฐ€ observeํ•  ๋•Œ, a๊ฐ€ ๋ณด๋Š” b์— ๊ด€ํ•œ entities์— ๋Œ€ํ•ด mutual knowledgeMsb\mathcal{M}^b_sMsbโ€‹์— ํฌํ•จ๋˜๊ณ , ์ด๊ฒƒ์— ๋Œ€ํ•œ ๊ทธ๋ฃน ์ „์ฒด์˜ ๊ต์ง‘ํ•ฉ์„ ํ•˜๊ฒŒ ๋˜๋ฉด, Lsa,1\mathcal{L}^{a,1}_sLsa,1โ€‹๋Š” ๊ทธ๋ฃน ๋‚ด ์ „์ฒด๊ฐ€ ์•„๋Š” entities๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

m = 2๋กœ iteration์„ ๋” ์ง„ํ–‰ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Lsa,2=โ‹‚bโˆˆG{eโˆˆLsb,1โˆฃฮผa(sa,sb)} \mathcal{L}^{a,2}_s =\bigcap_{b\in\mathcal{G}}\{e\in \mathcal{L}^{b,1}_s|\mu^a(s^a,s^b)\}Lsa,2โ€‹=โ‹‚bโˆˆGโ€‹{eโˆˆLsb,1โ€‹โˆฃฮผa(sa,sb)}

๊ฒฐ๊ตญ, agent๊ฐ€ a๊ฐ€ ๋ดค์„ ๋•Œ, ๋ชจ๋“  agent๊ฐ€ Lsa,1\mathcal{L}^{a,1}_sLsa,1โ€‹๋ฅผ ์•Œ๊ณ  ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์€ ์ƒ๋Œ€๋ฐฉ์ด ์•Œ๊ณ ์žˆ๋‹ค๋Š”๊ฑธ ์•Œ๊ณ ์žˆ๋Š” ์ƒํƒœ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฌดํ•œ๋Œ€๋กœ ๋ณด๋‚ด๋Š” ํ–‰์œ„๋Š” ์œ„์˜ ํ–‰์œ„๋ฅผ ๋ฐ˜๋ณตํ•˜๋Š” ๊ฒƒ์€ mutual knowledgeMsb\mathcal{M}^b_sMsbโ€‹๊ฐ€ common knowledgeLsG\mathcal{L}^\mathcal{G}_sLsGโ€‹๊ฐ€ ๋œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Lemma 4.4.1

์œ„์˜ ์žฌ๊ท€์ ์ธ ํ‘œํ˜„์—์„œ๋Š” ๊ณต์ง‘ํ•ฉ์—๋Œ€ํ•ด roughํ•˜๊ฒŒ ๋ณด์—ฌ์คฌ์ง€๋งŒ ์—ฌ๊ธฐ์„œ๋Š” ์ข€ ๋” ์—„๊ฒฉํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋ฏ€๋กœ ๊ทธ๋ฃน๋‚ด์˜ ์–ด๋А agent์˜ knowledge๋กœ ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋˜์ง€, agent๋Š” ๋ชจ๋“  agent๋Š” ์„œ๋กœ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Common Knowledge๋Š” ๊ทธ๋ฃน ๋‚ด์˜ ๋ชจ๋“  agent์— ๋Œ€ํ•ด ์˜ค์ง ๋ณผ ์ˆ˜ ์žˆ๋Š” mutual knowledge์—์„œ๋งŒ ๊ณ„์‚ฐ๋˜์–ด ์–ป์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. policy์— ์˜ํ•ด ์„ ํƒ๋œ action์€ ๊ทธ ์ž์ฒด๋กœ Common knowledge๋กœ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋Š” ์˜ค์ง common knowledge์™€ ๊ทธ๋ฅผ ๋žœ๋ค์œผ๋กœ ์„ ํƒํ• ๋•Œ์˜ seed์— ๋Œ€ํ•œ rule common knowledge์—๋งŒ ์˜์กดํ•ฉ๋‹ˆ๋‹ค.

๋งŒ์•ฝ ๋ชจ๋“  mask ฮผ\muฮผ๊ฐ€ ๋ชจ๋“  agent์—๊ฒŒ ์•Œ๋ ค์ ธ์žˆ๋‹ค๋ฉด, common knowledge LsG\mathcal{L}^\mathcal{G}_sLsGโ€‹๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

LsG={MsG,ifโˆงa,bโˆˆGฮผa(sa,sb)โˆ…ย ย ย ,ย ย ย ย ย ย ย ย ย ย ย ย otherwise\mathcal{L}^\mathcal{G}_s= \left\{\begin{matrix} \mathcal{M}^\mathcal{G}_s,\mathrm{if} \wedge_{a,b\in \mathcal{G}}\mu^a(s^a,s^b)\\ \emptyset\ \ \ ,\ \ \ \ \ \ \ \ \ \ \ \ \mathrm{otherwise} \end{matrix}\right.LsGโ€‹={MsGโ€‹,ifโˆงa,bโˆˆGโ€‹ฮผa(sa,sb)โˆ…ย ย ย ,ย ย ย ย ย ย ย ย ย ย ย ย otherwiseโ€‹

์ด๋ฅผ ํ•ด์„ํ•ด๋ณด๋ฉด, ๊ฐ agent๊ฐ€ ๋‹ค๋ฅธ ๋ชจ๋“  agent๋ฅผ ๊ด€์ฐฐํ–ˆ์„ ๋•Œ์— ๊ณตํ†ต์ ์œผ๋กœ ์•„๋Š” mutual knowledgeMsb\mathcal{M}^b_sMsbโ€‹์— ๋Œ€ํ•ด common knowledge LsG\mathcal{L}^\mathcal{G}_sLsGโ€‹๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค.

์ด๋ฅผ ์ •์˜ํ•˜๊ธฐ ์œ„ํ•ด (4.4.2)๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์žฌ๊ท€์˜ Lsa,0=Msa\mathcal{L}^{a,0}_s = \mathcal{M}^a_sLsa,0โ€‹=Msaโ€‹๋ถ€ํ„ฐ ์‹œ์ž‘ํ•œ๋‹ค๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Lsa,1={MsG,ifโˆงbโˆˆGฮผa(sa,sb)โˆ…,otherwise\mathcal{L}^{a,1}_s= \left\{\begin{matrix} \mathcal{M}^\mathcal{G}_s,\mathrm{if} \wedge_{b\in \mathcal{G}}\mu^a(s^a,s^b)\\ \emptyset,\mathrm{otherwise} \end{matrix}\right.Lsa,1โ€‹={MsGโ€‹,ifโˆงbโˆˆGโ€‹ฮผa(sa,sb)โˆ…,otherwiseโ€‹

์ด ๋•Œ, ๊ทธ๋ฃน๋‚ด์˜ mutual knowledgeMsG\mathcal{M}^{\mathcal{G}}_sMsGโ€‹๋Š” ๊ท€๋‚ฉ์ ์œผ๋กœ ๋ช‡๋ฒˆ์˜ iteration ํ›„์— Lsc,m=MsG \mathcal{L}^{c,m}_s = \mathcal{M}^{\mathcal{G}}_sLsc,mโ€‹=MsGโ€‹์ด ๋จ์„ ๋ณผ๊ฑด๋ฐ, ์ด์ „์— mutual knowledge๊ฐ€ common knowledge๊ฐ€ ๋˜๋Š” ๊ฒƒ์€ 2๋ฒˆ์ž„์„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์Œ์„ ํ†ตํ•ด ์ˆ˜์‹ํ™” ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Lsa,m+2={eโˆˆฮพโˆฃโ‹€bโˆˆG(ฮผa(sa,sb)โˆงโ‹€cโˆˆG(ฮผb(sb,sc)โˆงeโˆˆLsc,m))} \mathcal{L}^{a,m+2}_s = \{e\in\xi|\bigwedge_{b\in\mathcal{G}}(\mu^a(s^a,s^b)\wedge \bigwedge_{c\in\mathcal{G}}(\mu^b(s^b,s^c)\wedge e\in \mathcal{L}^{c,m}_s))\}Lsa,m+2โ€‹={eโˆˆฮพโˆฃโ‹€bโˆˆGโ€‹(ฮผa(sa,sb)โˆงโ‹€cโˆˆGโ€‹(ฮผb(sb,sc)โˆงeโˆˆLsc,mโ€‹))}

={eโˆˆMsGโˆฃโˆงb,cโˆˆGฮผb(sb,sc)}=LsG = \{ e \in \mathcal{M}^\mathcal{G}_s| \wedge_{b,c\in\mathcal{G}}\mu^b(s^b,s^c)\} = \mathcal{L}^\mathcal{G}_s={eโˆˆMsGโ€‹โˆฃโˆงb,cโˆˆGโ€‹ฮผb(sb,sc)}=LsGโ€‹

๊ทธ๋ฃน ๋‚ด์—์„œ ์‹œ๊ฐ„์— ๋”ฐ๋ฅธ common knowledge๋Š” ์ด์ „์˜ trajectories ฯ„0 \tau_0ฯ„0โ€‹๋ถ€ํ„ฐ ์ตœ๊ทผ ๊ด€์ธกํ•œ trajectory ฯ„tG=(ฯ„0,o1G,u1G,...,otG,utG)\tau^\mathcal{G}_t = (\tau_0,o^\mathcal{G}_1,\bold{u}^\mathcal{G}_1,...,o^\mathcal{G}_t,\bold{u}^\mathcal{G}_t)ฯ„tGโ€‹=(ฯ„0โ€‹,o1Gโ€‹,u1Gโ€‹,...,otGโ€‹,utGโ€‹)์™€ okG={skeโˆฃeโˆˆLskG}o^\mathcal{G}_k = \{s^e_k | e \in \mathcal{L}^{\mathcal{G}}_{s_k}\}okGโ€‹={skeโ€‹โˆฃeโˆˆLskโ€‹Gโ€‹}๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  masks ฮผa\mu^aฮผa๋ฅผ ์•„๋Š” ๊ฒƒ์€ ฯ„tG=LG(ฯ„ta)\tau^{\mathcal{G}}_t = \mathcal{L}^\mathcal{G}(\tau^a_t)ฯ„tGโ€‹=LG(ฯ„taโ€‹)๋ฅผ ฯ„ta=(ฯ„0,o1a,...,ota)\tau^a_t = (\tau_0,o^a_1,...,o^a_t)ฯ„taโ€‹=(ฯ„0โ€‹,o1aโ€‹,...,otaโ€‹)๋ฅผ ํ†ตํ•ด ๋„์ถœํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์—ฯ„G\tau^{\mathcal{G}}ฯ„G์— ์˜ํ•ด ์กฐ๊ฑดํ™”๋˜๋Š” ๋ชจ๋“  ํ•จ์ˆ˜๋Š” ๊ทธ๋ฃน๋‚ด์˜ agent์— ์˜ํ•ด ๋…๋ฆฝ์ ์œผ๋กœ ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.