πŸ˜‡
Deep Multi-Agent Reinforcement Learning
  • Deep Multi-Agent Reinforcement Learning
  • Abstract & Contents
    • Abstract
  • 1. Introduction
    • 1. INTRODUCTION
      • 1.1 The Industrial Revolution, Cognition, and Computers
      • 1.2 Deep Multi-Agent Reinforcement-Learning
      • 1.3 Overall Structure
  • 2. Background
    • 2. BACKGROUND
      • 2.1 Reinforcement Learning
      • 2.2 Multi-Agent Settings
      • 2.3 Centralized vs Decentralized Control
      • 2.4 Cooperative, Zero-sum, and General-Sum
      • 2.5 Partial Observability
      • 2.6 Centralized Training, Decentralized Execution
      • 2.7 Value Functions
      • 2.8 Nash Equilibria
      • 2.9 Deep Learning for MARL
      • 2.10 Q-Learning and DQN
      • 2.11 Reinforce and Actor-Critic
  • I Learning to Collaborate
    • 3. Counterfactual Multi-Agent Policy Gradients
      • 3.1 Introduction
      • 3.2 Related Work
      • 3.3 Multi-Agent StarCraft Micromanagement
      • 3.4 Methods
        • 3.4.1 Independent Actor-Critic
        • 3.4.2 Counterfactual Multi-Agent Policy Gradients
        • 3.4.2.1 baseline lemma
        • 3.4.2.2 COMA Algorithm
      • 3.5 Results
      • 3.6 Conclusions & Future Work
    • 4 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.1 Introduction
      • 4.2 Related Work
      • 4.3 Dec-POMDP and Features
      • 4.4 Common Knowledge
      • 4.5 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.6 Pairwise MACKRL
      • 4.7 Experiments and Results
      • 4.8 Conclusion & Future Work
    • 5 Stabilizing Experience Replay
      • 5.1 Introduction
      • 5.2 Related Work
      • 5.3 Methods
        • 5.3.1 Multi-Agent Importance Sampling
        • 5.3.2 Multi-Agent Fingerprints
      • 5.4 Experiments
        • 5.4.1 Architecture
      • 5.5 Results
        • 5.5.1 Importance Sampling
        • 5.5.2 Fingerprints
        • 5.5.3 Informative Trajectories
      • 5.6 Conclusion & Future Work
  • II Learning to Communicate
    • 6. Learning to Communicate with Deep Multi-Agent ReinforcementLearning
      • 6.1 Introduction
      • 6.2 Related Work
      • 6.3 Setting
      • 6.4 Methods
        • 6.4.1 Reinforced Inter-Agent Learning
        • 6.4.2 Differentiable Inter-Agent Learning
      • 6.5 DIAL Details
      • 6.6 Experiments
        • 6.6.1 Model Architecture
        • 6.6.2 Switch Riddle
        • 6.6.3 MNIST Games
        • 6.6.4 Effect of Channel Noise
      • 6.7 Conclusion & Future Work
    • 7. Bayesian Action Decoder
      • 7.1 Introduction
      • 7.2 Setting
      • 7.3 Method
        • 7.3.1 Public belief
        • 7.3.2 Public Belief MDP
        • 7.3.3 Sampling Deterministic Partial Policies
        • 7.3.4 Factorized Belief Updates
        • 7.3.5 Self-Consistent Beliefs
      • 7.4 Experiments and Results
        • 7.4.1 Matrix Game
        • 7.4.2 Hanabi
        • 7.4.3 Observations and Actions
        • 7.4.4 Beliefs in Hanabi
        • 7.4.5 Architecture Details for Baselines and Method
        • 7.4.6 Hyperparamters
        • 7.4.7 Results on Hanabi
      • 7.5 Related Work
        • 7.5.1 Learning to Communicate
        • 7.5.2 Research on Hanabi
        • 7.5.3 Belief State Methods
      • 7.6 Conclusion & Future Work
  • III Learning to Reciprocate
    • 8. Learning with Opponent-Learning Awareness
      • 8.1 Introduction
      • 8.2 Related Work
      • 8.3 Methods
        • 8.3.1 Naive Learner
        • 8.3.2 Learning with Opponent Learning Awareness
        • 8.3.3. Learning via Policy gradient
        • 8.3.4 LOLA with Opponent modeling
        • 8.3.5 Higher-Order LOLA
      • 8.4 Experimental Setup
        • 8.4.1 Iterated Games
        • 8.4.2 Coin Game
        • 8.4.3 Training Details
      • 8.5 Results
        • 8.5.1 Iterated Games
        • 8.5.2 Coin Game
        • 8.5.3 Exploitability of LOLA
      • 8.6 Conclusion & Future Work
    • 9. DiCE: The Infinitely Differentiable Monte Carlo Estimator
      • 9.1 Introduction
      • 9.2 Background
        • 9.2.1 Stochastic Computation Graphs
        • 9.2.2 Surrogate Losses
      • 9.3 Higher Order Gradients
        • 9.3.1 Higher Order Gradient Estimators
        • 9.3.2 Higher Order Surrogate Losses
        • 9.3.3. Simple Failing Example
      • 9.4 Correct Gradient Estimators with DiCE
        • 9.4.1 Implement of DiCE
        • 9.4.2 Casuality
        • 9.4.3 First Order Variance Reduction
        • 9.4.4 Hessian-Vector Product
      • 9.5 Case Studies
        • 9.5.1 Empirical Verification
        • 9.5.2 DiCE For multi-agent RL
      • 9.6 Related Work
      • 9.7 Conclusion & Future Work
  • Reference
    • Reference
  • After
    • 보좩
    • μ—­μž ν›„κΈ°
Powered by GitBook
On this page
  • In Fully-observable MARL environment
  • In Partially-observable MARL environment

Was this helpful?

  1. I Learning to Collaborate
  2. 5 Stabilizing Experience Replay
  3. 5.3 Methods

5.3.1 Multi-Agent Importance Sampling

이 sectionμ—μ„œλŠ” IQL의 non-stationarity 문제λ₯Ό importance sampling 기법을 μ‚¬μš©ν•˜μ—¬ ν•΄κ²°ν•˜λŠ” 것을 λ³΄μž…λ‹ˆλ‹€. 보톡 RLμ—μ„œ agentκ°€ off-policyλ₯Ό 배우기 μœ„ν•΄ target policyκ°€ λ§Œλ“  뢄포와 λͺ¨μ€ λ°μ΄ν„°μ˜ 뢄포가 λ‹€λ₯Όλ•Œ Importance sampling을 μ§„ν–‰ν•©λ‹ˆλ‹€. 이런 방식을 μ‘μš©ν•΄, ν˜„μž¬ ν™˜κ²½μ—μ„œ λ§Œλ“€ 뢄포와 λͺ¨μ€ λ‹€λ₯Έν™˜κ²½μ—μ„œμ˜ 뢄포에 차이λ₯Ό importance sampling을 톡해 ν•΄κ²°ν•  수 μžˆλ‹€λŠ” 것이 off-environment의 κΈ°λ³Έ μ•„μ΄λ””μ–΄μž…λ‹ˆλ‹€. μš°λ¦¬λŠ” 각 ν•™μŠ΅ν•  λ•Œλ§ˆλ‹€ λ‹€λ₯Έ agentλ“€μ˜ policiesκ°€ λ°”λ€Œμ–΄ ν™˜κ²½ 뢄포가 λ°”λ€” 것을 μ•Œκ³  μžˆμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ―€λ‘œ off-environmentλ₯Ό μ΄μš©ν•˜μ—¬ 이 문제λ₯Ό ν•΄κ²°ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

In Fully-observable MARL environment

Q-function이 μ‹€μ œ state s ssλ₯Ό λ³Ό 수 μžˆλ‹€λ©΄, μ£Όμ–΄μ§„ λ‹€λ₯Έ agent의 policy에 λŒ€ν•΄ ν•œ agent의 Bellman optimality equation을 λ‹€μŒκ³Ό 같이 μ“Έ 수 μžˆμŠ΅λ‹ˆλ‹€.

Qaβˆ—(s,uaβˆ£Ο€βˆ’a)=βˆ‘uβˆ’aΟ€βˆ’a(uβˆ’a∣s)[r(s,ua,uβˆ’a)+Ξ³βˆ‘sβ€²P(sβ€²βˆ£s,ua,uβˆ’a)max⁑uβ€²aQaβˆ—(sβ€²,uβ€²a)] Q^*_a(s,u^a|\bm{\pi}^{-a}) = \sum_{\bold{u}^{-a}}{\bm{\pi}^{-a}(\bold{u}^{-a}|s)\begin{bmatrix} r(s,u^a,\bold{u}^{-a})+\gamma \sum_{s'}{P(s'|s,u^a,\bold{u}^{-a})\max_{u^{'a}}{Q^*_a(s',u^{'a})}} \end{bmatrix}} Qaβˆ—β€‹(s,uaβˆ£Ο€βˆ’a)=βˆ‘uβˆ’aβ€‹Ο€βˆ’a(uβˆ’a∣s)[r(s,ua,uβˆ’a)+Ξ³βˆ‘s′​P(sβ€²βˆ£s,ua,uβˆ’a)maxuβ€²a​Qaβˆ—β€‹(sβ€²,uβ€²a)​]

이 λ•Œ, μ‹œκ°„μ΄ 지남에 따라, agent의 policyκ°€ λ³€ν•˜λ―€λ‘œ 이λ₯Ό κΈ°λ‘ν•˜κΈ° μœ„ν•œ μ‹œκ°„μ„ 넣은 tuple을 λ§Œλ“€λ©΄ λ‹€μŒκ³Ό 같이 ν‘œκΈ° κ°€λŠ₯ν•©λ‹ˆλ‹€.

<s,uar,Ο€(uβˆ’a∣s),sβ€²>(tc) <s,u^ar,\pi(\bold{u}^{-a}|s),s'>^{(t_c)}<s,uar,Ο€(uβˆ’a∣s),sβ€²>(tc​)

κ·Έλ ‡λ‹€λ©΄ lossλŠ” replay time tr t_rtr​에 λŒ€ν•΄ λ‹€μŒκ³Ό 같이 ꡬ할 수 μžˆμŠ΅λ‹ˆλ‹€.

L(ΞΈ)=βˆ‘i=1bΟ€trβˆ’a(uβˆ’a∣s)Ο€tiβˆ’a(uβˆ’a∣s)[(yiDQNβˆ’Q(s,u;ΞΈ))2] \mathcal{L}(\theta) = \sum_{i=1}^b\frac{\bm{\pi}^{-a}_{t_r}(\bold{u}^{-a}|s)}{\bm{\pi}_{t_i}^{-a}(\bold{u}^{-a}|s)}[(y_i^{\mathrm{DQN}}-Q(s,u;\theta))^2]L(ΞΈ)=βˆ‘i=1b​πtiβ€‹βˆ’a​(uβˆ’a∣s)Ο€trβ€‹βˆ’a​(uβˆ’a∣s)​[(yiDQNβ€‹βˆ’Q(s,u;ΞΈ))2]

In Partially-observable MARL environment

partially observableμƒν™©μ—μ„œλŠ” action-observation historiesκ°€ agent의 policiesλΏλ§Œμ•„λ‹ˆλΌ, transitionκ³Ό observation function과도 μ—°κ΄€λ˜μ–΄ 있기 λ•Œλ¬Έμ— 쒀더 식이 λ³΅μž‘ν•΄μ§‘λ‹ˆλ‹€. 이λ₯Ό μ •μ˜ν•˜κΈ° μœ„ν•΄μ„œ μ΄μ „μ˜ μ •μ˜λ“€μ„ ν™•μž₯ν•΄λ³΄κ² μŠ΅λ‹ˆλ‹€.

state spaces^={s,Ο„βˆ’a}∈S^=SΓ—Tnβˆ’1 \hat{s} = \{s,\bold{\tau}^{-a} \} \in \hat{S} = S \times T^{n-1}s^={s,Ο„βˆ’a}∈S^=SΓ—Tnβˆ’1μ΄λŠ” λ‹€λ₯Έ agentλ“€μ˜ 이전 historyλ₯Ό ν¬ν•¨ν•˜μ—¬ μ •μ˜λ©λ‹ˆλ‹€. 그리고, 그에 μƒμ‘ν•˜λŠ” observation function O^(s^,a)=O(s,a) \hat{O}(\hat{s},a) = O(s,a)O^(s^,a)=O(s,a)μž…λ‹ˆλ‹€. reward function은 r^(s^,u)=βˆ‘uβˆ’aΟ€βˆ’a(uβˆ’aβˆ£Ο„βˆ’a)r(s,u)\hat{r}(\hat{s},u) = \sum_{\bold{u}^{-a}}{\bm{\pi}^{-a}(\bold{u}^{-a}|\bm{\tau}^{-a})r(s,\bold{u})}r^(s^,u)=βˆ‘uβˆ’aβ€‹Ο€βˆ’a(uβˆ’aβˆ£Ο„βˆ’a)r(s,u)둜 joint action에 λŒ€ν•΄ μ •μ˜λ©λ‹ˆλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ transition probability function P^ \hat{P}P^λ₯Ό μ •μ˜ν•˜λ©΄,

P^(s^β€²βˆ£s^,u)=P(sβ€²,Ο„β€²βˆ£s,Ο„,u)=βˆ‘uβˆ’aΟ€βˆ’a(uβˆ’aβˆ£Ο„βˆ’a)P(sβ€²βˆ£s,u)P(Ο„β€²βˆ’aβˆ£Ο„βˆ’a,uβˆ’a,sβ€²) \hat{P}(\hat{s}'|\hat{s},u) = P(s',\tau'|s,\tau,u) = \sum_{\bold{u}^{-a}}{\bm{\pi}^{-a}(\bold{u}^{-a}|\bm{\tau}^{-a})P(s'|s,\bold{u})P(\tau^{'-a}|\bm{\tau}^{-a},\bold{u}^{-a},s')}P^(s^β€²βˆ£s^,u)=P(sβ€²,Ο„β€²βˆ£s,Ο„,u)=βˆ‘uβˆ’aβ€‹Ο€βˆ’a(uβˆ’aβˆ£Ο„βˆ’a)P(sβ€²βˆ£s,u)P(Ο„β€²βˆ’aβˆ£Ο„βˆ’a,uβˆ’a,sβ€²)

둜 μ •μ˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 바뀐 μ •μ˜λ₯Ό κ°€μ§€κ³  λ‹€μ‹œ Bellman Equationλ₯Ό λ‚˜νƒ€λ‚΄λ³΄κ² μŠ΅λ‹ˆλ‹€.

Q(Ο„,u)=βˆ‘s^p(s^βˆ£Ο„)[r^(s^,u)+Ξ³βˆ‘Ο„β€²,s^β€²,uβ€²P^(s^β€²βˆ£s^,u)Ο€(uβ€²,Ο„β€²)p(Ο„β€²βˆ£Ο„,s^β€²,u)Q(Ο„β€²,uβ€²)] Q(\tau,u) = \sum_{\hat{s}}{p(\hat{s}|\tau)[\hat{r}(\hat{s},u)+\gamma \sum_{\tau',\hat{s}',u'}\hat{P}(\hat{s}'|\hat{s},u)\pi(u',\tau')p(\tau'|\tau,\hat{s}',u)Q(\tau',u')]} Q(Ο„,u)=βˆ‘s^​p(s^βˆ£Ο„)[r^(s^,u)+Ξ³βˆ‘Ο„β€²,s^β€²,u′​P^(s^β€²βˆ£s^,u)Ο€(uβ€²,Ο„β€²)p(Ο„β€²βˆ£Ο„,s^β€²,u)Q(Ο„β€²,uβ€²)]

λ‹€μŒκ³Ό 같이 action-observation histories Ο„\tauτ에 λ”°λ₯Έ state s^\hat{s}s^둜 κ°ˆν™•λ₯ μ— λ”°λ₯Έ κ°’μœΌλ‘œ λ‚˜νƒ€λ‚©λ‹ˆλ‹€. μ΄λ•Œ 양변에 βˆ‘uβˆ’aΟ€βˆ’a(uβˆ’aβˆ£Ο„βˆ’a)\sum_{\bold{u}^{-a}}{\bm{\pi}^{-a}(\bold{u}^{-a}|\bm{\tau}^{-a})}βˆ‘uβˆ’aβ€‹Ο€βˆ’a(uβˆ’aβˆ£Ο„βˆ’a) λ₯Ό κ³±ν•΄μ£Όλ©΄, μ •μ˜λ“€μ— μ˜ν•΄ λ‹€μŒ 처럼 μ •μ˜ κ°€λŠ₯ν•©λ‹ˆλ‹€.

Qaβˆ—(s,uaβˆ£Ο€βˆ’a)=βˆ‘sp(s^βˆ£Ο„)βˆ‘uβˆ’aΟ€βˆ’a(uβˆ’aβˆ£Ο„βˆ’a)[r(s,u)+Ξ³βˆ‘rβ€²,s^β€²,uβ€²P(sβ€²βˆ£s,u)p(Ο„β€²βˆ’aβˆ£Ο„,uβˆ’a,sβ€²)Ο€(uβ€²,Ο„β€²)p(Ο„β€²βˆ£Ο„,s^β€²,u)Q(Ο„β€²,uβ€²) Q^*_a(s,u^a|\bm{\pi}^{-a}) = \sum_s{p(\hat{s}|\tau) \sum_{\bm{u}^{-a}}\bm{\pi}^{-a}(\bold{u}^{-a}|\bm{\tau}^{-a})[r(s,\bold{u})+\gamma \sum_{r',\hat{s}',u'}P(s'|s,\bold{u})p(\bm{\tau}^{'-a}|\bm{\tau},\bold{u}^{-a},s')\pi(u',\tau')p(\tau'|\tau,\hat{s}',u)Q(\tau',u')} Qaβˆ—β€‹(s,uaβˆ£Ο€βˆ’a)=βˆ‘s​p(s^βˆ£Ο„)βˆ‘uβˆ’aβ€‹Ο€βˆ’a(uβˆ’aβˆ£Ο„βˆ’a)[r(s,u)+Ξ³βˆ‘rβ€²,s^β€²,u′​P(sβ€²βˆ£s,u)p(Ο„β€²βˆ’aβˆ£Ο„,uβˆ’a,sβ€²)Ο€(uβ€²,Ο„β€²)p(Ο„β€²βˆ£Ο„,s^β€²,u)Q(Ο„β€²,uβ€²)이 λ•Œ, 이전엔 Ο€βˆ’a(uβˆ’a∣s)\bm{\pi}^{-a}(\bm{u}^{-a}|s)Ο€βˆ’a(uβˆ’a∣s)μ΄μ—ˆμ§€λ§Œ μ΄λ²ˆμ—” Ο€βˆ’a(uβˆ’aβˆ£Ο„βˆ’a)\bm{\pi}^{-a}(\bold{u}^{-a}|\bm{\tau}^{-a})Ο€βˆ’a(uβˆ’aβˆ£Ο„βˆ’a)에 μ˜μ‘΄ν•˜κΈ° λ•Œλ¬Έμ— importance weights Ο€βˆ’atr(uβˆ’a∣s)Ο€tiβˆ’a(uβˆ’a∣s)\frac{\bm{\pi}^{-a}{t_r}(\bold{u}^{-a}|s)}{\bm{\pi}{t_i}^{-a}(\bold{u}^{-a}|s)}Ο€tiβ€‹βˆ’a(uβˆ’a∣s)Ο€βˆ’atr​(uβˆ’a∣s)β€‹λŠ” κ·Όμ‚¬κ°’μœΌλ‘œ ꡬ해짐을 μ•Œ 수 μžˆμŠ΅λ‹ˆλ‹€.

Previous5.3 MethodsNext5.3.2 Multi-Agent Fingerprints

Last updated 4 years ago

Was this helpful?