๐Ÿ˜‡
Deep Multi-Agent Reinforcement Learning
  • Deep Multi-Agent Reinforcement Learning
  • Abstract & Contents
    • Abstract
  • 1. Introduction
    • 1. INTRODUCTION
      • 1.1 The Industrial Revolution, Cognition, and Computers
      • 1.2 Deep Multi-Agent Reinforcement-Learning
      • 1.3 Overall Structure
  • 2. Background
    • 2. BACKGROUND
      • 2.1 Reinforcement Learning
      • 2.2 Multi-Agent Settings
      • 2.3 Centralized vs Decentralized Control
      • 2.4 Cooperative, Zero-sum, and General-Sum
      • 2.5 Partial Observability
      • 2.6 Centralized Training, Decentralized Execution
      • 2.7 Value Functions
      • 2.8 Nash Equilibria
      • 2.9 Deep Learning for MARL
      • 2.10 Q-Learning and DQN
      • 2.11 Reinforce and Actor-Critic
  • I Learning to Collaborate
    • 3. Counterfactual Multi-Agent Policy Gradients
      • 3.1 Introduction
      • 3.2 Related Work
      • 3.3 Multi-Agent StarCraft Micromanagement
      • 3.4 Methods
        • 3.4.1 Independent Actor-Critic
        • 3.4.2 Counterfactual Multi-Agent Policy Gradients
        • 3.4.2.1 baseline lemma
        • 3.4.2.2 COMA Algorithm
      • 3.5 Results
      • 3.6 Conclusions & Future Work
    • 4 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.1 Introduction
      • 4.2 Related Work
      • 4.3 Dec-POMDP and Features
      • 4.4 Common Knowledge
      • 4.5 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.6 Pairwise MACKRL
      • 4.7 Experiments and Results
      • 4.8 Conclusion & Future Work
    • 5 Stabilizing Experience Replay
      • 5.1 Introduction
      • 5.2 Related Work
      • 5.3 Methods
        • 5.3.1 Multi-Agent Importance Sampling
        • 5.3.2 Multi-Agent Fingerprints
      • 5.4 Experiments
        • 5.4.1 Architecture
      • 5.5 Results
        • 5.5.1 Importance Sampling
        • 5.5.2 Fingerprints
        • 5.5.3 Informative Trajectories
      • 5.6 Conclusion & Future Work
  • II Learning to Communicate
    • 6. Learning to Communicate with Deep Multi-Agent ReinforcementLearning
      • 6.1 Introduction
      • 6.2 Related Work
      • 6.3 Setting
      • 6.4 Methods
        • 6.4.1 Reinforced Inter-Agent Learning
        • 6.4.2 Differentiable Inter-Agent Learning
      • 6.5 DIAL Details
      • 6.6 Experiments
        • 6.6.1 Model Architecture
        • 6.6.2 Switch Riddle
        • 6.6.3 MNIST Games
        • 6.6.4 Effect of Channel Noise
      • 6.7 Conclusion & Future Work
    • 7. Bayesian Action Decoder
      • 7.1 Introduction
      • 7.2 Setting
      • 7.3 Method
        • 7.3.1 Public belief
        • 7.3.2 Public Belief MDP
        • 7.3.3 Sampling Deterministic Partial Policies
        • 7.3.4 Factorized Belief Updates
        • 7.3.5 Self-Consistent Beliefs
      • 7.4 Experiments and Results
        • 7.4.1 Matrix Game
        • 7.4.2 Hanabi
        • 7.4.3 Observations and Actions
        • 7.4.4 Beliefs in Hanabi
        • 7.4.5 Architecture Details for Baselines and Method
        • 7.4.6 Hyperparamters
        • 7.4.7 Results on Hanabi
      • 7.5 Related Work
        • 7.5.1 Learning to Communicate
        • 7.5.2 Research on Hanabi
        • 7.5.3 Belief State Methods
      • 7.6 Conclusion & Future Work
  • III Learning to Reciprocate
    • 8. Learning with Opponent-Learning Awareness
      • 8.1 Introduction
      • 8.2 Related Work
      • 8.3 Methods
        • 8.3.1 Naive Learner
        • 8.3.2 Learning with Opponent Learning Awareness
        • 8.3.3. Learning via Policy gradient
        • 8.3.4 LOLA with Opponent modeling
        • 8.3.5 Higher-Order LOLA
      • 8.4 Experimental Setup
        • 8.4.1 Iterated Games
        • 8.4.2 Coin Game
        • 8.4.3 Training Details
      • 8.5 Results
        • 8.5.1 Iterated Games
        • 8.5.2 Coin Game
        • 8.5.3 Exploitability of LOLA
      • 8.6 Conclusion & Future Work
    • 9. DiCE: The Infinitely Differentiable Monte Carlo Estimator
      • 9.1 Introduction
      • 9.2 Background
        • 9.2.1 Stochastic Computation Graphs
        • 9.2.2 Surrogate Losses
      • 9.3 Higher Order Gradients
        • 9.3.1 Higher Order Gradient Estimators
        • 9.3.2 Higher Order Surrogate Losses
        • 9.3.3. Simple Failing Example
      • 9.4 Correct Gradient Estimators with DiCE
        • 9.4.1 Implement of DiCE
        • 9.4.2 Casuality
        • 9.4.3 First Order Variance Reduction
        • 9.4.4 Hessian-Vector Product
      • 9.5 Case Studies
        • 9.5.1 Empirical Verification
        • 9.5.2 DiCE For multi-agent RL
      • 9.6 Related Work
      • 9.7 Conclusion & Future Work
  • Reference
    • Reference
  • After
    • ๋ณด์ถฉ
    • ์—ญ์ž ํ›„๊ธฐ
Powered by GitBook
On this page

Was this helpful?

  1. II Learning to Communicate
  2. 7. Bayesian Action Decoder
  3. 7.3 Method

7.3.2 Public Belief MDP

Previous7.3.1 Public beliefNext7.3.3 Sampling Deterministic Partial Policies

Last updated 4 years ago

Was this helpful?

public belief๋ฅผ updateํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. public belief๋Š” ๋‹ค์Œ์ฒ˜๋Ÿผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

P(ftaโˆฃuta,Bt,ftpub,ฯ€^)P(f^a_t|u^a_t,\mathcal{B}_t,f^\mathrm{pub}_t,\hat{\pi})P(ftaโ€‹โˆฃutaโ€‹,Btโ€‹,ftpubโ€‹,ฯ€^)

ฯ€^\hat{\pi}ฯ€^์™€ utau^a_tutaโ€‹๋Š” public information์ด๋ฏ€๋กœ, ํ•œ agent๊ฐ€ ๊ด€์ธกํ•œ ๊ฒƒ utau^a_tutaโ€‹์„ ํ†ตํ•ด ์ดํ›„์˜ ๊ฐ€๋Šฅํ•œ private state features ftprif^\mathrm{pri}_tftpriโ€‹์˜ ํ™•๋ฅ ์ด ๋ฐ”๋กœ public belief์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์‹œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

P(ftaโˆฃuta,Bt,ftpub,ฯ€^)=P(utaโˆฃfta,ฯ€^)P(ftaโˆฃBt,ftpub)P(utaโˆฃBt,ftpub,ฯ€^) P(f^a_t|u^a_t,\mathcal{B}_t,f^\mathrm{pub}_t,\hat{\pi}) = \frac{P(u^a_t|f^a_t,\hat{\pi})P(f^a_t|\mathcal{B}_t,f^\mathrm{pub}_t)}{P(u^a_t|\mathcal{B}_t,f^\mathrm{pub}_t,\hat{\pi})}P(ftaโ€‹โˆฃutaโ€‹,Btโ€‹,ftpubโ€‹,ฯ€^)=P(utaโ€‹โˆฃBtโ€‹,ftpubโ€‹,ฯ€^)P(utaโ€‹โˆฃftaโ€‹,ฯ€^)P(ftaโ€‹โˆฃBtโ€‹,ftpubโ€‹)โ€‹

์ด๋Š”, Bt \mathcal{B_t}Btโ€‹์™€ ftpubf^\mathrm{pub}_tftpubโ€‹๋ฅผ ์•Œ๊ณ  ฯ€^\hat{\pi}ฯ€^๊ฐ€ ๋ฝ‘ํžŒ ์ƒํ™ฉ์—์„œutau^a_tutaโ€‹๊ฐ€ ์„ ํƒ๋์„ ๋•Œ, ์ด๋ฅผ Bt \mathcal{B_t}Btโ€‹์™€ ftpubf^\mathrm{pub}_tftpubโ€‹๋ฅผ ์•ˆ์ƒํƒœ์—์„œ ๊ด€์ธกํ•  ํ™•๋ฅ ๊ณผ ftaf^a_tftaโ€‹์™€ ฯ€^\hat{\pi}ฯ€^๋ฅผ ์•ˆ์ƒํƒœ์—์„œ uta u^a_tutaโ€‹๋ฅผ ์„ ํƒํ•  ํ™•๋ฅ ์˜ ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์ด๋Š” ๋‹น์—ฐํžˆ โˆ1(ฯ€^(fta),uta)P(ftaโˆฃBt,ftpub)\propto \bm{1}(\hat{\pi}(f^a_t),u^a_t)P(f^a_t|\mathcal{B}_t,f^\mathrm{pub}_t)โˆ1(ฯ€^(ftaโ€‹),utaโ€‹)P(ftaโ€‹โˆฃBtโ€‹,ftpubโ€‹)(indicator)์ž…๋‹ˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” ์ƒˆ๋กœ์šด MDP์ธ PuB-MDP๋ฅผ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์˜ (b)๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

PuB-MDP์˜ state ์— ๋Œ€ํ•ด sBADโˆˆSBADs_{\mathrm{BAD}} \in S_{\mathrm{BAD}}sBADโ€‹โˆˆSBADโ€‹๋Š” public observation๊ณผ public belief๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ์Šต๋‹ˆ๋‹ค. deterministic partial policies๋Š” private observation์„ ํ†ตํ•ด action ์œผ๋กœ mappingํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ transition probability๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

P(sBADโ€ฒโˆฃsBAD,ฯ€^)P(s'_{\mathrm{BAD}}|s_\mathrm{BAD},\hat{\pi})P(sBADโ€ฒโ€‹โˆฃsBADโ€‹,ฯ€^)

๋‹ค์Œ state๋Š” ์ƒˆ๋กœ์šด public belief๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋‹ˆ, ์ด๋ฅผ public belief update์‹์„ ํ†ตํ•ด ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ, ์ผ๋ฐ˜ MDP์—์„œ๋Š” action์— ์˜ํ•ด transition probability๊ฐ€ ์ •์˜๋˜์ง€๋งŒ ์—ฌ๊ธฐ์„œ๋Š” ฯ€^ \hat{\pi}ฯ€^์— ์˜ํ•ด(private observation์— ๋”ฐ๋ฅธ ์‹คํ–‰๋˜์ง€ ์•Š์€ action๋ชจ๋‘๊ฐ€ transition probability์— ๊ด€์—ฌํ•ฉ๋‹ˆ๋‹ค.) ์ •์˜๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

reward function์€ private state feature์— ๋Œ€ํ•œ marginality๋ฅผ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

rBAD(sBAD,ฯ€^)=โˆ‘fpriB(fpri)r(s,ฯ€^(fpri)) r_{\mathrm{BAD}}(s_{\mathrm{BAD}},\hat{\pi}) = \sum_{f^{pri}}{\mathcal{B}(f^\mathrm{pri})r(s,\hat{\pi}(f^\mathrm{pri}))}rBADโ€‹(sBADโ€‹,ฯ€^)=โˆ‘fpriโ€‹B(fpri)r(s,ฯ€^(fpri))

์ด๋Š” private observation์— ๋Œ€ํ•œBt\mathcal{B}_tBtโ€‹๋ฅผ reward์— ๊ณฑํ•ด ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.