๐Ÿ˜‡
Deep Multi-Agent Reinforcement Learning
  • Deep Multi-Agent Reinforcement Learning
  • Abstract & Contents
    • Abstract
  • 1. Introduction
    • 1. INTRODUCTION
      • 1.1 The Industrial Revolution, Cognition, and Computers
      • 1.2 Deep Multi-Agent Reinforcement-Learning
      • 1.3 Overall Structure
  • 2. Background
    • 2. BACKGROUND
      • 2.1 Reinforcement Learning
      • 2.2 Multi-Agent Settings
      • 2.3 Centralized vs Decentralized Control
      • 2.4 Cooperative, Zero-sum, and General-Sum
      • 2.5 Partial Observability
      • 2.6 Centralized Training, Decentralized Execution
      • 2.7 Value Functions
      • 2.8 Nash Equilibria
      • 2.9 Deep Learning for MARL
      • 2.10 Q-Learning and DQN
      • 2.11 Reinforce and Actor-Critic
  • I Learning to Collaborate
    • 3. Counterfactual Multi-Agent Policy Gradients
      • 3.1 Introduction
      • 3.2 Related Work
      • 3.3 Multi-Agent StarCraft Micromanagement
      • 3.4 Methods
        • 3.4.1 Independent Actor-Critic
        • 3.4.2 Counterfactual Multi-Agent Policy Gradients
        • 3.4.2.1 baseline lemma
        • 3.4.2.2 COMA Algorithm
      • 3.5 Results
      • 3.6 Conclusions & Future Work
    • 4 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.1 Introduction
      • 4.2 Related Work
      • 4.3 Dec-POMDP and Features
      • 4.4 Common Knowledge
      • 4.5 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.6 Pairwise MACKRL
      • 4.7 Experiments and Results
      • 4.8 Conclusion & Future Work
    • 5 Stabilizing Experience Replay
      • 5.1 Introduction
      • 5.2 Related Work
      • 5.3 Methods
        • 5.3.1 Multi-Agent Importance Sampling
        • 5.3.2 Multi-Agent Fingerprints
      • 5.4 Experiments
        • 5.4.1 Architecture
      • 5.5 Results
        • 5.5.1 Importance Sampling
        • 5.5.2 Fingerprints
        • 5.5.3 Informative Trajectories
      • 5.6 Conclusion & Future Work
  • II Learning to Communicate
    • 6. Learning to Communicate with Deep Multi-Agent ReinforcementLearning
      • 6.1 Introduction
      • 6.2 Related Work
      • 6.3 Setting
      • 6.4 Methods
        • 6.4.1 Reinforced Inter-Agent Learning
        • 6.4.2 Differentiable Inter-Agent Learning
      • 6.5 DIAL Details
      • 6.6 Experiments
        • 6.6.1 Model Architecture
        • 6.6.2 Switch Riddle
        • 6.6.3 MNIST Games
        • 6.6.4 Effect of Channel Noise
      • 6.7 Conclusion & Future Work
    • 7. Bayesian Action Decoder
      • 7.1 Introduction
      • 7.2 Setting
      • 7.3 Method
        • 7.3.1 Public belief
        • 7.3.2 Public Belief MDP
        • 7.3.3 Sampling Deterministic Partial Policies
        • 7.3.4 Factorized Belief Updates
        • 7.3.5 Self-Consistent Beliefs
      • 7.4 Experiments and Results
        • 7.4.1 Matrix Game
        • 7.4.2 Hanabi
        • 7.4.3 Observations and Actions
        • 7.4.4 Beliefs in Hanabi
        • 7.4.5 Architecture Details for Baselines and Method
        • 7.4.6 Hyperparamters
        • 7.4.7 Results on Hanabi
      • 7.5 Related Work
        • 7.5.1 Learning to Communicate
        • 7.5.2 Research on Hanabi
        • 7.5.3 Belief State Methods
      • 7.6 Conclusion & Future Work
  • III Learning to Reciprocate
    • 8. Learning with Opponent-Learning Awareness
      • 8.1 Introduction
      • 8.2 Related Work
      • 8.3 Methods
        • 8.3.1 Naive Learner
        • 8.3.2 Learning with Opponent Learning Awareness
        • 8.3.3. Learning via Policy gradient
        • 8.3.4 LOLA with Opponent modeling
        • 8.3.5 Higher-Order LOLA
      • 8.4 Experimental Setup
        • 8.4.1 Iterated Games
        • 8.4.2 Coin Game
        • 8.4.3 Training Details
      • 8.5 Results
        • 8.5.1 Iterated Games
        • 8.5.2 Coin Game
        • 8.5.3 Exploitability of LOLA
      • 8.6 Conclusion & Future Work
    • 9. DiCE: The Infinitely Differentiable Monte Carlo Estimator
      • 9.1 Introduction
      • 9.2 Background
        • 9.2.1 Stochastic Computation Graphs
        • 9.2.2 Surrogate Losses
      • 9.3 Higher Order Gradients
        • 9.3.1 Higher Order Gradient Estimators
        • 9.3.2 Higher Order Surrogate Losses
        • 9.3.3. Simple Failing Example
      • 9.4 Correct Gradient Estimators with DiCE
        • 9.4.1 Implement of DiCE
        • 9.4.2 Casuality
        • 9.4.3 First Order Variance Reduction
        • 9.4.4 Hessian-Vector Product
      • 9.5 Case Studies
        • 9.5.1 Empirical Verification
        • 9.5.2 DiCE For multi-agent RL
      • 9.6 Related Work
      • 9.7 Conclusion & Future Work
  • Reference
    • Reference
  • After
    • ๋ณด์ถฉ
    • ์—ญ์ž ํ›„๊ธฐ
Powered by GitBook
On this page

Was this helpful?

  1. II Learning to Communicate
  2. 7. Bayesian Action Decoder
  3. 7.4 Experiments and Results

7.4.7 Results on Hanabi

Previous7.4.6 HyperparamtersNext7.5 Related Work

Last updated 4 years ago

Was this helpful?

BAD agent๋Š” 2์ธ Hanabi์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ทธ๋ž˜ํ”„์˜ (a)๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ด๋Š” BAD agent์™€ ๋‘ LSTM agent์˜ training curve์— ๋Œ€ํ•œ ๊ทธ๋ž˜ํ”„์ž…๋‹ˆ๋‹ค. LSTM agent๋Š” ํ…Œ์ŠคํŠธํ•  ๋•Œ, ํ•™์Šต๋œ policy์ค‘์—์„œ ์ œ์ผ ์ข‹์€ ๋ฒ„์ „์„ ์‚ฌ์šฉํ•ด ์กฐ๊ธˆ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. agent๋ฅผ ๊ณ ๋ฅผ ๋•Œ, agent๋งˆ๋‹ค 10,000๋ฒˆ์˜ ๊ฒŒ์ž„์„ ํ†ตํ•ด ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๊ณ , ๊ฐ€์žฅ ์ข‹์€ agent๋ฅผ ๊ฐ€์ง€๊ณ  100,000 ๋ฒˆ์˜ ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. BAD agent๋ฅผ ๊ณ ๋ฅผ ๋•Œ๋„ ๋น„์Šทํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜์˜€๋Š”๋ฐ, V1์„ ์–ผ๋งˆ๋‚˜ ์„ž๋Š”์ง€์— ๋Œ€ํ•œ ฮฑ \alphaฮฑ์™€ hand์— ๋ช‡์žฅ์˜ ์นด๋“œ๋ฅผ ๋“œ๋А๋ƒ์— ๋”ฐ๋ผ ์ถ”๊ฐ€์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ ํƒ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋‹ค๋ฅธ method๋“ค ์ค‘์— 20์ ์ด ์•ˆ๋˜๋Š” method๋“ค์€ ๊ฐ€๋…์„ฑ์„ ์œ„ํ•ด์„œ ์ ์ง€ ์•Š์•˜๊ณ , Hanabi์˜ ๋ฃฐ์—์„œ 3๋ฒˆ์˜ ์‹คํŒจ๋ฅผํ•˜๋ฉด 0์ ์„ ์ฃผ๋„๋ก ํ•œ ๋ฒ„์ „์ธ๋ฐ๋„ 23.9์  ์ •๋„๋กœ ์—ฌ์ „ํžˆ heuristic rule๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

BAD์˜ agent์˜ ์‹ค์ œ ๊ฒŒ์ž„ํ”Œ๋ ˆ์ด๋Š” ๋ชจ๋‘ ๋”ฐ๋ฅด๊ธฐ ์‰ฌ์šด๊ฑด ์•„๋‹ˆ์ง€๋งŒ ๊ฒŒ์ž„์„ ๋ถ„์„ํ•ด๋ณด์•˜์„ ๋•Œ, ๋ช‡๋ช‡ convention์„ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์€ agent๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” convention์œผ๋กœ ์ƒˆ ์นด๋“œ์— ๋Œ€ํ•ด ๋นจ๊ฐ„์ƒ‰์ด๋‚˜ ๋…ธ๋ž€์ƒ‰์ด๋ž€ ํžŒํŠธ๋ฅผ ์ฃผ๋ฉด ์ด๋Š” ๋“ฑ๋กํ•ด๋„ ๋œ๋‹ค๋Š” convention์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ 25%์˜ ์ผ€์ด์Šค์—์„œ๋Š” ์ƒˆ๋กœ์šด ์นด๋“œ์— ๋Œ€ํ•ด ํฐ์ƒ‰์ด๋‚˜ ํŒŒ๋ž€์ƒ‰์„ ๊ฐ€๋ฆฌํ‚ค๋Š” ๊ฒƒ์ด ๋ฒ„๋ฆฌ๋ผ๋Š” convention์œผ๋กœ ์‚ฌ์šฉ๋˜๊ธฐ๋„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์œ„ ๊ทธ๋ฆผ์—์„œ์˜ (c)์—์„œ๋Š” V0, V1, V2์˜ iteration์— ๋”ฐ๋ฅธ cross entropy๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. belief update๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ์ง„ํ–‰ํ•  ๊ฒฝ์šฐ, ๊ธฐ์กด์˜ cross entropy๋ณด๋‹ค ํฌ๊ฒŒ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” convention์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ์ด ์„ฑ๊ณต์ ์ธ ๊ฒŒ์ž„ ํ”Œ๋ ˆ์ด์— ์žˆ์–ด ์ค‘์š”ํ•˜๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.