😇
Deep Multi-Agent Reinforcement Learning
  • Deep Multi-Agent Reinforcement Learning
  • Abstract & Contents
    • Abstract
  • 1. Introduction
    • 1. INTRODUCTION
      • 1.1 The Industrial Revolution, Cognition, and Computers
      • 1.2 Deep Multi-Agent Reinforcement-Learning
      • 1.3 Overall Structure
  • 2. Background
    • 2. BACKGROUND
      • 2.1 Reinforcement Learning
      • 2.2 Multi-Agent Settings
      • 2.3 Centralized vs Decentralized Control
      • 2.4 Cooperative, Zero-sum, and General-Sum
      • 2.5 Partial Observability
      • 2.6 Centralized Training, Decentralized Execution
      • 2.7 Value Functions
      • 2.8 Nash Equilibria
      • 2.9 Deep Learning for MARL
      • 2.10 Q-Learning and DQN
      • 2.11 Reinforce and Actor-Critic
  • I Learning to Collaborate
    • 3. Counterfactual Multi-Agent Policy Gradients
      • 3.1 Introduction
      • 3.2 Related Work
      • 3.3 Multi-Agent StarCraft Micromanagement
      • 3.4 Methods
        • 3.4.1 Independent Actor-Critic
        • 3.4.2 Counterfactual Multi-Agent Policy Gradients
        • 3.4.2.1 baseline lemma
        • 3.4.2.2 COMA Algorithm
      • 3.5 Results
      • 3.6 Conclusions & Future Work
    • 4 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.1 Introduction
      • 4.2 Related Work
      • 4.3 Dec-POMDP and Features
      • 4.4 Common Knowledge
      • 4.5 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.6 Pairwise MACKRL
      • 4.7 Experiments and Results
      • 4.8 Conclusion & Future Work
    • 5 Stabilizing Experience Replay
      • 5.1 Introduction
      • 5.2 Related Work
      • 5.3 Methods
        • 5.3.1 Multi-Agent Importance Sampling
        • 5.3.2 Multi-Agent Fingerprints
      • 5.4 Experiments
        • 5.4.1 Architecture
      • 5.5 Results
        • 5.5.1 Importance Sampling
        • 5.5.2 Fingerprints
        • 5.5.3 Informative Trajectories
      • 5.6 Conclusion & Future Work
  • II Learning to Communicate
    • 6. Learning to Communicate with Deep Multi-Agent ReinforcementLearning
      • 6.1 Introduction
      • 6.2 Related Work
      • 6.3 Setting
      • 6.4 Methods
        • 6.4.1 Reinforced Inter-Agent Learning
        • 6.4.2 Differentiable Inter-Agent Learning
      • 6.5 DIAL Details
      • 6.6 Experiments
        • 6.6.1 Model Architecture
        • 6.6.2 Switch Riddle
        • 6.6.3 MNIST Games
        • 6.6.4 Effect of Channel Noise
      • 6.7 Conclusion & Future Work
    • 7. Bayesian Action Decoder
      • 7.1 Introduction
      • 7.2 Setting
      • 7.3 Method
        • 7.3.1 Public belief
        • 7.3.2 Public Belief MDP
        • 7.3.3 Sampling Deterministic Partial Policies
        • 7.3.4 Factorized Belief Updates
        • 7.3.5 Self-Consistent Beliefs
      • 7.4 Experiments and Results
        • 7.4.1 Matrix Game
        • 7.4.2 Hanabi
        • 7.4.3 Observations and Actions
        • 7.4.4 Beliefs in Hanabi
        • 7.4.5 Architecture Details for Baselines and Method
        • 7.4.6 Hyperparamters
        • 7.4.7 Results on Hanabi
      • 7.5 Related Work
        • 7.5.1 Learning to Communicate
        • 7.5.2 Research on Hanabi
        • 7.5.3 Belief State Methods
      • 7.6 Conclusion & Future Work
  • III Learning to Reciprocate
    • 8. Learning with Opponent-Learning Awareness
      • 8.1 Introduction
      • 8.2 Related Work
      • 8.3 Methods
        • 8.3.1 Naive Learner
        • 8.3.2 Learning with Opponent Learning Awareness
        • 8.3.3. Learning via Policy gradient
        • 8.3.4 LOLA with Opponent modeling
        • 8.3.5 Higher-Order LOLA
      • 8.4 Experimental Setup
        • 8.4.1 Iterated Games
        • 8.4.2 Coin Game
        • 8.4.3 Training Details
      • 8.5 Results
        • 8.5.1 Iterated Games
        • 8.5.2 Coin Game
        • 8.5.3 Exploitability of LOLA
      • 8.6 Conclusion & Future Work
    • 9. DiCE: The Infinitely Differentiable Monte Carlo Estimator
      • 9.1 Introduction
      • 9.2 Background
        • 9.2.1 Stochastic Computation Graphs
        • 9.2.2 Surrogate Losses
      • 9.3 Higher Order Gradients
        • 9.3.1 Higher Order Gradient Estimators
        • 9.3.2 Higher Order Surrogate Losses
        • 9.3.3. Simple Failing Example
      • 9.4 Correct Gradient Estimators with DiCE
        • 9.4.1 Implement of DiCE
        • 9.4.2 Casuality
        • 9.4.3 First Order Variance Reduction
        • 9.4.4 Hessian-Vector Product
      • 9.5 Case Studies
        • 9.5.1 Empirical Verification
        • 9.5.2 DiCE For multi-agent RL
      • 9.6 Related Work
      • 9.7 Conclusion & Future Work
  • Reference
    • Reference
  • After
    • 보충
    • 역자 후기
Powered by GitBook
On this page
  • Starcraft II
  • Is Pair Controller really well trained to do delegate action?

Was this helpful?

  1. I Learning to Collaborate
  2. 4 Multi-Agent Common Knowledge Reinforcement Learning

4.7 Experiments and Results

Previous4.6 Pairwise MACKRLNext4.8 Conclusion & Future Work

Last updated 4 years ago

Was this helpful?

여기서는 MACKRL를 두가지 task에 대해 실험을 진행하였습니다.

Matrix 실험은 이후에 설명하도록 하겠습니다.

이 중에서 첫번째는 partially observable한 matrix game입니다.여기서의 state는 두가지의 random하게 선택된 bits로 구성되는데, 이는 iid를 따릅니다. 첫번째 비트는 information state로, 두 agent모두 관측가능합니다. 두번째 비트는 agent가 플레이하는 두 가지 일반 폼 게임 중 하나를 선택하고 50%확률로 sampling됩니다. 만약 첫번째 bit가 common knowledge상태에 있다면, P(common knowledge)가 일어난 것이고, matrix bit는 상대에게 항상 공개되어 모두의 common knowledge가 됩니다. 반면에 첫번째 bit가 볼수 없는 상황이 된다면, 각 agent는 matrix bit를 50%확률로 보게됩니다.

Common knowledge가 항상 공개된다면, MACKRL는 joint-action-learning(JAL)과 성능이 같았습니다. 중간 정도의 확률일 땐 MACKRL이 IAC와 JAL을 모두 앞섰고, MACKRL

Starcraft II

두번째 실험은 Starcraft II micromanagement의 MARL환경입니다. 이는 starcraft의 설정과 닮았는데, 3대3 마린전, 2 스토커 3질럿전이 있는데,이전 연구에서 independent learner가 실패하는 것을 보였고, 여기서는 성공하는 것을 봄으로써, MACKRL이 유효하다는 것을 보여줍니다.

Policy의 Neural Network Architecturing에 대한 설명을 진행하는데, 두번째와 세번째 hierarchy controllers는 parameter를 공유합니다. 그러므로, agent index나 index pair에 대한 정보를 agent에 넣어주어야 합니다.

Central-V와의 비교에서, 결론적으로 parameter수는 Central-V가 3배 적긴했지만, 결국엔 더 좋은 성능을 보인건 MACKRL이었습니다.

Is Pair Controller really well trained to do delegate action?

pair controller가 전략적으로 delegate하는 법을 배운다는 것을 보이기 위해 아래와 같은 그림을 제시합니다.

이는 주어진 pair controller의 common knowledge안에 있는 적에 수에 따른 delegation action ud u_dud​에 대한 퍼센트를 나타냈습니다.학습 초기에 pair controller는 드물게 decentralized controller에게 delegate했지만, 학습진행됨에 따라, 더 자주 적당한 적의 수가 common knowledge에 있을 때, delegation하는 법을 배웠습니다. 이는 delegation이 각 agent 개인적인 observation에서의 이점을 가져가면서도, common knowledge가 있을 때 협력하는 법도 배우는 것을 알 수 있습니다.