😇
Deep Multi-Agent Reinforcement Learning
  • Deep Multi-Agent Reinforcement Learning
  • Abstract & Contents
    • Abstract
  • 1. Introduction
    • 1. INTRODUCTION
      • 1.1 The Industrial Revolution, Cognition, and Computers
      • 1.2 Deep Multi-Agent Reinforcement-Learning
      • 1.3 Overall Structure
  • 2. Background
    • 2. BACKGROUND
      • 2.1 Reinforcement Learning
      • 2.2 Multi-Agent Settings
      • 2.3 Centralized vs Decentralized Control
      • 2.4 Cooperative, Zero-sum, and General-Sum
      • 2.5 Partial Observability
      • 2.6 Centralized Training, Decentralized Execution
      • 2.7 Value Functions
      • 2.8 Nash Equilibria
      • 2.9 Deep Learning for MARL
      • 2.10 Q-Learning and DQN
      • 2.11 Reinforce and Actor-Critic
  • I Learning to Collaborate
    • 3. Counterfactual Multi-Agent Policy Gradients
      • 3.1 Introduction
      • 3.2 Related Work
      • 3.3 Multi-Agent StarCraft Micromanagement
      • 3.4 Methods
        • 3.4.1 Independent Actor-Critic
        • 3.4.2 Counterfactual Multi-Agent Policy Gradients
        • 3.4.2.1 baseline lemma
        • 3.4.2.2 COMA Algorithm
      • 3.5 Results
      • 3.6 Conclusions & Future Work
    • 4 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.1 Introduction
      • 4.2 Related Work
      • 4.3 Dec-POMDP and Features
      • 4.4 Common Knowledge
      • 4.5 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.6 Pairwise MACKRL
      • 4.7 Experiments and Results
      • 4.8 Conclusion & Future Work
    • 5 Stabilizing Experience Replay
      • 5.1 Introduction
      • 5.2 Related Work
      • 5.3 Methods
        • 5.3.1 Multi-Agent Importance Sampling
        • 5.3.2 Multi-Agent Fingerprints
      • 5.4 Experiments
        • 5.4.1 Architecture
      • 5.5 Results
        • 5.5.1 Importance Sampling
        • 5.5.2 Fingerprints
        • 5.5.3 Informative Trajectories
      • 5.6 Conclusion & Future Work
  • II Learning to Communicate
    • 6. Learning to Communicate with Deep Multi-Agent ReinforcementLearning
      • 6.1 Introduction
      • 6.2 Related Work
      • 6.3 Setting
      • 6.4 Methods
        • 6.4.1 Reinforced Inter-Agent Learning
        • 6.4.2 Differentiable Inter-Agent Learning
      • 6.5 DIAL Details
      • 6.6 Experiments
        • 6.6.1 Model Architecture
        • 6.6.2 Switch Riddle
        • 6.6.3 MNIST Games
        • 6.6.4 Effect of Channel Noise
      • 6.7 Conclusion & Future Work
    • 7. Bayesian Action Decoder
      • 7.1 Introduction
      • 7.2 Setting
      • 7.3 Method
        • 7.3.1 Public belief
        • 7.3.2 Public Belief MDP
        • 7.3.3 Sampling Deterministic Partial Policies
        • 7.3.4 Factorized Belief Updates
        • 7.3.5 Self-Consistent Beliefs
      • 7.4 Experiments and Results
        • 7.4.1 Matrix Game
        • 7.4.2 Hanabi
        • 7.4.3 Observations and Actions
        • 7.4.4 Beliefs in Hanabi
        • 7.4.5 Architecture Details for Baselines and Method
        • 7.4.6 Hyperparamters
        • 7.4.7 Results on Hanabi
      • 7.5 Related Work
        • 7.5.1 Learning to Communicate
        • 7.5.2 Research on Hanabi
        • 7.5.3 Belief State Methods
      • 7.6 Conclusion & Future Work
  • III Learning to Reciprocate
    • 8. Learning with Opponent-Learning Awareness
      • 8.1 Introduction
      • 8.2 Related Work
      • 8.3 Methods
        • 8.3.1 Naive Learner
        • 8.3.2 Learning with Opponent Learning Awareness
        • 8.3.3. Learning via Policy gradient
        • 8.3.4 LOLA with Opponent modeling
        • 8.3.5 Higher-Order LOLA
      • 8.4 Experimental Setup
        • 8.4.1 Iterated Games
        • 8.4.2 Coin Game
        • 8.4.3 Training Details
      • 8.5 Results
        • 8.5.1 Iterated Games
        • 8.5.2 Coin Game
        • 8.5.3 Exploitability of LOLA
      • 8.6 Conclusion & Future Work
    • 9. DiCE: The Infinitely Differentiable Monte Carlo Estimator
      • 9.1 Introduction
      • 9.2 Background
        • 9.2.1 Stochastic Computation Graphs
        • 9.2.2 Surrogate Losses
      • 9.3 Higher Order Gradients
        • 9.3.1 Higher Order Gradient Estimators
        • 9.3.2 Higher Order Surrogate Losses
        • 9.3.3. Simple Failing Example
      • 9.4 Correct Gradient Estimators with DiCE
        • 9.4.1 Implement of DiCE
        • 9.4.2 Casuality
        • 9.4.3 First Order Variance Reduction
        • 9.4.4 Hessian-Vector Product
      • 9.5 Case Studies
        • 9.5.1 Empirical Verification
        • 9.5.2 DiCE For multi-agent RL
      • 9.6 Related Work
      • 9.7 Conclusion & Future Work
  • Reference
    • Reference
  • After
    • 보충
    • 역자 후기
Powered by GitBook
On this page

Was this helpful?

  1. III Learning to Reciprocate
  2. 8. Learning with Opponent-Learning Awareness
  3. 8.3 Methods

8.3.2 Learning with Opponent Learning Awareness

LOLA의 agent는 opponent의 미소변화량에 따른 expected total discounted return을 가지고 최대화를 진행합니다. opponent의 미소 변화량을 근사 하기 위해 테일러 근사를 하는데, 1차 테일러 근사를 나타내자면 다음과 같습니다.

J1(θ1,θ+Δθ2)≈J1(θ1,θ2)+(Δθ2)T(∇θ2J1(θ1,θ2) J^1(\theta^1,\theta+\Delta\theta^2) \approx J^1(\theta^1,\theta^2) + (\Delta\theta^2)^T(\nabla_{\theta^2}J^1(\theta^1,\theta^2)J1(θ1,θ+Δθ2)≈J1(θ1,θ2)+(Δθ2)T(∇θ2​J1(θ1,θ2)

이를 통해 LOLA 내의 agent 끼리는 상대방의 policy update에 적극적으로 영향을 끼치기 위해 노력하게 됩니다. 그렇다면 opponent parameter의 미소 변화량은 8.3.1에서 본 naive learning step과 같이 expected discounted return을 최대화 하기 위한 방향으로의 변화량이 됩니다.

Δθ2=∇θ2J2(θ1,θ2)η \Delta \theta^2 = \nabla_{\theta^2}J^2(\theta^1,\theta^2)\etaΔθ2=∇θ2​J2(θ1,θ2)η

8.3.1에서 본 agent 1의 i+1 i+1 i+1번째 parameter update식을 Δθ2\Delta \theta^2Δθ2를 치환한 1차 테일러 근사를 통해 나타내면 다음과 같습니다.

flola1(θ1,θ2)=Δθ1J1(θ1,θ2)δ+(∇θ2J1(θ1,θ2))T∇θ1∇θ2J2(θ1,θ2)δη \bm{f}^1_{\mathrm{lola}}(\theta^1,\theta^2) = \Delta_{\theta^1}J^1(\theta^1,\theta^2)\delta + (\nabla_{\theta^2}J^1(\theta^1,\theta^2))^T\nabla_{\theta^1}\nabla_{\theta^2}J^2(\theta^1,\theta^2)\delta\etaflola1​(θ1,θ2)=Δθ1​J1(θ1,θ2)δ+(∇θ2​J1(θ1,θ2))T∇θ1​∇θ2​J2(θ1,θ2)δη

η\etaη와δ\deltaδ는 learning rate입니다.

하지만 이는 정확한 gradient와 hessian에 대한 접근이 가능할 때이고, 일반적인 RL에서 사용할 수 있는 observation을 통해 얻은 식으로 구성해보도록 하겠습니다.

Previous8.3.1 Naive LearnerNext8.3.3. Learning via Policy gradient

Last updated 4 years ago

Was this helpful?