😇
Deep Multi-Agent Reinforcement Learning
  • Deep Multi-Agent Reinforcement Learning
  • Abstract & Contents
    • Abstract
  • 1. Introduction
    • 1. INTRODUCTION
      • 1.1 The Industrial Revolution, Cognition, and Computers
      • 1.2 Deep Multi-Agent Reinforcement-Learning
      • 1.3 Overall Structure
  • 2. Background
    • 2. BACKGROUND
      • 2.1 Reinforcement Learning
      • 2.2 Multi-Agent Settings
      • 2.3 Centralized vs Decentralized Control
      • 2.4 Cooperative, Zero-sum, and General-Sum
      • 2.5 Partial Observability
      • 2.6 Centralized Training, Decentralized Execution
      • 2.7 Value Functions
      • 2.8 Nash Equilibria
      • 2.9 Deep Learning for MARL
      • 2.10 Q-Learning and DQN
      • 2.11 Reinforce and Actor-Critic
  • I Learning to Collaborate
    • 3. Counterfactual Multi-Agent Policy Gradients
      • 3.1 Introduction
      • 3.2 Related Work
      • 3.3 Multi-Agent StarCraft Micromanagement
      • 3.4 Methods
        • 3.4.1 Independent Actor-Critic
        • 3.4.2 Counterfactual Multi-Agent Policy Gradients
        • 3.4.2.1 baseline lemma
        • 3.4.2.2 COMA Algorithm
      • 3.5 Results
      • 3.6 Conclusions & Future Work
    • 4 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.1 Introduction
      • 4.2 Related Work
      • 4.3 Dec-POMDP and Features
      • 4.4 Common Knowledge
      • 4.5 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.6 Pairwise MACKRL
      • 4.7 Experiments and Results
      • 4.8 Conclusion & Future Work
    • 5 Stabilizing Experience Replay
      • 5.1 Introduction
      • 5.2 Related Work
      • 5.3 Methods
        • 5.3.1 Multi-Agent Importance Sampling
        • 5.3.2 Multi-Agent Fingerprints
      • 5.4 Experiments
        • 5.4.1 Architecture
      • 5.5 Results
        • 5.5.1 Importance Sampling
        • 5.5.2 Fingerprints
        • 5.5.3 Informative Trajectories
      • 5.6 Conclusion & Future Work
  • II Learning to Communicate
    • 6. Learning to Communicate with Deep Multi-Agent ReinforcementLearning
      • 6.1 Introduction
      • 6.2 Related Work
      • 6.3 Setting
      • 6.4 Methods
        • 6.4.1 Reinforced Inter-Agent Learning
        • 6.4.2 Differentiable Inter-Agent Learning
      • 6.5 DIAL Details
      • 6.6 Experiments
        • 6.6.1 Model Architecture
        • 6.6.2 Switch Riddle
        • 6.6.3 MNIST Games
        • 6.6.4 Effect of Channel Noise
      • 6.7 Conclusion & Future Work
    • 7. Bayesian Action Decoder
      • 7.1 Introduction
      • 7.2 Setting
      • 7.3 Method
        • 7.3.1 Public belief
        • 7.3.2 Public Belief MDP
        • 7.3.3 Sampling Deterministic Partial Policies
        • 7.3.4 Factorized Belief Updates
        • 7.3.5 Self-Consistent Beliefs
      • 7.4 Experiments and Results
        • 7.4.1 Matrix Game
        • 7.4.2 Hanabi
        • 7.4.3 Observations and Actions
        • 7.4.4 Beliefs in Hanabi
        • 7.4.5 Architecture Details for Baselines and Method
        • 7.4.6 Hyperparamters
        • 7.4.7 Results on Hanabi
      • 7.5 Related Work
        • 7.5.1 Learning to Communicate
        • 7.5.2 Research on Hanabi
        • 7.5.3 Belief State Methods
      • 7.6 Conclusion & Future Work
  • III Learning to Reciprocate
    • 8. Learning with Opponent-Learning Awareness
      • 8.1 Introduction
      • 8.2 Related Work
      • 8.3 Methods
        • 8.3.1 Naive Learner
        • 8.3.2 Learning with Opponent Learning Awareness
        • 8.3.3. Learning via Policy gradient
        • 8.3.4 LOLA with Opponent modeling
        • 8.3.5 Higher-Order LOLA
      • 8.4 Experimental Setup
        • 8.4.1 Iterated Games
        • 8.4.2 Coin Game
        • 8.4.3 Training Details
      • 8.5 Results
        • 8.5.1 Iterated Games
        • 8.5.2 Coin Game
        • 8.5.3 Exploitability of LOLA
      • 8.6 Conclusion & Future Work
    • 9. DiCE: The Infinitely Differentiable Monte Carlo Estimator
      • 9.1 Introduction
      • 9.2 Background
        • 9.2.1 Stochastic Computation Graphs
        • 9.2.2 Surrogate Losses
      • 9.3 Higher Order Gradients
        • 9.3.1 Higher Order Gradient Estimators
        • 9.3.2 Higher Order Surrogate Losses
        • 9.3.3. Simple Failing Example
      • 9.4 Correct Gradient Estimators with DiCE
        • 9.4.1 Implement of DiCE
        • 9.4.2 Casuality
        • 9.4.3 First Order Variance Reduction
        • 9.4.4 Hessian-Vector Product
      • 9.5 Case Studies
        • 9.5.1 Empirical Verification
        • 9.5.2 DiCE For multi-agent RL
      • 9.6 Related Work
      • 9.7 Conclusion & Future Work
  • Reference
    • Reference
  • After
    • 보충
    • 역자 후기
Powered by GitBook
On this page

Was this helpful?

  1. III Learning to Reciprocate
  2. 9. DiCE: The Infinitely Differentiable Monte Carlo Estimator
  3. 9.3 Higher Order Gradients

9.3.2 Higher Order Surrogate Losses

Schulman은 1차 미분에만 집중했고 높은 차수의 미분에 대해 할수있다만 제시했었습니다. 또한 cost와 parameter간의 의존성을 끊어서 높은 차수의 미분이 좀더 간단해졌습니다.

다음 보겠습니다. single parameter θ\thetaθ에 대해 sampling distribution은 p(x;θ)p(x;\theta)p(x;θ)로 정의되고 objective는 f(x,θ)f(x,\theta)f(x,θ)입니다.

SL(L)=log⁡p(x;θ)f^(x)+f(x;θ) SL(\mathcal{L}) = \log p(x;\theta)\hat{f}(x) + f(x;\theta) SL(L)=logp(x;θ)f^​(x)+f(x;θ)

(∇θL)SL=Ex[∇θSL(L)]( \nabla_\theta\mathcal{L})_\mathrm{SL}= \mathbb{E}_x[\nabla_\theta\mathrm{SL}(\mathcal{L})](∇θ​L)SL​=Ex​[∇θ​SL(L)]

=Ex[f^(x)∇θlog⁡p(x;θ)+∇θf(x;θ)] = \mathbb{E}_x[\hat{f}(x)\nabla_{\theta}\log p(x;\theta) + \nabla_\theta f(x;\theta)]=Ex​[f^​(x)∇θ​logp(x;θ)+∇θ​f(x;θ)]

=Ex[gSL(x;θ)] = \mathbb{E}_x[g_\mathrm{SL}(x;\theta)]=Ex​[gSL​(x;θ)]

이 때, 9.3.1에서 본 아래의 수식과 첫번째 term의 의존성이 다른 것을 볼 수 있습니다.

E[f(x;θ)∇θlog⁡p(x;θ)+∇θf(x;θ))]⋯(9.3.1)\mathbb{E}[f(x;\theta)\nabla_{\theta}\log{p(x;\theta)}+\nabla_{\theta}f(x;\theta))] \cdots(9.3.1)E[f(x;θ)∇θ​logp(x;θ)+∇θ​f(x;θ))]⋯(9.3.1)

이는 같이 같은 1차 gradient를 근사하더라도 함수간의 의존성이 부족한 것은 정확한 2차 미분과의 괴리를 만들어 낼 수 있습니다.

SL(gSL(x;θ))=log⁡p(x;θ)g^SL(x)+gSL(x;θ) SL(g_{\mathrm{SL}}(x;\theta)) = \log p(x;\theta)\hat{g}_{\mathrm{SL}}(x) + g_{\mathrm{SL}}(x;\theta)SL(gSL​(x;θ))=logp(x;θ)g^​SL​(x)+gSL​(x;θ)

(∇θ2L)SL=Ex[∇θSL(gSL)]( \nabla^2_\theta\mathcal{L})_\mathrm{SL} = \mathbb{E}_x[\nabla_\theta\mathrm{SL}(g_{\mathrm{SL}})](∇θ2​L)SL​=Ex​[∇θ​SL(gSL​)]

=Ex[g^SL(x)∇θlog⁡p(x;θ)+∇θg(x;θ)] = \mathbb{E}_x [\hat{g}_{\mathrm{SL}}(x)\nabla_\theta\log p(x;\theta) + \nabla_{\theta}g(x;\theta)]=Ex​[g^​SL​(x)∇θ​logp(x;θ)+∇θ​g(x;θ)]

gSL(x;θ)g_{\mathrm{SL}}(x;\theta)gSL​(x;θ)는 g(x;θ)g(x;\theta)g(x;θ)와 θ\thetaθ에 의존성에 관한 조그만 차이가 있을 뿐입니다. 그래서 이 때 지금의 값은 같습니다. 하지만 미분을 더 진행할 때, 이는 큰 차이가 드러납니다.

∇θg(x;θ)=∇θf(x;θ)∇θlog⁡(p(x;θ))+f(x;θ)∇θ2log⁡(p(x;θ))+∇θ2f(x;θ) \nabla_{\theta}g(x;\theta) = \nabla_{\theta}f(x;\theta)\nabla_{\theta}\log(p(x;\theta)) + f(x;\theta)\nabla^2_\theta \log(p(x;\theta)) + \nabla^2_\theta f(x;\theta)∇θ​g(x;θ)=∇θ​f(x;θ)∇θ​log(p(x;θ))+f(x;θ)∇θ2​log(p(x;θ))+∇θ2​f(x;θ)

∇θgSL(x;θ)=f^(x)∇θ2log⁡(p(x;θ))+∇θ2f(x;θ) \nabla_\theta g_{\mathrm{SL}}(x;\theta) = \hat{f}(x)\nabla^2_\theta\log(p(x;\theta))+\nabla^2_\theta f(x;\theta)∇θ​gSL​(x;θ)=f^​(x)∇θ2​log(p(x;θ))+∇θ2​f(x;θ)

아래 gSLg_\mathrm{SL}gSL​에 관한 식에선 f(x;θ)∇θ2log⁡(p(x;θ)) f(x;\theta)\nabla^2_\theta \log(p(x;\theta))f(x;θ)∇θ2​log(p(x;θ))term을 잃어버리게 됩니다. 하지만 앞에서도 말했듯이 ggg또한 수렴하지 않는다는 것이 Finn의 연구에서 밝혀졌습니다.

Previous9.3.1 Higher Order Gradient EstimatorsNext9.3.3. Simple Failing Example

Last updated 4 years ago

Was this helpful?