😇
Deep Multi-Agent Reinforcement Learning
  • Deep Multi-Agent Reinforcement Learning
  • Abstract & Contents
    • Abstract
  • 1. Introduction
    • 1. INTRODUCTION
      • 1.1 The Industrial Revolution, Cognition, and Computers
      • 1.2 Deep Multi-Agent Reinforcement-Learning
      • 1.3 Overall Structure
  • 2. Background
    • 2. BACKGROUND
      • 2.1 Reinforcement Learning
      • 2.2 Multi-Agent Settings
      • 2.3 Centralized vs Decentralized Control
      • 2.4 Cooperative, Zero-sum, and General-Sum
      • 2.5 Partial Observability
      • 2.6 Centralized Training, Decentralized Execution
      • 2.7 Value Functions
      • 2.8 Nash Equilibria
      • 2.9 Deep Learning for MARL
      • 2.10 Q-Learning and DQN
      • 2.11 Reinforce and Actor-Critic
  • I Learning to Collaborate
    • 3. Counterfactual Multi-Agent Policy Gradients
      • 3.1 Introduction
      • 3.2 Related Work
      • 3.3 Multi-Agent StarCraft Micromanagement
      • 3.4 Methods
        • 3.4.1 Independent Actor-Critic
        • 3.4.2 Counterfactual Multi-Agent Policy Gradients
        • 3.4.2.1 baseline lemma
        • 3.4.2.2 COMA Algorithm
      • 3.5 Results
      • 3.6 Conclusions & Future Work
    • 4 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.1 Introduction
      • 4.2 Related Work
      • 4.3 Dec-POMDP and Features
      • 4.4 Common Knowledge
      • 4.5 Multi-Agent Common Knowledge Reinforcement Learning
      • 4.6 Pairwise MACKRL
      • 4.7 Experiments and Results
      • 4.8 Conclusion & Future Work
    • 5 Stabilizing Experience Replay
      • 5.1 Introduction
      • 5.2 Related Work
      • 5.3 Methods
        • 5.3.1 Multi-Agent Importance Sampling
        • 5.3.2 Multi-Agent Fingerprints
      • 5.4 Experiments
        • 5.4.1 Architecture
      • 5.5 Results
        • 5.5.1 Importance Sampling
        • 5.5.2 Fingerprints
        • 5.5.3 Informative Trajectories
      • 5.6 Conclusion & Future Work
  • II Learning to Communicate
    • 6. Learning to Communicate with Deep Multi-Agent ReinforcementLearning
      • 6.1 Introduction
      • 6.2 Related Work
      • 6.3 Setting
      • 6.4 Methods
        • 6.4.1 Reinforced Inter-Agent Learning
        • 6.4.2 Differentiable Inter-Agent Learning
      • 6.5 DIAL Details
      • 6.6 Experiments
        • 6.6.1 Model Architecture
        • 6.6.2 Switch Riddle
        • 6.6.3 MNIST Games
        • 6.6.4 Effect of Channel Noise
      • 6.7 Conclusion & Future Work
    • 7. Bayesian Action Decoder
      • 7.1 Introduction
      • 7.2 Setting
      • 7.3 Method
        • 7.3.1 Public belief
        • 7.3.2 Public Belief MDP
        • 7.3.3 Sampling Deterministic Partial Policies
        • 7.3.4 Factorized Belief Updates
        • 7.3.5 Self-Consistent Beliefs
      • 7.4 Experiments and Results
        • 7.4.1 Matrix Game
        • 7.4.2 Hanabi
        • 7.4.3 Observations and Actions
        • 7.4.4 Beliefs in Hanabi
        • 7.4.5 Architecture Details for Baselines and Method
        • 7.4.6 Hyperparamters
        • 7.4.7 Results on Hanabi
      • 7.5 Related Work
        • 7.5.1 Learning to Communicate
        • 7.5.2 Research on Hanabi
        • 7.5.3 Belief State Methods
      • 7.6 Conclusion & Future Work
  • III Learning to Reciprocate
    • 8. Learning with Opponent-Learning Awareness
      • 8.1 Introduction
      • 8.2 Related Work
      • 8.3 Methods
        • 8.3.1 Naive Learner
        • 8.3.2 Learning with Opponent Learning Awareness
        • 8.3.3. Learning via Policy gradient
        • 8.3.4 LOLA with Opponent modeling
        • 8.3.5 Higher-Order LOLA
      • 8.4 Experimental Setup
        • 8.4.1 Iterated Games
        • 8.4.2 Coin Game
        • 8.4.3 Training Details
      • 8.5 Results
        • 8.5.1 Iterated Games
        • 8.5.2 Coin Game
        • 8.5.3 Exploitability of LOLA
      • 8.6 Conclusion & Future Work
    • 9. DiCE: The Infinitely Differentiable Monte Carlo Estimator
      • 9.1 Introduction
      • 9.2 Background
        • 9.2.1 Stochastic Computation Graphs
        • 9.2.2 Surrogate Losses
      • 9.3 Higher Order Gradients
        • 9.3.1 Higher Order Gradient Estimators
        • 9.3.2 Higher Order Surrogate Losses
        • 9.3.3. Simple Failing Example
      • 9.4 Correct Gradient Estimators with DiCE
        • 9.4.1 Implement of DiCE
        • 9.4.2 Casuality
        • 9.4.3 First Order Variance Reduction
        • 9.4.4 Hessian-Vector Product
      • 9.5 Case Studies
        • 9.5.1 Empirical Verification
        • 9.5.2 DiCE For multi-agent RL
      • 9.6 Related Work
      • 9.7 Conclusion & Future Work
  • Reference
    • Reference
  • After
    • ëłŽì¶©
    • 역자 후Ʞ
Powered by GitBook
On this page

Was this helpful?

  1. Reference

Reference

[1] John Ruggles. Locomotive steam-engine for rail and other roads. US Patent 1. July 1836.

[2] Jeremy Rifkin. The end of work: The decline of the global labor force and the dawn of the post-market era. ERIC, 1995.

[3] William M Siebert. “Frequency discrimination in the auditory system: Place or periodicity mechanisms?” In: Proceedings of the IEEE 58.5 (1970), pp. 723–730.

[4] Donald Waterman. “A guide to expert systems”. In: (1986).

[5] Marti A. Hearst et al. “Support vector machines”. In: IEEE Intelligent Systems and their applications 13.4 (1998), pp. 18–28.

[6] Carl Edward Rasmussen. “Gaussian processes in machine learning”. In: Advanced lectures on machine learning. Springer, 2004, pp. 63–71.

[7] Yann LeCun et al. “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.

[8] Geoffrey Hinton et al. “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups”. In: IEEE Signal processing magazine 29.6 (2012), pp. 82–97.

[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105.

[10] Brendan Shillingford et al. “Large-scale visual speech recognition”. In: arXiv preprint arXiv:1807.05162 (2018).

[11] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with neural networks”. In: Advances in neural information processing systems. 2014, pp. 3104–3112.

[12] Richard S Sutton. “Learning to predict by the methods of temporal differences”. In: Machine learning 3.1 (1988), pp. 9–44.

[13] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (2015), pp. 529–533.

[14] David Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489.

[15] Robin IM Dunbar. “Neocortex size as a constraint on group size in primates”. In: Journal of human evolution 22.6 (1992), pp. 469–493.

[16] Robert M Axelrod. The evolution of cooperation: revised edition. Basic books, 2006.

[17] Erik Zawadzki, Asher Lipson, and Kevin Leyton-Brown. “Empirically evaluating multiagent learning algorithms”. In: arXiv preprint arXiv:1401.8074 (2014).

[18] Kagan Tumer and Adrian Agogino. “Distributed agent-based air traffic flow management”. In: Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems. ACM. 2007, p. 255.

[19] Lloyd S Shapley. “Stochastic games”. In: Proceedings of the national academy of sciences 39.10 (1953), pp. 1095–1100.

[20] Frans A. Oliehoek, Matthijs T. J. Spaan, and Nikos Vlassis. “Optimal and Approximate Q-value Functions for Decentralized POMDPs”. In: 32 (2008), pp. 289–353.

[21] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. Vol. 1. 1. MIT press Cambridge, 1998.

[22] Yoav Shoham, Rob Powers, Trond Grenager, et al. “If multi-agent learning is the answer, what is the question?” In: Artificial Intelligence 171.7 (2007), pp. 365–377.

[23] John F Nash et al. “Equilibrium points in n-person games”. In: Proceedings of the national academy of sciences 36.1 (1950), pp. 48–49.

[24] Ian Goodfellow et al. Deep learning. Vol. 1. MIT press Cambridge, 2016.

[25] Ming Tan. “Multi-agent reinforcement learning: Independent vs. cooperative agents”. In: Proceedings of the tenth international conference on machine learning. 1993, pp. 330–337.

[26] Ardi Tampuu et al. “Multiagent cooperation and competition with deep reinforcement learning”. In: arXiv preprint arXiv:1511.08779 (2015).

[27] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. New York: Cambridge University Press, 2009.

[28] Matthew Hausknecht and Peter Stone. “Deep recurrent q-learning for partially observable mdps”. In: arXiv preprint arXiv:1507.06527 (2015).

[29] Richard S Sutton et al. “Policy gradient methods for reinforcement learning with function approximation.” In: NIPS. Vol. 99. 1999, pp. 1057–1063.

[30] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Machine learning 8.3-4 (1992), pp. 229–256.

[31] John Schulman et al. “Gradient Estimation Using Stochastic Computation Graphs”. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. 2015, pp. 3528–3536.

[32] Hajime Kimura, Shigenobu Kobayashi, et al. “An analysis of actor-critic algorithms using eligibility traces: reinforcement learning with imperfect value functions”. In: Journal of Japanese Society for Artificial Intelligence 15.2 (2000), pp. 267–275.

[34] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In: arXiv preprint arXiv:1611.01224 (2016).

[35] Roland Hafner and Martin Riedmiller. “Reinforcement learning in feedback control”. In: Machine learning 84.1 (2011), pp. 137–169.

[36] Lex Weaver and Nigel Tao. “The optimal reward baseline for gradient-based reinforcement learning”. In: Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc. 2001, pp. 538–545.

[37] Vijay R Konda and John N Tsitsiklis. “Actor-Critic Algorithms.” In: NIPS. Vol. 13. 2000, pp. 1008–1014.

[38] Kyunghyun Cho et al. “On the properties of neural machine translation: Encoder-decoder approaches”. In: arXiv preprint arXiv:1409.1259 (2014).

[39] Yu-Han Chang, Tracey Ho, and Leslie Pack Kaelbling. “All learning is Local: Multi-agent Learning in Global Reward Games.” In: NIPS. 2003, pp. 807–814.

[40] Nicolas Usunier et al. “Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks”. In: arXiv preprint arXiv:1609.02993 (2016).

[41] Peng Peng et al. “Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games”. In: arXiv preprint arXiv:1703.10069 (2017).

[42] Lucian Busoniu, Robert Babuska, and Bart De Schutter. “A comprehensive survey of multiagent reinforcement learning”. In: IEEE Transactions on Systems Man and Cybernetics Part C Applications and Reviews 38.2 (2008), p. 156.

[43] Erfu Yang and Dongbing Gu. Multiagent reinforcement learning for multi-robot systems: A survey. Tech. rep. tech. rep, 2004.

[44] Joel Z Leibo et al. “Multi-agent Reinforcement Learning in Sequential Social Dilemmas”. In: arXiv preprint arXiv:1702.03037 (2017).

[45] Abhishek Das et al. “Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning”. In: arXiv preprint arXiv:1703.06585 (2017).

[46] Igor Mordatch and Pieter Abbeel. “Emergence of Grounded Compositional Language in Multi-Agent Populations”. In: arXiv preprint arXiv:1703.04908 (2017).

[47] Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. “Multi-agent cooperation and the emergence of (natural) language”. In: arXiv preprint arXiv:1612.07182 (2016).

[48] Sainbayar Sukhbaatar, Rob Fergus, et al. “Learning multiagent communication with backpropagation”. In: Advances in Neural Information Processing Systems. 2016, pp. 2244–2252.

[49] Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. “Cooperative Multi-Agent Control Using Deep Reinforcement Learning”. In: (2017).

[50] Shayegan Omidshafiei et al. “Deep Decentralized Multi-task Multi-Agent RL under Partial Observability”. In: arXiv preprint arXiv:1703.06182 (2017).

[51] Tabish Rashid et al. “QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning”. In: Proceedings of The 35th International Conference on Machine Learning. 2018.

[52] Peter Sunehag et al. “Value-Decomposition Networks For Cooperative Multi-Agent Learning”. In: arXiv preprint arXiv:1706.05296 (2017).

[53] Ryan Lowe et al. “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments”. In: arXiv preprint arXiv:1706.02275 (2017).

[54] Danny Weyns, Alexander Helleboogh, and Tom Holvoet. “The packet-world: A test bed for investigating situated multi-agent systems”. In: Software Agent-Based Applications, Platforms and Development Kits. Springer, 2005, pp. 383–408.

[55] David H Wolpert and Kagan Tumer. “Optimal payoff functions for members of collectives”. In: Modeling complexity in economic and social systems. World Scientific, 2002, pp. 355–369.

[56] Scott Proper and Kagan Tumer. “Modeling difference rewards for multiagent learning”. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 3. International Foundation for Autonomous Agents and Multiagent Systems. 2012, pp. 1397–1398.

[57] Mitchell K Colby, William Curran, and Kagan Tumer. “Approximating difference evaluations with local information”. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems. 2015, pp. 1659–1660.

[58] Gabriel Synnaeve et al. “TorchCraft: a Library for Machine Learning Research on Real-Time Strategy Games”. In: arXiv preprint arXiv:1611.00625 (2016).

[59] R. Collobert, K. Kavukcuoglu, and C. Farabet. “Torch7: A Matlab-like Environment for Machine Learning”. In: BigLearn, NIPS Workshop. 2011.

[60] Landon Kraemer and Bikramjit Banerjee. “Multi-agent reinforcement learning as a rehearsal for decentralized planning”. In: Neurocomputing 190 (2016), pp. 82–94.

[61] Emilio Jorge, Mikael Kageback, and Emil Gustavsson. “Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence”. In: arXiv preprint arXiv:1611.03218 (2016).

[62] Martin J Osborne and Ariel Rubinstein. A course in game theory. MIT press, 1994.

[63] Katie Genter, Tim Laue, and Peter Stone. “Three years of the RoboCup standard platform league drop-in player competition”. In: Autonomous Agents and Multi-Agent Systems 31.4 (2017), pp. 790–820.

[64] Carlos Guestrin, Daphne Koller, and Ronald Parr. “Multiagent planning with factored MDPs”. In: Advances in neural information processing systems. 2002, pp. 1523–1530.

[65] Jelle R Kok and Nikos Vlassis. “Sparse cooperative Q-learning”. In: Proceedings of the twenty-first international conference on Machine learning. ACM. 2004, p. 61.

[66] Katie Genter, Noa Agmon, and Peter Stone. “Ad hoc teamwork for leading a flock”. In: Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems. 2013, pp. 531–538.

[67] Samuel Barrett, Peter Stone, and Sarit Kraus. “Empirical evaluation of ad hoc teamwork in the pursuit domain”. In: The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2. International Foundation for Autonomous Agents and Multiagent Systems. 2011, pp. 567–574.

[68] Stefano V Albrecht and Peter Stone. “Reasoning about hypothetical agent behaviours and their parameters”. In: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems. 2017, pp. 547–555.

[69] Alessandro Panella and Piotr Gmytrasiewicz. “Interactive POMDPs with finite-state models of other agents”. In: Autonomous Agents and Multi-Agent Systems 31.4 (2017), pp. 861–904.

[70] Takaki Makino and Kazuyuki Aihara. “Multi-agent reinforcement learning algorithm to handle beliefs of other agents’ policies and embedded beliefs”. In: Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems. ACM. 2006, pp. 789–791.

[71] Kyle A Thomas et al. “The psychology of coordination and common knowledge.” In: Journal of personality and social psychology 107.4 (2014), p. 657.

[72] Ariel Rubinstein. “The Electronic Mail Game: Strategic Behavior Under" Almost Common Knowledge"”. In: The American Economic Review (1989), pp. 385–391.

[73] Gizem Korkmaz et al. “Collective action through common knowledge using a facebook model”. In: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems. 2014, pp. 253–260.

[74] Ronen I. Brafman and Moshe Tennenholtz. “Learning to Coordinate Efficiently: A Model-based Approach”. In: Journal of Artificial Intelligence Research. Vol. 19. 2003, pp. 11–23.

[75] Robert J Aumann et al. “Subjectivity and correlation in randomized strategies”. In: Journal of mathematical Economics 1.1 (1974), pp. 67–96.

[76] Ludek Cigler and Boi Faltings. “Decentralized anti-coordination through multi-agent learning”. In: Journal of Artificial Intelligence Research 47 (2013), pp. 441–473.

[77] Craig Boutilier. “Sequential optimality and coordination in multiagent systems”. In: IJCAI. Vol. 99. 1999, pp. 478–485.

[78] Christopher Amato, George D Konidaris, and Leslie P Kaelbling. “Planning with macro-actions in decentralized POMDPs”. In: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems. 2014, pp. 1273–1280.

[79] Miao Liu et al. “Learning for Multi-robot Cooperation in Partially Observable Stochastic Environments with Macro-actions”. In: arXiv preprint arXiv:1707.07399 (2017).

[80] Rajbala Makar, Sridhar Mahadevan, and Mohammad Ghavamzadeh. “Hierarchical multi-agent reinforcement learning”. In: Proceedings of the fifth international conference on Autonomous agents. ACM. 2001, pp. 246–253.

[81] Thomas G. Dietterich. “Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition”. In: J. Artif. Int. Res. 13.1 (Nov. 2000), pp. 227–303.

[82] Saurabh Kumar et al. “Federated Control with Hierarchical Multi-Agent Deep Reinforcement Learning”. In: arXiv preprint arXiv:1712.08266 (2017).

[83] Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. “Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems”. In: The Knowledge Engineering Review 27.01 (2012), pp. 1–31.

[84] Kamil Ciosek and Shimon Whiteson. “OFFER: Off-Environment Reinforcement Learning”. In: (2017).

[85] Gerald Tesauro. “Extending Q-Learning to General Adaptive Multi-Agent Systems.” In: NIPS. Vol. 4. 2003.

[86] Tom Schaul et al. “Prioritized Experience Replay”. In: CoRR abs/1511.05952 (2015).

[87] Vincent Conitzer and Tuomas Sandholm. “AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents”. In: Machine Learning 67.1-2 (2007), pp. 23–43.

[88] Bruno C Da Silva et al. “Dealing with non-stationary environments using context detection”. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp. 217–224.

[89] Jelle R Kok and Nikos Vlassis. “Collaborative multiagent reinforcement learning by payoff propagation”. In: Journal of Machine Learning Research 7.Sep (2006), pp. 1789–1828.

[90] Martin Lauer and Martin Riedmiller. “An algorithm for distributed reinforcement learning in cooperative multi-agent systems”. In: In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer. 2000.

[91] Maja J Mataric. “Using communication to reduce locality in distributed multiagent learning”. In: Journal of experimental & theoretical artificial intelligence 10.3 (1998), pp. 357–369.

[92] CP Robert and G Casella. “Monte Carlo Statistical Methods Springer”. In: New York (2004).

[93] F. S. Melo, M. Spaan, and S. J. Witwicki. “QueryPOMDP: POMDP-based communication in multiagent systems”. In: Multi-Agent Systems. 2011, pp. 189–204.

[94] L. Panait and S. Luke. “Cooperative multi-agent learning: The state of the art”. In: Autonomous Agents and Multi-Agent Systems 11.3 (2005), pp. 387–434.

[95] C. Zhang and V. Lesser. “Coordinating multi-agent reinforcement learning with limited communication”. In: vol. 2. 2013, pp. 1101–1108.

[96] T. Kasai, H. Tenmoto, and A. Kamiya. “Learning of communication codes in multi-agent reinforcement learning problem”. In: IEEE Soft Computing in Industrial Applications. 2008, pp. 1–6.

[97] C. L. Giles and K. C. Jim. “Learning communication for multi-agent systems”. In: Innovative Concepts for Agent-Based Systems. Springer, 2002, pp. 377–390.

[98] Karol Gregor et al. “DRAW: A recurrent neural network for image generation”. In: arXiv preprint arXiv:1502.04623 (2015).

[99] Matthieu Courbariaux and Yoshua Bengio. “BinaryNet: Training deep neural networks with weights and activations constrained to +1 or -1”. In: arXiv preprint arXiv:1602.02830 (2016).

[100] Geoffrey Hinton and Ruslan Salakhutdinov. “Discovering binary codes for documents by learning deep generative models”. In: Topics in Cognitive Science 3.1 (2011), pp. 74–91.

[101] Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. “Language understanding for text-based games using deep reinforcement learning”. In: arXiv preprint arXiv:1506.08941 (2015).

[102] Sepp Hochreiter and Jurgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8 (1997), pp. 1735–1780.

[103] Junyoung Chung et al. “Empirical evaluation of gated recurrent neural networks on sequence modeling”. In: arXiv preprint arXiv:1412.3555 (2014).

[104] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. “An empirical exploration of recurrent network architectures”. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015, pp. 2342–2350.

[105] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: arXiv preprint arXiv:1502.03167 (2015).

[106] W. Wu. 100 prisoners and a lightbulb. Tech. rep. OCF, UC Berkeley, 2002.

[107] Michael Studdert-Kennedy. “How Did Language go Discrete?” In: Language Origins: Perspectives on Evolution. Ed. by Maggie Tallerman. Oxford University Press, 2005. Chap. 3.

[109] Michael C. Frank and Noah D. Goodman. “Predicting pragmatic reasoning in language games”. In: Science (80-. ). 336.6084 (2012), p. 998. arXiv: 0602092 [physics].

[113] Piotr J. Gmytrasiewicz and Prashant Doshi. “A framework for sequential planning in multi-agent settings”. In: J. Artif. Intell. Res. 24 (2005), pp. 49–79. arXiv: 1109.2135.

[118] Jakob N Foerster et al. “Learning to communicate to solve riddles with deep distributed recurrent q-networks”. In: arXiv preprint arXiv:1602.02672 (2016).

[119] Jean-Francois Baffier et al. “Hanabi is NP-complete, even for cheaters who look at their cards”. In: (2016).

[121] Bruno Bouzy. “Playing Hanabi Near-Optimally”. In: Advances in Computer Games. Springer. 2017, pp. 51–62.

[122] Joseph Walton-Rivers et al. “Evaluating and modelling Hanabi-playing agents”. In: Evolutionary Computation (CEC), 2017 IEEE Congress on. IEEE. 2017, pp. 1382–1389.

[123] Hirotaka Osawa. “Solving Hanabi: Estimating Hands by Opponent’s Actions in Cooperative Game with Incomplete Information.” In: AAAI workshop: Computer Poker and Imperfect Information. 2015, pp. 37–43.

[124] Markus Eger, Chris Martens, and Marcela Alfaro Cordoba. “An intentional AI for hanabi”. In: 2017 IEEE Conf. Comput. Intell. Games, CIG 2017 (2017), pp. 68–75.

[125] Matej Moravcik et al. “DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker”. In: arXiv preprint arXiv:1701.01724 (2017).

[126] Noam Brown and Tuomas Sandholm. “Superhuman AI for heads-up no-limit poker: Libratus beats top professionals”. In: Science 359.6374 (2018), pp. 418–424.

[127] Pablo Hernandez-Leal et al. “A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity”. In: arXiv preprint arXiv:1707.09183 (2017).

[128] Tuomas W Sandholm and Robert H Crites. “Multiagent reinforcement learning in the iterated prisoner’s dilemma”. In: Biosystems 37.1-2 (1996), pp. 147–166.

[129] Michael Bowling and Manuela Veloso. “Multiagent learning using a variable learning rate”. In: Artificial Intelligence 136.2 (2002), pp. 215–250.

[130] William Uther and Manuela Veloso. Adversarial reinforcement learning. Tech. rep. Technical report, Carnegie Mellon University, 1997. Unpublished, 1997.

[131] C. Claus and C. Boutilier. “The Dynamics of Reinforcement Learning Cooperative Multiagent Systems”. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence. June 1998, pp. 746–752.

[132] Michael Wunder, Michael L Littman, and Monica Babes. “Classes of multiagent q-learning dynamics with epsilon-greedy exploration”. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010, pp. 1167–1174.

[133] Martin Zinkevich, Amy Greenwald, and Michael L Littman. “Cyclic equilibria in Markov games”. In: Advances in Neural Information Processing Systems. 2006, pp. 1641–1648.

[134] Michael L Littman. “Friend-or-foe Q-learning in general-sum games”. In: ICML. Vol. 1. 2001, pp. 322–328.

[135] Doran Chakraborty and Peter Stone. “Multiagent learning in the presence of memory-bounded agents”. In: Autonomous agents and multi-agent systems 28.2 (2014), pp. 182–213.

[136] Ronen I. Brafman and Moshe Tennenholtz. “Efficient Learning Equilibrium”. In: Advances in Neural Information Processing Systems. Vol. 9. 2003, pp. 1635–1643.

[137] Marc Lanctot et al. “A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning”. In: Advances in Neural Information Processing Systems (NIPS). 2017.

[138] Johannes Heinrich and David Silver. “Deep reinforcement learning from self-play in imperfect-information games”. In: arXiv preprint arXiv:1603.01121 (2016).

[139] Adam Lerer and Alexander Peysakhovich. “Maintaining cooperation in complex social dilemmas using deep reinforcement learning”. In: arXiv preprint arXiv:1707.01068 (2017).

[141] Jacob W Crandall and Michael A Goodrich. “Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning”. In: Machine Learning 82.3 (2011), pp. 281–314.

[142] George W Brown. “Iterative solution of games by fictitious play”. In: (1951).

[143] Richard Mealing and Jonathan Shapiro. “Opponent Modelling by Expectation-Maximisation and Sequence Prediction in Simplified Poker”. In: IEEE Transactions on Computational Intelligence and AI in Games (2015).

[144] Neil C Rabinowitz et al. “Machine Theory of Mind”. In: arXiv preprint arXiv:1802.07740 (2018).

[145] Richard Mealing and Jonathan L Shapiro. “Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games.” In: ICAISC (2). 2013, pp. 385–396.

[146] Pablo Hernandez-Leal and Michael Kaisers. “Learning against sequential opponents in repeated stochastic games”. In: (2017).

[147] Chongjie Zhang and Victor R Lesser. “Multi-Agent Learning with Policy Prediction.” In: AAAI. 2010.

[148] Luke Metz et al. “Unrolled generative adversarial networks”. In: arXiv preprint arXiv:1611.02163 (2016).

[149] Max Kleiman-Weiner et al. “Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction”. In: COGSCI. 2016.

[150] Stephane Ross, Geoffrey J Gordon, and J Andrew Bagnell. “No-regret reductions for imitation learning and structured prediction”. In: In AISTATS. Citeseer. 2011.

[151] Mariusz Bojarski et al. “End to end learning for self-driving cars”. In: arXiv preprint arXiv:1604.07316 (2016).

[152] R Duncan Luce and Howard Raiffa. “Games and Decisions: Introduction and Critical Survey”. In: (1957).

[153] King Lee and K Louis. “The Application of Decision Theory and Dynamic Programming to Adaptive Control Systems”. PhD thesis. 1967.

[154] Drew Fudenberg and Jean Tirole. “Game theory, 1991”. In: Cambridge, Massachusetts 393 (1991), p. 12.

[155] B Myerson Roger. Game theory: analysis of conflict. 1991.

[156] Robert Gibbons. Game theory for applied economists. Princeton University Press, 1992.

[157] William H Press and Freeman J Dyson. “Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent”. In: Proceedings of the National Academy of Sciences 109.26 (2012), pp. 10409–10413.

[158] John E Dennis Jr and Jorge J More. “Quasi-Newton methods, motivation and theory”. In: SIAM review 19.1 (1977), pp. 46–89.

[159] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. 2017, pp. 1126–1135.

[160] Maruan Al-Shedivat et al. “Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments”. In: CoRR abs/1710.03641 (2017). arXiv: 1710.03641.

[161] Zhenguo Li et al. “Meta-SGD: Learning to Learn Quickly for Few Shot Learning”. In: CoRR abs/1707.09835 (2017). arXiv: 1707.09835.

[162] Martin Abadi et al. “TensorFlow: A System for Large-Scale Machine Learning”. In: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016. 2016, pp. 265–283.

[163] Adam Paszke et al. “Automatic differentiation in PyTorch”. In: (2017)

[164] John Schulman, Pieter Abbeel, and Xi Chen. “Equivalence Between Policy Gradients and Soft Q-Learning”. In: CoRR abs/1704.06440 (2017). arXiv: 1704.06440.

[165] John Schulman et al. “Trust region policy optimization”. In: International Conference on Machine Learning. 2015, pp. 1889–1897.

[166] Barak A Pearlmutter. “Fast exact multiplication by the Hessian”. In: Neural computation 6.1 (1994), pp. 147–160.

[168] Michael C Fu. “Gradient estimation”. In: Handbooks in operations research and management science 13 (2006), pp. 575–616.

[169] Ivo Grondman et al. “A survey of actor-critic reinforcement learning: Standard and natural policy gradients”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42.6 (2012), pp. 1291–1307.

[170] Peter W Glynn. “Likelihood ratio gradient estimation for stochastic systems”. In: Communications of the ACM 33.10 (1990), pp. 75–84.

[171] David Wingate and Theophane Weber. “Automated Variational Inference in Probabilistic Programming”. In: CoRR abs/1301.1299 (2013). arXiv: 1301.1299.

[172] Rajesh Ranganath, Sean Gerrish, and David M. Blei. “Black Box Variational Inference”. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014. 2014, pp. 814–822.

[173] Diederik P. Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: CoRR abs/1312.6114 (2013). arXiv: 1312.6114.

[174] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. “Stochastic Backpropagation and Approximate Inference in Deep Generative Models”. In: (2014), pp. 1278–1286.

[175] Atilim Gunes Baydin, Barak A. Pearlmutter, and Alexey Andreyevich Radul. “Automatic differentiation in machine learning: a survey”. In: CoRR abs/1502.05767 (2015). arXiv: 1502.05767.

Previous9.7 Conclusion & Future WorkNextëłŽì¶©

Last updated 4 years ago

Was this helpful?

[33] John Schulman et al. “High-Dimensional Continuous Control Using Generalized Advantage Estimation”. In: CoRR abs/1506.02438 (2015). url: .

[108] H. P. Grice. “Logic and Conversation”. In: Syntax and Semantics: Vol. 3: Speech Acts. Ed. by Peter Cole and Jerry L. Morgan. New York: Academic Press, 1975, pp. 41–58. url: .

[110] Ashutosh Nayyar, Aditya Mahajan, and Demosthenis Teneketzis. “Decentralized stochastic control with partial history sharing: A common information approach”. In: IEEE Trans. Automat. Contr. 58.7 (2013), pp. 1644–1658. arXiv: 1209.1695. url: .

[111] Chris L. Baker et al. “Rational quantitative attribution of beliefs, desires and percepts in human mentalizing”. In: Nat. Hum. Behav. 1.4 (2017), pp. 1–10. url: .

[112] L P Kaelbling, M L Littman, and A R Cassandra. “Planning and acting in partially observable stochastic domains”. In: Artif. Intell. 101.1-2 (1998), pp. 99–134. url: .

[114] Thomas P Minka. “Expectation Propagation for Approximate Bayesian Inference”. In: Uncertain. Artif. Intell. 17.2 (2001), pp. 362–369. arXiv: 1301.2294. url: . 1.1.86.1319%7B%5C&%7Drep=rep1%7B%5C&%7Dtype=pdf.

[115] Lasse Espeholt et al. “IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures”. In: arXiv:1802.01561 (2018). arXiv: 1802.01561. url: .

[116] Max Jaderberg et al. “Human-level performance in first-person multiplayer games with population-based deep reinforcement learning”. In: arXiv:1807.01281 (2018). arXiv: 1807.01281. url: .

[117] Max Jaderberg et al. “Population Based Training of Neural Networks”. In: arXiv:1711.09846 (2017). arXiv: 1711.09846. url: .

[120] Christopher Cox et al. “How to Make the Perfect Fireworks Display : Two Strategies for Hanabi”. In: Math. Mag. 88 (2015), p. 323. url: .

[140] Enrique Munoz de Cote and Michael L. Littman. “A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games”. In: 24th Conference on Uncertainty in Artificial Intelligence (UAI’08). 2008. url: .

[167] Bradly Stadie et al. Some Considerations on Learning to Explore via Meta-Reinforcement Learning. 2018. url: .

http://arxiv.org/abs/1506.02438
http://www.ucl.ac.uk/ls/studypacks/Grice-Logic.pdf
https://arxiv.org/abs/1209.1695
http://dx.doi.org/10.1038/s41562-017-0064
http://dx.doi.org/10.1016/S0004-3702(98)00023-X
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10
http://arxiv.org/abs/1802.01561
http://arxiv.org/abs/1807.01281
http://arxiv.org/abs/1711.09846
http://www.jstor.org/stable/10.4169/math.mag.88.5.323
http://uai2008.cs.helsinki.fi/UAI_camera_ready/munoz.pdf
https://openreview.net/forum?id=Skk3Jm96W