Improve agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (2024)

Improve agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (2)

Advanced Search

nips

research-article

Free Access

  • Authors:
  • Assaf Hallak NVIDIA Research

    NVIDIA Research

    Search about this author

    ,
  • Gal Dalal NVIDIA Research

    NVIDIA Research

    Search about this author

    ,
  • Steven Dalton NVIDIA Research

    NVIDIA Research

    Search about this author

    ,
  • Iuri Frosio NVIDIA Research

    NVIDIA Research

    Search about this author

    ,
  • Shie Mannor NVIDIA Research

    NVIDIA Research

    Search about this author

    ,
  • Gal Chechik NVIDIA Research

    NVIDIA Research

    Search about this author

Published:10 June 2024Publication History

  • 0citation
  • 0
  • Downloads

Metrics

Total Citations0Total Downloads0

Last 12 Months0

Last 6 weeks0

  • Get Citation Alerts

    New Citation Alert added!

    This alert has been successfully added and will be sent to:

    You will be notified whenever a record that you have chosen has been cited.

    To manage your alert preferences, click on the button below.

    Manage my Alerts

    New Citation Alert!

    Please log in to your account

  • Publisher Site

NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

Improve agents without retraining: parallel tree search with off-policy correction

Pages 5518–5530

PreviousChapterNextChapter

Improve agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (3)

ABSTRACT

Tree Search (TS) is crucial to some of the most influential successes in reinforcement learning. Here, we tackle two major challenges with TS that limit its usability: distribution shift and scalability. We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps. We show this is due to a distribution shift to areas where value estimates are highly inaccurate and analyze this effect using Extreme Value theory. To overcome this problem, we introduce a novel off-policy correction term that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories. We prove that our correction eliminates the above mismatch and bound the probability of sub-optimal action selection. Our correction significantly improves pre-trained Rainbow agents without any further training, often more than doubling their scores on Atari games. Next, we address the scalability issue given by the computational complexity of exhaustive TS that scales exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously. Batch-BFS reduces runtime by two orders of magnitude and, beyond inference, enables also training with TS of depths that were not feasible before. We train DQN agents from scratch using TS and show improvement in several Atari games compared to both the original DQN and the more advanced Rainbow.

Skip Supplemental Material Section

Supplemental Material

Available for Download

pdf

3540261.3540683_supp.pdf (461 KB)

Supplemental material.

References

  1. How much did alphago zero cost? https://www.yuzeh.com/data/agz-cost.html. Accessed: 2021-05-20.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (4)
  2. Paul Serban Agachi, Mircea Vasile Cristea, Alexandra Ana Csavdari, and Botond Szilagyi. 2. Model predictive control. De Gruyter, 2016.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (5)Cross Ref
  3. JM Blair, CA Edwards, and JH Johnson. Rational chebyshev approximations for the inverse of the error function. Mathematics of Computation, 30(136):827–830, 1976.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (7)Cross Ref
  4. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (9)
  5. Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (10)Cross Ref
  6. Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. arXiv preprint arXiv:1807.01675, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (12)
  7. Francesco Paolo Cantelli. Sui confini della probabilita. In Atti del Congresso Internazionale dei Matematici: Bologna del 3 al 10 de settembre di 1928, pages 47–60, 1929.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (13)
  8. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning, 2014. cite arxiv:1410.0759.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (14)
  9. Stuart Coles, Joanna Bawa, Lesley Trenner, and Pat Dorazio. An introduction to statistical modeling of extreme values, volume 208. Springer, 2001.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (15)Cross Ref
  10. Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (17)Digital Library
  11. Steven Dalton, Iuri Frosio, and Michael Garland. Accelerating reinforcement learning through gpu atari emulation. arXiv preprint arXiv:1907.08467, 2019, BSD 3-Clause "New" or "Revised" License.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (19)
  12. Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Beyond the one-step greedy approach in reinforcement learning. In International Conference on Machine Learning, pages 1387–1396. PMLR, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (20)
  13. Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Multiple-step greedy policies in approximate and online reinforcement learning. In NeurIPS, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (21)
  14. Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. How to combine tree-search methods in reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2019), 2019.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (22)Digital Library
  15. Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pages 1407–1416. PMLR, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (24)
  16. Jesse Farebrother, Marlos C Machado, and Michael Bowling. Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (25)
  17. Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (26)
  18. Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (27)
  19. David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (28)
  20. Peter Hart, Nils Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (29)Cross Ref
  21. Hado Hasselt. Double q-learning. Advances in neural information processing systems, 23:2613–2621, 2010.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (31)
  22. Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (32)Cross Ref
  23. Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (34)
  24. Kaixhin. Rainbow Pre-trained Agents v1.3, 2019, MIT License.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (35)
  25. Gabriel Kalweit and Joschka Boedecker. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, pages 195–206. PMLR, 2017.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (36)
  26. Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to simulate dynamic environments with gamegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1231–1240, 2020.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (37)Cross Ref
  27. Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (39)
  28. Steven M LaValle et al. Rapidly-exploring random trees: A new tool for path planning. Technical Report. Computer Science Department, Iowa State University, 1998.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (40)
  29. Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, Joelle Pineau, and Kee-Eung Kim. Optidice: Offline policy optimization via stationary distribution correction estimation. arXiv preprint arXiv:2106.10783, 2021.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (41)
  30. Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nuttapong Chentanez, Miles Macklin, and Dieter Fox. Gpu-accelerated robotic simulation for distributed reinforcement learning. In Conference on Robot Learning, pages 270–282. PMLR, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (42)
  31. Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (43)Cross Ref
  32. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (45)
  33. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (46)
  34. Thomas M Moerland, Anna Deichler, Simone Baldi, Joost Broekens, and Catholijn M Jonker. Think too fast nor too slow: The computational trade-off between planning and reinforcement learning. arXiv preprint arXiv:2005.07404, 2020.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (47)
  35. Remi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (48)Digital Library
  36. Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (50)
  37. NVIDIA. Isaac Gym preview release, 2021.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (51)
  38. Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. Actionconditional video prediction using deep networks in atari games. arXiv preprint arXiv:1507.08750, 2015.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (52)
  39. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (53)
  40. John Quan and Georg Ostrovski. DQN Zoo: Reference implementations of DQN-based agents, 2020, Apache License 2.0.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (54)
  41. Sébastien Racanière, Théophane Weber, David P Reichert, Lars Buesing, Arthur Guez, Danilo Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5694–5705, 2017.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (55)
  42. Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (56)Cross Ref
  43. Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, and David Silver. Online and offline reinforcement learning by planning with a learned model. arXiv preprint arXiv:2104.06294, 2021.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (58)
  44. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (59)
  45. David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (60)
  46. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (61)Digital Library
  47. Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721. IEEE, 2017.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (63)
  48. Mario Zanon and Sébastien Gros. Safe reinforcement learning using robust mpc. IEEE Transactions on Automatic Control, 2020.Google ScholarImprove agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (64)

Cited By

View all

Improve agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (65)

    Recommendations

    • Team formation with learning agents that improve coordination

      AAMAS '14: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems

      Learning agents increase their team's performance by learning to coordinate better with their teammates, and we are interested in forming teams that contain such learning agents. In particular, we consider finite training instances for learning agents ...

      Read More

    • Reinforcement learning without rewards

      Read More

    • Agents Teaching Agents: A Survey on Inter-agent Transfer Learning

      AAMAS '20: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems

      While reinforcement learning (RL) has helped artificial agents solve challenging tasks, high sample complexity is still a major concern. Inter-agent teaching -- endowing agents with the ability to respond to instructions from others -- has been ...

      Read More

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    Get this Publication

    • Information
    • Contributors
    • Published in

      Improve agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (66)

      NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

      December 2021

      30517 pages

      ISBN:9781713845393

      • Editors:
      • M. Ranzato,
      • A. Beygelzimer,
      • Y. Dauphin,
      • P.S. Liang,
      • J. Wortman Vaughan

      Copyright © 2021 Neural Information Processing Systems Foundation, Inc.

      Sponsors

        In-Cooperation

          Publisher

          Curran Associates Inc.

          Red Hook, NY, United States

          Publication History

          • Published: 10 June 2024

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Conference

          Funding Sources

          • Improve agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (67)

            Other Metrics

            View Article Metrics

          • Bibliometrics
          • Citations0
          • Article Metrics

            • Total Citations

              View Citations
            • Total Downloads

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0

            Other Metrics

            View Author Metrics

          • Cited By

            This publication has not been cited yet

          Digital Edition

          View this article in digital edition.

          View Digital Edition

          • Figures
          • Other

            Close Figure Viewer

            Browse AllReturn

            Caption

            View Table of Contents

            Export Citations

              Your Search Results Download Request

              We are preparing your search results for download ...

              We will inform you here when the file is ready.

              Download now!

              Your Search Results Download Request

              Your file of search results citations is now ready.

              Download now!

              Your Search Results Download Request

              Your search export query has expired. Please try again.

              Improve agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (2024)
              Top Articles
              Latest Posts
              Article information

              Author: Saturnina Altenwerth DVM

              Last Updated:

              Views: 6384

              Rating: 4.3 / 5 (64 voted)

              Reviews: 87% of readers found this page helpful

              Author information

              Name: Saturnina Altenwerth DVM

              Birthday: 1992-08-21

              Address: Apt. 237 662 Haag Mills, East Verenaport, MO 57071-5493

              Phone: +331850833384

              Job: District Real-Estate Architect

              Hobby: Skateboarding, Taxidermy, Air sports, Painting, Knife making, Letterboxing, Inline skating

              Introduction: My name is Saturnina Altenwerth DVM, I am a witty, perfect, combative, beautiful, determined, fancy, determined person who loves writing and wants to share my knowledge and understanding with you.