Improve agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (2024)

Advanced Search
Browse
About
- Sign in
- Register

Advanced Search

nips

research-article

Free Access

Authors:
Assaf Hallak NVIDIA Research

NVIDIA Research
Search about this author

,
Gal Dalal NVIDIA Research

NVIDIA Research
Search about this author

,
Steven Dalton NVIDIA Research

NVIDIA Research
Search about this author

,
Iuri Frosio NVIDIA Research

NVIDIA Research
Search about this author

,
Shie Mannor NVIDIA Research

NVIDIA Research
Search about this author

,
Gal Chechik NVIDIA Research

NVIDIA Research
Search about this author

NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing SystemsDecember 2021Article No.: 422Pages 5518–5530

NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

Improve agents without retraining: parallel tree search with off-policy correction

Pages 5518–5530

PreviousChapterNextChapter

ABSTRACT

Tree Search (TS) is crucial to some of the most influential successes in reinforcement learning. Here, we tackle two major challenges with TS that limit its usability: distribution shift and scalability. We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps. We show this is due to a distribution shift to areas where value estimates are highly inaccurate and analyze this effect using Extreme Value theory. To overcome this problem, we introduce a novel off-policy correction term that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories. We prove that our correction eliminates the above mismatch and bound the probability of sub-optimal action selection. Our correction significantly improves pre-trained Rainbow agents without any further training, often more than doubling their scores on Atari games. Next, we address the scalability issue given by the computational complexity of exhaustive TS that scales exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously. Batch-BFS reduces runtime by two orders of magnitude and, beyond inference, enables also training with TS of depths that were not feasible before. We train DQN agents from scratch using TS and show improvement in several Atari games compared to both the original DQN and the more advanced Rainbow.

Skip Supplemental Material Section

Supplemental Material

Available for Download

pdf

3540261.3540683_supp.pdf (461 KB)

Supplemental material.

References

How much did alphago zero cost? https://www.yuzeh.com/data/agz-cost.html. Accessed: 2021-05-20.Google Scholar
Paul Serban Agachi, Mircea Vasile Cristea, Alexandra Ana Csavdari, and Botond Szilagyi. 2. Model predictive control. De Gruyter, 2016.Google ScholarCross Ref
JM Blair, CA Edwards, and JH Johnson. Rational chebyshev approximations for the inverse of the error function. Mathematics of Computation, 30(136):827–830, 1976.Google ScholarCross Ref
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.Google Scholar
Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.Google ScholarCross Ref
Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. arXiv preprint arXiv:1807.01675, 2018.Google Scholar
Francesco Paolo Cantelli. Sui confini della probabilita. In Atti del Congresso Internazionale dei Matematici: Bologna del 3 al 10 de settembre di 1928, pages 47–60, 1929.Google Scholar
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning, 2014. cite arxiv:1410.0759.Google Scholar
Stuart Coles, Joanna Bawa, Lesley Trenner, and Pat Dorazio. An introduction to statistical modeling of extreme values, volume 208. Springer, 2001.Google ScholarCross Ref
Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.Google ScholarDigital Library
Steven Dalton, Iuri Frosio, and Michael Garland. Accelerating reinforcement learning through gpu atari emulation. arXiv preprint arXiv:1907.08467, 2019, BSD 3-Clause "New" or "Revised" License.Google Scholar
Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Beyond the one-step greedy approach in reinforcement learning. In International Conference on Machine Learning, pages 1387–1396. PMLR, 2018.Google Scholar
Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Multiple-step greedy policies in approximate and online reinforcement learning. In NeurIPS, 2018.Google Scholar
Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. How to combine tree-search methods in reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2019), 2019.Google ScholarDigital Library
Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pages 1407–1416. PMLR, 2018.Google Scholar
Jesse Farebrother, Marlos C Machado, and Michael Bowling. Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123, 2018.Google Scholar
Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.Google Scholar
Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR, 2018.Google Scholar
David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.Google Scholar
Peter Hart, Nils Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.Google ScholarCross Ref
Hado Hasselt. Double q-learning. Advances in neural information processing systems, 23:2613–2621, 2010.Google Scholar
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.Google ScholarCross Ref
Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.Google Scholar
Kaixhin. Rainbow Pre-trained Agents v1.3, 2019, MIT License.Google Scholar
Gabriel Kalweit and Joschka Boedecker. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, pages 195–206. PMLR, 2017.Google Scholar
Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to simulate dynamic environments with gamegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1231–1240, 2020.Google ScholarCross Ref
Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.Google Scholar
Steven M LaValle et al. Rapidly-exploring random trees: A new tool for path planning. Technical Report. Computer Science Department, Iowa State University, 1998.Google Scholar
Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, Joelle Pineau, and Kee-Eung Kim. Optidice: Offline policy optimization via stationary distribution correction estimation. arXiv preprint arXiv:2106.10783, 2021.Google Scholar
Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nuttapong Chentanez, Miles Macklin, and Dieter Fox. Gpu-accelerated robotic simulation for distributed reinforcement learning. In Conference on Robot Learning, pages 270–282. PMLR, 2018.Google Scholar
Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.Google ScholarCross Ref
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.Google Scholar
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.Google Scholar
Thomas M Moerland, Anna Deichler, Simone Baldi, Joost Broekens, and Catholijn M Jonker. Think too fast nor too slow: The computational trade-off between planning and reinforcement learning. arXiv preprint arXiv:2005.07404, 2020.Google Scholar
Remi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.Google ScholarDigital Library
Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.Google Scholar
NVIDIA. Isaac Gym preview release, 2021.Google Scholar
Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. Actionconditional video prediction using deep networks in atari games. arXiv preprint arXiv:1507.08750, 2015.Google Scholar
Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.Google Scholar
John Quan and Georg Ostrovski. DQN Zoo: Reference implementations of DQN-based agents, 2020, Apache License 2.0.Google Scholar
Sébastien Racanière, Théophane Weber, David P Reichert, Lars Buesing, Arthur Guez, Danilo Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5694–5705, 2017.Google Scholar
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.Google ScholarCross Ref
Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, and David Silver. Online and offline reinforcement learning by planning with a learned model. arXiv preprint arXiv:2104.06294, 2021.Google Scholar
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.Google Scholar
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.Google Scholar
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.Google ScholarDigital Library
Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721. IEEE, 2017.Google Scholar
Mario Zanon and Sébastien Gros. Safe reinforcement learning using robust mpc. IEEE Transactions on Automatic Control, 2020.Google Scholar

Cited By

View all

Recommendations

Team formation with learning agents that improve coordination
AAMAS '14: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems
Learning agents increase their team's performance by learning to coordinate better with their teammates, and we are interested in forming teams that contain such learning agents. In particular, we consider finite training instances for learning agents ...
Read More
Reinforcement learning without rewards
Read More
Agents Teaching Agents: A Survey on Inter-agent Transfer Learning
AAMAS '20: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems
While reinforcement learning (RL) has helped artificial agents solve challenging tasks, high sample complexity is still a major concern. Inter-agent teaching -- endowing agents with the ability to respond to instructions from others -- has been ...
Read More

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Information
Contributors

Published in
NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems
December 2021
30517 pages
ISBN:9781713845393
Editors:
M. Ranzato,
A. Beygelzimer,
Y. Dauphin,
P.S. Liang,
J. Wortman Vaughan
Copyright © 2021 Neural Information Processing Systems Foundation, Inc.
Sponsors
In-Cooperation
Publisher
Curran Associates Inc.
Red Hook, NY, United States
Publication History
- Published: 10 June 2024
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Bibliometrics
Citations0

Article Metrics
- Total Citations
  View Citations
- Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

Improve agents without retraining | Proceedings of the 35th International Conference on Neural Information Processing Systems (2024)

New Citation Alert added!

New Citation Alert!

NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Recommendations

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

Export Citations