“End-to-end reinforcement learning”的意思、由来-开放百科全书

In end-to-end reinforcement learning, the end-to-end process, in other words, the entire process from sensors to motors in a robot or agent involves a single, layered or recurrent neural network without modularization{{citation needed|date=February 2019}}. The network is trained by reinforcement learning (RL).^[1] The approach has been proposed for a long time,^[2]^[3] but was reenergized by the successful results in learning to play Atari video games (2013–15)^[4]^[5]^[6]^[7] and AlphaGo (2016)^[8] by Google DeepMind. It employs unsupervised learning, without requiring sample (labeled, usually manually) data{{clarify|date=February 2019}}.

RL traditionally required explicit design of state space and action space, while the mapping from state space to action space is learned.^[9] Therefore, RL has been limited to learning only for action, and human designers have to design how to construct state space from sensor signals and to give how the motion commands are generated for each action before learning. Neural networks have been often used in RL, to provide non-linear function approximation to avoid the curse of dimensionality.^[9] Recurrent neural networks have been also employed, mainly to avoid perceptual aliasing or partially observable Markov decision process (POMDP).^[10]^[11]^[12]^[13]^[14]

End-to-end RL extends RL from learning only for actions to learning the entire process from sensors to motors including higher-level functions that are difficult to develop independently from other functions. Higher-level functions do not connect directly with either sensors or motors, and so even giving their inputs and outputs is difficult.

History

The approach originated in TD-Gammon (1992).^[15] In backgammon, the evaluation of the game situation during self-play was learned through TD(

) using a layered neural network. Four inputs were used for the number of pieces of a given color at a given location on the board, totaling 198 input signals. With zero knowledge built in, the network learned to play the game at an intermediate level.

Shibata began working with this framework in 1997.^[16]^[3] They employed Q-learning and actor-critic for continuous motion tasks,^[17] and used a recurrent neural network for memory-required tasks.^[18] They applied this framework to some real robot tasks.^[17]^[19] They demonstrated learning of various functions.

Beginning around 2013, Google DeepMind showed impressive learning results in video games^[4]^[5] and game of Go (AlphaGo).^[8] They used a deep convolutional neural network that showed superior results in image recognition. They used 4 frames of almost raw RGB pixels (84x84) as inputs. The network was trained based on RL with the reward representing the sign of the change in the game score. All 49 games were learned using the same network architecture and Q-learning with minimal prior knowledge, and outperformed competing methods on almost all the games and performed at a level that is comparable or superior to a professional human game tester.^[5] It is sometimes called Deep-Q network (DQN). In AlphaGo, deep neural networks are trained not only by reinforcement learning, but also by supervised learning and Monte Carlo tree search.^[8]

Function emergence

Shibata's group showed that various functions emerge in this framework, including:^[3]

References

1. ^{{cite speech |last1=Demis |first1=Hassabis | date=March 11, 2016 |title= Artificial Intelligence and the Future. |url= https://www.youtube.com/watch?v=8Z2eLTSCuBk}}
2. ^{{cite book |last=Shibata |first=Katsunari |editor-last=Mellouk |editor-first=Abdelhamid |title=Advances in Reinforcement Learning |publisher=Intech |date=January 14, 2011 |pages=99–120 |chapter=Chapter 6: Emergence of Intelligence through Reinforcement Learning with a Neural Network |url= http://www.intechopen.com/books/advances-in-reinforcement-learning |chapterurl= http://www.intechopen.com/books/advances-in-reinforcement-learning/emergence-of-intelligence-through-reinforcement-learning-with-a-neural-network |isbn=978-953-307-369-9}}
3. ^¹²{{cite arXiv |last=Shibata |first=Katsunari |title=Functions that Emerge through End-to-End Reinforcement Learning | date=March 7, 2017 |eprint=1703.02239 }}
4. ^¹{{cite conference |first= Volodymyr|display-authors=etal|last= Mnih |date=December 2013 |title= Playing Atari with Deep Reinforcement Learning |url= https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf |conference= NIPS Deep Learning Workshop 2013}}
5. ^¹²{{cite journal |first= Volodymyr|display-authors=etal|last= Mnih |year=2015 |title= Human-level control through deep reinforcement learning |url= http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html |journal=Nature|volume=518 |issue=7540 |pages=529–533 |doi=10.1038/nature14236|bibcode=2015Natur.518..529M }}
6. ^{{cite video |people= V. Mnih|display-authors=etal| date=26 February 2015 |title= Performance of DQN in the Game Space Invaders |url= http://www.nature.com/nature/journal/v518/n7540/extref/nature14236-sv1.mov}}
7. ^{{cite video |people= V. Mnih|display-authors=etal| date=26 February 2015 |title= Demonstration of Learning Progress in the Game Breakout |url= http://www.nature.com/nature/journal/v518/n7540/extref/nature14236-sv2.mov}}
8. ^¹²{{Cite journal|title = Mastering the game of Go with deep neural networks and tree search|url = https://www.nature.com/nature/journal/v529/n7587/full/nature16961.html|journal = Nature| issn= 0028-0836|pages = 484–489|volume = 529|issue = 7587|doi = 10.1038/nature16961|pmid = 26819042|first1 = David|last1 = Silver|author-link1=David Silver (programmer)|first2 = Aja|last2 = Huang|author-link2=Aja Huang|first3 = Chris J.|last3 = Maddison|first4 = Arthur|last4 = Guez|first5 = Laurent|last5 = Sifre|first6 = George van den|last6 = Driessche|first7 = Julian|last7 = Schrittwieser|first8 = Ioannis|last8 = Antonoglou|first9 = Veda|last9 = Panneershelvam|first10= Marc|last10= Lanctot|first11= Sander|last11= Dieleman|first12=Dominik|last12= Grewe|first13= John|last13= Nham|first14= Nal|last14= Kalchbrenner|first15= Ilya|last15= Sutskever|author-link15=Ilya Sutskever|first16= Timothy|last16= Lillicrap|first17= Madeleine|last17= Leach|first18= Koray|last18= Kavukcuoglu|first19= Thore|last19= Graepel|first20= Demis |last20=Hassabis|author-link20=Demis Hassabis|date= 28 January 2016|bibcode = 2016Natur.529..484S|accessdate=10 December 2017}}{{closed access}}
9. ^¹{{cite book | last1 = Sutton | first1 = Richard S. |last2=Barto |first2=Andrew G. | title = Reinforcement Learning: An Introduction | publisher = MIT Press | year = 1998 | isbn = 978-0262193986}}
10. ^{{cite conference |first1= Long-Ji |last1= Lin |first2= Tom M. |last2= Mitchell |year=1993 |title= Reinforcement Learning with Hidden States | journal=From Animals to Animats |volume=2 | pages=271–280 }}
11. ^{{cite conference |first1= Ahmet |last1= Onat |first2= Hajime|display-authors=etal|last2= Kita |year=1998 |title= Q-learning with Recurrent Neural Networks as a Controller for the Inverted Pendulum Problem | conference=The 5th International Conference on Neural Information Processing (ICONIP) |pages=837–840}}
12. ^{{cite conference |first1= Ahmet |last1= Onat |first2= Hajime|display-authors=etal|last2= Kita |year=1998 |title= Recurrent Neural Networks for Reinforcement Learning: Architecture, Learning Algorithms and Internal Representation |url=http://ieeexplore.ieee.org/document/687168/ |conference=International Joint Conference on Neural Networks (IJCNN) |pages=2010–2015}}
13. ^{{cite conference |first1= Bram |last1= Bakker |first2= Fredrik|display-authors=etal|last2= Linaker |year=2002 |title= Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction |url=ftp://ftp.idsia.ch/pub/juergen/bakkeriros2002.pdf| conference= 2002 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS) |pages=938–943}}
14. ^{{cite conference |first1= Bram |last1= Bakker |first2= Viktor|display-authors=etal|last2= Zhumatiy |year=2003 |title= A Robot that Reinforcement-Learns to Identify and Memorize Important Previous Observation |url=ftp://ftp.idsia.ch/pub/juergen/bakkeriros2003.pdf| conference= 2003 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS) |pages=430–435}}
15. ^{{cite journal | url=http://www.bkgm.com/articles/tesauro/tdl.html | title=Temporal Difference Learning and TD-Gammon | date=March 1995 | last=Tesauro | first=Gerald | journal=Communications of the ACM | volume=38 | issue=3 | doi = 10.1145/203330.203343 | pages=58–68}}
16. ^{{cite conference |first1= Katsunari |last1= Shibata |first2= Yoichi |last2= Okabe |year=1997 |title= Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/ICNN97.pdf |conference= International Conference on Neural Networks (ICNN) 1997}}
17. ^¹{{cite conference |first1= Katsunari |last1= Shibata |first2= Masaru |last2= Iida |year=2003 |title= Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/SICE03.pdf |conference= SICE Annual Conference 2003}}
18. ^{{cite conference |first1= Hiroki |last1= Utsunomiya |first2= Katsunari |last2= Shibata |year=2008 |title= Contextual Behavior and Internal Representations Acquired by Reinforcement Learning with a Recurrent Neural Network in a Continuous State and Action Space Task |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/ICONIP98Utsunomiya.pdf |conference= International Conference on Neural Information Processing (ICONIP) '08}}
19. ^{{cite conference |first1= Katsunari |last1= Shibata |first2= Tomohiko |last2= Kawano |year=2008 |title= Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/ICONIP98.pdf |conference= International Conference on Neural Information Processing (ICONIP) '08}}
20. ^{{cite arXiv|eprint=1703.03543|first=Katsunari|last=Shibata|title=Communications that Emerge through Reinforcement Learning Using a (Recurrent) Neural Network|date=March 9, 2017}}