“Q-learning”的意思、由来-开放百科全书

Q-learning is a model-free reinforcement learning algorithm. The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances. It does not require a model (hence the connotation "model-free") of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations.

For any finite Markov decision process (FMDP), Q-learning finds a policy that is optimal in the sense that it maximizes the expected value of the total reward over any and all successive steps, starting from the current state.^[1] Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy.^[1] "Q" names the function that returns the reward used to provide the reinforcement and can be said to stand for the "quality" of an action taken in a given state.^[2]

Reinforcement learning

Reinforcement learning involves an agent, a set of states {{tmath|S}}, and a set {{tmath|A}} of actions per state. By performing an action

, the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score).

The goal of the agent is to maximize its total (future) reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of the expected values of the rewards of all future steps starting from the current state.

As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then:

On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, entry slows, as time fighting other passengers to board is not rewarded. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now:

Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy.

Algorithm

The weight for a step from a state

steps into the future is calculated as

(the discount factor) is a number between 0 and 1 (

) and has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start").

may also be interpreted as the probability to succeed (or survive) at every step

The algorithm, therefore, has a function that calculates the quality of a state-action combination:

Before learning begins, {{tmath|Q}} is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time

the agent selects an action

, observes a reward

, enters a new state

(that may depend on both the previous state

and the selected action), and

is updated. The core of the algorithm is a simple value iteration update, using the weighted average of the old value and the new information:

where is the reward received when moving from the state

to the state

, and

is the learning rate (

An episode of the algorithm ends when state

is a final or terminal state. However, Q-learning can also learn in non-episodic tasks.{{citation needed|date=December 2017}} If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.

For all final states

is never updated, but is set to the reward value

observed for state

. In most cases,

can be taken to equal zero.

Influence of variables

Explore vs exploit

The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of

is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as

for all

.^[3]

Discount factor

The discount factor {{tmath|\\gamma}} determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e.

(in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For {{tmath|\\gamma {{=}} 1}}, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite.^[4] Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network.^[5] In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.^[6]

Initial conditions (Q₀)

Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions",^[7] can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward

can be used to reset the initial conditions.^[8] According to this idea, the first time an action is taken the reward is used to set the value of

. This allows immediate learning in case of fixed deterministic rewards. A model that incorporates reset of initial conditions (RIC) is expected to predict participants' behavior better than a model that assumes any arbitrary initial condition (AIC).^[8] RIC seems to be consistent with human behaviour in repeated binary choice experiments.^[8]

Implementation

Q-learning at its simplest stores data in tables. This approach falters with increasing numbers of states/actions.

Function approximation

Q-learning can be combined with function approximation.^[9] This makes it possible to apply the algorithm to larger problems, even when the state space is continuous.

One solution is to use an (adapted) artificial neural network as a function approximator.^[10] Function approximation may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.

Quantization

Another technique to decrease the state/action space quantizes possible values. Consider the example of learning to balance a stick on a finger. To describe a state at a certain point in time involves the position of the finger in space, its velocity, the angle of the stick and the angular velocity of the stick. This yields a four-element vector that describes one state, i.e. a snapshot of one state encoded into four values. The problem is that infinitely many possible states are present. To shrink the possible space of valid actions multiple values can be assigned to a bucket. The exact distance of the finger from its starting position (-Infinity to Infinity) is not known, but rather whether it is far away or not (Near, Far).

History

Watkins was addressing “Learning from delayed rewards”, the title of his PhD Thesis. Eight years earlier in 1981 the same problem, under the name of “Delayed reinforcement learning”, was solved by Bozinovski's Crossbar Adaptive Array (CAA).^[15]^[16] The memory matrix W =||w(a,s)|| was the same as the eight years later Q-table of Q-learning. The architecture introduced the term “state evaluation” in reinforcement learning. The crossbar learning algorithm, written in mathematical pseudocode in the paper, in each iteration performs the following computation:

The term “secondary reinforcement” is borrowed from animal learning theory, to model state values via backpropagation: the state value v(s’) of the consequence situation is backpropagated to the previously encountered situations. CAA computes state values vertically and actions horizontally (the "crossbar"). Demonstration graphs showing delayed reinforcement learning contained states (desirable, undesirable, and neutral states), which were computed by the state evaluation function. This learning system was a forerunner of the Q-learning algorithm.^[17]

In 2014 Google DeepMind patented^[18] an application of Q-learning to deep learning, titled "deep reinforcement learning" or "deep Q-learning" that can play Atari 2600 games at expert human levels.

Variants

Deep Q-learning

The DeepMind system used a deep convolutional neural network, with layers of tiled convolutional filters to mimic the effects of receptive fields. Reinforcement learning is unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q. This instability comes from the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy and the data distribution, and the correlations between Q and the target values.

The technique used experience replay, a biologically inspired mechanism that uses a random sample of prior actions instead of the most recent action to proceed.^[2] This removes correlations in the observation sequence and smooths changes in the data distribution. Iterative update adjusts Q towards target values that are only periodically updated, further reducing correlations with the target.^[19]

Double Q-learning

Because the future maximum approximated action value in Q-learning is evaluated using the same Q function as in current action selection policy, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this. Double Q-learning^[20] is an off-policy reinforcement learning algorithm, where a different policy is used for value evaluation than what is used to select the next action.

In practice, two separate value functions are trained in a mutually symmetric fashion using separate experiences,

and

. The double Q-learning update step is then as follows:

Now the estimated value of the discounted future is evaluated using a different policy, which solves the overestimation issue.

This algorithm was later combined with deep learning, as in the DQN algorithm, resulting in Double DQN, which outperforms the original DQN algorithm.^[21]

Others

Delayed Q-learning is an alternative implementation of the online Q-learning algorithm, with probably approximately correct (PAC) learning.^[22]

Greedy GQ is a variant of Q-learning to use in combination with (linear) function approximation.^[23] The advantage of Greedy GQ is that convergence is guaranteed even when function approximation is used to estimate the action values.

See also

References

1. ^¹{{Cite paper |last=Melo |first=Francisco S. |title=Convergence of Q-learning: a simple proof |url=http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf}}
2. ^¹{{Cite web |url=http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ |title=Demystifying Deep Reinforcement Learning |last=Matiisen |first=Tambet |date=December 19, 2015 |website=neuro.cs.ut.ee |publisher=Computational Neuroscience Lab |language=en-US |access-date=2018-04-06}}
3. ^{{Cite book |url=http://incompleteideas.net/sutton/book/ebook/the-book.html |title=Reinforcement Learning: An Introduction |last=Sutton |first=Richard |last2=Barto |first2=Andrew |date=1998 |publisher=MIT Press}}
4. ^{{Cite book |title=Artificial Intelligence: A Modern Approach |last=Russell |first=Stuart J. |last2=Norvig |first2=Peter |date=2010 |publisher=Prentice Hall |isbn=978-0136042594 |edition=Third |page=649 |author-link=Stuart J. Russell |author-link2=Peter Norvig}}
5. ^{{cite journal|first=Leemon |last=Baird |title=Residual algorithms: Reinforcement learning with function approximation |url=http://www.leemon.com/papers/1995b.pdf |journal=ICML |pp= 30–37 |year=1995}}
6. ^{{cite arxiv|last=François-Lavet|first=Vincent|last2=Fonteneau|first2=Raphael|last3=Ernst|first3=Damien|date=2015-12-07|title=How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies|eprint=1512.02011 |class=cs.LG}}
7. ^{{Cite book |chapter-url=http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html |title=Reinforcement Learning: An Introduction |last=Sutton |first=Richard S. |last2=Barto |first2=Andrew G. |archive-url=https://web.archive.org/web/20130908031737/http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html |archive-date=2013-09-08 |dead-url=yes |access-date=2013-07-18 |chapter=2.7 Optimistic Initial Values}}
8. ^¹²{{Cite journal |last=Shteingart |first=Hanan |last2=Neiman |first2=Tal |last3=Loewenstein |first3=Yonatan |date=May 2013 |title=The role of first impression in operant learning. |url=http://ratio.huji.ac.il/sites/default/files/publications/dp626.pdf |journal=Journal of Experimental Psychology: General |language=en |volume=142 |issue=2 |pages=476–488 |doi=10.1037/a0029550 |issn=1939-2222 |pmid=22924882}}
9. ^{{cite book|chapter-url={{google books |plainurl=y |id=YPjNuvrJR0MC|page=|pp= 207-251}}|title=Reinforcement Learning: State-of-the-Art|editor-last1=Wiering|editor-first1=Marco|editor-last2=Otterlo|editor-first2=Martijn van|date=5 March 2012|publisher=Springer Science & Business Media |first=Hado van |last=Hasselt |chapter=Reinforcement Learning in Continuous State and Action Spaces |pp= 207–251 |isbn=978-3-642-27645-3}}
10. ^{{cite journal|last=Tesauro|first=Gerald|date=March 1995|title=Temporal Difference Learning and TD-Gammon|url=http://www.bkgm.com/articles/tesauro/tdl.html|journal=Communications of the ACM|volume=38|issue=3|pages=58–68|doi=10.1145/203330.203343|id=|accessdate=2010-02-08}}
11. ^{{citation |type=Ph.D. thesis|last=Watkins |first=C.J.C.H. |year=1989 |title=Learning from Delayed Rewards |publisher=Cambridge University|url=http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf}}
12. ^Watkins and Dayan, C.J.C.H., (1992), 'Q-learning.Machine Learning'
13. ^ Tsitsiklis, J., (1994), 'Asynchronous Stochastic Approximation and Q-learning. Machine Learning'
14. ^Bertsekas and Tsitsiklis, (1996), 'Neuro-Dynamic Programming. Athena Scientific'
15. ^{{cite book|editor-last1=Dobnikar|editor-first1=Andrej|editor-last2=Steele|editor-first2=Nigel C.|editor-last3=Pearson|editor-first3=David W.|editor-first4=Rudolf F. |editor-last4=Albrecht|title=Artificial Neural Nets and Genetic Algorithms: Proceedings of the International Conference in Portorož, Slovenia, 1999|chapter-url={{google books |plainurl=y |id=clKwynlfZYkC|page=320-325}}|date=15 July 1999|publisher=Springer Science & Business Media|isbn=978-3-211-83364-3 |first=S. |last=Bozinovski |chapter=Crossbar Adaptive Array: The first connectionist network that solved the delayed reinforcement learning problem|pp=320–325}}
16. ^{{cite book|editor-last=Trappl|editor-first=Robert|title=Cybernetics and Systems Research: Proceedings of the Sixth European Meeting on Cybernetics and Systems Research|chapter-url={{google books |plainurl=y |id=mGtQAAAAMAAJ|page=397}}|year=1982|publisher=North Holland|isbn=978-0-444-86488-8|first=S. |last=Bozinovski |chapter=A self learning system using secondary reinforcement|pp=397–402}}
17. ^{{cite book|editor-last1=Omidvar|editor-first1=Omid|editor-last2=Elliott|editor-first2=David L.|title=Neural Systems for Control|chapter-url={{google books |plainurl=y |id=oLcAiySCow0C}}|date=24 February 1997|publisher=Elsevier|isbn=978-0-08-053739-9|first=A. |last=Barto |chapter=Reinforcement learning}}
18. ^{{cite web|url=https://patentimages.storage.googleapis.com/71/91/4a/c5cf4ffa56f705/US20150100530A1.pdf|title=Methods and Apparatus for Reinforcement Learning, US Patent #20150100530A1|publisher=US Patent Office|date=9 April 2015|accessdate=28 July 2018}}
19. ^{{Cite journal |last=Mnih |first=Volodymyr |last2=Kavukcuoglu |first2=Koray |last3=Silver |first3=David |last4=Rusu |first4=Andrei A. |last5=Veness |first5=Joel |last6=Bellemare |first6=Marc G. |last7=Graves |first7=Alex |last8=Riedmiller |first8=Martin |last9=Fidjeland |first9=Andreas K. |date=Feb 2015 |title=Human-level control through deep reinforcement learning |url=http://www.nature.com/articles/nature14236 |journal=Nature |language=en |volume=518 |issue=7540 |pages=529–533 |doi=10.1038/nature14236 |pmid=25719670 |issn=0028-0836}}
20. ^{{Cite journal |last=van Hasselt |first=Hado |year=2011 |title=Double Q-learning |url=http://papers.nips.cc/paper/3964-double-q-learning |format=PDF |journal=Advances in Neural Information Processing Systems |volume=23 |pages=2613–2622}}
21. ^{{Cite journal |last=van Hasselt |first=Hado |last2=Guez |first2=Arthur |last3=Silver |first3=David |date=2015 |title=Deep reinforcement learning with double Q-learning |url=https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12389/11847 |format=PDF |journal=AAAI Conference on Artificial Intelligence |pages=2094–2100}}
22. ^{{Cite journal |last=Strehl |first=Alexander L. |last2=Li |first2=Lihong |last3=Wiewiora |first3=Eric |last4=Langford |first4=John |last5=Littman |first5=Michael L. |year=2006 |title=Pac model-free reinforcement learning |url=http://research.microsoft.com/pubs/178886/published.pdf |journal=Proc. 22nd ICML |pp=881–888}}
23. ^{{cite web |first1=Hamid |last1=Maei |first2=Csaba |last2=Szepesvári |first3=Shalabh |last3=Bhatnagar |first4=Richard |last4=Sutton |url=https://webdocs.cs.ualberta.ca/~sutton/papers/MSBS-10.pdf |title=Toward off-policy learning control with function approximation in Proceedings of the 27th International Conference on Machine Learning |pp= 719–726|year= 2010}}