slides - Cambridge Machine Learning Group
Transcription
slides - Cambridge Machine Learning Group
Introduction to Reinforcement Learning Part 5: Learning (Near)-Optimal Policies Rowan McAllister Reinforcement Learning Reading Group 13 May 2015 Note I’ve created these slides whilst following, and using figures from “Algorithms for Reinforcement Learning ” lectures by Csaba Szepesv´ari, specifically sections 4.1 - 4.3. The lectures themselves are available on Professor Szepesv´ari’s homepage: http://www.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf Any errors please email me: rtm26@cam.ac.uk Context P and R known? I I Evaluate a policy π Compute optimal policy π∗ week 1 week 1 P and R unknown? I I Evaluate a policy π Learn optimal policy π∗ weeks 2,3,4 today! Goal: cumulative rewards - summation of all rewards whilst learning. Goal: cumulative rewards - summation of all rewards whilst learning. ‘dilemma’: Explore vs Exploit? Goal: cumulative rewards - summation of all rewards whilst learning. ‘dilemma’: Explore vs Exploit? pure exploitation (greedy actions) → might fail to learn better policies Goal: cumulative rewards - summation of all rewards whilst learning. ‘dilemma’: Explore vs Exploit? pure exploitation (greedy actions) → might fail to learn better policies pure exploration → continually receive mediocre rewards Goal: cumulative rewards - summation of all rewards whilst learning. ‘dilemma’: Explore vs Exploit? pure exploitation (greedy actions) → might fail to learn better policies pure exploration → continually receive mediocre rewards need to balance... Simple Exploration Heuristics -greedy: π(x, a) = · 1 + (1 − ) · I[a = arg max Qt (x, a 0 )] |A| a0 Simple Exploration Heuristics -greedy: π(x, a) = · 1 + (1 − ) · I[a = arg max Qt (x, a 0 )] |A| a0 Boltzmann: π(x, a) = exp βQt (x, a) P 0 a 0 exp βQt (x, a ) Bandits Bandits and single-state MDPs: |X| = 1 UCB1 optimism in the face of uncertainty... s 2 log t nt (a) Ut (a) = Qt (a) + R π = argmaxa U(a) where: nt (a) : number of time actions a selected Qt (a) : sample mean of rewards observed for action a, bounded [−R, R] UCB - simple reward Now assume the goal is simple reward - to optimise reward after T interactions. i.e. rewards up until do not matter. Idea: start eliminating actions when you’re sufficiently certain they’re worse than others r log(2|A|T /δ) Ut (a) = Qt (a) + R 2t r log(2|A|T /δ) Lt (a) = Qt (a) − R 2t where 0 < δ < 1 is user-specified probability of failure we eliminate action a if Ut (a) < maxa 0 ∈A Lt (a 0 ) Pretty much unbeatable algorithm apart from constant scaling factors. Q(uality)-learning Algorithm: Tabular Q-learning (called after each transition) 1: Input: X (last state), A (last action), R (reward), Y (next state), Q (current action-value estimates) 2: δ ← R + γ · maxa Q[Y , a] − Q[X , A] 3: Q[X , A] ← Q[X , A] + α · δ 4: return Q Pros: Simple! “Off policy” algorithm (meaning: can use arbitrary sampling policy, but must visit each stat-action pair infinitely often.) Developed here in Cambridge Engineering Dept! Q-learning with Linear Approximation Algorithm: Q-learning with Linear Approximation (called after each transition) 1: Input: X (last state), A (last action), R (reward), Y (next state), θ (parameters) 2: δ ← R + γ · maxa θ> φ[Y , a] − θ> φ[X , A] 3: θ ← θ + α · δ · φ[X , A] | {z } ∇θ Qθ return θ Note this is TD(0) when only one action! where 4: Qθ (x, a) = θ> φ(x, a), x ∈ X, a ∈ A and θ ∈ Rd φ:X×A→ (weights) Rd (basis fns)