強(qiáng)化學(xué)習(xí)reinforcementlearning

上傳人：奔*** IP屬地：河北更新時(shí)間：2024-01-05 格式：ppt 頁(yè)數(shù)：100 大?。?.51MB 人氣指數(shù)：12 舉報(bào) 版權(quán)申訴

強(qiáng)化學(xué)習(xí)reinforcementlearning_第1頁(yè)

已閱讀1頁(yè)，還剩99頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、高級(jí)人工智能第十章,史忠植中國(guó)科學(xué)院計(jì)算技術(shù)研究所,強(qiáng)化學(xué)習(xí)Reinforcement Learning,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),2,內(nèi)容提要,引言強(qiáng)化學(xué)習(xí)模型動(dòng)態(tài)規(guī)劃蒙特卡羅方法時(shí)序差分學(xué)習(xí)Q學(xué)習(xí)強(qiáng)化學(xué)習(xí)中的函數(shù)估計(jì)應(yīng)用,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),3,引言,人類通常從與外界環(huán)境的交互中學(xué)習(xí)。所謂強(qiáng)化（reinforcement）學(xué)習(xí)是指從環(huán)境狀態(tài)到行為映射的學(xué)習(xí)，以使系統(tǒng)行為

2、從環(huán)境中獲得的累積獎(jiǎng)勵(lì)值最大。在強(qiáng)化學(xué)習(xí)中，我們?cè)O(shè)計(jì)算法來(lái)把外界環(huán)境轉(zhuǎn)化為最大化獎(jiǎng)勵(lì)量的方式的動(dòng)作。我們并沒(méi)有直接告訴主體要做什么或者要采取哪個(gè)動(dòng)作,而是主體通過(guò)看哪個(gè)動(dòng)作得到了最多的獎(jiǎng)勵(lì)來(lái)自己發(fā)現(xiàn)。主體的動(dòng)作的影響不只是立即得到的獎(jiǎng)勵(lì)，而且還影響接下來(lái)的動(dòng)作和最終的獎(jiǎng)勵(lì)。試錯(cuò)搜索(trial-and-error search)和延期強(qiáng)化(delayed reinforcement)這兩個(gè)特性是強(qiáng)化學(xué)習(xí)中兩個(gè)最重要的特性。,2024/2

3、/29,史忠植強(qiáng)化學(xué)習(xí),4,引言,強(qiáng)化學(xué)習(xí)技術(shù)是從控制理論、統(tǒng)計(jì)學(xué)、心理學(xué)等相關(guān)學(xué)科發(fā)展而來(lái)，最早可以追溯到巴甫洛夫的條件反射實(shí)驗(yàn)。但直到上世紀(jì)八十年代末、九十年代初強(qiáng)化學(xué)習(xí)技術(shù)才在人工智能、機(jī)器學(xué)習(xí)和自動(dòng)控制等領(lǐng)域中得到廣泛研究和應(yīng)用，并被認(rèn)為是設(shè)計(jì)智能系統(tǒng)的核心技術(shù)之一。特別是隨著強(qiáng)化學(xué)習(xí)的數(shù)學(xué)基礎(chǔ)研究取得突破性進(jìn)展后，對(duì)強(qiáng)化學(xué)習(xí)的研究和應(yīng)用日益開展起來(lái)，成為目前機(jī)器學(xué)習(xí)領(lǐng)域的研究熱點(diǎn)之一。,2024/2/29,史忠植強(qiáng)

4、化學(xué)習(xí),5,引言,強(qiáng)化思想最先來(lái)源于心理學(xué)的研究。1911年Thorndike提出了效果律（Law of Effect）：一定情景下讓動(dòng)物感到舒服的行為，就會(huì)與此情景增強(qiáng)聯(lián)系（強(qiáng)化），當(dāng)此情景再現(xiàn)時(shí)，動(dòng)物的這種行為也更易再現(xiàn)；相反，讓動(dòng)物感覺(jué)不舒服的行為，會(huì)減弱與情景的聯(lián)系，此情景再現(xiàn)時(shí)，此行為將很難再現(xiàn)。換個(gè)說(shuō)法，哪種行為會(huì)“記住”，會(huì)與刺激建立聯(lián)系，取決于行為產(chǎn)生的效果。動(dòng)物的試錯(cuò)學(xué)習(xí),包含兩個(gè)含義：選擇（selectional）

5、和聯(lián)系（associative），對(duì)應(yīng)計(jì)算上的搜索和記憶。所以，1954年，Minsky在他的博士論文中實(shí)現(xiàn)了計(jì)算上的試錯(cuò)學(xué)習(xí)。同年，F(xiàn)arley和Clark也在計(jì)算上對(duì)它進(jìn)行了研究。強(qiáng)化學(xué)習(xí)一詞最早出現(xiàn)于科技文獻(xiàn)是1961年Minsky 的論文“Steps Toward Artificial Intelligence”，此后開始廣泛使用。1969年， Minsky因在人工智能方面的貢獻(xiàn)而獲得計(jì)算機(jī)圖靈獎(jiǎng)。,2024/2/29,史忠植

6、強(qiáng)化學(xué)習(xí),6,引言,1953到1957年，Bellman提出了求解最優(yōu)控制問(wèn)題的一個(gè)有效方法：動(dòng)態(tài)規(guī)劃（dynamic programming） Bellman于 1957年還提出了最優(yōu)控制問(wèn)題的隨機(jī)離散版本，就是著名的馬爾可夫決策過(guò)程（MDP, Markov decision processe），1960年Howard提出馬爾可夫決策過(guò)程的策略迭代方法，這些都成為現(xiàn)代強(qiáng)化學(xué)習(xí)的理論基礎(chǔ)。1972年，Klopf把試錯(cuò)學(xué)習(xí)和時(shí)序差

7、分結(jié)合在一起。1978年開始，Sutton、Barto、 Moore，包括Klopf等對(duì)這兩者結(jié)合開始進(jìn)行深入研究。 1989年Watkins提出了Q-學(xué)習(xí)[Watkins 1989]，也把強(qiáng)化學(xué)習(xí)的三條主線扭在了一起。1992年，Tesauro用強(qiáng)化學(xué)習(xí)成功了應(yīng)用到西洋雙陸棋（backgammon）中，稱為TD-Gammon 。,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),7,內(nèi)容提要,引言強(qiáng)化學(xué)習(xí)模型動(dòng)態(tài)規(guī)劃蒙特卡羅方法時(shí)

8、序差分學(xué)習(xí)Q學(xué)習(xí)強(qiáng)化學(xué)習(xí)中的函數(shù)估計(jì)應(yīng)用,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),8,主體,,強(qiáng)化學(xué)習(xí)模型,i: inputr: reward s: state,a: action,狀態(tài) si,si+1,ri+1,獎(jiǎng)勵(lì) ri,環(huán)境,,,,,,,動(dòng)作 ai,,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),9,描述一個(gè)環(huán)境（問(wèn)題）,Accessible vs. inaccessibleDeterministic vs. non-d

9、eterministicEpisodic vs. non-episodicStatic vs. dynamicDiscrete vs. continuous,The most complex general class of environments are inaccessible, non-deterministic, non-episodic, dynamic, and continuous.,2024/2/29,史忠植

10、強(qiáng)化學(xué)習(xí),10,強(qiáng)化學(xué)習(xí)問(wèn)題,Agent-environment interactionStates, Actions, RewardsTo define a finite MDPstate and action sets : S and Aone-step “dynamics” defined by transition probabilities (Markov Property):reward probabiliti

11、es:,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),11,與監(jiān)督學(xué)習(xí)對(duì)比,Reinforcement Learning – Learn from interactionlearn from its own experience, and the objective is to get as much reward as possible. The learner is not told which actions to take, bu

12、t instead must discover which actions yield the most reward by trying them.,RLSystem,,,,Inputs,Outputs (“actions”),Training Info = evaluations (“rewards” / “penalties”),Supervised Learning – Learn from examples provid

13、ed by a knowledgable external supervisor.,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),12,強(qiáng)化學(xué)習(xí)要素,Policy: stochastic rule for selecting actionsReturn/Reward: the function of future rewards agent tries to maximizeValue: what is good because it

14、predicts rewardModel: what follows what,Is unknown,Is my goal,Is I can get,Is my method,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),13,在策略Π下的Bellman公式,,The basic idea:,So:,Or, without the expectation operator:,is the discount rate,2024/2/29,史

15、忠植強(qiáng)化學(xué)習(xí),14,Bellman最優(yōu)策略公式,其中：V*：狀態(tài)值映射S：環(huán)境狀態(tài)R：獎(jiǎng)勵(lì)函數(shù)P：狀態(tài)轉(zhuǎn)移概率函數(shù)?：折扣因子,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),15,馬爾可夫決策過(guò)程 MARKOV DECISION PROCESS,由四元組定義。環(huán)境狀態(tài)集S 系統(tǒng)行為集合A 獎(jiǎng)勵(lì)函數(shù)R：S×A→? 狀態(tài)轉(zhuǎn)移函數(shù)P：S×A→PD（S）記R（s，a，s′）為系統(tǒng)在

16、狀態(tài)s采用a動(dòng)作使環(huán)境狀態(tài)轉(zhuǎn)移到s′獲得的瞬時(shí)獎(jiǎng)勵(lì)值；記P（s，a，s′）為系統(tǒng)在狀態(tài)s采用a動(dòng)作使環(huán)境狀態(tài)轉(zhuǎn)移到s′的概率。,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),16,馬爾可夫決策過(guò)程 MARKOV DECISION PROCESS,馬爾可夫決策過(guò)程的本質(zhì)是：當(dāng)前狀態(tài)向下一狀態(tài)轉(zhuǎn)移的概率和獎(jiǎng)勵(lì)值只取決于當(dāng)前狀態(tài)和選擇的動(dòng)作，而與歷史狀態(tài)和歷史動(dòng)作無(wú)關(guān)。因此在已知狀態(tài)轉(zhuǎn)移概率函數(shù)P和獎(jiǎng)勵(lì)函數(shù)R的環(huán)境模型知識(shí)下，可以采用動(dòng)態(tài)規(guī)劃技

17、術(shù)求解最優(yōu)策略。而強(qiáng)化學(xué)習(xí)著重研究在P函數(shù)和R函數(shù)未知的情況下，系統(tǒng)如何學(xué)習(xí)最優(yōu)行為策略。,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),17,MARKOV DECISION PROCESS,,Characteristics of MDP:a set of states : Sa set of actions : Aa reward function :R : S x A ? RA state transition func

18、tion:T: S x A ? ∏ ( S) T (s,a,s’): probability of transition from s to s’ using action a,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),18,馬爾可夫決策過(guò)程 MARKOV DECISION PROCESS,,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),19,,,MDP EXAMPLE:,Transitionfunction,States a

19、nd rewards,Bellman Equation:,(Greedy policy selection),2024/2/29,史忠植強(qiáng)化學(xué)習(xí),20,MDP Graphical Representation,β, α : T (s, action, s’ ),,Similarity to Hidden Markov Models (HMMs),2024/2/29,史忠植強(qiáng)化學(xué)習(xí),21,Reinforcement Lear

20、ning …,,,Deterministic transitionsStochastic transitions,is the probability to reaching state j when taking action a in state i,A simple environment that presents the agent with a sequential decision problem:,Move cost

21、 = 0.04,(Temporal) credit assignment problem sparse reinforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequences is conditional on observations along the way; Important in stochas

22、tic environment (e.g. jet flying),2024/2/29,史忠植強(qiáng)化學(xué)習(xí),22,Reinforcement Learning …,M = 0.8 in direction you want to go 0.2 in perpendicular,Policy: mapping from states to actions,An optimal policy for the

23、 stochastic environment:,utilities of states:,,,,,,,,,,Markov property: Transition probabilities depend on state only, not on the path to the state.Markov decision problem (MDP).Partially observable MDP (POMDP): percep

24、ts does not have enough info to identify transition probabilities.,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),23,動(dòng)態(tài)規(guī)劃Dynamic Programming,動(dòng)態(tài)規(guī)劃(dynamic programming)的方法通過(guò)從后繼狀態(tài)回溯到前驅(qū)狀態(tài)來(lái)計(jì)算賦值函數(shù)。動(dòng)態(tài)規(guī)劃的方法基于下一個(gè)狀態(tài)分布的模型來(lái)接連的更新狀態(tài)。強(qiáng)化學(xué)習(xí)的動(dòng)態(tài)規(guī)劃的方法是基于這樣一個(gè)事實(shí)：對(duì)任何策略

25、π和任何狀態(tài)s，有下式迭代的一致的等式成立,,π(a｜s)是給定在隨機(jī)策略π下狀態(tài)s時(shí)動(dòng)作a的概率。π(s→s'｜a)是在動(dòng)作a下狀態(tài)s轉(zhuǎn)到狀態(tài)s'的概率。這就是對(duì)Vπ的Bellman(1957)等式。,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),24,動(dòng)態(tài)規(guī)劃Dynamic Programming - Problem,A discrete-time dynamic systemStates {1, … , n} +

26、termination state 0Control U(i)Transition Probability pij(u)Accumulative cost structurePolicies,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),25,Finite Horizon ProblemInfinite Horizon ProblemValue Iteration,動(dòng)態(tài)規(guī)劃Dynamic Programming –

27、Iterative Solution,,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),26,動(dòng)態(tài)規(guī)劃中的策略迭代/值迭代,,,Policy Iteration,Value Iteration,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),27,動(dòng)態(tài)規(guī)劃方法,,,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),28,自適應(yīng)動(dòng)態(tài)規(guī)劃(ADP),Idea: use the constraints (state transition probabilitie

28、s) between states to speed learning.Solve,= value determination.No maximization over actions because agent is passive unlike in value iteration.,using DP,Large state spacee.g. Backgammon: 1050 equations in 1050 varia

29、bles,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),29,Value Iteration Algorithm,,AN ALTERNATIVE ITERATION: (Singh,1993),(Important for model free learning),Stop Iteration when V(s) differs less than ?.Policy difference ratio =< 2?γ / (1-γ

30、) ( Williams & Baird 1993b),2024/2/29,史忠植強(qiáng)化學(xué)習(xí),30,Policy Iteration Algorithm,Policies converge faster than values.,,Why faster convergence?,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),31,動(dòng)態(tài)規(guī)劃Dynamic Programming,典型的動(dòng)態(tài)規(guī)劃模型作用有限，很多問(wèn)題很難給出環(huán)境的

31、完整模型。仿真機(jī)器人足球就是這樣的問(wèn)題，可以采用實(shí)時(shí)動(dòng)態(tài)規(guī)劃方法解決這個(gè)問(wèn)題。在實(shí)時(shí)動(dòng)態(tài)規(guī)劃中不需要事先給出環(huán)境模型，而是在真實(shí)的環(huán)境中不斷測(cè)試，得到環(huán)境模型?？梢圆捎梅磦魃窠?jīng)網(wǎng)絡(luò)實(shí)現(xiàn)對(duì)狀態(tài)泛化，網(wǎng)絡(luò)的輸入單元是環(huán)境的狀態(tài)s, 網(wǎng)絡(luò)的輸出是對(duì)該狀態(tài)的評(píng)價(jià)V(s)。,,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),32,沒(méi)有模型的方法Model Free Methods,Models of the environment:T: S x

32、 A ? ∏ ( S) and R : S x A ? RDo we know them? Do we have to know them?Monte Carlo MethodsAdaptive Heuristic CriticQ Learning,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),33,蒙特卡羅方法 Monte Carlo Methods,蒙特卡羅方法不需要一個(gè)完整的模型。而是它們對(duì)狀態(tài)的整個(gè)軌道進(jìn)行抽

33、樣，基于抽樣點(diǎn)的最終結(jié)果來(lái)更新賦值函數(shù)。蒙特卡羅方法不需要經(jīng)驗(yàn)，即從與環(huán)境聯(lián)機(jī)的或者模擬的交互中抽樣狀態(tài)、動(dòng)作和獎(jiǎng)勵(lì)的序列。聯(lián)機(jī)的經(jīng)驗(yàn)是令人感興趣的，因?yàn)樗恍枰h(huán)境的先驗(yàn)知識(shí)，卻仍然可以是最優(yōu)的。從模擬的經(jīng)驗(yàn)中學(xué)習(xí)功能也很強(qiáng)大。它需要一個(gè)模型，但它可以是生成的而不是分析的，即一個(gè)模型可以生成軌道卻不能計(jì)算明確的概率。于是，它不需要產(chǎn)生在動(dòng)態(tài)規(guī)劃中要求的所有可能轉(zhuǎn)變的完整的概率分布。,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),34,Mo

34、nte Carlo方法,,,,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),35,蒙特卡羅方法 Monte Carlo Methods,Idea: Hold statistics about rewards for each state Take the average This is the V(s)Based only on experience ?Ass

35、umes episodic tasks ? (Experience is divided into episodes and all episodes will terminate regardless of the actions selected.) Incremental in episode-by-episode sense not step-by-step sense.,2024/2/29,史忠植強(qiáng)化學(xué)

36、習(xí),36,Monte Carlo策略評(píng)價(jià),Goal: learn Vp(s) under P and R are unknown in advanceGiven: some number of episodes under p which contain sIdea: Average returns observed after visits to s,Every-Visit MC: average returns for ever

37、y time s is visited in an episodeFirst-visit MC: average returns only for first time s is visited in an episodeBoth converge asymptotically,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),37,Problem: Unvisited pairs(problem of maintaining exp

38、loration)For every make sure that:P( selected as a start state and action) >0 (Assumption of exploring starts ),蒙特卡羅方法,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),38,蒙特卡羅控制,How to select Policies:(Similar to policy evaluation),MC

39、 policy iteration: Policy evaluation using MC methods followed by policy improvement Policy improvement step: greedify with respect to value (or action-value) function,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),39,時(shí)序差分學(xué)習(xí) Temporal-Difference,

40、時(shí)序差分學(xué)習(xí)中沒(méi)有環(huán)境模型，根據(jù)經(jīng)驗(yàn)學(xué)習(xí)。每步進(jìn)行迭代，不需要等任務(wù)完成。預(yù)測(cè)模型的控制算法，根據(jù)歷史信息判斷將來(lái)的輸入和輸出，強(qiáng)調(diào)模型的函數(shù)而非模型的結(jié)構(gòu)。時(shí)序差分方法和蒙特卡羅方法類似，仍然采樣一次學(xué)習(xí)循環(huán)中獲得的瞬時(shí)獎(jiǎng)懲反饋，但同時(shí)類似與動(dòng)態(tài)規(guī)劃方法采用自舉方法估計(jì)狀態(tài)的值函數(shù)。然后通過(guò)多次迭代學(xué)習(xí)，去逼近真實(shí)的狀態(tài)值函數(shù)。,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),40,時(shí)序差分學(xué)習(xí) TD,2024/2/29,史忠植強(qiáng)化學(xué)

41、習(xí),41,時(shí)序差分學(xué)習(xí) Temporal-Difference,target: the actual return after time t,,target: an estimate of the return,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),42,時(shí)序差分學(xué)習(xí) (TD),Idea: Do ADP backups on a per move basis, not for the whole state space.,Theor

42、em: Average value of U(i) converges to the correct value.,Theorem: If ? is appropriately decreased as a function of times a state is visited (?=?[N[i]]), then U(i) itself converges to the correct value,2024/2/29,史忠植

43、強(qiáng)化學(xué)習(xí),43,TD(l) – A Forward View,TD(l) is a method for averaging all n-step backups weight by ln-1 (time since visitation)l-return: Backup using l-return:,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),44,時(shí)序差分學(xué)習(xí)算法 TD(?),Idea: update from the w

44、hole epoch, not just on state transition.,Special cases:?=1: Least-mean-square (LMS), Mont Carlo?=0: TDIntermediate choice of ? (between 0 and 1) is best. Interplay with ? …,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),45,時(shí)序差分學(xué)習(xí)算法 TD(?),算

45、法 10.1 TD(0)學(xué)習(xí)算法Initialize V(s) arbitrarily, ? to the policy to be evaluatedRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy ? derived from V(e.g., ε-greedy)

46、 Take action a, observer r, s′ Until s is terminal,,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),46,時(shí)序差分學(xué)習(xí)算法,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),47,時(shí)序差分學(xué)習(xí)算法收斂性TD(?),Theorem: Converges w.p. 1 under certain boundaries conditions.Decrease ?i(t) s.t.

47、,In practice, often a fixed ? is used for all i and t.,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),48,時(shí)序差分學(xué)習(xí) TD,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),49,Q-learning,Watkins, 1989在Q學(xué)習(xí)中，回溯從動(dòng)作結(jié)點(diǎn)開始，最大化下一個(gè)狀態(tài)的所有可能動(dòng)作和它們的獎(jiǎng)勵(lì)。在完全遞歸定義的Q學(xué)習(xí)中，回溯樹的底部結(jié)點(diǎn)一個(gè)從根結(jié)點(diǎn)開始的動(dòng)作和它們的后繼動(dòng)作的獎(jiǎng)勵(lì)的

48、序列可以到達(dá)的所有終端結(jié)點(diǎn)。聯(lián)機(jī)的Q學(xué)習(xí)，從可能的動(dòng)作向前擴(kuò)展，不需要建立一個(gè)完全的世界模型。Q學(xué)習(xí)還可以脫機(jī)執(zhí)行。我們可以看到，Q學(xué)習(xí)是一種時(shí)序差分的方法。,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),50,Q-learning,在Q學(xué)習(xí)中，Q是狀態(tài)-動(dòng)作對(duì)到學(xué)習(xí)到的值的一個(gè)函數(shù)。對(duì)所有的狀態(tài)和動(dòng)作： Q: (state x action) → value 對(duì)Q學(xué)習(xí)中的一步：,(10.15),其中c和γ都≤1，rt+1是狀

49、態(tài)st+1的獎(jiǎng)勵(lì)。,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),51,Q-Learning,Estimate the Q-function using some approximator (for example, linear regression or neural networks or decision trees etc.).Derive the estimated policy as an argument of the

50、maximum of the estimated Q-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate

51、 loss function.,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),52,Q-learning,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),53,探查Exploration,在使用(控制)和探查(標(biāo)識(shí))之間尋求折中,Extremes: greedy vs. random acting(n-armed bandit models)Q-learning將收斂到最佳的Q 估值，如果*(由于探查)，每個(gè)狀態(tài)可以被無(wú)限地訪問(wèn),*當(dāng)

52、時(shí)間接近無(wú)限時(shí)，動(dòng)作選擇變得貪婪，和*學(xué)習(xí)速率a 被減少足夠快但不是太迅速(我們?cè)赥D 學(xué)習(xí)過(guò)程中討論),2024/2/29,史忠植強(qiáng)化學(xué)習(xí),54,常用探查方法,In value iteration in an ADP agent: Optimistic estimate of utility U+(i)?-greedy methodNongreedy actions Greedy actionBolt

53、zmann exploration,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),55,Q-Learning 算法,Q學(xué)習(xí)算法Initialize Q(s,a) arbitrarilyRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from Q (e.g.

54、, ε-greedy)Take action a, observer r, s′ Until s is terminal,,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),56,Q-Learning 算法,SetForThe estimated policy satisfies,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),57,直覺(jué)是什么?,Bellman equation gives If

55、 and the training set were infinite, then Q-learning minimizes which is equivalent to minimizing,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),58,A-Learning,Murphy, 2003 and Robins, 2004Estimate the A-function (advantages) using

56、some approximator, as in Q-learning.Derive the estimated policy as an argument of the maximum of the estimated A-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with

57、 linear regression as the approximator, and of course, squared error as the appropriate loss function.,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),59,A-Learning Algorithm (Inefficient Version),ForThe estimated policy satisfies,2024/2/2

58、9,史忠植強(qiáng)化學(xué)習(xí),60,Q-Learning and A-learning差別,Q-learningAt time t we model the main effects of the history, (St,,At-1) and the action At and their interactionOur Yt-1 is affected by how we modeled the main effect of the

59、history in time t, (St,,At-1) A-learningAt time t we only model the effects of At and its interaction with (St,,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,,At-1),2024/2/

60、29,史忠植強(qiáng)化學(xué)習(xí),61,Q-Learning Vs. A-Learning,直到現(xiàn)在有關(guān)的優(yōu)點(diǎn)和缺點(diǎn)沒(méi)被完全知道。Q-learning有低的分散, 高的偏置。A-learning有高的分散, 低的偏置。比較Q-learning與A-learning必須在分散與偏置間求折中。,史忠植強(qiáng)化學(xué)習(xí),High-level robot behavior control usingPartially Observable Ma

61、rkov Decision Processes,,,USER + WORLD + ROBOT,ACTIONS,OBSERVATIONS,BELIEF STATE,STATE,62,2024/2/29,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),63,POMDP部分感知馬氏決策過(guò)程,Rather than observing the state we observe some function of the state.Ob – Obse

62、rvable functiona random variable for each states.Problem : different states may look similar,The optimal strategy might need to consider the history.,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),64,Framework of POMDP,,POMDP由六元組定義。其中定義了環(huán)境潛在的

63、馬爾可夫決策模型上，Ω是觀察的集合，即系統(tǒng)可以感知的世界狀態(tài)集合，觀察函數(shù)О：S×A→PD（Ω）。系統(tǒng)在采取動(dòng)作a轉(zhuǎn)移到狀態(tài)s′時(shí)，觀察函數(shù)О確定其在可能觀察上的概率分布。記為О（s′, a, o）。,,[1] Ω可以是S的子集，也可以與S無(wú)關(guān),2024/2/29,史忠植強(qiáng)化學(xué)習(xí),65,POMDPs,What if state information (from sensors) is noisy?Mostly the

64、 case!,MDP techniques are suboptimal!Two halls are not the same.,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),66,POMDPs – A Solution Strategy,SE: Belief State Estimator (Can be based on HMM)П: MDP Techniques,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),67,POMDP_信度狀態(tài)

65、方法,Idea: Given a history of actions and observable value, we compute a posterior distribution for the state we are in (belief state)The belief-state MDPStates: distribution over S (states of the POMDP)Actions: as in P

66、OMDPTransition: the posterior distribution (given the observation),Open Problem : How to deal with the continuous distribution?,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),68,The Learning Process of Belief MDP,,,,,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),69,Majo

67、r Methods to Solve POMDP,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),70,強(qiáng)化學(xué)習(xí)中的函數(shù)估計(jì),Generalization of the value function to the entire state space,,is the TD operator.,is the function approximation operator.,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),71,并行兩個(gè)迭代過(guò)程,值函數(shù)

68、迭代過(guò)程值函數(shù)逼近過(guò)程,,,How to construct the M function? Using state cluster, interpolation, decision tree or neural network?,2024/2/29,史忠植強(qiáng)化學(xué)習(xí),72,Function Approximator: V( s) = f( s, w)Update: Gradient-descent Sarsa:

眾賞文庫(kù)> 全部分類> 畢業(yè)設(shè)計(jì)

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

強(qiáng)化學(xué)習(xí)reinforcementlearning

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

強(qiáng)化學(xué)習(xí)reinforcementlearning

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

免費(fèi)下載