版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
1、<p> Q-Learning By Examples</p><p> In this tutorial, you will discover step by step how an agent learns through training without teacher (unsupervised) in unknown environment. You will find out part
2、of reinforcement learning algorithm called Q-learning. Reinforcement learning algorithm has been widely used for many applications such as robotics, multi agent system, game, and etc. </p><p> Instead of le
3、arning the theory of reinforcement that you can read it from many books and other web sites (see Resources for more references), in this tutorial will introduce the concept through simple but comprehensive numerical exam
4、ple. You may also download the Matlab code or MS Excel Spreadsheet for free. </p><p> Suppose we have 5 rooms in a building connected by certain doors as shown in the figure below. We give name to each room
5、 A to E. We can consider outside of the building as one big room to cover the building, and name it as F. Notice that there are two doors lead to the building from F, that is through room B and room E. </p><p&
6、gt; We can represent the rooms by graph, each room as a vertex (or node) and each door as an edge (or link). Refer to my other tutorial on Graph if you are not sure about what is Graph. </p><p> We want t
7、o set the target room. If we put an agent in any room, we want the agent to go outside the building. In other word, the goal room is the node F. To set this kind of goal, we introduce give a kind of reward value to each
8、door (i.e. edge of the graph). The doors that lead immediately to the goal have instant reward of 100 (see diagram below, they have red arrows). Other doors that do not have direct connection to the target room have zero
9、 reward. Because the door is two way (from A can go</p><p> Additional loop with highest reward (100) is given to the goal room (F back to F) so that if the agent arrives at the goal, it will remain there f
10、orever. This type of goal is called absorbing goal because when it reaches the goal state, it will stay in the goal state. </p><p> Ladies and gentlemen, now is the time to introduce our superstar agent…. &
11、lt;/p><p> Imagine our agent as a dumb virtual robot that can learn through experience. The agent can pass one room to another but has no knowledge of the environment. It does not know which sequence of doors
12、the agent must pass to go outside the building. </p><p> Suppose we want to model some kind of simple evacuation of an agent from any room in the building. Now suppose we have an agent in Room C and we want
13、 the agent to learn to reach outside the house (F). (see diagram below) </p><p> How to make our agent learn from experience? </p><p> Before we discuss about how the agent will learn (using
14、Q learning) in the next section, let us discuss about some terminologies of state and action . </p><p> We call each room (including outside the building) as a state . Agent's movement from one room to
15、another room is called action . Let us draw back our state diagram. State is depicted using node in the state diagram, while action is represented by the arrow. </p><p> Suppose now the agent is in state C.
16、 From state C, the agent can go to state D because the state C is connected to D. From state C, however, the agent cannot directly go to state B because there is no direct door connecting room B and C (thus, no arrow). F
17、rom state D, the agent can go either to state B or state E or back to state C (look at the arrow out of state D). If the agent is in state E, then three possible actions are to go to state A, or state F or state D. If ag
18、ent is state B, it can g</p><p> We can put the state diagram and the instant reward values into the following reward table, or matrix R . </p><p> The minus sign in the table says that the ro
19、w state has no action to go to column state. For example, State A cannot go to state B (because no door connecting room A and B, remember?) </p><p> In the previous sections of this tutorial, we have modele
20、d the environment and the reward system for our agent. This section will describe learning algorithm called Q learning (which is a simplification of reinforcement learning). </p><p> We have model the envir
21、onment reward system as matrix R. </p><p> Now we need to put similar matrix name Q in the brain of our agent that will represent the memory of what the agent have learned through many experiences. The row
22、of matrix Q represents current state of the agent, the column of matrix Q pointing to the action to go to the next state. </p><p> In the beginning, we say that the agent know nothing, thus we put Q as zero
23、 matrix. In this example, for the simplicity of explanation, we assume the number of state is known (to be six). In more general case, you can start with zero matrix of single cell. It is a simple task to add more column
24、 and rows in Q matrix if a new state is found. </p><p> The transition rule of this Q learning is a very simple formula </p><p> The formula above have meaning that the entry value in matrix Q
25、 (that is row represent state and column represent action) is equal to corresponding entry of matrix R added by a multiplication of a learning parameter and maximum value of Q for all action in the next state.</p>
26、<p> Our virtual agent will learn through experience without teacher (this is called unsupervised learning). The agent will explore state to state until it reaches the goal. We call each exploration as an episode
27、. In one episode the agent will move from initial state until the goal state. Once the agent arrives at the goal state, program goes to the next episode. The algorithm below has been proved to be convergence (See referen
28、ces for the proof) </p><p><b> Q學(xué)習(xí)實(shí)例</b></p><p> 在本教程中,您將一步一步地發(fā)現(xiàn)在未知的環(huán)境中一個代理如何進(jìn)行沒有老師(非監(jiān)督)的學(xué)習(xí)訓(xùn)練。你會發(fā)現(xiàn)強(qiáng)化學(xué)習(xí)算法的一部分——稱為Q學(xué)習(xí)。強(qiáng)化學(xué)習(xí)算法已經(jīng)得到廣泛的應(yīng)用,如機(jī)器人技術(shù)、多代理系統(tǒng)、游戲,等等。</p><p> 雖然你可以
29、閱讀從許多書籍和其他網(wǎng)站(參見參考資料獲取更多的引用)來學(xué)習(xí)的加固的理論,但本教程將通過數(shù)值例子介紹簡單而全面的概念。你也可以下載Matlab代碼或免費(fèi)的Excel電子表格。</p><p> 假設(shè)我們在建筑中有5個房間,由某些大門連接如下圖所示。我們給每個房間一個名字,從A到E 。我們可以考慮大樓外部作為一個大房間里涵蓋了大樓,并將其命名為F 。請注意,有兩扇門可以從F到建筑里,就是通過B室和E室。</
30、p><p> 我們可以通過圖形表示房間。每個房間作為一個頂點(diǎn)(或節(jié)點(diǎn))和每個門作為一個邊緣(或鏈接)。如果你不確定是什么圖,請參考我其他教程上的圖形,。</p><p> 我們想要設(shè)定目標(biāo)房間。如果我們把一個系統(tǒng)放到任何房間中,我們想要它到建筑物的外面。換句話說,目標(biāo)的房間是節(jié)點(diǎn)F。為了設(shè)置這樣的目標(biāo),我們介紹給每扇門(即圖的邊)一個獎勵價值。立即到達(dá)目標(biāo)的門有即時回報100(見圖表,他們
31、有紅色箭頭)。其他沒有直接連接到目標(biāo)的房間的門只有零回報。因?yàn)橥ㄟ^門是有兩個方向的(從A可以去E和從E可以回到A),我們給每個房間的前面的圖分配兩個箭頭。每個箭頭都包含一個即時回報價值。這個圖變得狀態(tài)關(guān)系圖如下所示</p><p> 額外的有最高的獎勵(100)的路徑是考慮到目標(biāo)的房間(F回到F),以便使代理如果到達(dá)目標(biāo),它將永遠(yuǎn)留在那里。這種類型的目標(biāo)被稱為吸收目標(biāo),因?yàn)楫?dāng)它達(dá)到目標(biāo)狀態(tài),它將停留在目標(biāo)狀態(tài)。
32、</p><p> 現(xiàn)在是時候介紹我們的超級代理了…</p><p> 想象一下我們的代理作為一個愚蠢的虛擬機(jī)器人,這種機(jī)器人可以通過經(jīng)驗(yàn)學(xué)習(xí)。代理可以從一個房間到另一個房間,但是沒有對環(huán)境的認(rèn)知。它不知道去建筑物的外面必須通過哪個序列的門代理。</p><p> 假設(shè)我們想為某種簡單的從任何教室疏散代理建模?,F(xiàn)在假設(shè)我們有一個代理在房間C,我們想要代理學(xué)會達(dá)
33、到在房子外面(F)。(見下圖)</p><p> 如何使我們的代理從經(jīng)驗(yàn)中學(xué)習(xí)?</p><p> 在我們討論關(guān)于代理將學(xué)習(xí)(使用Q學(xué)習(xí))之前,在接下來的部分中,我們討論一些術(shù)語和行動。</p><p> 我們稱每個房間(包括建筑外)為一個區(qū)域。代理的運(yùn)動從一個房間到另一個房間叫行動。讓我們收回我們的狀態(tài)關(guān)系圖。狀態(tài)是使用狀態(tài)關(guān)系圖的節(jié)點(diǎn)描述,行動用箭頭表示。
34、</p><p> 假設(shè)現(xiàn)在代理是在區(qū)域C,代理可以去區(qū)域D,因?yàn)闋顟B(tài)C是連接到D。從國家C,然而,代理不能直接去國家B,因?yàn)闆]有直接連接房間門B和C(因此,沒有箭頭)。從區(qū)域D,代理要么去區(qū)域B或E或回到狀態(tài)C(看了箭頭區(qū)域D)。如果代理是在區(qū)域E。然后三種可能的行動去F或D。如果代理在區(qū)域B,它要么去D或F .從A,它只能回到區(qū)域E。</p><p> 我們可以把狀態(tài)圖和即時回報值
35、分為以下獎勵表或矩陣R。</p><p> 在表中的減號表示,這行沒有去列的行動。例如,A不能去B(因?yàn)闆]有門連接房間A和B。)</p><p> 前一節(jié)本教程中,我們已經(jīng)為我們代理的環(huán)境和獎勵系統(tǒng)建模。這一小節(jié)將描述學(xué)習(xí)算法稱為Q學(xué)習(xí)(這是一個簡化的強(qiáng)化學(xué)習(xí))。</p><p> 我們有模型環(huán)境回饋系統(tǒng)為矩陣R。</p><p>
36、現(xiàn)在我們需要把相似矩陣的大腦中名為Q將代表我們的代理,大腦可以通過經(jīng)驗(yàn)學(xué)習(xí)到很多。一排排的矩陣Q代表了當(dāng)前的狀態(tài)的代理,列的矩陣Q指向行動去下一個狀態(tài)。</p><p> 在開始的時候,我們說代理一無所知。因此我們把Q看作零矩陣。在這個例子中,簡單的解釋,我們假設(shè)的區(qū)域數(shù)是已知的(6)。在更一般的情況下,你可以從單個細(xì)胞零矩陣開始。如果有一個新的區(qū)域發(fā)現(xiàn),在Q矩陣中添加更多的列和行是一個簡單任務(wù)。</p&
37、gt;<p> Q學(xué)習(xí)的轉(zhuǎn)換規(guī)則是一個非常簡單的公式</p><p> 上面的公式已經(jīng)意味著條目值在矩陣Q(即行代表區(qū)域和列代表行動)等同于相應(yīng)的條目的矩陣R添加一個乘法的學(xué)習(xí)參數(shù)和在這個狀態(tài)下所有行動的Q的最大值。</p><p> 我們的虛擬代理將通過經(jīng)驗(yàn)在沒有老師的情況下學(xué)習(xí)(這就是所謂的無監(jiān)督學(xué)習(xí))。代理將探索各區(qū)域,直到達(dá)到目標(biāo)。我們調(diào)用每個勘探作為一個插話。
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
評論
0/150
提交評論