POMDP

授权协议 View license
开发语言
所属分类 应用工具、 科研计算工具
软件类型 开源软件
地区 不详
投 递 者 赫连睿
操作系统 跨平台
开源组织
适用人群 未知
 软件概览

POMDP

Implementing a reinforcement learning algorithm based upon a partially observable Markov decision process.

The task

Here the agent will be presented with a two-alternative forced decision task. Over a number of trials the agent will be able to choose and then perform an action based upon a given stimulus. The stimulus values range from -0.5 to 0.5. When the stimulus value is less than 0, the agent should choose Left to make a correct decision, and when the stimulus value is greater than 0 the agent should choose Right to make a correct decision. If the stimulus value is 0, the correct decision is randomly assigned to be either left or right for the given trial, and the agent will be rewarded accordingly.

The agent is rewarded in an asymmetric manner. For some trials, the agent receives an additional reward for making a left correct action. For the remaining trials, the agent receives an additional reward for making a right correct action. The trials are presented to the agent in blocks.

Task parameters

This code allows the user to choose some of the parameters of the task. For instance,

  • the number of trials
  • the number of reward blocks
  • options for reward blocks ('right','left' or 'none', where 'none' is optional)
  • stimulus values

The model

Note that this model implements a POMDP with Q-values. Q-values are a quantification of the agent's value of choosing a particular action. Q-values are updated with every trial based upon the reward received. The higher the Q-value, the higher the agent currently values making a particular action.

  1. At the beginning of each trial, the agent receives some stimulus, s. The larger the absolute value of stimulus, the clearer the stimulus appears to the agent.

  2. In order to model the agent having an imperfect perception of the stimulus, noise is added to the stimulus value. The perceived stimulus value is sampled from a normal distribution with mean s (the stimulus value) and standard deviation, sigma. The value of sigma is a parameter of the model.

  3. Using its perceived, noisy value of the stimulus, the agent then forms a belief as to the correct side of the stimulus. The agent calculates the probability of the stimulus being on a given side by calculating the cumulative probability of a normally distributed random variable (with mean noisy-stimulus-value and standard deviation sigma, as above) at zero.

  4. The agent then combines its belief as to the current side of the stimulus with its stored Q-values.

  5. The agent chooses either a left or right action, and receives the appropriate reward. This reward depends firstly on whether the agent has chosen the correct side, and secondly which is the current reward block. The current reward block will dictate whether the agent receives an additional reward for a correct action. The value of this additional reward is a second parameter of the model.

  6. The agent calculates the error in its prediction. This is equivalent to the reward minus the Q-value of the action taken.

  7. The prediction error, the agent's belief and the agent's learning rate (a third parameter of the model) are then used to update the Q-values for the next iteration.

Model parameters

  • sigma, the noise added to the agent's perception of the stimulus and the standard deviation in the agent's belief distribution.
  • the value of the additional reward.
  • the learning rate.

Running the code

The file 'Main.m' is the file which runs the model. The code runs as is, and will plot the results.

The first two sections of the allow the user to alter both the task parameters and the model parameters. The third section generates random stimulus values and reward blocks to be fed to the agent. The fourth section implements the POMDP in with the function 'RunPOMDP'. The final section plots the results.

References

The ideas used to build the model implemented here are largely drawn from

Terminology and the majority of the notation are also taken from these sources.

The task implemented is based upon

  • 【AI】浅析马尔可夫家族(MC, HMM, MDP, POMDP, MOMDP) 1 马尔可夫(Markov)的前驱知识点 马尔可夫性:又被称之为“无后效性”,即系统的下个状态只与当前状态信息有关,而与更早之前的状态无关 个人解读:由于“过去的过去”的所有后果都在“过去”的状态中展示,因此在预测“未来”时已经有了足够的信息量。本质上所有的问题都可以转化为具有马尔可夫性的模型,只要在他的状态中存储足

  • (写给读者:本文旨在记录我自己对该算法(方法)的理解,用于面试中可能用得到的回答,因此尽量将该文写的口语化,尽量能懂,而不是高深的各种符号,但一些关键的公式还是会贴以加深记忆。) 思考问题及回答问题方式:是什么?为什么?怎么做? 1、是什么? 一个数学模型,用于在部分可知环境下进行决策的建模,显示世界中的许多决策问题能够抽象为这样一个POMDP(部分可观测的马尔科夫决策过程),然后从数学的层面来解

  • Markov 首先,马尔可夫过程的大概意思就是**未来只与现在有关,与过去无关。**即定义如下: F t n , t 1 , t 2 … … . t n − 1 ( x n ∣ x 1 , x 2 … … x n 1 ) = F t n t n − 1 ( x n ∣ x n 1 ) F_{t_{n}, t_{1}, t_{2} \ldots \ldots . t_{n-1}}\left(x_{n