Reinforcement learning (RL) is a paradigm of machine learning. The goal of RL is learning to decide how an agent should take actions in an environment in order to maximize the cumulative reward.
Markov Decision Process
A Markov Decision Process is a discrete-time stochastic control process. At each time step, the process is in a state s, and the decision maker may choose to perform an action. The process responds at the next time step by randomly moving into a new state s’ and giving the decision maker a reward. Importantly, the transition to the new state s’ only depends on the current state s and the action taken.
Value Function
A value function of a state s given a policy is defined by the expected utility over all possible state sequences from the state s produced by following that policy, where the utility of a sequence is defined as the summation of discounted rewards. If the policy is optimal, the value function becomes the utility of the state s.
Bellman Equation
Bellman Equation links the utility of a state s to the utilities of its successor states s’.
Value Iteration
Value iteration is an algorithm that iteratively update the estimation of state utilites using Bellman equation. It assume a known transition model.