REINFORCEMENT LEARNING

Last updated on: May 26, 2024



Definition of reinforcement learning


Key aspects of reinforcement learning

  1. Optimization
  2. Delayed consequences
  3. Exploration
  4. How do we explore?

  5. Generalization
  6. What is decision policies? Mapping from experiences to decision and why this need to be learned?


AI Planning (vs RL)

It involves the highlighted (in bold and underlined)

Supervised Machine Learning (vs RL)


Unsupervised Machine Learning (vs RL)


Imitation Learning

Imitation Learning, also known as Learning from Demonstration (LfD), is a method of machine learning where the learning agent aims to mimic human behavior. In traditional machine learning approaches, an agent learns from trial and error within an environment, guided by a reward function. However, in imitation learning, the agent learns from a dataset of demonstrations by an expert, typically a human. The goal is to replicate the expert's behavior in similar, if not the same, situations.
Imitation learning represents a powerful paradigm in machine learning, enabling agents to learn complex behaviors without the need for explicit reward functions. Its application spans numerous domains, offering the potential to automate tasks that have traditionally required human intuition and expertise. As research in this field continues to advance, we can expect imitation learning to play an increasingly significant role in the development of intelligent systems.

Features of Imitation learning:

Imitation Learning vs RL


How to proceed with RL?


Issues with RL


Sequential Decision Making Under Uncertainity

Sequential Decision Making Under Uncertainity
  • Goal: Select actions to minimize total expected future reward.
  • May require balancing immediate & long term rewards
  • May require strategic behaviour to achieve high rewards.
  • The world can also be an agent.
  • Example: Web advertisement, Robot unloading dishwasher, blood pressure control
Sequential Decision Making Underuncertainity - Discrete Time
History
World State
Agent State

Markov Assumption


Types of Sequential Decision Making Process

Full Observability
Partial Observability
Partial Observability Example
Types Of Sequential Decision Process
MDPs and POMDPs
How the world changes
Mars Rover Example

RL Algorithm Components


Types of RL Agents


RL Agents

RL Agents

Key Challanges in Learning to Make Sequences of Good Decicions

Key Challanges In Learning To Make Sequence Of Good Decisions

Planning Example

Planning Example

Reinforcement Learning Example

Reinforcement Learning Example

Exploration and Exploitation


Exploration and Exploitation Examples

Exploration And Exploitation Examples

Evaluation and Control

Evaluation: Estimate/predict the expected rewards from following a given policy
Control: Optimization- find the best policy
Example Evaluation
Example Control

Markov Process or Markov Chain

Markov Process Markov chain
When we think of a Markov Process or Markov chain, we don't think of a control yet. We don't think of any action. There is no action. But the idea is there will be a scochastic process that's evolving over time.

Return & Value Function

Return-and-value-function
  • When we think about returns, we think about returns and expected returns.
  • We define horizon as how long the agent is acting for. How long this process is going on for and can be infinite.
  • The definition of Return is the sum of rewards from time step. The time can be infinite.
  • Value function is the expected return. If the process is deterministic these two things will be identical. If the process is stochastic, they will be different. That's because in deterministic you always go to same next state. Whereas, it will br different in stochastic process (in general).

Discount factor

Discount-factor
  • Discount factor- Used for mathematic convenience.
  • We can be assured that the value function is bounded as long as the reward function is bounded.
  • People empirically often act as there is a discount factor. We weigh future rewards lowen than immediate rewards typically.
  • If you are only using discount factor for mathematical convenience, if your horizon is always guaranteed to be finite, its fine to use y (gamma) = 1

Computing the Value of Markov Reward Process

computing-the-value-of-a-markov-reward-process
  • The accuracy is given by 1 by square root of n. Where n is the number of roll-outs (simulations).
  • It does not require the process to be markov process. Its just a way to estimate sums of returns. Sums of rewards.
  • It can give you better estimates of process. Which means computationally cheaper ways of estimating what the value of a process.
computing-the-value-of-a-markov-reward-process-2
This means the MRP values is the sum of Immediate reward plus the discounted sum of future rewards.
matrix-form-of-MRP
The figure shows the equation to get V or V(s)
iterative-algorithm-for-computing-value
The computational complexity is lower than the previous method.

Markov Decision Process

Markov-Decision-Process
MRP(Markov Reward Process) + action
MDP is a tuple which as state, action, reward, dynamics model and discount factor

MDP Policies

MDP-Policies
MDP-plus-policy

MDP Control

MDP-Control

In general if there are |a| actions and |s| states, how many deterministic policies are there?
--There are |a| ^ |s| (a to the power s) deterministic policies.

Is the optimal policy for a MDP always unique?
--NO

But there is always an optimal policy which gives maximum return.


Policy Search

Policy-search

If there is unlimited compute power, examine each and every policy and then take the max (reward) of those.

But, thats not the case.


MDP Policy Iteration (PI)

MDP-Policy-Iteration-(PI)

In practice, we keep track of a guess of what the optimal policy might be. We evaluate its value and then we try to improve it.

If we can't improve it anymore then we halt.

So, we start by initializing randomly. Here now you can think of the subscript is indexing which policy we're at. So, initially we start off with some random policy and then Pi_i is always going to index our current guess of what the optimal policy might be. And while its not changing, we'll talk about whether or not it can change or go back to the same one in a second, we do value function policy. We evaluate the policy using the same sorts of techniques we just discussed because it's a fixed policy. Which means we are now in a markov reward process. And then we do policy improvement. So, the new thing we are doing before now is policy improvement.


New Definition: State-Action Value Q

NewDefinition-State-ActionValueQ
  • In order to define how we can improve a policy, we are going to find something new which is the state action value.
  • State values are denoted by V. Which is V_Pi(S) (V Pi of s). Which says if we start in state s and you follow policy pi, what is expected discounted sum of rewards.
  • A state action value says, I will follow this policy pi but not right away. I will take an action a, which might be different than what my policy is telling me to do and then later on the next time-step I'm going to follow policy pi.
  • So, it just says I'm going to get my immediate reward from taking this action a that I am chosing and then I'm going to transition to a new state. Again. that depends on my current state and the action I just took and from then on I am going to take policy pi. So, that defines the Q function.

Policy Improvement

Policy-Improvement
  • What policy improvement does is it takes into consideration a policy, and gets value of it. So, policy evaluation just allowed you to compute what was the value of that policy. And now I want to see if I can improve it.
  • Now, remember right now we are in the case where we know the dynamics model and we know the reward model. So, we proceed with Q computation where we compare the previous value function by policy and now compute Q_pi which is obtained by taking a different action. It could be the same and we do this for all A and for all S. So, for all A and all S we compute this and then we are going to compute the new policy and this is the improvement step which maximizes this Q.
  • So, we do this computation and then we take the max.
  • Now, by definition this has to be greater than or equal to Q_pi(S, pi_(a))
  • Finding local maxima or global maxima? --Global

Delving Deeper Into Policy Improvement Step

Delving-Deeper-Into-Policy-Improvement-Step
  • First Q function, is calculated. Then new policy is evaluated. For acceptance of new policy, the new value has to be greater than the old value. When we do this, we compute Q function. First we take an action and then we follow our old policy from then onwards. And then I am picking whatever action is maximizing that quantify for each state.
  • Okay. So, I am gonna do this process for each state. But, then we are not going to follow the old policy from that point onwards, we will follow the new policy for all the time. If the new is better.
Monotonic-Improvement-in-policy
  • The value is monotonic if the value of new policy is greater than or equal to the old policy for all states. So, it has to be either the same value or better.
  • With strict inequality if the old policy was suboptimal.
proof-monotonic-improvement-in-policy
proof-monotonic-improvement-in-policy-1
policy-iteration-1
policy-iteration-2
optimal-policy-1
Bellman-Eqn
value-iteration
policy-iteration-as-bellman-operation-1
contraction-operator
Will-value-iteration-converge
proof-bellman-backup-is-a-contraction

Dynamic Programming for Policy Evaluation

Dynamic-programming-for-policy-evaluation
Dynamic-programming-for-policy-evaluation-1
Dynamic-programming-for-policy-evaluation-2
  • You are in a state s (denoted with white circle at the top), and then you can take an action. So, what dynamic programming is doing is computing an estimate of the V_pi here at the top by saying, what is the expectation over Pi of RT plus Gamma V_k-1. And that expectation over will be the probalility of s prime given s or P(s'| s1 Pi(s) ).
  • We start with a state and then we take an action, and then we think about the next states that we could reach.
  • We are assuming that we are in a stochastic process. So, we will be at different next state each time.
  • And then we can think about after we reach that state, we can take some other actions.
  • So we can think of drawing the tree of trajectories that we might reach if we started in a state and start following our policy.
  • Where whenever we get to make a choice, there is a single action we take because we are doing policy evaluation.
  • In dynamic programming we take an expectation over next states.
Dynamic-programming-for-policy-evaluation-3
Dynamic-programming-for-policy-evaluation-4
Dynamic-programming-for-policy-evaluation-5
Dynamic-programming-for-policy-evaluation-6

Policy Evaluation

Policy-evaluation-DP

Monte Carlo (MC) Policy Evaluation

Monte-Carlo-Policy-Evaluation
Monte-Carlo-Policy-Evaluation-1
Monte-Carlo-Policy-Evaluation-2
First-visit-Monte-Carlo-On-Policy-Evaluation

Bias, Variance and MSE

How do we evaluate whether or not the estimate is good? How are we going to compare all of the estimators?

Policy-Evaluation-Bias-Variance-MSE
Properties-Every-Visit-MC

Incremental Monte Carlo On Policy Evaluation

Incremental-MC-On-Policy-Evaluation
Incremental-MC-On-Policy-Evaluation-Running-Mean
MC-On-Policy-Evaluation
MC-On-Policy-Evaluation-1
MC-On-Policy-Evaluation-2
MC-On-Policy-Evaluation-3
MC-Policy-Evaluation-Summary

Temporal Difference Learning

Temporal-Difference-Learning
Temporal-Difference-Learning-1
Temporal-Difference-Learning-2
Temporal-Difference-Learning-3
TD-Learning
Temporal-Difference-Policy-Evaluation
DP-MC-TD-Understanding
DP-MC-TD-Understanding-1

Some Important Properties to evaluate Model-free Policy Evaluation Algorithms

Some-Important-Properties-to-evaluate-Model-free-Policy-Evaluation-Algorithms
Bias-Variance-Of-Model-Free-Policy-Evaluation-Algorithm
In-The-Example-MC-TD-Differences

The differences between MC and TD for mars rover example


Batch MS and TD

Batch-MC-and-TD

What if we want to go over data more than once. If they were willing to spend a bit more computation. So we can actually get better estimates and be more sample efficient. Meaning that we want to use our data more effciently so that we can get a better estimate. So often we call this batch or offline.

Let's imagine we have set of K episodes, and we can repeatedly smaple an episode. And then we either apply Monte Carlo or TD to the whole episode.

Example-MC-TD
Example-MC-TD-1
Batch-MC-TD-Converges
Some-Important-Properties-to-evaluate-Model-free-Policy-Evaluation-Algorithms-1
Alternative-certainity-equivalence-V-pi-MLE-MDP-Model-Estimates



References:

Course content: https://www.youtube.com/playlist?list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u

Content- Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 1 - Introduction - Emma Brunskill - https://www.youtube.com/watch?v=FgzM3zpZ55o

Content- Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 2 - Given a Model of the World - https://www.youtube.com/watch?v=E3f2Camj0Is&t=1560s

Content- Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 3 - Model-Free Policy Evaluation - https://www.youtube.com/watch?v=dRIhrn8cc9w&t=78s

Imitation learning: https://deepai.org/machine-learning-glossary-and-terms/imitation-learning

Menaing of Montezuma's revenge: https://healthcenter.indiana.edu/health-answers/travel/travelers-diarrhea.html