Back to Writing
· 3 min read · Shukraditya

RL Mastery Notes — Week 1

#reinforcement-learning

Day 0

Terminology

  1. Policy:

    Rule used by agent to decide what action to take

    • stochastic

      • $a_t \sim \pi(\cdot|s_t)$
      • at any given moment $t$, agent looks at current situation $s_t$ . Instead of a fixed move, it has a choice of moves, each with a specific probability. randomly picks move $a_t$ based on probability.
      • two types:
        1. Categorical
          • discrete action spaces. classifier over discrete actions.
          • Input: observation -> some layers -> logits for each action -> softmax to get probs
          • Sampling: given the prob for each action, sample.
          • log likelihood: denote last layer probs as $P_{\theta}(s)$. vector with many entries=actions. log likelihood for action $a$ into vector $\log \pi_{\theta}(a|s)=\log[P_{\theta(s)}]_{a}$
        2. Diagonal Gaussian Policies
          • multivariant gaussian distrib described by mean vector $\mu$ and covariance matrix $\sum$.
          • diagonal gaussian distrib special case where cov matrix has only diagonal entries. $\therefore$ vector representation
          • 2 ways to represent as vectors:
            1. single vector of $\log(\sigma)$ (SD)
            2. neural net that maps from states to $\log_{\theta}(\sigma)$
    • deterministic

      • $a_{t}=\mu(s_{t})$
      • action exactly determined by state, no randomness
  2. Returns

    • finite horizon undiscounted return: sum of rewards obtained in a fixed window of steps $$R(\tau)=\sum_{t=0}^Tr_{t}$$
    • infinite horizon discounted return: sum of all rewards ever , discounted by how far off they’re obtained. “reward received k time steps in future worth only $\gamma^{k-1}$ times what it would be worth immediately”. $$R(\tau)=\sum_{t=0}^\infty \gamma^tr_{t}$$
  3. Trajectory

    • $\tau$ sequence of states and actions
    • $s_0$ sampled randomly from start state distribution, denoted by $\rho_{0}$ is $s_{0}\sim \rho_{0}(\cdot)$

RL Problem

maximisation of expected return over a given horizon.

for stochastic env transitions and policy, probability of T-step trajectory $$P(\tau|\pi)=\phi_{0}(s_{0})\prod_{t=0}^{T-1}\pi(a_{t},s_{t})P(s_{t+1}|s_{t},a_{t})$$

the expected return is $$J(\pi)=\int_{\tau}P(\tau|\pi)R(\tau)=\mathbb{E}_{\tau \sim \pi}[R(\tau)]\quad \text{expected Reward for trajectory }\tau\text{ following policy }\pi$$

the central optimisation problem in RL can be then expressed by as $\pi^*=\mathbb{\text{argmax}}_{\pi}J(\pi)$ where $\pi^*$ is the optimal policy.

Value Functions

value of state/state-action pair ie expected return if you START in that state or state=action pair and act on policy.

types:

  • on policy value function: expected return for start state s and always follows policy $\pi$ $$V^\pi(s)=\mathbb{E}_{\tau\sim \pi}[R(\tau)|S_{0}=s]$$
  • on policy value action function: expected return for start state a and arbitrary action a(may not be on policy) and THEN forever act on policy $\pi$ $$Q^\pi(s,a)=\mathbb{E}_{\tau\sim \pi}[R(\tau)|S_{0}=s,A_{0}=a]$$
  • optimal value function: max value of $V^\pi(s)$ subject to $\pi$ acting always according to optimal policy
  • optimal value action function: max value of $Q^\pi(s,a)$ subject to pi, first on arbitrary action a and then forever according to optimal policy
Question

When we talk about value functions, if we do not make reference to time-dependence, we only mean expected infinite-horizon discounted return. Value functions for finite-horizon undiscounted return would need to accept time as an argument. Can you think about why? Hint: what happens when time’s up?

Solution

When time’s up, $v_{t}(s)=0$ for all states. since there is no more reward to be gained as time is over. so, $v_1(s)\ne v_{10}(s)$. that’s why.

Bellman Function

RL Mastery-2

Advantage function

how much better than others on average <- useful for policy gradient methods

$$A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s)$$

Sources