2022年12月24日

## 计算机代写|强化学习代写Reinforcement learning代考|Transition Ta

After having discussed state and action, it is time to look at the transition function $T_a\left(s, s^{\prime}\right)$. The transition function $T_a$ determines how the state changes after an action has been selected. In model-free reinforcement learning the transition function is implicit to the solution algorithm: the environment has access to the transition function and uses it to compute the next state $s^{\prime}$, but the agent has not. (In Chap. 5 we will discuss model-based reinforcement learning. There the agent has its own transition function, an approximation of the environment’s transition function, which is learned from the environment feedback.)
Graph View of the State Space
We have discussed states, actions, and transitions. The dynamics of the MDP are modelled by transition function $T_a(\cdot)$ and reward function $R_a(\cdot)$. The imaginary space of all possible states is called the state space. The state space is typically large. The two functions define a two-step transition from state $s$ to $s^{\prime}$, via action $a$ : $s \rightarrow a \rightarrow s^{\prime}$.

To help our understanding of the transitions between states, we can use a graphical depiction, as in Fig. 2.5.

In the figure, states and actions are depicted as nodes (vertices), and transitions are links (edges) between the nodes. States are drawn as open circles, and actions as smaller black circles. In a certain state $s$, the agent can choose which action $a$ to perform, which is then acted out in the environment. The environment returns the new state $s^{\prime}$ and the reward $r^{\prime}$.

Figure $2.5$ shows a transition graph of the elements of the MDP tuple $s, a, t_a, r_a$ as well as $s^{\prime}$, and policy $\pi$, and how the value can be calculated. The root node at the top is state $s$, where policy $\pi$ allows the agent to choose between three actions $a$, that, following distribution $\mathrm{Pr}$, each can transition to two possible states $s^{\prime}$, with their reward $r^{\prime}$. In the figure, a single transition is shown. Please use your imagination to picture the other transitions as the graph extends down.

In the left panel of the figure the environment can choose which new state it returns in response to the action (stochastic environment); in the middle panel there is only one state for each action (deterministic environment); the tree can then be simplified, showing only the states, as in the right panel.

## 计算机代写|强化学习代写Reinforcement learning代考|Reinforcement Learning Objective

We now have the ingredients to formally state the objective $J(\cdot)$ of reinforcement learning. The objective is to achieve the highest possible average return from the start state:
$$J(\pi)=V^\pi\left(s_0\right)=\mathbb{E}_{\tau_0 \sim p\left(\tau_0 \mid \pi\right)}\left[R\left(\tau_0\right)\right]$$
for $p\left(\tau_0\right)$ given in Eq. 2.1. There is one optimal value function, which achieves higher or equal value than all other value functions. We search for a policy that achieves this optimal value function, which we call the optimal policy $\pi^{\star}$ :
$$\pi^{\star}(a \mid s)=\underset{\pi}{\arg \max } V^\pi\left(s_0\right) .$$

This function $\pi^{\star}$ is the optimal policy, and it uses the arg max function to select the policy with the optimal value. The goal in reinforcement learning is to find this optimal policy for start state $s_0$.

A potential benefit of state-action values $Q$ over state values $V$ is that stateaction values directly tell what every action is worth. This may be useful for action selection, since, for discrete action spaces,
$$a^{\star}=\underset{a \in A}{\arg \max } Q^{\star}(s, a),$$
the Q function directly identifies the best action. Equivalently, the optimal policy can be obtained directly from the optimal $\mathrm{Q}$ function:
$$\pi^{\star}(s)=\underset{a \in A}{\arg \max } Q^{\star}(s, a) .$$
We will now turn to construct algorithms to compute the value function and the policy function.

## 计算机代写|强化学习代写Reinforcement learning代考|Reinforcement Learning Objective

$$J(\pi)=V^\pi\left(s_0\right)=\mathbb{E}_{\tau_0 \sim p\left(\tau_0 \mid \pi\right)}\left[R\left(\tau_0\right)\right]$$

$$\pi^{\star}(a \mid s)=\underset{\pi}{\arg \max } V^\pi\left(s_0\right) .$$

$Q$ 相对于状态值 $\mathrm{V}$ 的一个潜在好处是状态动作 $V$ 值直接说明每个动作的 价值。这可能对动作选择很有用，因为对于离散动作空间，
$$a^{\star}=\underset{a \in A}{\arg \max } Q^{\star}(s, a),$$
Q函数直接标识最佳动作。等价地，最优策略可以直接从最优Q函数得 到：
$$\pi^{\star}(s)=\underset{a \in A}{\arg \max } Q^{\star}(s, a) .$$

