## 经济代写|博弈论代写Game Theory代考|Case Study

Here, we consider TD learning on random walk. Given a policy $\mu$, an MDP can be considered as a Markov cost process, or MCP. In this MCP, we have $n=20$ states. The transition digram of the $\mathrm{MCP}$ is given in Figure 19.2. At state $i=i_k, k=2,3, \ldots, n-1$, the process proceed either left to $i_{k-1}$ or right to $i_{k+1}$, with equal probability. The transition at states from $i_2$ to $i_{n-1}$ is similar to symmetric one-dimensional random walk. At state $i_1$, the process proceed to state $i_2$ with probability $1 / 2$ or stays at the same state with equal probability. At state $i_n$, the probabilities of transition to $i_{n-1}$ and staying at $i_n$ are both $\frac{1}{2}$. That is, we have $p_{i_k i_{k+1}}\left(\mu\left(i_k\right)\right)=p_{i_k i_{k-1}}\left(\mu\left(i_k\right)\right)=\frac{1}{2}$ for $k=2,3, \ldots, n-1, p_{i_1 i_1}=$ $p_{i_1 l_2}=\frac{1}{2}$ and $p_{i_n l_n}=p_{i_n i_{n-1}}=\frac{1}{2}$. The cost at state $i_k$ is set to be $k$ if $k \leq 10$ and $21-k$ if $k>10$. That is,
$$g\left(i_k, \mu\left(i_k\right)\right)=\left{\begin{array}{ll} k & \text { if } k \leq 10 \ 21-k & \text { else } \end{array} .\right.$$
We consider the discount factor $\alpha=0.9$. The task here is to use approximate $\operatorname{TD}(\lambda)$ learning algorithm to esitimate and approximate the cost-to-go function $J^\mu$ of this MCP. We consider a linear parametrization of the form
$$J(i, r)=r(3) i^2+r(2) i+r(1)$$
and $r=(r(1), r(2), r(3)) \in \mathbb{R}^3$. Suppose the learning agent updates $r_t$ based on $\operatorname{TD}(\lambda)$ learning algorithm (19.3) and (19.4) and tries to find an estimate of $J^\mu$. We simulate the MCP and obtain a trajectory that long enough and its associated cost signals. We need an infinite long trajectory ideally. But here, we set the length of the trajectory to be $10^5$. We run, respectively, $\operatorname{TD}(1)$ and $\operatorname{TD}(0)$ on the same simulated trajectory based on rules given in (19.3) and (19.4). The black line indicates the cost-to-go function of the MCP. The blue markers are the approximations of the cost-to-go function obtained by following the $\operatorname{TD}(\lambda)$ algorithm (19.3) and (19.4) with $\lambda=1$ and $\lambda=0$. We can see that $J_{\mathrm{TD(1)}}$ and $J_{\mathrm{TD(0)}}$ is a quadratic function of $i$ as we set in (19.21). Both $J_{\mathrm{TD}(1)}$ and $J_{\mathrm{TD(0)}}$ can serve a fairly good approximation of $J^\mu$ as we can see. The dimension of the parameters we need to update goes from $n=20$ in the $\operatorname{TD}(\lambda)$ algorithm (19.2) to $K=3$ in the approximation counterpart (19.3) which is more efficient computationally.

## 经济代写|博弈论代写Game Theory代考|Motivation and Challenges

Military tactical networks often suffer from severe resource constraints (e.g. battery, computing power, bandwidth, and/or storage). This puts a high challenge in designing a network (or system) which should be robust against adversaries under high dynamics in tactical environments. When a network consists of highly heterogeneous entities, such as Internet-of-Things (IoT) devices in a large-scale network, deploying multiple defense mechanisms to protect a system requires high intelligence to meet conflicting system goals of security and performance. When a node fails due to being compromised or functional fault, multiple strategies can be considered to deal with this node failure, such as destruction, repair, or replacement. What strategy to take to deal with this node is vital for the system to complete a given mission as well as to defend against adversaries because nodes themselves are network resources for providing destined services and defense/security. If a node detected as compromised is not useful in a given system, it can be discarded (i.e. disconnected or destroyed) based on the concept of “disposable security” (Kott et al. 2016) or “self-destruction” (Brueckner et al. 2014; Curiac et al. 2009; Zeng et al. 2010). However, if the node is regarded as a highly critical asset which keeps highly confidential information or provides critical services (e.g. web servers or databases), its removal will cause a critical damage to service provision and/or may introduce security breach. To obtain optimal intrusion response strategies to deal with nodes detected as compromised/failed, we propose a bio-inspired multilayer network structure that consists of three layers including the core layer, the middle layer, and the outer layer where the nodes are placed in each layer according to their importance (i.e. place the most important nodes in the core layer, the medium important nodes in the middle layer, and the least important nodes in the outer layer). This network design is bio-inspired by mimicking a defense system of the human body upon pathogen attacks and how the body deals with an infection aiming to reach inner organs (Janeway et al. 1996).

