# Regression and Curve Fitting

Doug I. Jones

## 数学代写|数值分析代写numerical analysis代考|Regression and Curve Fitting

Regression and curve fitting are also important computational tasks in statistics. A common approach is to use least squares to fit a set of data $\left(\boldsymbol{x}i, y_i\right)$, $i=1,2, \ldots, N$, with a function $x \mapsto \sum{j=1}^m c_j \varphi_j(x)$. We assume a statistical model $y_i=\sum_{j=1}^m c_j \varphi_j\left(x_i\right)+\epsilon_i$ where $\epsilon_i$ follows a $\operatorname{Normal}\left(0, \sigma^2\right)$ distribution and the $\epsilon_i$ ‘s are mutually independent. The likelihood function is \begin{aligned} L(\boldsymbol{c}) & =\prod_{i=1}^N(2 \pi)^{-1 / 2} \exp \left(-\epsilon_i^2 / \sigma^2\right) \ & =(2 \pi)^{-N / 2} \exp \left(-\epsilon^T \epsilon / \sigma^2\right), \quad \text { so } \ \ln L(\boldsymbol{c}) & =-\frac{N}{2} \ln (2 \pi)-\frac{1}{\sigma^2} \boldsymbol{\epsilon}^T \epsilon=-\frac{N}{2} \ln (2 \pi)-\frac{1}{\sigma^2}(\Phi \boldsymbol{c}-\boldsymbol{y})^T(\Phi \boldsymbol{c}-\boldsymbol{y}) \end{aligned}
where $\Phi_{i j}=\varphi_i\left(\boldsymbol{x}_j\right)$. So maximizing $L(\boldsymbol{c})$ for $\boldsymbol{c}$ is equivalent to minimizing $(\Phi \boldsymbol{c}-\boldsymbol{y})^T(\Phi \boldsymbol{c}-\boldsymbol{y})$. That is, we are minimizing the sum of the squares of the errors. This can be done using either normal equations or the $\mathrm{QR}$ factorization. The optimal $\boldsymbol{c}$ is given by the normal equations $\Phi^T \Phi \widehat{\boldsymbol{c}}=\Phi^T \boldsymbol{y}$. If $\boldsymbol{c}$ is the true set coefficients, $\boldsymbol{y}=\Phi \boldsymbol{c}+\boldsymbol{\epsilon}$ and so $\Phi^T \Phi \widehat{\boldsymbol{c}}=\Phi^T \Phi \boldsymbol{c}+\Phi^T \boldsymbol{\epsilon}$ and therefore $\widehat{\boldsymbol{c}}=\boldsymbol{c}+\left(\Phi^T \Phi\right)^{-1} \Phi^T \boldsymbol{\epsilon}$. The error in the computed coefficients $\widehat{\boldsymbol{c}}-\boldsymbol{c}=\left(\Phi^T \Phi\right)^{-1} \Phi^T \boldsymbol{\epsilon}$ is distributed according to the $\operatorname{Normal}\left(\mathbf{0}, \sigma^2\left(\Phi^T \Phi\right)^{-1}\right)$ distribution. This estimate is unbiased: $\mathbb{E}[\widehat{\boldsymbol{c}}]=\boldsymbol{c}$. We might want to estimate the variance $\sigma^2$ of the $\epsilon_i$ ‘s by using $\widehat{\boldsymbol{\epsilon}}=\boldsymbol{y}-\Phi \widehat{\boldsymbol{c}}$. However, this will lead to a biased estimate of $\sigma^2$ :
\begin{aligned} \widehat{\boldsymbol{\epsilon}} & =\boldsymbol{y}-\Phi \widehat{\boldsymbol{c}}=\boldsymbol{y}-\Phi\left(\Phi^T \Phi\right)^{-1} \Phi^T \boldsymbol{y} \ & =\left[I-\Phi\left(\Phi^T \Phi\right)^{-1} \Phi^T\right] \boldsymbol{y} . \end{aligned}
The matrix $P:=I-\Phi\left(\Phi^T \Phi\right)^{-1} \Phi^T$ is the orthogonal projection onto the orthogonal complement of range $\Phi$. Since $\boldsymbol{y}=\Phi \boldsymbol{c}+\boldsymbol{\epsilon}$,
$$\widehat{\epsilon}=P(\Phi c+\epsilon)=P \epsilon$$
so $\widehat{\boldsymbol{\epsilon}}$ is in (range $\Phi)^{\perp}$. We can consider $\widehat{\boldsymbol{\epsilon}}$ to be distributed according to the $\operatorname{Normal}\left(\mathbf{0}, \sigma^2 P\right)$ distribution as $P^T=P=P^2=P^T P$, understood as the limit of the distribution of $\operatorname{Normal}\left(0, \sigma^2 P+\alpha I\right)$ as $\alpha \downarrow 0$. Also
$$\mathbb{E}\left[\widehat{\boldsymbol{\epsilon}}^T \widehat{\boldsymbol{\epsilon}}\right]=\mathbb{E}\left[\boldsymbol{\epsilon}^T P^T P \boldsymbol{\epsilon}\right]=\mathbb{E}\left[\boldsymbol{\epsilon}^T P \boldsymbol{\epsilon}\right]=\operatorname{trace}(P) \sigma^2$$

## 数学代写|数值分析代写numerical analysis代考|Bayesian Inference

Given a data set $D$, how likely is a hypothesis $H$ ? This is the conditional probability $\operatorname{Pr}[H \mid D]$. But this is usually very difficult to compute directly. Instead, it is much easier to compute $\operatorname{Pr}[D \mid H]$ as the hypothesis $H$ is a statement about the nature of the data $D$. From Bayes’ theorem,
$$\operatorname{Pr}[H \mid D]=\frac{\operatorname{Pr}[D \& H]}{\operatorname{Pr}[D]}=\frac{\operatorname{Pr}[D \mid H] \operatorname{Pr}[H]}{\sum_{H^{\prime}} \operatorname{Pr}\left[D \mid H^{\prime}\right] \operatorname{Pr}\left[H^{\prime}\right]}$$
where $H^{\prime}$ ranges over all plausible hypotheses. The value $\operatorname{Pr}[H]$ is the probability that hypothesis $H$ is true before we have any data about it. This probability $\operatorname{Pr}[H]$ is the a priori probability, while $\operatorname{Pr}[H \mid D]$ is the a posteriori probability of hypothesis $H$. Estimating $\operatorname{Pr}[H]$ is often a subjective matter. Consider the question, “What is the probability that the sun will rise every 24 hours?” We might have personally observed these occurring tens of thousands of times and have historical records going back hundreds of thousands of times before that. But what should we assign to this hypothesis before we have evidence, or before we know anything about the sun? Without evidence we have very little basis for any computation of $\operatorname{Pr}[H]$. We can make an arbitrary assignment $\operatorname{Pr}[H]=\frac{1}{2}$. The observed data of centuries of observing the sun rise every day would then give $\operatorname{Pr}[H \mid D]$ very close to one.

$x \mapsto \sum j=1^m c_j \varphi_j(x)$. 我们假设一个统计模型 $y_i=\sum_{j=1}^m c_j \varphi_j\left(x_i\right)+\epsilon_i$ 在哪里 $\epsilon_i$ 跟随一个
$\operatorname{Normal}\left(0, \sigma^2\right)$ 分布和 $\epsilon_i$ 的是相互独立的。似然函数是
$$L(\boldsymbol{c})=\prod_{i=1}^N(2 \pi)^{-1 / 2} \exp \left(-\epsilon_i^2 / \sigma^2\right) \quad=(2 \pi)^{-N / 2}$$

$\hat{\boldsymbol{c}}-\boldsymbol{c}=\left(\Phi^T \Phi\right)^{-1} \Phi^T \boldsymbol{\epsilon}$ 是根据分布
$\operatorname{Normal}\left(\mathbf{0}, \sigma^2\left(\Phi^T \Phi\right)^{-1}\right)$ 分配。这个估计是无偏

$$\hat{\boldsymbol{\epsilon}}=\boldsymbol{y}-\Phi \hat{\boldsymbol{c}}=\boldsymbol{y}-\Phi\left(\Phi^T \Phi\right)^{-1} \Phi^T \boldsymbol{y} \quad=[I-\Phi$$

$$\hat{\epsilon}=P(\Phi c+\epsilon)=P \epsilon$$

$$\mathbb{E}\left[\hat{\boldsymbol{\epsilon}}^T \hat{\boldsymbol{\epsilon}}\right]=\mathbb{E}\left[\boldsymbol{\epsilon}^T P^T P \boldsymbol{\epsilon}\right]=\mathbb{E}\left[\boldsymbol{\epsilon}^T P \boldsymbol{\epsilon}\right]=\operatorname{trace}(P) \sigma^2$$

## 数学代写|数值分析代写numerical analysis代考|Bayesian Inference

$$\operatorname{Pr}[H \mid D]=\frac{\operatorname{Pr}[D \& H]}{\operatorname{Pr}[D]}=\frac{\operatorname{Pr}[D \mid H] \operatorname{Pr}[H]}{\sum_{H^{\prime}} \operatorname{Pr}\left[D \mid H^{\prime}\right] \operatorname{Pr}\left[H^{\prime}\right]}$$

