## 统计代写|线性回归分析代写linear regression analysis代考|Complements

The Least Squares Central Limit Theorem 2.8 is often a good approximation if $n \geq 10 p$ and the error distribution has “light tails,” i.e. the probability of an outlier is nearly 0 and the tails go to zero at an exponential rate or faster. For error distributions with heavier tails, much larger samples are needed, and the assumption that the variance $\sigma^2$ exists is crucial, e.g. Cauchy errors are not allowed. Norman and Streiner (1986, p. 63) recommend $n \geq 5 p$.
The classical MLR prediction interval does not work well and should be replaced by the Olive (2007) asymptotically optimal PI (2.20). Lei and Wasserman (2014) provide an alternative: use the Lei et al. (2013) PI $\left[\tilde{r}_L, \tilde{r}_U\right]$ on the residuals, then the PI for $Y_f$ is
$$\left[\hat{Y}_f+\tilde{r}_L, \hat{Y}_f+\tilde{r}_U\right]$$
Bootstrap PIs need more theory and instead of using $B=1000$ samples, use $B=\max (1000, n)$. See Olive (2014, pp. 279-285).

For the additive error regression model $Y=m(\boldsymbol{x})+e$, the response plot of $\hat{Y}=\hat{m}(\boldsymbol{x})$ vs. $Y$, with the identity line added as a visual aid, is used like the MLR response plot. We want $n \geq 10 d f$ where $d f$ is the degrees of freedom from fitting $\hat{m}$. Olive (2013a) provides PIs for this model, including the location model. These PIs are large sample PIs provided that the sample quantiles of the residuals are consistent estimators of the population quantiles of the errors. The response plot and PIs could also be used for methods described in James et al. (2013) such as ridge regression, lasso, principal components regression, and partial least squares. See Pelawa Watagoda and Olive (2017) if $n$ is not large compared to $p$.

## 统计代写|线性回归分析代写linear regression analysis代考|Lack of Fit Tests

Then $M S P E=S S P E /(n-c)$ is an unbiased estimator of $\sigma^2$ when model (2.29) holds, regardless of the form of $m$. The PE in SSPE stands for “pure error.”

Now SSLF $=S S E-S S P E=\sum_{j=1}^c n_j\left(\bar{Y}_j-\hat{Y}_j\right)^2$. Notice that $\bar{Y}_j$ is an unbiased estimator of $m\left(\boldsymbol{x}_j\right)$ while $\hat{Y}_j$ is an estimator of $m$ if the MLR model is appropriate: $m\left(\boldsymbol{x}_j\right)=\boldsymbol{x}_j^T \boldsymbol{\beta}$. Hence SSLF and MSLF can be very large if the MLR model is not appropriate.

The 4 step lack of fit test is i) Ho: no evidence of MLR lack of fit, $H_A$ : there is lack of fit for the MLR model.
ii) $F_{L F}=M S L F / M S P E$.
iii) The pval $=P\left(F_{c-p, n-c}>F_{L F}\right)$.
iv) Reject Ho if pval $\leq \delta$ and state the $H_A$ claim that there is lack of fit. Otherwise, fail to reject Ho and state that there is not enough evidence to conclude that there is MLR lack of fit.

Although the lack of fit test seems clever, examining the response plot and residual plot is a much more effective method for examining whether or not the MLR model fits the data well provided that $n \geq 10 p$. A graphical version of the lack of fit test would compute the $\bar{Y}_j$ and see whether they scatter about the identity line in the response plot. When there are no replicates, the range of $\hat{Y}$ could be divided into several narrow nonoverlapping intervals called slices. Then the mean $\bar{Y}_j$ of each slice could be computed and a step function with step height $\bar{Y}_j$ at the $j$ th slice could be plotted. If the step function follows the identity line, then there is no evidence of lack of fit. However, it is easier to check whether the $Y_i$ are scattered about the identity line. Examining the residual plot is useful because it magnifies deviations from the identity line that may be difficult to see until the linear trend is removed. The lack of fit test may be sensitive to the assumption that the errors are iid $N\left(0, \sigma^2\right)$.

